[TOC]
- Title: Using Kedro And Optuna for Your Project
- Review Date: Wed, Mar 27, 2024
- url: https://neptune.ai/blog/kedro-pipelines-with-optuna-hyperparameter-sweeps
Use Kedro and Optuan for your ML project
Example pyproject.toml
|
|
The default Kedro project structure is as follows:
|
|
Use Jupyter lab
ref: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html
it is very useful for having data preprocessing
Global configs and catalog factory issue
to have global configs that can be used in both parameters.yml
and catalog.yml
you can create a globals.yml
file, and do the following
globals.yml
|
|
parameters.yml
|
|
catalog.yml
|
|
However, the following is not allowed
|
|
It is because the following node can not be parsed
|
|
Some hints for the usage of Kedro namespace and Configs structure
-
if you have so called “alternative” module that will swap any current module very likely, then give them a separate namespace
-
if you have two modules that need to be compared with each other, use namespace with different pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
from kedro.pipeline.modular_pipeline import pipeline def create_pipeline(**kwargs) -> Pipeline: pipeline_instance = pipeline( [ node( func=split_data, inputs=["model_input_table", "params:model_options"], outputs=["X_train", "y_train"], name="split_data_node", ), node( func=train_model, inputs=["X_train", "y_train"], outputs="regressor", name="train_model_node", ), ] ) ds_pipeline_1 = pipeline( pipe=pipeline_instance, inputs="model_input_table", namespace="active_modelling_pipeline", ) ds_pipeline_2 = pipeline( pipe=pipeline_instance, inputs="model_input_table", namespace="candidate_modelling_pipeline", ) return ds_pipeline_1 + ds_pipeline_2
catalog.yml
1 2 3 4
"{namespace}.regressor": type: pickle.PickleDataset filepath: data/06_models/regressor_{namespace}.pkl versioned: true
-
if you have parallel environment that will not impact each other, control it using
global configs
Troubleshoot
- how to save the intermediate data output
ref: https://www.youtube.com/watch?v=sll_LhZE-p8
well the official team suggested that the intermediate dataset can use parquette
, which is faster for access , and for saving the intermediate data output, just register that in the catelog file.