[TOC]
- Title: Using Kedro And Optuna for Your Project
- Review Date: Wed, Mar 27, 2024
- url: https://neptune.ai/blog/kedro-pipelines-with-optuna-hyperparameter-sweeps
Use Kedro and Optuan for your ML project
Example pyproject.toml
|
|
The default Kedro project structure is as follows:
|
|
Use Jupyter lab
ref: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html
it is very useful for having data preprocessing
Global configs and catalog factory issue
to have global configs that can be used in both parameters.yml and catalog.yml you can create a globals.yml file, and do the following
globals.yml
|
|
parameters.yml
|
|
catalog.yml
|
|
However, the following is not allowed
|
|
It is because the following node can not be parsed
|
|
Some hints for the usage of Kedro namespace and Configs structure
-
if you have so called “alternative” module that will swap any current module very likely, then give them a separate namespace
-
if you have two modules that need to be compared with each other, use namespace with different pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30from kedro.pipeline.modular_pipeline import pipeline def create_pipeline(**kwargs) -> Pipeline: pipeline_instance = pipeline( [ node( func=split_data, inputs=["model_input_table", "params:model_options"], outputs=["X_train", "y_train"], name="split_data_node", ), node( func=train_model, inputs=["X_train", "y_train"], outputs="regressor", name="train_model_node", ), ] ) ds_pipeline_1 = pipeline( pipe=pipeline_instance, inputs="model_input_table", namespace="active_modelling_pipeline", ) ds_pipeline_2 = pipeline( pipe=pipeline_instance, inputs="model_input_table", namespace="candidate_modelling_pipeline", ) return ds_pipeline_1 + ds_pipeline_2catalog.yml1 2 3 4"{namespace}.regressor": type: pickle.PickleDataset filepath: data/06_models/regressor_{namespace}.pkl versioned: true -
if you have parallel environment that will not impact each other, control it using
global configs
Troubleshoot
- how to save the intermediate data output
ref: https://www.youtube.com/watch?v=sll_LhZE-p8
well the official team suggested that the intermediate dataset can use parquette, which is faster for access , and for saving the intermediate data output, just register that in the catelog file.