Using Kedro And Optuna for Your Project

[TOC]

Title: Using Kedro And Optuna for Your Project
Review Date: Wed, Mar 27, 2024
url: https://neptune.ai/blog/kedro-pipelines-with-optuna-hyperparameter-sweeps

Use Kedro and Optuan for your ML project

Kedro - manage the ML pipeline
Optuna - hyperparameter tuning tool

Example pyproject.toml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


[build-system]
requires = [ "setuptools",]
build-backend = "setuptools.build_meta"

[project]
name = "kedro_hyperparameter_sweep_test"
authors = [
    {name = "Sukai Huang", email = "hsk6808065@163.com"}
]
readme = "README.md"
dynamic = [ "dependencies", "version",]

[project.scripts]
kedro-hyperparameter-sweep-test = "kedro_hyperparameter_sweep_test.__main__:main"

[project.optional-dependencies]
docs = [ "docutils<0.18.0", "sphinx~=3.4.3", "sphinx_rtd_theme==0.5.1", "nbsphinx==0.8.1", "sphinx-autodoc-typehints==1.11.1", "sphinx_copybutton==0.3.1", "ipykernel>=5.3, <7.0", "Jinja2<3.1.0", "myst-parser~=0.17.2",]

[tool.kedro]
package_name = "kedro_hyperparameter_sweep_test"
project_name = "kedro_hyperparameter_sweep_test"
kedro_init_version = "0.19.3"
tools = [ "Linting", "Custom Logging", "Documentation", "Data Structure", "Kedro Viz",]
example_pipeline = "False"
source_dir = "src"

[tool.ruff]
line-length = 88
show-fixes = true
select = [ "F", "W", "E", "I", "UP", "PL", "T201",]
ignore = [ "E501",]

[project.entry-points."kedro.hooks"]

[tool.ruff.format]
docstring-code-format = true

    
[tool.setuptools.dynamic.dependencies]
file = "requirements.txt"

[tool.setuptools.dynamic.version]
attr = "kedro_hyperparameter_sweep_test.__version__"

[tool.setuptools.packages.find]
where = [ "src",]
namespaces = false

[tool.setuptools.package-data]
kedro_hyperparameter_sweep_test = ["*.csv", "*.md", "*.log"]

The default Kedro project structure is as follows:

1
2
3
4
5
6
7
8
9


project-dir         # Parent directory of the template
├── .gitignore      # Hidden file that prevents staging of unnecessary files to `git`
├── conf            # Project configuration files
├── data            # Local project data (not committed to version control)
├── docs            # Project documentation
├── notebooks       # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src)
├── pyproject.toml  # Identifies the project root and contains configuration information
├── README.md       # Project README
└── src             # Project source code

Use Jupyter lab

ref: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html

it is very useful for having data preprocessing

Global configs and catalog factory issue

to have global configs that can be used in both parameters.yml and catalog.yml you can create a globals.yml file, and do the following

globals.yml

1
2


env_name: "crafter" # crafter | minigrid 
env_purpose: "lang" # lang | policy, lang is for language model training, policy is for policy training

parameters.yml

1
2


env_name: "${globals:env_name}"
env_purpose: "${globals:env_purpose}"

catalog.yml

1
2
3


debug_data:
  type: pandas.CSVDataset
  filepath: "data/03_traj_instr_pairs/${globals:env_name}/test_debug_data.csv"

However, the following is not allowed

1
2
3
4



"${globals:env_name}_debug_data#csv":
  type: pandas.CSVDataset
  filepath: "data/03_traj_instr_pairs/${globals:env_name}-nested/test_debug_data.csv"

It is because the following node can not be parsed

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


node(
                func=generate_traj_instr_pairs,
                inputs=[
                    "expert_model",
                    "expert_model_eval_env",
                    "eval_env_init_obs",
                    "parameters",
                    "params:traj_instr_pairs_params",
                ],
                outputs="${globals:env_name}_debug_data#csv",
                name="generate_traj_instr_pairs_node",
            )

Some hints for the usage of Kedro namespace and Configs structure

if you have so called “alternative” module that will swap any current module very likely, then give them a separate namespace

if you have two modules that need to be compared with each other, use namespace with different pipeline

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


from kedro.pipeline.modular_pipeline import pipeline
def create_pipeline(**kwargs) -> Pipeline:
    pipeline_instance = pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "y_train"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
        ]
    )
    ds_pipeline_1 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    ds_pipeline_2 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )

    return ds_pipeline_1 + ds_pipeline_2

catalog.yml

1
2
3
4


"{namespace}.regressor":
  type: pickle.PickleDataset
  filepath: data/06_models/regressor_{namespace}.pkl
  versioned: true

if you have parallel environment that will not impact each other, control it using global configs

Troubleshoot

how to save the intermediate data output

ref: https://www.youtube.com/watch?v=sll_LhZE-p8

well the official team suggested that the intermediate dataset can use parquette, which is faster for access , and for saving the intermediate data output, just register that in the catelog file.

Use Kedro and Optuan for your ML project#

Example pyproject.toml#

The default Kedro project structure is as follows:#

Use Jupyter lab#

Global configs and catalog factory issue#

Troubleshoot#