[TOC]

  1. Title: Using Kedro And Optuna for Your Project
  2. Review Date: Wed, Mar 27, 2024
  3. url: https://neptune.ai/blog/kedro-pipelines-with-optuna-hyperparameter-sweeps

Use Kedro and Optuan for your ML project

  • Kedro - manage the ML pipeline
  • Optuna - hyperparameter tuning tool

Example pyproject.toml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[build-system]
requires = [ "setuptools",]
build-backend = "setuptools.build_meta"

[project]
name = "kedro_hyperparameter_sweep_test"
authors = [
    {name = "Sukai Huang", email = "hsk6808065@163.com"}
]
readme = "README.md"
dynamic = [ "dependencies", "version",]

[project.scripts]
kedro-hyperparameter-sweep-test = "kedro_hyperparameter_sweep_test.__main__:main"

[project.optional-dependencies]
docs = [ "docutils<0.18.0", "sphinx~=3.4.3", "sphinx_rtd_theme==0.5.1", "nbsphinx==0.8.1", "sphinx-autodoc-typehints==1.11.1", "sphinx_copybutton==0.3.1", "ipykernel>=5.3, <7.0", "Jinja2<3.1.0", "myst-parser~=0.17.2",]

[tool.kedro]
package_name = "kedro_hyperparameter_sweep_test"
project_name = "kedro_hyperparameter_sweep_test"
kedro_init_version = "0.19.3"
tools = [ "Linting", "Custom Logging", "Documentation", "Data Structure", "Kedro Viz",]
example_pipeline = "False"
source_dir = "src"

[tool.ruff]
line-length = 88
show-fixes = true
select = [ "F", "W", "E", "I", "UP", "PL", "T201",]
ignore = [ "E501",]

[project.entry-points."kedro.hooks"]

[tool.ruff.format]
docstring-code-format = true

    
[tool.setuptools.dynamic.dependencies]
file = "requirements.txt"

[tool.setuptools.dynamic.version]
attr = "kedro_hyperparameter_sweep_test.__version__"

[tool.setuptools.packages.find]
where = [ "src",]
namespaces = false

[tool.setuptools.package-data]
kedro_hyperparameter_sweep_test = ["*.csv", "*.md", "*.log"]

The default Kedro project structure is as follows:

1
2
3
4
5
6
7
8
9
project-dir         # Parent directory of the template
├── .gitignore      # Hidden file that prevents staging of unnecessary files to `git`
├── conf            # Project configuration files
├── data            # Local project data (not committed to version control)
├── docs            # Project documentation
├── notebooks       # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src)
├── pyproject.toml  # Identifies the project root and contains configuration information
├── README.md       # Project README
└── src             # Project source code

Use Jupyter lab

ref: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html

it is very useful for having data preprocessing

Global configs and catalog factory issue

to have global configs that can be used in both parameters.yml and catalog.yml you can create a globals.yml file, and do the following

globals.yml

1
2
env_name: "crafter" # crafter | minigrid 
env_purpose: "lang" # lang | policy, lang is for language model training, policy is for policy training

parameters.yml

1
2
env_name: "${globals:env_name}"
env_purpose: "${globals:env_purpose}"

catalog.yml

1
2
3
debug_data:
  type: pandas.CSVDataset
  filepath: "data/03_traj_instr_pairs/${globals:env_name}/test_debug_data.csv"

However, the following is not allowed

1
2
3
4

"${globals:env_name}_debug_data#csv":
  type: pandas.CSVDataset
  filepath: "data/03_traj_instr_pairs/${globals:env_name}-nested/test_debug_data.csv"

It is because the following node can not be parsed

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
node(
                func=generate_traj_instr_pairs,
                inputs=[
                    "expert_model",
                    "expert_model_eval_env",
                    "eval_env_init_obs",
                    "parameters",
                    "params:traj_instr_pairs_params",
                ],
                outputs="${globals:env_name}_debug_data#csv",
                name="generate_traj_instr_pairs_node",
            )

Some hints for the usage of Kedro namespace and Configs structure

  • if you have so called “alternative” module that will swap any current module very likely, then give them a separate namespace

  • if you have two modules that need to be compared with each other, use namespace with different pipeline

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    
    from kedro.pipeline.modular_pipeline import pipeline
    def create_pipeline(**kwargs) -> Pipeline:
        pipeline_instance = pipeline(
            [
                node(
                    func=split_data,
                    inputs=["model_input_table", "params:model_options"],
                    outputs=["X_train", "y_train"],
                    name="split_data_node",
                ),
                node(
                    func=train_model,
                    inputs=["X_train", "y_train"],
                    outputs="regressor",
                    name="train_model_node",
                ),
            ]
        )
        ds_pipeline_1 = pipeline(
            pipe=pipeline_instance,
            inputs="model_input_table",
            namespace="active_modelling_pipeline",
        )
        ds_pipeline_2 = pipeline(
            pipe=pipeline_instance,
            inputs="model_input_table",
            namespace="candidate_modelling_pipeline",
        )
    
        return ds_pipeline_1 + ds_pipeline_2
    

    catalog.yml

    1
    2
    3
    4
    
    "{namespace}.regressor":
      type: pickle.PickleDataset
      filepath: data/06_models/regressor_{namespace}.pkl
      versioned: true
    
  • if you have parallel environment that will not impact each other, control it using global configs

Troubleshoot

  1. how to save the intermediate data output

ref: https://www.youtube.com/watch?v=sll_LhZE-p8

well the official team suggested that the intermediate dataset can use parquette, which is faster for access , and for saving the intermediate data output, just register that in the catelog file.