[TOC]

  1. Title: Using Kedro And Optuna for Your Project
  2. Review Date: Wed, Mar 27, 2024
  3. url: https://neptune.ai/blog/kedro-pipelines-with-optuna-hyperparameter-sweeps

Use Kedro and Optuan for your ML project

Example pyproject.toml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[build-system]
requires = [ "setuptools",]
build-backend = "setuptools.build_meta"

[project]
name = "kedro_hyperparameter_sweep_test"
authors = [
    {name = "Sukai Huang", email = "hsk6808065@163.com"}
]
readme = "README.md"
dynamic = [ "dependencies", "version",]

[project.scripts]
kedro-hyperparameter-sweep-test = "kedro_hyperparameter_sweep_test.__main__:main"

[project.optional-dependencies]
docs = [ "docutils<0.18.0", "sphinx~=3.4.3", "sphinx_rtd_theme==0.5.1", "nbsphinx==0.8.1", "sphinx-autodoc-typehints==1.11.1", "sphinx_copybutton==0.3.1", "ipykernel>=5.3, <7.0", "Jinja2<3.1.0", "myst-parser~=0.17.2",]

[tool.kedro]
package_name = "kedro_hyperparameter_sweep_test"
project_name = "kedro_hyperparameter_sweep_test"
kedro_init_version = "0.19.3"
tools = [ "Linting", "Custom Logging", "Documentation", "Data Structure", "Kedro Viz",]
example_pipeline = "False"
source_dir = "src"

[tool.ruff]
line-length = 88
show-fixes = true
select = [ "F", "W", "E", "I", "UP", "PL", "T201",]
ignore = [ "E501",]

[project.entry-points."kedro.hooks"]

[tool.ruff.format]
docstring-code-format = true

    
[tool.setuptools.dynamic.dependencies]
file = "requirements.txt"

[tool.setuptools.dynamic.version]
attr = "kedro_hyperparameter_sweep_test.__version__"

[tool.setuptools.packages.find]
where = [ "src",]
namespaces = false

[tool.setuptools.package-data]
kedro_hyperparameter_sweep_test = ["*.csv", "*.md", "*.log"]

The default Kedro project structure is as follows:

1
2
3
4
5
6
7
8
9
project-dir         # Parent directory of the template
├── .gitignore      # Hidden file that prevents staging of unnecessary files to `git`
├── conf            # Project configuration files
├── data            # Local project data (not committed to version control)
├── docs            # Project documentation
├── notebooks       # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src)
├── pyproject.toml  # Identifies the project root and contains configuration information
├── README.md       # Project README
└── src             # Project source code

Use Jupyter lab

ref: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html

it is very useful for having data preprocessing

Global configs and catalog factory issue

to have global configs that can be used in both parameters.yml and catalog.yml you can create a globals.yml file, and do the following

globals.yml

1
2
env_name: "crafter" # crafter | minigrid 
env_purpose: "lang" # lang | policy, lang is for language model training, policy is for policy training

parameters.yml

1
2
env_name: "${globals:env_name}"
env_purpose: "${globals:env_purpose}"

catalog.yml

1
2
3
debug_data:
  type: pandas.CSVDataset
  filepath: "data/03_traj_instr_pairs/${globals:env_name}/test_debug_data.csv"

However, the following is not allowed

1
2
3
4

"${globals:env_name}_debug_data#csv":
  type: pandas.CSVDataset
  filepath: "data/03_traj_instr_pairs/${globals:env_name}-nested/test_debug_data.csv"

It is because the following node can not be parsed

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
node(
                func=generate_traj_instr_pairs,
                inputs=[
                    "expert_model",
                    "expert_model_eval_env",
                    "eval_env_init_obs",
                    "parameters",
                    "params:traj_instr_pairs_params",
                ],
                outputs="${globals:env_name}_debug_data#csv",
                name="generate_traj_instr_pairs_node",
            )

Some hints for the usage of Kedro namespace and Configs structure

Troubleshoot

  1. how to save the intermediate data output

ref: https://www.youtube.com/watch?v=sll_LhZE-p8

well the official team suggested that the intermediate dataset can use parquette, which is faster for access , and for saving the intermediate data output, just register that in the catelog file.