I am using hydra for the configuration of deep learning projects. I want to put together several datasets for the training. As the number of datasets is a priori not defined, I want to get the datasets as a list. I want to define them as defaults in the parent config.yaml file.
I have found a working solution. I put it here because I found it hard to find and it could be useful to others. Also, I am wondering if you have some better solutions for the problem.
After some searching, I arrived at the interpolation oc.dict.values (see this resolver, and particularly this solution).
My project structure is:
├── configs
│ ├── config.yaml
│ └── data_repository
│ ├── data1.yaml
│ ├── data2.yaml
│ └── data3.yaml
├── test_hydra.py
All my dataset configurations are in the data_repository subfolders.
I want, as an example, to use only data1 and data2, as shown in the config.yaml file:
# config.yaml
defaults:
- _self_
- data_repository/data1
- data_repository/data2
hydra:
job:
chdir: True
data_used: ${oc.dict.values:data_repository}
data1.yaml:
# @package data_repository.data1
dataset_name: data1
number_layers: 1
data2.yaml:
# @package data_repository.data2
dataset_name: data2
number_layers: 1
data3.yaml:
# @package data_repository
dataset_name: data3
number_layers: 1
test_hydra.py:
# test_hydra.py
import hydra
from omegaconf import OmegaConf
@hydra.main(config_name='config', version_base="1.1", config_path="configs")
def train(config):
config = OmegaConf.structured(config)
print("\n")
print(config)
print("\n" + OmegaConf.to_yaml(config) + "\n")
print("config.data_used = ", config["data_used"])
for i, data in enumerate(config.data_used):
print(f"config.data_used[{i}] = {data}")
if __name__ == "__main__":
train()
Running test_hydra (with hydra-core 1.3.2) gives the output:
{'data_used': '${oc.dict.values:data_repository}', 'data_repository': {'data1': {'dataset_name': 'data1', 'number_layers': 1}, 'data2': {'dataset_name': 'data2', 'number_layers': 1}}}
data_used: ${oc.dict.values:data_repository}
data_repository:
data1:
dataset_name: data1
number_layers: 1
data2:
dataset_name: data2
number_layers: 1
config.data_used = ['${data_repository.data1}', '${data_repository.data2}']
config.data_used[0] = {'dataset_name': 'data1', 'number_layers': 1}
config.data_used[1] = {'dataset_name': 'data2', 'number_layers': 1}
It gives the desired output. We have the dictionary data_repository that I will not use, and the list data_used that contains the desired list of datasets.
It is working, however it could be more fancy: we have indeed duplication of the data (data_repository and data_used), and the line data_used: ${oc.dict.values:data_repository} in config.yaml is a little cryptic. Do you have some suggestions for improvement?