In the execution of a job in my MLOps pipeline, I want to obtain a dataset's content, modify it, and upload the new content, overwriting the old content of the file. Is there any option to do it? I have tried different forms, such as using Dataset.Tabular.register_pandas_Dataframe, but it creates a new directory for the new file, and in my case, I need to replace the existing file.
from azureml.core import Dataset
dataset_test = Dataset.Tabular.from_parquet_files(path = [(ds, 'main.parquet')])
file_dataframe = dataset_test.to_pandas_dataframe()
file_path = 'test_dir/main.parquet'
file_dataframe['column'] = 'new_value'
file_dataset = Dataset.Tabular.register_pandas_dataframe(
dataframe=file_dataframe,
target=(ds, file_path),
name='main.parquet',
description='Test upload new file'
)
According to the documentation below:
azureml.data.dataset_factory.TabularDatasetFactory class - Azure Machine Learning Python | Microsoft Learn
The target field in the
register_pandas_dataframefunction requires you to provide a datastore object and folder path where you need to store your parquet files, not the parquet file path.It is also mentioned that, to avoid conflict, a guid folder is created, so you cannot overwrite the existing file inside the folder.
However, you can register the dataset with the same name, using the name field in the
register_pandas_dataframefunction, which creates multiple versions of the data. Later, you can access the latest version, which is the updated dataset.Example:
The code below was run three times with the same name,
Par_updates.This creates data in three different guid folders with three different versions.
And in the data asset:
The third version is your latest altered version of your data, and you can access it using the code below.