Update the content of a Dataset in AzureML studio

52 Views Asked by At

In the execution of a job in my MLOps pipeline, I want to obtain a dataset's content, modify it, and upload the new content, overwriting the old content of the file. Is there any option to do it? I have tried different forms, such as using Dataset.Tabular.register_pandas_Dataframe, but it creates a new directory for the new file, and in my case, I need to replace the existing file.

   from azureml.core import Dataset

   dataset_test = Dataset.Tabular.from_parquet_files(path = [(ds, 'main.parquet')])
   file_dataframe = dataset_test.to_pandas_dataframe()
   file_path = 'test_dir/main.parquet'
   file_dataframe['column'] = 'new_value'
   file_dataset = Dataset.Tabular.register_pandas_dataframe(
        dataframe=file_dataframe,
        target=(ds, file_path),
        name='main.parquet',
        description='Test upload new file'
   )
1

There are 1 best solutions below

0
JayashankarGS On

According to the documentation below:

azureml.data.dataset_factory.TabularDatasetFactory class - Azure Machine Learning Python | Microsoft Learn

The target field in the register_pandas_dataframe function requires you to provide a datastore object and folder path where you need to store your parquet files, not the parquet file path.

Required, the datastore path where the dataframe parquet data will be uploaded to. A guid folder will be generated under the target path to avoid conflict.

It is also mentioned that, to avoid conflict, a guid folder is created, so you cannot overwrite the existing file inside the folder.

However, you can register the dataset with the same name, using the name field in the register_pandas_dataframe function, which creates multiple versions of the data. Later, you can access the latest version, which is the updated dataset.

Example:

The code below was run three times with the same name, Par_updates.

file_dataframe = dataset_test.to_pandas_dataframe()
file_path = 'updated'
file_dataset = Dataset.Tabular.register_pandas_dataframe(
     dataframe=file_dataframe,
     target=(datastore, file_path),
     name="Par_updates",
     description='Test upload new file'
)

This creates data in three different guid folders with three different versions.

enter image description here

And in the data asset:

enter image description here

The third version is your latest altered version of your data, and you can access it using the code below.

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("Par_updates", version="3")

tbl = mltable.load(f'azureml:/{data_asset.id}')

df = tbl.to_pandas_dataframe()
df

enter image description here