The block of code below is part of the best trial notebook that is auto-generated by executing a Databricks AutoML run.
import mlflow
import os
import uuid
import shutil
import pandas as pd
# Create temp directory to download input data from MLflow
input_temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], "tmp", str(uuid.uuid4())[:8])
os.makedirs(input_temp_dir)
# Download the artifact and read it into a pandas DataFrame
input_data_path = mlflow.artifacts.download_artifacts(run_id="e2a4a93aafb24aa9956e83f6b7ab3e28", artifact_path="data", dst_path=input_temp_dir)
df_loaded = pd.read_parquet(os.path.join(input_data_path, "training_data"))
# Delete the temp data
shutil.rmtree(input_temp_dir)
# Preview data
df_loaded.head(5)
The run_id in the above code block, e2a4a93aafb24aa9956e83f6b7ab3e28, can I grab it from the AutoMLSummary returned from running automl.regress? If I use summary.best_trial.mlflow_run_id, I get a different value. So what is this run_id and how do I get it?
Aside from the above code block, is there a way to grab the dataset that's been loaded into df_loaded? It's essentially the input dataset that I fed into automl.regress except it has a column that indicates whether each row is part of training, validation, and testing subsets.
I am fairly new to Databricks AutoML, so am not sure what's the best way to get this done.
Thanks ahead of time.
As I mentioned, I tried grabbing the run_id from summary.best_trial.mlflow_run_id, but the values do not match. I have tried reading the documentation for automl and mlflow, but no luck.