I have a data asset in Azure Machine Learning. I want to convert it into a Pyspark dataframe. In the consume tab of the data asset, I get the code to convert it into a Pandas dataframe. However this data is huge (1 Tb+) so it will not fit into a Pandas dataframe.
This is the code I am using:
import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("data_asset_name", version="1")
tbl = mltable.load(data_asset.path)
df = tbl.to_pandas_dataframe()
df
The function
to_pandas_dataframe()
Converts the MLTable into a pandas dataframe.
Is there any function/way that I would be able to convert it into a pyspark dataframe?
You can pass the
mltabledata asset directly to Spark jobs; there is no need to create a Spark DataFrame. Here is the sample code:Here, I have given the input type as
mltableand path data asset id.The output is written to the wrangled folder.
If you still want to create pandas dataframe only way is using pandas dataframe in serverless spark compute.
Refer this github for more information.