Read "Integrated Dataset" into a Dataframe with Azure Synapse Notebook

178 Views Asked by At

I know how to read a file in the Azure Data Lake Storage, but I don't know how to read a file that is listed under the Integration Dataset section in Azure. enter image description here

2

There are 2 best solutions below

1
DileeprajnarayanThumula On

An integration dataset acts as a blueprint for your data, helping you specify its structure and whereabouts in a data store. It is a guide that is used in your data pipelines and data flows how to interact with your stored information.

A CSV file with customer data in Azure Blob Storage. You create an integration dataset to define details like the file's structure (such as column names and data types) and its location in Blob Storage. Later on, you can use this dataset within Synapse pipeline activities or data flows to read from or write to the CSV file.

An integration dataset does the heavy lifting of telling your systems "what" your data looks like and "where" to find it in a data store.

If you want to Read data from your Spark notebooks

I have tried the below approach:

data_path = spark.read.load(f"abfss://[email protected]/jan19.csv", format='csv', header=True)
data_path.show(10)

Results:

+---+------+--------+
| Id| Name |   Movie|
+---+------+--------+
|  1|Thomas|   peaky|
|  2|  Tony|Sopranos|
+---+------+--------+

enter image description here

I agree lcolbert, with In the connection blade, examine the values for "File path," specifically the components within the brackets, such as [container], [directory], and [file name]. This file path contains the information that Spark requires to read the data.

The file path directly pointing to the delimited file in the data lake is:

abfss://[email protected]/jan19.csv

Here:

synpgen is the name of the data lake. folder02 is the specific container within the data lake. jan19.csv is the file located within that container.

0
lcolbert On

The integration dataset is not holding csv data. The delimited text dataset SFTP_Five9_CALL_SEGMENT is most likely pointing to a file in your data lake. In the connection blade, look at the File path's [container]/[directory]/[file name] values. That file path is the information Spark needs to read the data.

In a notebook you can establish your filepath using abfss (Azure blob file system secure). To read this data into a dataframe set the filepath and read the data in using spark.read.csv.

Five9_FILEPATH = 'abfss:/<container>@<datalake_name>.dfs.core.windows.net/<rest_of_file_path>'

Then create the dataframe using spark.read.csv()

df = spark.read.csv(f'{Five9_FILEPATH}', header=True)

Make sure you format abfss properly

The container name is to the left of the @ symbol. **Do not include it to the right of the @ symbol. Only include the directories you need to descend through after the container.

abfss://[email protected]/< directories within container to the file >

In DileeprajnarayanThumula's example, they are using a file path leading directly to the delimited file in the data lake:

abfss://[email protected]/jan19.csv

  • synpgen is the data lake name
  • folder02 is the container within the data lake
  • *jan19th.csv is the file inside of that container.

You could also have a folder within the container say one for the month of January (jan). The path would then be 'abfss://[email protected]/jan/jan19th.csv'

You could read all the files in a given directory into one dataframe:

jan_filepath = 'abfss://[email protected]/jan/*'
jan_df = spark.read.csv(f'{jan_filepath}', header=True)