How to load data from Azure Data Lake to Microsoft Lakehouse using Notebooks?

Question

How to load data from Azure Data Lake to Microsoft Lakehouse using Notebooks?

538 Views Asked by Zephyr Harrison At 26 September 2023 at 01:52

Here's my situation:

I have a storage in my Azure account that contains my tables from Dynamics 365 F&O and I have a JSON file with the column's name and type. This is the 'header' file and i have another csv file (can it be 1 or more csv to the same table) with the data.

So, i need to combine these 2 for each table, and then load it to my Fabric Lakehouse. So far, i'm trying to do this using this code:

import json
import os

def get_cdm_files(directory_path):
    cdm_files = []

    for root, diers, files in os.walk(directory_path):
        for file in files:
            if file.endswith('.cdm.json'):
                cdm_files.append(os.path.join(root, file))
  
    return cdm_files

def load_table_cdm_file(cdm_file_path):
    with open(cdm_file_path.replace("abfss://[email protected]/", "/dbfs/mnt/dynamics/")) as f:
        cdm_json = json.load(f)

    colss = []
    for item in cdm_json['definitions'][0]['hasAttributes']:
        colss.append(item["name"])
  
    return spark.read.csv(cdm_file_path.replace("cdm.json", ".csv"), header=False, inferSchema=True)

def load_all_tables(cdm_files):
    tables = {}

    for cdm_file in cdm_files:
        table_name = cdm_file.split("/")[-1].replace(".cdm.json", "").lower()
        tables[table_name] = load_table_cdm_file(cdm_file)
    
    return tables

def write_table_delta(table_name, table_df):
    spark.sql(f"DROP TABLE IF EXISTS Lakehousename.Dynamics365_{table_name}")
    table_df.write.mode("overwrite").format("delta").saveAsTable(f"Dynamics365_{table_name}")

def main():
    cdm_files = get_cdm_files("abfss://[email protected]/domainname.operations.dynamics.com/Tables/")
    
    if "TABLENAME1.cdm.json" in cdm_files:
        cdm_files.remove("abfss://[email protected]/domainname.operations.dynamics.com/Tables/Custom/TABLENAME1.cdm.json")
    
    if "TABLENAME2.cdm.json" in cdm_files:
        cdm_files.remove("abfss://[email protected]/domainname.operations.dynamics.com/Tables/Custom/TABLENAME2.cdm.json")
    
    if "TABLE3.cdm.json" in cdm_files:
        cdm_files.remove("abfss://[email protected]/domainname.operations.dynamics.com/Tables/Custom/TABLE3.cdm.json")

    tables = load_all_tables(cdm_files)

    for table_name, table_df in tables.items():
        write_table_delta(table_name, table_df)

I tried looking for guides but as it's a new thing isn't there much to search for, even the AI could help at all.

Original Q&A

There are 1 best solutions below

**JayashankarGS** · Answer 1 · 2023-09-26T07:57:49.470000

Alter each of your functions as below.

get_cdm_files

import json
import os

def get_cdm_files(directory_path):
    cdm_files = []

    for root, diers, files in os.walk(directory_path.replace("abfss://[email protected]/", "/dbfs/mnt/dynamics/")):
        for file in files:
            if file.endswith('.cdm.json'):
                cdm_files.append(os.path.join(root.replace( "/dbfs/mnt/dynamics/","abfss://[email protected]/"), file))
  
    return cdm_files

will get .cdm.json files.

Next,

load_table_cdm_file

For reading csv files using the schema in json files.


from pyspark.sql.types import StructField,StructType,StringType,DoubleType,IntegerType

data_type_mapping = {
        "string": StringType(),
        "integer": IntegerType(),
        "double": DoubleType(),
        # Add more mappings as needed
    }

def load_table_cdm_file(cdm_file_path):
    with open(cdm_file_path.replace("abfss://[email protected]/", "/dbfs/mnt/dynamics/")) as f:
        cdm_json = json.load(f)

    columns = [StructField(item["name"], data_type_mapping.get(item["type"],StringType()), True) for item in cdm_json['definitions'][0]['hasAttributes']]
    schema = StructType(columns)
    print(schema)
    df = spark.read.format("csv").schema(schema).load(cdm_file_path.replace("cdm.json", "csv"), header=False)
    return df

There is no change in load_all_tables keep it has it is. Now writing the table to lakehouse, if you are using notebook in lakehouse itself the write_table_delta function works fine.

Or

if you are using notebook in databricks use below code to write. Before running this code make sure to check mark Enable credential passthrough for user-level data access under Advanced option.

Copy abfss path to the lakehouse table.

Go to properties of the table and copy the path, it is something similar to below one.

abfss://<kjfneldqw>@msit-onelake.dfs.fabric.microsoft.com/<6382ey398e>/Tables

write_table_delta

lakehouse_table_path="abfss_path" #your_abfss_path_to_lakehouse_table
def write_table_delta(table_name, table_df):
    table_df.write.mode("overwrite").format("delta").save(f"{lakehouse_table_path}/Dynamics365_{table_name}")

Now run your main code.

cdm_files = get_cdm_files("abfss://[email protected]/domainname.operations.dynamics.com/Tables/")
print(cdm_files)
tables = load_all_tables(cdm_files)
print(tables)
for table_name, table_df in tables.items():
    write_table_delta(table_name, table_df)

Output:

and in lakehouse.

enter image description here

Again you can read this table using spark.read.format("delta").load("abfss_path") providing lakehouse abfss table path.

How to load data from Azure Data Lake to Microsoft Lakehouse using Notebooks?

There are 1 best solutions below

Related Questions in DATAFRAME

Related Questions in PYSPARK

Related Questions in AZURE-DATA-LAKE

Related Questions in DATA-LAKEHOUSE

Related Questions in MICROSOFT-FABRIC

Trending Questions

Popular # Hahtags

Popular Questions