Assistance Refactoring Pyspark Replace Code for Synapse Notebook (working with JSON)

53 Views Asked by Jess P At 15 February 2024 at 08:11

Could anyone help with advice on how to refactor this code to make it faster? We are ingesting JSON and transforming in preparation for a later step but this particular section of the Notebook is taking a lot longer than expected (and we're new to python). Code is below, thanks!

    import pandas as pd

    df_main = pd.DataFrame(table_data)
    df_parameters = pd.DataFrame(parameter_data)

    # Convert df_parameters to dictionary for faster lookups
    parameter_dict = dict(zip(df_parameters['abc'], df_parameters['def']))

    def replace_parameters(row):
        if isinstance(row['expression'], str) and row['xyz'] == row['xyz']:
            for parameter_name, value in parameter_dict.items():
                if row['expression']:
                    row['expression'] = row['expression'].replace(f'|{parameter_name}|', str(value))
                if row['sp']:
                    row['sp'] = row['sp'].replace(f'|{parameter_name}|', str(value))
        return row

    # Apply the replacement function only when xyz is equal to xyz
    df_main = df_main.apply(replace_parameters, axis=1)

Original Q&A

There are 1 best solutions below

JayashankarGS On 15 February 2024 at 10:36 BEST ANSWER

Instead of using a for loop, use pattern matching. I believe this kind of expression you are using; if not, adjust the pattern accordingly.

Expressions I used:

|param1| + |param2|
|param2| - |param3|
|param1| * |param3|
|param2| / |param3|

You can use the code below.

Data I used:

import pandas as pd
import re

table_data = {
    'expression': ['|param1| + |param2|', '|param2| - |param3|'],
    'sp': ['|param1| * |param3|', '|param2| / |param3|'],
    'xyz': [1, 2]  # Sample values for 'xyz'
}

parameter_data = {
    'abc': ['param1', 'param2', 'param3'],
    'def': [10, 20, 30]  # Sample values for parameters
}

df_main = pd.DataFrame(table_data)
df_parameters = pd.DataFrame(parameter_data)

Output:

enter image description here

Modify your function like below:

import re
parameter_dict = dict(zip(parameter_data['abc'], parameter_data['def']))

tmp = [ "\|" +i+ "\|" for i in parameter_data['abc']]
param_pattern = '|'.join(tmp)

# Define a function to perform replacements
def replace_parameters(row):
    if isinstance(row['expression'], str) and row['xyz'] == row['xyz']:
        row['expression'] = re.sub(param_pattern,lambda match: str(parameter_dict[match.group(0).strip('|')]), row['expression'])
        row['sp'] = re.sub(param_pattern, lambda match: str(parameter_dict[match.group(0).strip('|')]), row['sp'])
    return row

df_main = df_main.apply(replace_parameters, axis=1)

df_main

Output:

enter image description here

You can see the parameters are replaced. Once again, I am saying, according to your expressions, you create the pattern. For the above kind of expressions, below is the pattern I created: '\\|param1\\||\\|param2\\||\\|param3\\|'.

Assistance Refactoring Pyspark Replace Code for Synapse Notebook (working with JSON)

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in JSON

Related Questions in REPLACE

Related Questions in AZURE-SYNAPSE

Related Questions in SPARK-NOTEBOOK

Trending Questions

Popular # Hahtags

Popular Questions