Assistance Refactoring Pyspark Replace Code for Synapse Notebook (working with JSON)

53 Views Asked by At

Could anyone help with advice on how to refactor this code to make it faster? We are ingesting JSON and transforming in preparation for a later step but this particular section of the Notebook is taking a lot longer than expected (and we're new to python). Code is below, thanks!

    import pandas as pd

    df_main = pd.DataFrame(table_data)
    df_parameters = pd.DataFrame(parameter_data)

    # Convert df_parameters to dictionary for faster lookups
    parameter_dict = dict(zip(df_parameters['abc'], df_parameters['def']))

    def replace_parameters(row):
        if isinstance(row['expression'], str) and row['xyz'] == row['xyz']:
            for parameter_name, value in parameter_dict.items():
                if row['expression']:
                    row['expression'] = row['expression'].replace(f'|{parameter_name}|', str(value))
                if row['sp']:
                    row['sp'] = row['sp'].replace(f'|{parameter_name}|', str(value))
        return row

    # Apply the replacement function only when xyz is equal to xyz
    df_main = df_main.apply(replace_parameters, axis=1)
1

There are 1 best solutions below

1
JayashankarGS On BEST ANSWER

Instead of using a for loop, use pattern matching. I believe this kind of expression you are using; if not, adjust the pattern accordingly.

Expressions I used:

  1. |param1| + |param2|
  2. |param2| - |param3|
  3. |param1| * |param3|
  4. |param2| / |param3|

You can use the code below.

Data I used:

import pandas as pd
import re

table_data = {
    'expression': ['|param1| + |param2|', '|param2| - |param3|'],
    'sp': ['|param1| * |param3|', '|param2| / |param3|'],
    'xyz': [1, 2]  # Sample values for 'xyz'
}

parameter_data = {
    'abc': ['param1', 'param2', 'param3'],
    'def': [10, 20, 30]  # Sample values for parameters
}

df_main = pd.DataFrame(table_data)
df_parameters = pd.DataFrame(parameter_data)

Output:

enter image description here

Modify your function like below:

import re
parameter_dict = dict(zip(parameter_data['abc'], parameter_data['def']))

tmp = [ "\|" +i+ "\|" for i in parameter_data['abc']]
param_pattern = '|'.join(tmp)

# Define a function to perform replacements
def replace_parameters(row):
    if isinstance(row['expression'], str) and row['xyz'] == row['xyz']:
        row['expression'] = re.sub(param_pattern,lambda match: str(parameter_dict[match.group(0).strip('|')]), row['expression'])
        row['sp'] = re.sub(param_pattern, lambda match: str(parameter_dict[match.group(0).strip('|')]), row['sp'])
    return row

df_main = df_main.apply(replace_parameters, axis=1)

df_main

Output:

enter image description here

You can see the parameters are replaced. Once again, I am saying, according to your expressions, you create the pattern. For the above kind of expressions, below is the pattern I created: '\\|param1\\||\\|param2\\||\\|param3\\|'.