The Pandas-on-Spark 'apply' returns incorrect results

34 Views Asked by At

To solve the equations of the below nature with Pandas-on-Spark, I created a class named "DiffEqns".

x[k+1] = v[k] * y[k] + x[k]

y[k+1] = 0.01 * z[k] + y[k]

where k ranges from 0 to N-1, x[0] = 0.0, y[0] = 1.0

class DiffEqns:
    def __init__(self):
        self.x = 0.0
        self.y = 1.0
        self.prev_second = None

    def calculate_values(self, row):
        # Calculate x and y
        self.x = row["v"] * self.y + self.x
        row["x"] = self.x
        self.y = 0.01 * row["z"] + self.y
        row["y"] = self.y
        
        print(self.x, self.y)
        print(row["x"], row["y"])
        
        return row

I wrote the below code on Databricks, and that gave me good results:

import pyspark.pandas as ps
import pandas as pd

# Sample input data
data = [(0.0, 1, 10),(0.5, 1, 20),(1.0, 2, 10),(1.3, 2, 20),(1.6, 2, 30),
        (2.0, 1, 10),(2.5, 1, 20),(3.0, 2, 10),(3.3, 2, 20),(3.6, 2, 30),
        ]

# Create a pyspark.pandas DataFrame with input columns t, v, and z
df = ps.DataFrame(data, columns=["v", "z"])
        
# Initialize the calculator
de_obj = DiffEqns()

# Apply the calculate_values function to the DataFrame
df = df.apply(de_obj.calculate_values, axis=1)

print(df)

However, when I introduced a new data set that is at least 10,000 times larger than the current data (for industrial applications), the output results were totally wrong.

As you can see, for the sake of debugging, I have added the print statements. These statements always print the correct result at any computational stage.

However, when "df" is printed post-computation, it gives garbage results or results from previous computations that are not correct. My speculation is that the return from function "row" is responsible for this issue.

Could anyone please help me understand the pitfall in my implementation that is causing the problem?

0

There are 0 best solutions below