The Pandas-on-Spark 'apply' returns incorrect results

34 Views Asked by lord_mendonca At 21 September 2023 at 13:21

To solve the equations of the below nature with Pandas-on-Spark, I created a class named "DiffEqns".

x[k+1] = v[k] * y[k] + x[k]

y[k+1] = 0.01 * z[k] + y[k]

where k ranges from 0 to N-1, x[0] = 0.0, y[0] = 1.0

class DiffEqns:
    def __init__(self):
        self.x = 0.0
        self.y = 1.0
        self.prev_second = None

    def calculate_values(self, row):
        # Calculate x and y
        self.x = row["v"] * self.y + self.x
        row["x"] = self.x
        self.y = 0.01 * row["z"] + self.y
        row["y"] = self.y
        
        print(self.x, self.y)
        print(row["x"], row["y"])
        
        return row

I wrote the below code on Databricks, and that gave me good results:

import pyspark.pandas as ps
import pandas as pd

# Sample input data
data = [(0.0, 1, 10),(0.5, 1, 20),(1.0, 2, 10),(1.3, 2, 20),(1.6, 2, 30),
        (2.0, 1, 10),(2.5, 1, 20),(3.0, 2, 10),(3.3, 2, 20),(3.6, 2, 30),
        ]

# Create a pyspark.pandas DataFrame with input columns t, v, and z
df = ps.DataFrame(data, columns=["v", "z"])
        
# Initialize the calculator
de_obj = DiffEqns()

# Apply the calculate_values function to the DataFrame
df = df.apply(de_obj.calculate_values, axis=1)

print(df)

However, when I introduced a new data set that is at least 10,000 times larger than the current data (for industrial applications), the output results were totally wrong.

As you can see, for the sake of debugging, I have added the print statements. These statements always print the correct result at any computational stage.

However, when "df" is printed post-computation, it gives garbage results or results from previous computations that are not correct. My speculation is that the return from function "row" is responsible for this issue.

Could anyone please help me understand the pitfall in my implementation that is causing the problem?

Original Q&A

The Pandas-on-Spark 'apply' returns incorrect results

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in DATABRICKS

Related Questions in PYSPARK-PANDAS

Related Questions in SPARK-KOALAS

Trending Questions

Popular # Hahtags

Popular Questions