I want to be able to consecutively go through a table using the value calculated in the previous row in the current row. It seems a window function could do this.

from pyspark.sql import SparkSession
from pyspark.sql import Window
import pyspark.sql.functions as F

# Create a sample DataFrame
data = [
    (1, 1, 3),
    (2, 2, 4),
    (3, 3, 5),
    (4, 4, 6),
    (5, 8, 10)
]

columns = ["id", "start", "stop"]

spark = SparkSession.builder.master("local").appName("SelfJoin").getOrCreate()

df = spark.createDataFrame(data, columns).sort('start')

# 1. Sort the DataFrame by 'start'
df = df.sort("start").withColumn('prev_start', col('start'))

# 2. Initialize a window that looks back one record
window = Window.orderBy(['start']).rowsBetween(-1, -1)

df = (
    df
    .withColumn("prev_start", F.lag("prev_start", 1).over(window))
)

df.toPandas().style.hide_index()

At row one

id start stop prev_start
1 1 3 NaN

At row two

Extra: I need to specify that if value in previous row is missing, then use the value in the current row: 1.0

id start stop prev_start
1 1 3 NaN
2 2 4 1.0

At Row 3:

id start stop prev_start
1 1 3 NaN
2 2 4 1.0
3 3 5 2.0

Look back, prev_start assigned to previous row value of 1

id start stop prev_start
1 1 3 NaN
2 2 4 1.0
3 3 5 1.0

At Row 4

id start stop prev_start
1 1 3 NaN
2 2 4 1.0
3 3 5 1.0
4 4 6 3.0

Assign value of prev_start to what is now 1, not what it was originally: 2

id start stop prev_start
1 1 3 NaN
2 2 4 1.0
3 3 5 1.0
4 4 6 1.0

so now, the actually application, I also do this for prev_end = max(prev_end, end)

  • as soon as this prev_end is less than the current row start (and the data is ordered correctly) I now have a non-contiguous time frame (start to stop), that will be from prev_start to prev_end

Result

id start stop prev_start
1 1 3 NaN
2 2 4 1.0
3 3 5 2.0
4 4 6 3.0
5 8 10 4.0

But I want

Result

id start stop prev_start
1 1 3 NaN
2 2 4 1.0
3 3 5 1.0
4 4 6 1.0
5 8 10 1.0

Because in this case, the values are calculated row by row, so once the value is set to 1, all prev_start values would consecutively be made 1

0

There are 0 best solutions below