How can i access both columns' values from a rolling window of a pandas dataframe?

62 Views Asked by At

My current goal is to find the distance between two points based on a latitude and longitude system, in order to track the trajectory of a flight. I have a pandas dataframe that contains changing latitude and longitude values. In order to find the distance between these points, I use the haversine distance function that takes these values as input in order to find the distance in kilometers.

I first tried to implement a for loop that iterates over the length of the flight and calculates the distance similar to the code below:

    for i in range(len(df) - 1):
        row1 = df.iloc[i]
        row2 = df.iloc[i + 1]
        result = haversine_distance(row1, row2)

However the dataset is very large, and due to the unefficiency in time I moved to a different strategy.

I then tried to implement a rolling window using the df.rolling function in pandas, along with a .apply with a lambda function as below:

df['DISTANCE'] = df[['Latitude', 'Longitude']].rolling(window=2).apply(lambda x: haversine_distance(x), raw = True)

My understanding of what happens here is that an 2d-array (from raw = True) is passed into the haversine function with 4 latitude and longitude values from the window.

However, I get a 1d array instead of the 2 values from 1 column rather than a 2d array of 4 values from 2 columns. What i mean by this is :

df = pd.DataFrame({'Latitude': [40.7128, 37.7749, 34.0522],
                   'Longitude': [-74.0060, -122.4194, -118.2437]})

If if the dataframe as shown above, I would get array [[40.7128, -74.0060],[37.7749,-122.4194]].

How can I fix my code or go about it differently in order to get these values? Attached below is the haversine function:

def haversine_distance(ndarray):
    lat1, lat2 = ndarray[0][0], ndarray[0][1]
    lon1, lon2 = ndarray[1][0], ndarray[1][1]

    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])

    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c
    return km

and this is the desired output:

df = pd.DataFrame({'Latitude': [40.7128, 37.7749, 34.0522],
                   'Longitude': [-74.0060, -122.4194, -118.2437],
                   'DISTANCE': [0, 4129.0861, 559.1205]})
2

There are 2 best solutions below

0
mozway On

You need to vectorize your haversine function, then craft an array with 4 columns in the correct order (with shift+concat+to_numpy) and pass this to the function:

df = pd.DataFrame({'Latitude': [40.7128, 37.7749, 34.0522],
                   'Longitude': [-74.0060, -122.4194, -118.2437]})

def haversine_distance(ndarray):
    # get the coordinates as 4 vectors
    # Convert latitude and longitude from degrees to radians
    lat1, lat2, lon1, lon2 = np.radians(ndarray.T)

    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c
    return km

a = (pd.concat([df[['Latitude', 'Longitude']],
                df[['Latitude', 'Longitude']].shift()
               ], axis=1)
       .iloc[:, [0,2,1,3]].to_numpy()
    )

df['DISTANCE'] = haversine_distance(a)

Output:

   Latitude  Longitude     DISTANCE
0   40.7128   -74.0060          NaN
1   37.7749  -122.4194  4129.086165
2   34.0522  -118.2437   559.120577

NB. instead of reordering the columns with .iloc[:, [0,2,1,3]] you could also use lat1, lon1, lat2, lon2 = ndarray.T in the function.

Intermediate a:

#            lat1        lat2      lon1       lon2
array([[  40.7128,       nan,  -74.006 ,       nan],
       [  37.7749,   40.7128, -122.4194,  -74.006 ],
       [  34.0522,   37.7749, -118.2437, -122.4194]])

Alternatively, a function that takes the df as input:

def haversine_distance_df(df, cols=['Latitude', 'Longitude']):
    # get the coordinates as 4 vectors    
    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = np.radians(
        pd.concat([df[cols], df[cols].shift()], axis=1)
          .to_numpy().T
    )
    
    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c
    return km

df['DISTANCE'] = haversine_distance_df(df)
0
Onyambu On

You can directly use numpy:

def haversine_distance(arr):
  arr = np.radians(arr).to_numpy()
  dlat, dlon = np.diff(arr, axis = 0).T
  a = np.sin(dlat / 2.0) ** 2 + \
      np.cos(arr[1:,0]) * np.cos(arr[:-1,0]) * np.sin(dlon / 2.0) ** 2
  return np.r_[np.nan, 6371 * 2 * np.arcsin(np.sqrt(a))]

df.assign(dist = haversine_distance(df))
   Latitude  Longitude         dist
0   40.7128   -74.0060          NaN
1   37.7749  -122.4194  4129.086165
2   34.0522  -118.2437   559.120577