Handling missing values in lag column of dataframe

32 Views Asked by At

I need to find difference between sucessive datetimes ... and split the file when I find a time difference of more than 5 minutes. When I subtract periods I get answers in milliseconds and there is also a missing value in the lagged column. If I skip the missing value, then I can't update the dataframe. Not sure how to proceed.

using CSV
using DataFrames
using ShiftedArrays: lag 
using Dates
df = CSV.read(IOBuffer("""
date
2024-01-25 19:15
2024-01-25 19:20
2024-01-25 19:25
2024-01-25 21:20
2024-01-25 21:25
"""),DataFrame; dateformat=DateFormat("yy-mm-dd H:M"))
condf=transform(df,[:date,:date] => ((x,y)->Dates.CompoundPeriod.(skipmissing(x-lag(y,1)))) =>:ldate)
1

There are 1 best solutions below

8
Bogumił Kamiński On

It is not fully clear what you want to do with the rows with missing value.

If you want to get rid of them then use the dropmissing function before your operation.

If you want to keep it it depends on what do you assume that the missing actually held. The easiest is to probably use Impute.locf or Imputel.nocb function from Impute.jl to fill missing with the value that will give you the desired split.

If what I write is not clear then please comment (best would be if you would show the input and desired output - including missing values in the example).

Some additional cases to consider:

  • what would you want if the data started/ended with missing values?
  • what would you want if there were several consecutive missing values?
  • is it guaranteed that the data frame is sorted by timestamp column?