Add row, copying previous row values, creating new value only for one column, assigning dummy only to new row

12 Views Asked by At

I have a huge data set where I have the columns shown below (and other columns). 5mln rows and 11 columns.

SYM Date Hour Minute Second
A 01/10/2020 6 10 1
A 01/10/2020 6 15 22
A 01/10/2020 7 4 23
A 02/10/2020 7 23 23
B 01/10/2020 6 22 45
B 01/10/2020 12 56 47
B 02/10/2020 16 43 2
C 01/10/2020 6 31 23
C 02/10/2020 6 41 24
C 02/10/2020 7 9 33
D 01/10/2020 6 2 22
D 02/10/2020 6 4 12
D 03/10/2020 7 54 12
D 03/10/2020 8 53 12
D 03/10/2020 8 55 12
D 03/10/2020 9 4 12
D 03/10/2020 9 7 24

I have also anoterh dataset which instead looks like this:

Date Hour Minute Second
01/10/2020 6 10 1
01/10/2020 6 22 45
02/10/2020 16 43 2
02/10/2020 6 41 24
02/10/2020 7 9 33
03/10/2020 7 54 12
03/10/2020 8 53 12
03/10/2020 8 55 12
04/10/2020 9 4 12
04/10/2020 9 7 24

lets call it df1 and df2, respectively.

i want df1 to look at df2, and put a dummy = 1 where the date,hour,minute,second are matching. However, in df2 (which is the reference) there are certain references that are not present in df1. Therefore, I would like df1 to insert a row right after the date,h,m,s nearest to the one shown in df2, copying all the values, but creating a d,h,m,s equal to the one shown in df2 and give a dummy = 1 to that row only.

Example: df2 has 04:14:08 and df1 has either 04:14:05 or 04:14:09. If this happens: create a new row with time = 04:14:08 and give it all the values present in 04:14:05, and give dummy = 1 only to new row 04:14:08.

How could I do this?

I tried the following:

df1<- dplyr::inner_join( df1
                           , df2
                           , by = c( "date") # only use date as a key
                           , suffix = c("", "_b")
) %>% 
  mutate( 
    datetime_a = ymd_hms( paste0( date, " "
                                  , sprintf("%02d", hour), ":"
                                  , sprintf("%02d", minute), ":"
                                  , sprintf("%02d", second)
    ))
    , datetime_b = ymd_hms( paste0( date, " "
                                    , sprintf("%02d", hour_b), ":"
                                    , sprintf("%02d", minute_b), ":"
                                    , sprintf("%02d", second_b)
    ) )
    , diff_seconds = abs(as.integer(difftime(datetime_a, datetime_b, units = "secs")))
    
  ) %>% 
  # Remove rows where the difference between datetime_a and datetime_b is
  # greater than 3 seconds:
  filter(diff_seconds <= 3 ) %>%
  # Then add the dummy variable:
  mutate(dummy = 1) %>% 
  # Remove any unnecessary columns:
  select(date, sym, hour, minute, second,bidPrice.x,offerPrice.x,mid.x,lag.x,lagsym.x,ret.x, dummy) %>% 
  # Add the dummy column to A using a left_join:
  dplyr::left_join(swd, ., by = c("date", "sym", "hour", "minute", "second")) %>% 
  # Rows without any matches have dummy = NA. Recode:
  mutate(dummy = ifelse( is.na(dummy), yes = 0, no = dummy))

this is to make an approximation, that is, looking at a range of +/- 3 seconds. However, this gives multiple dummy = 1 which is not good for me.

0

There are 0 best solutions below