I have a huge data set where I have the columns shown below (and other columns). 5mln rows and 11 columns.
| SYM | Date | Hour | Minute | Second |
|---|---|---|---|---|
| A | 01/10/2020 | 6 | 10 | 1 |
| A | 01/10/2020 | 6 | 15 | 22 |
| A | 01/10/2020 | 7 | 4 | 23 |
| A | 02/10/2020 | 7 | 23 | 23 |
| B | 01/10/2020 | 6 | 22 | 45 |
| B | 01/10/2020 | 12 | 56 | 47 |
| B | 02/10/2020 | 16 | 43 | 2 |
| C | 01/10/2020 | 6 | 31 | 23 |
| C | 02/10/2020 | 6 | 41 | 24 |
| C | 02/10/2020 | 7 | 9 | 33 |
| D | 01/10/2020 | 6 | 2 | 22 |
| D | 02/10/2020 | 6 | 4 | 12 |
| D | 03/10/2020 | 7 | 54 | 12 |
| D | 03/10/2020 | 8 | 53 | 12 |
| D | 03/10/2020 | 8 | 55 | 12 |
| D | 03/10/2020 | 9 | 4 | 12 |
| D | 03/10/2020 | 9 | 7 | 24 |
I have also anoterh dataset which instead looks like this:
| Date | Hour | Minute | Second |
|---|---|---|---|
| 01/10/2020 | 6 | 10 | 1 |
| 01/10/2020 | 6 | 22 | 45 |
| 02/10/2020 | 16 | 43 | 2 |
| 02/10/2020 | 6 | 41 | 24 |
| 02/10/2020 | 7 | 9 | 33 |
| 03/10/2020 | 7 | 54 | 12 |
| 03/10/2020 | 8 | 53 | 12 |
| 03/10/2020 | 8 | 55 | 12 |
| 04/10/2020 | 9 | 4 | 12 |
| 04/10/2020 | 9 | 7 | 24 |
lets call it df1 and df2, respectively.
i want df1 to look at df2, and put a dummy = 1 where the date,hour,minute,second are matching. However, in df2 (which is the reference) there are certain references that are not present in df1. Therefore, I would like df1 to insert a row right after the date,h,m,s nearest to the one shown in df2, copying all the values, but creating a d,h,m,s equal to the one shown in df2 and give a dummy = 1 to that row only.
Example: df2 has 04:14:08 and df1 has either 04:14:05 or 04:14:09. If this happens: create a new row with time = 04:14:08 and give it all the values present in 04:14:05, and give dummy = 1 only to new row 04:14:08.
How could I do this?
I tried the following:
df1<- dplyr::inner_join( df1
, df2
, by = c( "date") # only use date as a key
, suffix = c("", "_b")
) %>%
mutate(
datetime_a = ymd_hms( paste0( date, " "
, sprintf("%02d", hour), ":"
, sprintf("%02d", minute), ":"
, sprintf("%02d", second)
))
, datetime_b = ymd_hms( paste0( date, " "
, sprintf("%02d", hour_b), ":"
, sprintf("%02d", minute_b), ":"
, sprintf("%02d", second_b)
) )
, diff_seconds = abs(as.integer(difftime(datetime_a, datetime_b, units = "secs")))
) %>%
# Remove rows where the difference between datetime_a and datetime_b is
# greater than 3 seconds:
filter(diff_seconds <= 3 ) %>%
# Then add the dummy variable:
mutate(dummy = 1) %>%
# Remove any unnecessary columns:
select(date, sym, hour, minute, second,bidPrice.x,offerPrice.x,mid.x,lag.x,lagsym.x,ret.x, dummy) %>%
# Add the dummy column to A using a left_join:
dplyr::left_join(swd, ., by = c("date", "sym", "hour", "minute", "second")) %>%
# Rows without any matches have dummy = NA. Recode:
mutate(dummy = ifelse( is.na(dummy), yes = 0, no = dummy))
this is to make an approximation, that is, looking at a range of +/- 3 seconds. However, this gives multiple dummy = 1 which is not good for me.