Add row, copying previous row values, creating new value only for one column, assigning dummy only to new row

12 Views Asked by Mining At 28 August 2023 at 18:14

I have a huge data set where I have the columns shown below (and other columns). 5mln rows and 11 columns.

SYM	Date	Hour	Minute	Second
A	01/10/2020	6	10	1
A	01/10/2020	6	15	22
A	01/10/2020	7	4	23
A	02/10/2020	7	23	23
B	01/10/2020	6	22	45
B	01/10/2020	12	56	47
B	02/10/2020	16	43	2
C	01/10/2020	6	31	23
C	02/10/2020	6	41	24
C	02/10/2020	7	9	33
D	01/10/2020	6	2	22
D	02/10/2020	6	4	12
D	03/10/2020	7	54	12
D	03/10/2020	8	53	12
D	03/10/2020	8	55	12
D	03/10/2020	9	4	12
D	03/10/2020	9	7	24

I have also anoterh dataset which instead looks like this:

Date	Hour	Minute	Second
01/10/2020	6	10	1
01/10/2020	6	22	45
02/10/2020	16	43	2
02/10/2020	6	41	24
02/10/2020	7	9	33
03/10/2020	7	54	12
03/10/2020	8	53	12
03/10/2020	8	55	12
04/10/2020	9	4	12
04/10/2020	9	7	24

lets call it df1 and df2, respectively.

i want df1 to look at df2, and put a dummy = 1 where the date,hour,minute,second are matching. However, in df2 (which is the reference) there are certain references that are not present in df1. Therefore, I would like df1 to insert a row right after the date,h,m,s nearest to the one shown in df2, copying all the values, but creating a d,h,m,s equal to the one shown in df2 and give a dummy = 1 to that row only.

Example: df2 has 04:14:08 and df1 has either 04:14:05 or 04:14:09. If this happens: create a new row with time = 04:14:08 and give it all the values present in 04:14:05, and give dummy = 1 only to new row 04:14:08.

How could I do this?

I tried the following:

df1<- dplyr::inner_join( df1
                           , df2
                           , by = c( "date") # only use date as a key
                           , suffix = c("", "_b")
) %>% 
  mutate( 
    datetime_a = ymd_hms( paste0( date, " "
                                  , sprintf("%02d", hour), ":"
                                  , sprintf("%02d", minute), ":"
                                  , sprintf("%02d", second)
    ))
    , datetime_b = ymd_hms( paste0( date, " "
                                    , sprintf("%02d", hour_b), ":"
                                    , sprintf("%02d", minute_b), ":"
                                    , sprintf("%02d", second_b)
    ) )
    , diff_seconds = abs(as.integer(difftime(datetime_a, datetime_b, units = "secs")))
    
  ) %>% 
  # Remove rows where the difference between datetime_a and datetime_b is
  # greater than 3 seconds:
  filter(diff_seconds <= 3 ) %>%
  # Then add the dummy variable:
  mutate(dummy = 1) %>% 
  # Remove any unnecessary columns:
  select(date, sym, hour, minute, second,bidPrice.x,offerPrice.x,mid.x,lag.x,lagsym.x,ret.x, dummy) %>% 
  # Add the dummy column to A using a left_join:
  dplyr::left_join(swd, ., by = c("date", "sym", "hour", "minute", "second")) %>% 
  # Rows without any matches have dummy = NA. Recode:
  mutate(dummy = ifelse( is.na(dummy), yes = 0, no = dummy))

this is to make an approximation, that is, looking at a range of +/- 3 seconds. However, this gives multiple dummy = 1 which is not good for me.

Original Q&A

Add row, copying previous row values, creating new value only for one column, assigning dummy only to new row

There are 0 best solutions below

Related Questions in FOR-LOOP

Related Questions in IF-STATEMENT

Related Questions in ROW

Related Questions in DUMMY-VARIABLE

Related Questions in PANEL-DATA

Trending Questions

Popular # Hahtags

Popular Questions