Data Filtering. Remove zero rows but only until the first non zero value for each ID

156 Views Asked by At

I have a data table that captures the monthly returns for different stocks over a period of time. The data was collected in a way that if the sock was delisted, the rest of the return values are filled with zeroes (which does not align with reality). The filtering I need is the following:

for each stock: remove all lines from the end of the data collecting period which have a zero return until the first non-zero return. All other returns after the first non-zero return are valid and should not be deleted.

My idea was to order the data by date in descending order, group It by Id and then apply a function that removes all the zeroes until the first non-zero. However, I am not sure how to apply this. I wanted to use the data.table package so I created this function.

i <- 1
nzero <- i + 1

while (i <= count(stocks)+1) {
  if (stocks[i,6] == 0 | is.na(stocks[i,6])) {
    while (stocks[nzero,6] == 0 | is.na(stocks[nzero,6])) {
      nzero <- nzero + 1
    }
    test89 <- rbind(test89, tocks[nzero, count(stocks)])
    i <- nzero
  }
  else{
    if (stocks[i,1] != stocks[(i+1),1]) {
      test89 <- rbind(test89,stocks[(i+1):count(stocks)])
    }
  }
  i <- i+1
}

the first column of stocks refers to the Id of the stock and the 6. refers to the return.

Does this make sense? how can I change it for it to work and how can I apply a personalized function after grouping the data?

Thanks for the support and sorry for the not so qualified question Best

1

There are 1 best solutions below

0
Andrew Brown On

Perhaps something like this is what you are looking for? I made some dummy data for 12 months where returns are either 1 or 0. First one has a run of 6 non-zero values, second has 4.

library(data.table)

stocks <- data.table(
  stock = do.call('c', lapply(letters[1:2], rep, 12)),
  month = rep(1:12, 2),
  return = c(rep(1, 6), rep(0, 6),
             rep(1, 4), rep(0, 8))
)

# reverse date order
stocks <- stocks[order(-month)]

# calculate (first) run of zero values
zero_run_length <- function(x) {
  r <- rle(x)
  if (r$values[1] == 0) {
    return(r$lengths[1])
  } else return(0)
}

# calculate "zero run length" (on reversed order table)
zero_idx <- stocks[, list(zrl = zero_run_length(return)), by = "stock"]
zero_idx
#>    stock zrl
#> 1:     b   8
#> 2:     a   6

# select target rows
stocks2 <- stocks[zero_idx, on = "stock"][, .SD[((zrl[1] + 1):.N),], by = "stock"]

# put back in order
stocks2[order(stock, month),]
#>     stock month return zrl
#>  1:     a     1      1   6
#>  2:     a     2      1   6
#>  3:     a     3      1   6
#>  4:     a     4      1   6
#>  5:     a     5      1   6
#>  6:     a     6      1   6
#>  7:     b     1      1   8
#>  8:     b     2      1   8
#>  9:     b     3      1   8
#> 10:     b     4      1   8