Calculate cumulative sum by row, skipping zeros, and restarting using column names pattern

78 Views Asked by At

I have a function to calculate the cumulative sum of the row, restarting the process every time after the Y3 in the column name.

There should be a sum consisting of Y1 in the first cell, Y1+Y2 in the second cell, Y1+Y2+Y3 in the third cell. Then the code restarts - starting from Y1.

All the column names have a pattern - they start from Y1, Y2, or Y3.

I would like to jump over the cells that contain zeros and continue calculating. That is, if the cell equals zero, it should stay as it is, instead of being overwritten by the cumulative. The code would skip a 0 cell and add the value of the current cell to the value of the previous, non-zero one.

For instance, if Y2=0, the sum in the Y3 cell should be Y3=Y3 (+Y2)+Y1, as usual, while cell Y2=0 remains unchanged instead of being overwritten by the cumulative Y2=Y2+Y1.

I wrote a function (see below), but there are two issues with the Rating 3C line.

Here is the reproducible code chunk:

df=structure(list(Ratingstufe = c("10", "1A", "1B", "2A", "2B", 
     "3A", "3C", "9C"), 
     Y1.Base = c(0, 0, 0, 0, 0, 0, 0, 1), 
     Y2.Base = c(0, 0, 0, 0, 0, 0, 0, -1), 
     Y3.Base = c(0, 0, 0, 0, 0, 0, 0, 0), 
     Y1.Scenario_1 = c(0, 0, 2, 2, 2, 2, 2, 1), 
     Y2.Scenario_1 = c(0, 0, 0, 0, 0, 0, 0, -1), 
     Y3.Scenario_1 = c(0, 0, 0, 0, 0, 0, -1, 0), 
     # Y1.Scenario_2 = c(0, 0, 1, 1, 1, 1, 1, 1),
     # Y2.Scenario_2 = c(0, 1, 1, 0, 1, 0, 1, 0),
     # Y3.Scenario_2 = c(0, 0, 0, 0, 0, 0, -1, 0),
     Y1.Scenario_3 = c(0, 0, 1, 1, 0, 1, 1, 1),
     Y2.Scenario_3 = c(0, 0, 1, 1, 1, 2, 2, -1),
     Y3.Scenario_3 = c(0, 0, 0, 0, 0, 0, 1, 1)),
     row.names = c(NA,  8L), class = "data.frame")

This is what I tried.

# Function to calculate cumulative sum by row, ignoring zeros and restarting at "Y1"
cumsum_ignore_zeros <- function(row) {
  # Initialize variables
  cumulative <- 0
  result <- numeric(length(row))

  # Iterate through the row elements
  for (i in seq_along(row)) {
    if (row[i] != 0) {
      cumulative <- cumulative + row[i]
      result[i] <- cumulative
    } else {
      result[i] <- row[i]
    }

    # Check if the column name contains 'Y1' and reset cumulative sum if it does
    if (grepl("Y1", names(df)[i])) {
      cumulative <- 0
    }
  }
  return(result)
}

# Apply the function to each row
df[-1] <- t(apply(df[-1], 1, cumsum_ignore_zeros))

Problem 1: column Y3.Scenario_1. There is an incorrectly calculated -1 at the intersection with the Rating 3C. I feel it's because in the preceding cell (Y2.Scenario_1), there is a result of 0, and the function does not add Y3 to Y1, but simply writes Y3 result down. Instead of jumping over Y2, and then doing Y3=Y3+Y1, it seemingly restarts, going Y3=Y3 instead.

Once 1 was added to -1 (Y2=Y2+Y1=1+(-1)=0) and replaced the cell on the intersection of Rating 3C and Y2.Scenario_1 with 0, the function ignored 0 in Y2 and restarted? And showed a Y3=-1, seemingly a result of summing the Y3=Y3 (the cell with itself), instead of summing Y3=Y3+Y1.

Problem 2: The next case where this issue is manifesting itself is when somehow, I believe, Y1.Scenario_3 adds the value of Y3.Scenario_1 to itself, when it is not supposed to. Such that Y1.Scenario_3=Y1.Scenario_3+Y3.Scenario_1. It should have been just Y1.Scenario_3=Y1.Scenario_3.

Instead, 1 adds -1 to itself in the on the intersection of Rating 3C and Y1.Scenario_3, becoming 0.

Y1.Scenario_3 was supposed to not look back to Y3.Scenario_1. I do not see any other option as to why the result would be like this.

Here is the outcome I have currently.

tail(df)
  Ratingstufe Y1.Base Y2.Base Y3.Base Y1.Scenario_1 Y2.Scenario_1 Y3.Scenario_1 Y1.Scenario_3 Y2.Scenario_3 Y3.Scenario_3
3          1B       0       0       0             2             0             0             1             2             0
4          2A       0       0       0             2             0             0             1             2             0
5          2B       0       0       0             2             0             0             0             1             0
6          3A       0       0       0             2             0             0             1             3             0
7          3C       0       0       0             2             0            -1             0             2             1
8          9C       1       0       0             1             0             0             1             0             1

And this is what I would like to achieve (I marked the changes by the stars):

tail(df)
  Ratingstufe Y1.Base Y2.Base Y3.Base Y1.Scenario_1 Y2.Scenario_1 Y3.Scenario_1 Y1.Scenario_3 Y2.Scenario_3 Y3.Scenario_3
3          1B       0       0       0             2             0             0             1             2             0
4          2A       0       0       0             2             0             0             1             2             0
5          2B       0       0       0             2             0             0             0             1             0
6          3A       0       0       0             2             0             0             1             3             0
7          3C       0       0       0             2             0             1*            1*            3*            4*
8          9C       1       0       0             1             0             0             1             0             1

My original dataset for your convenience (dput can be found above):

tail(df)
  Ratingstufe Y1.Base Y2.Base Y3.Base Y1.Scenario_1 Y2.Scenario_1 Y3.Scenario_1 Y1.Scenario_3 Y2.Scenario_3 Y3.Scenario_3
3          1B       0       0       0             2             0             0             1             1             0
4          2A       0       0       0             2             0             0             1             1             0
5          2B       0       0       0             2             0             0             0             1             0
6          3A       0       0       0             2             0             0             1             2             0
7          3C       0       0       0             2             0            -1             1             2             1
8          9C       1      -1       0             1            -1             0             1            -1             1

Any other solution besides correcting the loop is also welcome.

3

There are 3 best solutions below

2
M-- On BEST ANSWER

We can convert the data to long format and get the cumsum (with an additional condition) for each "scenario" and "rating" (assuming that ratings are unique), and then convert back to wide format.

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(-Ratingstufe) %>% 
  separate(name, into = c("Y", "Scen"), sep = "\\.", remove = FALSE) %>% 
  mutate(csum = cumsum(value) * (value != 0), 
            .by = c(Ratingstufe, Scen)) %>% 
  pivot_wider(id_cols = Ratingstufe, names_from = name, values_from = csum)

#> # A tibble: 8 × 10
#>   Ratingstufe Y1.Base Y2.Base Y3.Base Y1.Scenario_1 Y2.Scenario_1 Y3.Scenario_1
#>   <chr>         <dbl>   <dbl>   <dbl>         <dbl>         <dbl>         <dbl>
#> 1 10                0       0       0             0             0             0
#> 2 1A                0       0       0             0             0             0
#> 3 1B                0       0       0             2             0             0
#> 4 2A                0       0       0             2             0             0
#> 5 2B                0       0       0             2             0             0
#> 6 3A                0       0       0             2             0             0
#> 7 3C                0       0       0             2             0             1
#> 8 9C                1       0       0             1             0             0
#> # ℹ 3 more variables: Y1.Scenario_3 <dbl>, Y2.Scenario_3 <dbl>,
#> #   Y3.Scenario_3 <dbl>

Created on 2024-02-29 with reprex v2.0.2

0
Friede On

I came up with

X = lapply(split.default(df[-1], gl(ncol(df[-1]) / 3, 3)), 
           \(x) t(apply(x, 1, cumsum))) |>
  do.call(cbind, args = _)
df[-1] = ifelse(df[-1] == 0, 0, X)

which is a split-apply-combine approach incorporating cumsum rowwisely (MARGIN = 1 in apply()).


I recently saw a beautiful approach (as ifelse() alternative) to conditionally replace values of a data frame/matrix with values of another (same dimension), but cannot remember.

0
Ingrid On

I found a correct way for the function to not overwrite the cell value with a row-wise cumsum whenever there is a 0 in a cell, and continue calculating the cumulative until it hits Y1 again:

cumsum_w_reset <- function(df) {
      # Convert columns that need to be summed to numeric
      # this adds forced NA writing issue in the Rating column. Instead, subset df[-1] in the "Apply" row.
        #df[] <- lapply(df, function(x) if(!is.numeric(x)) as.numeric(as.character(x)) else x)
    
      # Apply the cumulative sum row-wise with resets
      t(apply(df, 1, function(row) {
        # Initialize the running sum
        running_sum <- 0
        # Initialize the flag to track if we should start summing
        start_sum <- FALSE
    
        # Iterate over each element of the row
        for (i in seq_along(row)) {
          # Check if the current column is a reset point
          if (grepl("^Y1", names(df)[i])) {
            running_sum <- row[i] # Set the running sum to the current value
            start_sum <- TRUE # Start summing from this point
          } else if (start_sum && row[i] != 0) {
            # Add the current value to the running sum if it's not 0
            running_sum <- running_sum + row[i]
            row[i] <- running_sum # Update the row with the new value
          }
          # If the value is 0, do not add to running_sum and do not update the value of the cell
        }
        return(row)
      }))
    }
    
    # Apply the function to the numeric columns of the dataframe
    #df[-1] <- cumsum_w_reset(df[-1])
    df[, sapply(df, is.numeric)] <-  cumsum_w_reset(df[, sapply(df, is.numeric)])