Last value manually inputted in an r script difftime output

51 Views Asked by At

R coding in Databricks.

I want the time intervals in hours between date1 entries, arranged by pid, med and date1.

I want the latest date1 entry in a sequence of events per date to be manually adjustable to 24 hours.

A cohort is when pid, med and date1 is the same.
Any change will terminate the last cohort with an hour_output == 24.

df

  pid       med                date1                    
1  1       drugA             2023-02-02 09:00:00         
2  1       drugA             2023-02-02 12:00:00         
3  1       drugA             2023-02-02 14:00:00        
4  1       drugB             2023-02-03 10:00:00         
5  1       drugB             2023-02-03 18:00:00 

What script was attempted.

df1 <- df %>%
  arrange(pid, med, date1) %>%
  mutate(hours_output = as.numeric(difftime(lead(date1), date1, units = "hours")))

# Replace the last duration value with 24 hours
df1$hours_output[last(nrow(df1))] <- 24

df1 <- df1 %>% select(med, date1, hours_output)
head(df1)

Actual output

  pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        20.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        18.00 (18 hours to the next row - not shown)

Desired output

  pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        24.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        24.00
3

There are 3 best solutions below

0
db2020 On BEST ANSWER

The following works on the dummy data (as at top of question) when used in Databricks.

Gives desired output.

(Also, works on actual data in Databricks)

library(dplyr)
library(lubridate)

# Convert datetime column to POSIXct object
df$date1 <- ymd_hms(df$date1)

df <- df %>% arrange(date1)

# Calculate duration between consecutive datetime values, including last interval
durations <- c(diff(df$date1), 0)

# Convert durations to hours and round to 2 decimal places
durations <- round(as.numeric(durations, units = "hours"), 2)

# Replace any negative values with 0
durations[durations < 0] <- 0

# Find last timestamp for each date and replace duration with 24 hours
last_times <- dc_4 %>%
  group_by(Date = as.Date(date1)) %>%
  slice_tail(n = 1) %>%
  ungroup()

durations[df$date1 %in% last_times$date1] <- 24

df$duration <- durations

df1 <- df %>% select(date1, duration)

head(df1, 10)

2
r2evans On

Since you want to do it per patient and per med, you should be using group_by so that differences don't mistakenly calculate differences between different groups.

library(dplyr) # 1.1.0 for .by=
df %>%
  mutate(date1 = as.POSIXct(date1)) %>% # may not be needed with your real data
  mutate(
    hours_output = as.numeric(c(diff(date1), 24), units="hours"),
    .by = c(pid, med)
  )
#   pid   med               date1 hours_output
# 1   1 drugA 2023-02-02 09:00:00      3 hours
# 2   1 drugA 2023-02-02 12:00:00      2 hours
# 3   1 drugA 2023-02-02 14:00:00     24 hours
# 4   1 drugB 2023-02-03 10:00:00      8 hours
# 5   1 drugB 2023-02-03 18:00:00     24 hours

I'm using .by= which is new to dplyr_1.1.0; if you have a version before that, then use group_by explicitly:

df %>%
  mutate(date1 = as.POSIXct(date1)) %>%
  group_by(pid, med) %>%
  mutate(hours_output = as.numeric(c(diff(date1), 24), units="hours"))
0
db2020 On

I can make this work as a sample in Databricks.

Working example using dummy data

df <- df %>% arrange(datetime_col)

# Sample dataframe with datetime values
df <- data.frame(datetime_col = c("2023-02-02 09:00:00", "2023-02-02 12:00:00", "2023-02-02 14:00:00"))

# Convert datetime column to POSIXct object
df$datetime_col <- as.POSIXct(df$datetime_col, format = "%Y-%m-%d %H:%M:%S")
#df$datetime_col

# Calculate duration between consecutive datetime values, including last interval
durations <- c(diff(df$datetime_col, units = "hours"), 0)

# Convert durations to hours
durations <- as.numeric(durations, units = "hours")

# Replace last duration value with 24 hours
durations[length(durations)] <- 24

durations <- round(durations, 2)

df$duration <- durations

df_f <- df %>% select(datetime_col, duration)
df_f