I am working in R with a large dataset. The dataset records when bike rides start and end throughout a year with columns "started_at" and "ended_at" respectively. I want to create an area graph in ggplot that shows the activity of the bike rides over a 24-hour period. Therefore, the x-axis should start with 00:00:00 and end with 23:59:59. Ideally, I would have accuracy down to the second. The y-axis would show all active bike rides during any set time. For example, if the dataset only contained two rides: 2023-01-01 23:00:00--2023-01-02 01:00:00 2023-01-05 00:00:00--2023-01-05 01:00:00 Then the chart should show a straight value of 2 from 00:00:00 to 01:00:00, drop to 0, then rise to 1 from 23:00:00 to 23:59:59.
To accomplish this task, I first need to make a dataframe that lists all times from 00:00:00 to 23:59:59. I did that with this code:
df <- data.frame(time = format(seq(from = as.POSIXct("00:00:00", format = "%T"), to = as.POSIXct(
"23:59:59", format = "%T"), by = "1 sec"), "%H:%M:%S"))
Then, I need to record how many trips are active at any one time. I currently have a method for doing this as long as ended_at > started_at (when a trip doesn't occur over midnight or during a DST roll back). I'm not trying to fix that problem yet. The code I came up with looks like this:
sum(mapply(between, "03:00:00", format(as.POSIXct(sample$started_at), "%H:%M:%S"), format(as.POSIXct(sample$ended_at), "%H:%M:%S")))
With a 100,000 row sample, the code takes ~10 seconds to complete. I need to work with 4 million rows with 86400 different times. I could lower my accuracy by only looking at each hour, but the code is still too slow. Is there a more efficient way to code this?