I have a dataset that looks like this:
| Session | UserID | Duration(min) | id | Page# |
|---|---|---|---|---|
| 123 | 345 | 40 | 1 | Pg1 |
| 123 | 345 | 40 | 2 | Pg2 |
| 123 | 345 | 40 | 3 | Pg2 |
| 123 | 345 | 40 | 4 | Pg3 |
| 123 | 345 | 40 | 5 | Pg2 |
| 123 | 345 | 40 | 6 | Pg3 |
| 124 | 346 | 100 | 1 | Pg2 |
| 124 | 346 | 100 | 2 | Pg2 |
| 124 | 346 | 100 | 3 | Pg3 |
| 124 | 346 | 100 | 4 | Pg4 |
| 125 | 347 | 59 | 1 | Pg1 |
| 125 | 347 | 59 | 2 | Pg5 |
I would like the dataset to look like this (below) because it would eventually create a more meaningful user journey.
| Session | UserID | Duration(min) | id | Page# |
|---|---|---|---|---|
| 123 | 345 | 40 | 1 | Pg1 |
| 123 | 345 | 40 | 2 | Pg2 |
| 123 | 345 | 40 | 3 | Pg3 |
| 123 | 345 | 40 | 4 | Pg2 |
| 123 | 345 | 40 | 5 | Pg3 |
| 124 | 346 | 100 | 1 | Pg2 |
| 124 | 346 | 100 | 2 | Pg3 |
| 124 | 346 | 100 | 3 | Pg4 |
| 125 | 347 | 59 | 1 | Pg1 |
| 125 | 347 | 59 | 2 | Pg5 |
Duration (min) = the number of minutes from the beginning of the path to the last page viewed. id = path number or path length number.
I am trying to take out subsequent repetitive events based on the session and user id, leaving the duration as is.
Result: