I have some hospital visit healthcare data in a dataframe of the form:
| record_id | client_id | date_of_encounter | hospital_id |
|---|---|---|---|
| 1 | MK456 | 2014-01-01 | 01J |
| 2 | JJ103 | 2016-04-01 | 02J |
| 3 | MK456 | 2014-02-26 | 01J |
| 4 | JJ103 | 2016-05-01 | 02H |
| 5 | MK456 | 2014-03-01 | 02H |
| 6 | JJ103 | 2016-06-06 | 02J |
I want to create a column hospital_count which is a cumulative count of the UNIQUE hospitals visits by each client at the date_of_counter. I have already sorted is by the client_id and
date_of_counter. The result transformation would be
| record_id | client_id | date_of_encounter | hospital_id | hospital_count |
|---|---|---|---|---|
| 1 | MK456 | 2014-01-01 | 01J | 1 |
| 3 | MK456 | 2014-02-26 | 01J | 1 |
| 5 | MK456 | 2014-03-01 | 02H | 2 |
| 2 | JJ103 | 2016-04-01 | 02J | 1 |
| 4 | JJ103 | 2016-05-01 | 02H | 2 |
| 6 | JJ103 | 2016-06-06 | 02J | 2 |
Some suggest using a combination of a groupby and cumsum() but I am not too sure how?
Using
GoupBy.cumcountCumulative count of the number of distinct hospitals visited by each client
Explanation: We drop successive visits of the same client to the same hospital using
drop_duplicates; then we can simply count the visits of each client usinggroupbyandcumcount. However, this leavesNaNvalues in the lines that were dropped; we fill those values usingfillna.Cumulative count of the number of visits of each client to each hospital