I have a dataframe of visits that includes a person ID column, and a given person may have more than one visit. I want to number the visits for a given person.
I can sort the dataframe by visit date, then iterate and count the number of times a person ID has occurred, or its occurrences between the first index and current index. Is there a more efficient (ideally vectorized) way? (Note that I don't want the total counts as from value_counts(), I want a new series that is a running count of visits per person.)
Here is a dummy dataframe where person 'a' has two visits:
In[1]: df = pd.DataFrame({'Visit': [0, 1, 2, 3, 4, 5], 'Person_ID': ['a', 'b', 'c', 'a', 'd', 'e']})
In[2]: df['Person_ID'].value_counts()
Out[2]:
a 2
b 1
c 1
d 1
e 1
Name: Person_ID, dtype: int64
Example 1: incrementing dictionary counter
In[3]: person_visit_count = {}
for index, row in df.iterrows():
if row['Person_ID'] not in person_visit_count:
person_visit_count[row['Person_ID']] = 1
else:
person_visit_count[row['Person_ID']] += 1
df.loc[index, 'Person_Visit'] = person_visit_count[row['Person_ID']]
In[4]: df
Out[4]:
Visit Person_ID Person_Visit
0 0 a 1.0
1 1 b 1.0
2 2 c 1.0
3 3 a 2.0
4 4 d 1.0
5 5 e 1.0
Example 2: index-based counter
In[5]: for index, row in df.iterrows():
df.loc[index, 'Person_Visit2'] = (df.loc[0:index, 'Person_ID'] == row['Person_ID']).sum()
In[6]: df
Out[6]:
Visit Person_ID Person_Visit Person_Visit2
0 0 a 1.0 1.0
1 1 b 1.0 1.0
2 2 c 1.0 1.0
3 3 a 2.0 2.0
4 4 d 1.0 1.0
5 5 e 1.0 1.0
Iterating is rarely the best option, you can group and do a cumulative count of the rows
Just for comparison, here are the timings for the solutions you gave and this one (given by the timeit module)
groupby with cumcount
incrementing dictionary counter
index-based counter