I have a Series as follows:
(multi-index) | Count |
Group | Source | Name |
______________________________________
A | X | Jolly | 2
| | Stone | 1
| Y | Sand | 1
B | X | Sand | 1
| Y | Beach | 1
I also have the precursor dataframe to this, which looks like:
Group | Source | Name
___________________________
A | X | Jolly
A | X | Stone
A | X | Jolly
A | Y | Sand
B | X | Sand
B | X | Beach
B | Y | Stone
Each Group is guaranteed to have a Source X and a Source Y, where the corresponding Names for X and Y within a Group do not overlap. That is to say if a Name is found for Source X, it won't be found for Source Y within that group.
For example, I want to count how many times Stone and Sand are seen together for different Sources within a Group. So, Jolly is seen twice along with Stone once for Source Y Name Sand within Group A. If I see Jolly and Sand frequently together across Groups, where each is from a different Source this may mean they reference the same entity.
How do I accomplish this with pandas?
I don't have a particular expected output in mind, perhaps this:
Y
Stone | Sand | Beach | Water
Stone NaN 1 NaN NaN
X Sand 1 NaN NaN NaN
... ... ... ... ...
EDIT #1: I was able to arrive at this based on @mozway answer:
def get_source_grouped_combos(df):
src_x = df[df['Source'] == 'X'].groupby('Name').count()
src_y = df[df['Source'] == 'Y'].groupby('Name').count()
return list(product(src_x.index.values, src_y.index.values))
((dfex.loc[dfex['Group'].map(
dfex.groupby(['Group','Source'])['Name']
.agg(set)
.groupby(level=0)
.agg(lambda x: len(set.intersection(*x))==0))])
.groupby(['Group'])[['Source','Name']]
.apply(get_source_grouped_combos)
.explode('Name')
.agg(set)
.value_counts()
)
Which outputs this:
{Sand, Stone} 2
{Jolly, Sand} 1
{Beach, Stone} 1
dtype: int64
Apply is pretty slow, however. Is there a faster way?
EDIT #2: When applying this to my actual dataset (which has all the same dtypes per column) the last 2 lines of the above solution:
.agg(set)
.value_counts()
Throws the following error:
AttributeError: 'set' object has no attribute 'value_counts'
.agg(set) seems to be wrapping the entire Series in a set instead of each value. How can I fix it so it counts the occurence of each set of values?
EDIT #3:
Seems as though the appearance of NaN values caused this. I would be interested to know why it wraps the series in a set when NaN values are in the series instead of the individual elements.
The expected output is unclear, but do you want something like:
output: