Sorting the order of bigrams, removing duplicates and summing up its frequencies

55 Views Asked by At

I have a dataframe which has columns bigrams and counts. The dataframe looks like:

bigrams counts
('asset', 'experience') 1
('qualifications', 'your') 1
('your', 'contribution') 1
('contribution', 'bilingual') 1
('your', 'qualifications') 1
('bilingual', 'contribution') 1

So, for ('contribution', 'bilingual')it should sort the bigrams in alphabetical order and output ('bilingual', 'contribution') in the bigrams column, remove one of the duplicates, and the counts should add to 2 in the counts column. This process should happen for all such occurances of ('contribution', 'bilingual') and for all such bigrams (like, for example here in the above dataframe, ('your', 'contribution')) throughout the dataframe. Then, finally the bigrams which has the maximum frequency should be at the top of the counts column, followed by the bigrams in decreasing order of their frequencies.

I also want to preserve the format in which the bigrams are there in the above dataframe.

I want my output to be like this:

bigrams counts
('qualifications', 'your') 2
('bilingual', 'contribution') 2
('your', 'contribution') 1
('asset', 'experience') 1

I tried to solve this based on these two SO questions Q1 and Q2, but they are giving me weird answers, not what I need.

The code with which I tried:

import pandas as pd

df = pd.read_csv('emplois_df_FonctionsStagiaire_bigrams_counts.csv')

# splitting the strings in `bigrams` column by space, sort the resulting list and join again. This will help to order the jumbled bigrams
#df_new = pd.DataFrame(columns = ['bigrams', 'counts']
df['bigrams'] = df['bigrams'].apply(lambda x: ' ' .join(sorted(x.strip().split(', '))))
#OR
df['bigrams'] = df['bigrams'].apply(lambda x: tuple(sorted(x.strip('()').split(', '))))
# Do the groupby and sum the `sum` column
df_new = df.groupby('bigrams', as_index=False)['counts'].sum()

df_new.to_csv('emplois_df_FonctionsStagiaire_bigrams_sorted_counts.csv', index= False)

type(df.bigrams.values[1]) gives <class 'str'>.

1

There are 1 best solutions below

0
Andrej Kesely On

Try:

from ast import literal_eval

# apply ast.literal_eval if necessary:
df['bigrams'] = df['bigrams'].apply(literal_eval)

df['bigrams'] = df['bigrams'].apply(lambda x: tuple(sorted(x)))
df = df.groupby('bigrams').agg({'counts':'sum'}).reset_index()

print(df)

Prints:

                     bigrams  counts
0        (asset, experience)       1
1  (bilingual, contribution)       2
2       (contribution, your)       1
3     (qualifications, your)       2