Sorting the order of bigrams, removing duplicates and summing up its frequencies

55 Views Asked by gmohor21 At 19 June 2023 at 23:17

I have a dataframe which has columns bigrams and counts. The dataframe looks like:

bigrams	counts
('asset', 'experience')	1
('qualifications', 'your')	1
('your', 'contribution')	1
('contribution', 'bilingual')	1
('your', 'qualifications')	1
('bilingual', 'contribution')	1

So, for ('contribution', 'bilingual')it should sort the bigrams in alphabetical order and output ('bilingual', 'contribution') in the bigrams column, remove one of the duplicates, and the counts should add to 2 in the counts column. This process should happen for all such occurances of ('contribution', 'bilingual') and for all such bigrams (like, for example here in the above dataframe, ('your', 'contribution')) throughout the dataframe. Then, finally the bigrams which has the maximum frequency should be at the top of the counts column, followed by the bigrams in decreasing order of their frequencies.

I also want to preserve the format in which the bigrams are there in the above dataframe.

I want my output to be like this:

bigrams	counts
('qualifications', 'your')	2
('bilingual', 'contribution')	2
('your', 'contribution')	1
('asset', 'experience')	1

I tried to solve this based on these two SO questions Q1 and Q2, but they are giving me weird answers, not what I need.

The code with which I tried:

import pandas as pd

df = pd.read_csv('emplois_df_FonctionsStagiaire_bigrams_counts.csv')

# splitting the strings in `bigrams` column by space, sort the resulting list and join again. This will help to order the jumbled bigrams
#df_new = pd.DataFrame(columns = ['bigrams', 'counts']
df['bigrams'] = df['bigrams'].apply(lambda x: ' ' .join(sorted(x.strip().split(', '))))
#OR
df['bigrams'] = df['bigrams'].apply(lambda x: tuple(sorted(x.strip('()').split(', '))))
# Do the groupby and sum the `sum` column
df_new = df.groupby('bigrams', as_index=False)['counts'].sum()

df_new.to_csv('emplois_df_FonctionsStagiaire_bigrams_sorted_counts.csv', index= False)

type(df.bigrams.values[1]) gives <class 'str'>.

Original Q&A

There are 1 best solutions below

Andrej Kesely On 20 June 2023 at 00:12

Try:

from ast import literal_eval

# apply ast.literal_eval if necessary:
df['bigrams'] = df['bigrams'].apply(literal_eval)

df['bigrams'] = df['bigrams'].apply(lambda x: tuple(sorted(x)))
df = df.groupby('bigrams').agg({'counts':'sum'}).reset_index()

print(df)

Prints:

                     bigrams  counts
0        (asset, experience)       1
1  (bilingual, contribution)       2
2       (contribution, your)       1
3     (qualifications, your)       2

Sorting the order of bigrams, removing duplicates and summing up its frequencies

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in SORTING

Related Questions in TEXT-MINING

Related Questions in COUNTING

Related Questions in N-GRAM

Trending Questions

Popular # Hahtags

Popular Questions