Need help grouping data from multiple columns to an Index column in Python

157 Views Asked by At

I have a "grocery store transactions" csv file loaded into Python that currently looks like this:

txns = pd.read_csv('transactions.csv')
txns.head(10)

Grocery transactions Grocery transactions picture

*** My goal is to group all Products purchased by Transaction number i.e. the Transaction column will serve as the index column. ***

*** I want each row to represent a unique Transaction # and all their associated Product purchases for that transaction. ***

Currently, however, a transaction involving multiple products span multiple rows. This is preventing me from doing my grocery store market basket analysis.

If anyone has any tips or feedback on how I can make this change happen, please comment below!

1

There are 1 best solutions below

0
Reoun On BEST ANSWER

As @Nick said, you can use groupby .sum to make a unique index Transaction.

new_txns = txns.groupby('Transaction').sum()

Then change it back to one hot encoding for basket analysis.

def onehot_encode(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

new_txns = new_txns.applymap(onehot_encode)

Note: If you want one hot as True False.

new_txns = new_txns.astype('bool')