How to display the variable name in a Python DataFrame instead of the column name?

124 Views Asked by At

I'm currently studying the basics of data analysis with Python in Colab, and for that I'm using my IMDb watchlist as a dataset.

In the column Genres, several movie genres can be registered in the same cell (which makes things more difficult), and I'm trying to calculate the proportions of the genres presented in this dataset and then plot it with a pie or barh chart maybe.

dataset

So I created variables to store the value_counts() of each genre as True or False, as you can see below:

action = df['Genres'].str.contains('Action').value_counts()
animation = df['Genres'].str.contains('Animation').value_counts()
biography = df['Genres'].str.contains('Biography').value_counts()
comedy = df['Genres'].str.contains('Comedy').value_counts()
crime = df['Genres'].str.contains('Crime').value_counts()
drama = df['Genres'].str.contains('Drama').value_counts()
documentary = df['Genres'].str.contains('Documentary').value_counts()
family = df['Genres'].str.contains('Family').value_counts()
fantasy = df['Genres'].str.contains('Fantasy').value_counts()
film_noir = df['Genres'].str.contains('Film-Noir').value_counts()
history = df['Genres'].str.contains('History').value_counts()
horror = df['Genres'].str.contains('Horror').value_counts()
mystery = df['Genres'].str.contains('Mystery').value_counts()
music = df['Genres'].str.contains('Music').value_counts()
musical = df['Genres'].str.contains('Musical').value_counts()
romance = df['Genres'].str.contains('Romance').value_counts()
scifi = df['Genres'].str.contains('Sci-Fi').value_counts()
sport = df['Genres'].str.contains('Sport').value_counts()
thriller = df['Genres'].str.contains('Thriller').value_counts()
war = df['Genres'].str.contains('War').value_counts()
western = df['Genres'].str.contains('Western').value_counts()

Then I put these variables into a DataFrame:

genres = pd.DataFrame(
    [action, animation, biography,
     comedy, crime, drama,
     documentary, family, fantasy,
     film_noir, history, horror,
     mystery, music, musical,
     romance, scifi, sport,
     thriller, war, western],
    )
genres.head(5)

The problem is in the output:

output

I'd like it to display the variable names instead of 'Genres', as it's being show in the first column. Is it possible?

2

There are 2 best solutions below

2
Laurent B. On BEST ANSWER

To avoid using a relatively slow for loop :

Let's suppose with have the following dataframe

                       Genres
0              Comedy, Horror
1          Comedy, Drama, War
2  Mistery, Romance, Thriller

Proposed code

import pandas as pd

# create the original DataFrame
df = pd.DataFrame({'Genres': ['Comedy, Horror', 'Comedy, Drama, War', 'Mistery, Romance, Thriller']})

# split the genres by comma and remove leading spaces
df['Genres'] = df['Genres'].str.split(',').apply(lambda x: [i.strip() for i in x])

# explode the list into separate rows
df = df.explode('Genres')

# Counting Matrix using crosstab method
genre_counts = pd.crosstab(index=df.index, columns=df['Genres'], margins=False).to_dict('index')

genre_counts = pd.DataFrame(genre_counts)

# count the number of 0s and 1s in each row
counts = ( genre_counts.apply(lambda row: [sum(row == 0), sum(row == 1)], axis=1) )

# Final count with 2 columns 'False' and 'True'
counts = pd.DataFrame(counts.tolist(), index=counts.index, columns=['False', 'True'])

print(counts)

Vizualisation

          False  True
Comedy        1     2
Drama         2     1
Horror        2     1
Mistery       2     1
Romance       2     1
Thriller      2     1
War           2     1
3
Marilyn Smith On

I think you can achieve this by creating a DataFrame using a dictionary where keys are the genre names, and values are the corresponding Series containing the counts. Here's an example:

import pandas as pd

# Sample DataFrame
data = {'Genres': ['Action, Drama', 'Comedy, Romance', 'Action, Comedy', 'Drama', 'Comedy']}
df = pd.DataFrame(data)

# List of genres
genre_list = ['Action', 'Animation', 'Biography', 'Comedy', 'Crime', 'Drama', 'Documentary', 'Family',
              'Fantasy', 'Film-Noir', 'History', 'Horror', 'Mystery', 'Music', 'Musical', 'Romance',
              'Sci-Fi', 'Sport', 'Thriller', 'War', 'Western']

# Create a dictionary to store genre counts
genre_counts = {}

# Populate the dictionary with counts
for genre in genre_list:
    genre_counts[genre] = df['Genres'].str.contains(genre).sum()

# Create a DataFrame from the dictionary
genres_df = pd.DataFrame(list(genre_counts.items()), columns=['Genre', 'Count'])

# Display the DataFrame
print(genres_df)

This code creates a dictionary (genre_counts) where keys are genre names, and values are the counts of each genre in the 'Genres' column. Then, it converts the dictionary into a DataFrame (genres_df) and displays it. This way, the DataFrame will have 'Genre' and 'Count' columns instead of 'Genres'.