Why pandas pivot_table count over a dataframe with 2 columns works with aggfunc=len, but results in empty dataset with count?

Question

Why pandas pivot_table count over a dataframe with 2 columns works with aggfunc=len, but results in empty dataset with count?

132 Views Asked by Luxspes At 16 March 2024 at 22:02

Why I cannot replace the len below with "count" and keep same output behavior?. I mean, it runs without error, but the output is very different, if I change the len to count, I get an empty dataset.

import pandas as pd

data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
        'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}

df = pd.DataFrame(data)

pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc=len)

print(pivot_table)

Results in:

Value      10   20   30   40   50   60   70   80   90
Category                                             
A         1.0  NaN  NaN  1.0  NaN  NaN  1.0  NaN  NaN
B         NaN  1.0  NaN  NaN  1.0  NaN  NaN  1.0  NaN
C         NaN  NaN  1.0  NaN  NaN  1.0  NaN  NaN  1.0

size produces same result as len:

pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="size")

Even weirder, I can make "count" work like "size" (or len) if I do this strange workaround: create a copy of Value with a different "ValueCopy" name :

df["ValueCopy"] = df["Value"]
pivot_table = df.pivot_table(index='Category', columns='Value', values='ValueCopy', aggfunc="count")

But if I change the pivot_table call to use "count" without the extra dummy column workaround:

pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="count")

it results in:

Empty DataFrame
Columns: []
Index: [A, B, C]

Questions:

Have I hit a Pandas bug?
Is this expected Pandas behavior?
How does "count" work if it does not call len? (I tried to read the pivot_table Pandas code, but I found it too convoluted)
Why creating a dummy copy of the value column makes count work like size (and len) ?
Is the fact that count works fine in Polars & DuckDb indicative of this being a defect in Pandas? (see below)

In Polars, "count" works perfectly fine as:

import polars as pl

# Create a DataFrame
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pl.DataFrame(data)

# Pivot the DataFrame
pivot_table = df.pivot(index='Category', columns='Value', values='Value', aggregate_function="count")

# Print the pivot table
print(pivot_table)

results in:

Shape: (3, 10)
┌──────────┬──────┬──────┬──────┬───┬──────┬──────┬──────┬──────┐
│ Category ┆ 10   ┆ 20   ┆ 30   ┆ … ┆ 60   ┆ 70   ┆ 80   ┆ 90   │
│ ---      ┆ ---  ┆ ---  ┆ ---  ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ str      ┆ u32  ┆ u32  ┆ u32  ┆   ┆ u32  ┆ u32  ┆ u32  ┆ u32  │
╞══════════╪══════╪══════╪══════╪═══╪══════╪══════╪══════╪══════╡
│ A        ┆ 1    ┆ null ┆ null ┆ … ┆ null ┆ 1    ┆ null ┆ null │
│ B        ┆ null ┆ 1    ┆ null ┆ … ┆ null ┆ null ┆ 1    ┆ null │
│ C        ┆ null ┆ null ┆ 1    ┆ … ┆ 1    ┆ null ┆ null ┆ 1    │
└──────────┴──────┴──────┴──────┴───┴──────┴──────┴──────┴──────┘

Likewise DuckDb SQL pivot count has no trouble counting:

import pandas as pd
import duckdb

# Your existing DataFrame
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
        'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)

# Create a DuckDB connection
con = duckdb.connect()

# Register the DataFrame with DuckDB
con.register('df_pivot', df)

# Perform the pivot operation in DuckDB
query = """

PIVOT df_pivot
ON Value
USING Count(Value)
GROUP BY Category

"""
pivot_table = con.execute(query).fetchdf()

print(pivot_table)

results in:

  Category  10  20  30  40  50  60  70  80  90
0        A   1   0   0   1   0   0   1   0   0
1        B   0   1   0   0   1   0   0   1   0
2        C   0   0   1   0   0   1   0   0   1

Original Q&A

There are 1 best solutions below

**e-motta** · Accepted Answer · 2024-03-17T20:42:51.117000

In Pandas, pivot_table at some point calls groupby(index+columns).

In your example that means grouping on the only two existing columns, so there are no values to count.

When using len in this situation it returns the length of each group as a DataFrame (using DataFrame.__len__, which returns len(self.index)) -- which IMO seems more odd than returning nothing.

Why pandas pivot_table count over a dataframe with 2 columns works with aggfunc=len, but results in empty dataset with count?

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in COUNT

Related Questions in PIVOT-TABLE

Trending Questions

Popular # Hahtags

Popular Questions