Why I cannot replace the len below with "count" and keep same output behavior?. I mean, it runs without error, but the output is very different, if I change the len to count, I get an empty dataset.
import pandas as pd
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc=len)
print(pivot_table)
Results in:
Value 10 20 30 40 50 60 70 80 90
Category
A 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN
B NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN
C NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
size produces same result as len:
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="size")
Even weirder, I can make "count" work like "size" (or len) if I do this strange workaround: create a copy of Value with a different "ValueCopy" name :
df["ValueCopy"] = df["Value"]
pivot_table = df.pivot_table(index='Category', columns='Value', values='ValueCopy', aggfunc="count")
But if I change the pivot_table call to use "count" without the extra dummy column workaround:
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="count")
it results in:
Empty DataFrame
Columns: []
Index: [A, B, C]
Questions:
- Have I hit a Pandas bug?
- Is this expected Pandas behavior?
- How does
"count"work if it does not calllen? (I tried to read the pivot_table Pandas code, but I found it too convoluted) - Why creating a dummy copy of the value column makes count work like size (and len) ?
- Is the fact that count works fine in Polars & DuckDb indicative of this being a defect in Pandas? (see below)
In Polars, "count" works perfectly fine as:
import polars as pl
# Create a DataFrame
data = {
'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pl.DataFrame(data)
# Pivot the DataFrame
pivot_table = df.pivot(index='Category', columns='Value', values='Value', aggregate_function="count")
# Print the pivot table
print(pivot_table)
results in:
Shape: (3, 10)
┌──────────┬──────┬──────┬──────┬───┬──────┬──────┬──────┬──────┐
│ Category ┆ 10 ┆ 20 ┆ 30 ┆ … ┆ 60 ┆ 70 ┆ 80 ┆ 90 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════╪══════╪══════╪══════╪═══╪══════╪══════╪══════╪══════╡
│ A ┆ 1 ┆ null ┆ null ┆ … ┆ null ┆ 1 ┆ null ┆ null │
│ B ┆ null ┆ 1 ┆ null ┆ … ┆ null ┆ null ┆ 1 ┆ null │
│ C ┆ null ┆ null ┆ 1 ┆ … ┆ 1 ┆ null ┆ null ┆ 1 │
└──────────┴──────┴──────┴──────┴───┴──────┴──────┴──────┴──────┘
Likewise DuckDb SQL pivot count has no trouble counting:
import pandas as pd
import duckdb
# Your existing DataFrame
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
# Create a DuckDB connection
con = duckdb.connect()
# Register the DataFrame with DuckDB
con.register('df_pivot', df)
# Perform the pivot operation in DuckDB
query = """
PIVOT df_pivot
ON Value
USING Count(Value)
GROUP BY Category
"""
pivot_table = con.execute(query).fetchdf()
print(pivot_table)
results in:
Category 10 20 30 40 50 60 70 80 90
0 A 1 0 0 1 0 0 1 0 0
1 B 0 1 0 0 1 0 0 1 0
2 C 0 0 1 0 0 1 0 0 1
In Pandas,
pivot_tableat some point callsgroupby(index+columns).In your example that means grouping on the only two existing columns, so there are no values to count.
When using
lenin this situation it returns the length of each group as a DataFrame (usingDataFrame.__len__, which returnslen(self.index)) -- which IMO seems more odd than returning nothing.