import polars as pl
df = pl.DataFrame(
{
"X": [4, 2, 3, 4],
"Y": ["p", "p", "p", "p"],
"Z": ["b", "b", "b", "b"],
}
)
We know the equivalent of pandas's df.drop_duplicates() is df.unique() in python-polars
But, each time I execute my query I get a different result?
print(df.unique())
X Y Z
i64 str str
3 "p" "b"
2 "p" "b"
4 "p" "b"
X Y Z
i64 str str
4 "p" "b"
2 "p" "b"
3 "p" "b"
X Y Z
i64 str str
2 "p" "b"
3 "p" "b"
4 "p" "b"
Is this intentional and what is the reason behind it?
Yes, this is an intentional behavior.
If you need a consistent behavior then do:
polars.DataFrame.unique
maintain_order
Maintaining order is not streaming-friendly as it requires bringing together all the chunks in memory to compare the order of the rows.
With this change of default the developers want to ensure that Polars is ready to work with datasets of all sizes while allowing users to choose different behaviour if desired.
A related point is the choice of which row within each duplicated group is kept by
unique. In Pandas this defaults to the first row of each duplicated groups. In Polars the default isanyas this again allows more optimizations.Other functions that have this behavior include:
1.
group_by(maintain_order: bool = False)2.
partition_by(maintain_order: bool = True)3.
pivot(maintain_order: bool = True)4.
upsample(maintain_order: bool = False)A detailed article here by @LiamBrannigan: https://www.rhosignal.com/posts/polars-ordering/