df.drop_duplicates() in polars?

120 Views Asked by At
import polars as pl

df = pl.DataFrame(
    {
        "X": [4, 2, 3, 4],
        "Y": ["p", "p", "p", "p"],
        "Z": ["b", "b", "b", "b"],
    }
)

We know the equivalent of 's df.drop_duplicates() is df.unique() in

But, each time I execute my query I get a different result?

print(df.unique())

X   Y   Z
i64 str str
3   "p" "b"
2   "p" "b"
4   "p" "b"

X   Y   Z
i64 str str
4   "p" "b"
2   "p" "b"
3   "p" "b"

X   Y   Z
i64 str str
2   "p" "b"
3   "p" "b"
4   "p" "b"

Is this intentional and what is the reason behind it?

1

There are 1 best solutions below

0
Talha Tayyab On BEST ANSWER

Yes, this is an intentional behavior.

If you need a consistent behavior then do:

df.unique(maintain_order=True)

polars.DataFrame.unique

maintain_order

Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.

Maintaining order is not streaming-friendly as it requires bringing together all the chunks in memory to compare the order of the rows.

With this change of default the developers want to ensure that Polars is ready to work with datasets of all sizes while allowing users to choose different behaviour if desired.

A related point is the choice of which row within each duplicated group is kept by unique. In Pandas this defaults to the first row of each duplicated groups. In Polars the default is any as this again allows more optimizations.

Other functions that have this behavior include:

1.group_by (maintain_order: bool = False)

2.partition_by (maintain_order: bool = True)

3.pivot (maintain_order: bool = True)

4.upsample (maintain_order: bool = False)

A detailed article here by @LiamBrannigan: https://www.rhosignal.com/posts/polars-ordering/