Is this the most performant way to rename a Polars DF column?

75 Views Asked by JJ Fantini At 01 March 2024 at 11:46

Issue:

I have a column name that can change its prefix and suffix based on some function arguments, but there is a section of the column name that is always the same. I need to rename that column to something easy for reference in a different workflow. I am in search of the quickest way to find the column I am looking for and rename it to my desired name.

I am using a for loop to check if the part of the string is in each column, but I don't think that this is the most performant way to rename a column based on regex filtering.

Solution + Reprex

This is what I have come up with:


data = pl.DataFrame({
    "foo": [1, 2, 3, 4, 5],
    "bar": [5, 4, 3, 2, 1],
    "std_volatility_pct_21D": [0.1, 0.2, 0.15, 0.18, 0.16]
})

for col in data.columns:
    if "volatility_pct" in col:
        new_data = data.rename({col: "realized_volatility"})

Perfromance

import polars as pl
import polars.selectors as cs

data = pl.DataFrame(
    {
        "foo": [1, 2, 3, 4, 5],
        "bar": [5, 4, 3, 2, 1],
        "std_volatility_pct_21D": [0.1, 0.2, 0.15, 0.18, 0.16],
    }
)


# 1
def rename_volatility_column(data):
    for col in data.columns:
        if "volatility_pct" in col:
            return data.rename({col: "realized_volatility"})
    return data


%timeit rename_volatility_column(data)


# 2
def adjust_volatility_column(data):
    return data.select(
        ~cs.contains("volatility_pct"),
        cs.contains("volatility_pct").alias("realized_volatility"),
    )


%timeit adjust_volatility_column(data)

# 3
%timeit data.rename(lambda col: "realized_volatility" if "volatility_pct" in col else col)

#1
18.8 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

#2
330 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

#3
133 µs ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Original Q&A

There are 2 best solutions below

Hericks On 01 March 2024 at 11:52 BEST ANSWER

You can use polars' column selectors.

~cs.contains("volatility_pct") selects all column that do not contain volatility_pct
cs.contains("volatility_pct").alias("realized_volatility") selects all columns that contain volatility_pct and renames them to realized_volatility

import polars.selectors as cs

(
    data
    .select(
        ~cs.contains("volatility_pct"),
        cs.contains("volatility_pct").alias("realized_volatility"),
    )
)

jqurious On 01 March 2024 at 12:05

.rename() also accepts a Callable - which could perhaps be nicer to write.

df.rename(lambda col:
   "realized_volatility" if "volatility_pct" in col else col
)

shape: (5, 3)
┌─────┬─────┬─────────────────────┐
│ foo ┆ bar ┆ realized_volatility │
│ --- ┆ --- ┆ ---                 │
│ i64 ┆ i64 ┆ f64                 │
╞═════╪═════╪═════════════════════╡
│ 1   ┆ 5   ┆ 0.1                 │
│ 2   ┆ 4   ┆ 0.2                 │
│ 3   ┆ 3   ┆ 0.15                │
│ 4   ┆ 2   ┆ 0.18                │
│ 5   ┆ 1   ┆ 0.16                │
└─────┴─────┴─────────────────────┘

It doesn't seem like there would be much difference performance-wise with any of the approaches.

Is this the most performant way to rename a Polars DF column?

Issue:

Solution + Reprex

Perfromance

There are 2 best solutions below

Related Questions in PYTHON-3.X

Related Questions in DATAFRAME

Related Questions in FOR-LOOP

Related Questions in RENAME

Related Questions in PYTHON-POLARS

Trending Questions

Popular # Hahtags

Popular Questions