In polars select all column ending with pattern and add new columns without pattern

117 Views Asked by At

I have the following dataframe:

import polars as pl
import numpy as np

df = pl.DataFrame({
    "nrs": [1, 2, 3, None, 5],
    "names_A0": ["foo", "ham", "spam", "egg", None],
    "random_A0": np.random.rand(5),
    "A_A2": [True, True, False, False, False],
})
digit = 0

For each column X whose name ends with the string suf =f'_A{digit}', I want to add an identical column to df, whose name is the same as X, but without suf.

In the example, I need to add columns names and random to the original dataframe df, whose content is identical to that of columns names_A0 and random_A0 respectively.

2

There are 2 best solutions below

0
Cameron Riddell On BEST ANSWER

You can you Polars Selectors along with some basic strings operations to accomplish this. Depending on what you how you expect this problem to evolve, you can jump straight to regular expressions, or use polars.selectors.ends_with/string.removesuffix

String Suffix Operations

This approach uses

- polars.selectors.ends_with # find columns ending with string
- string.removesuffix        # remove suffix from end of string

translating to

import polars as pl
from polars import selectors as cs
import numpy as np
import re
from functools import partial

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names_A0": ["foo", "ham", "spam", "egg", None],
        "random_A0": np.random.rand(5),
        "A_A2": [True, True, False, False, False],
    }
)
digit = 0
suffix = f'_A{digit}'

print(
    # keep original A0 columns
    df.with_columns(
        cs.ends_with(suffix).name.map(lambda s: s.removesuffix(suffix))
    ),
    # shape: (5, 6)
    # ┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
    # │ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ foo      ┆ 0.713324  ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ ham      ┆ 0.980031  ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ spam     ┆ 0.242768  ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ egg      ┆ 0.528783  ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ null     ┆ 0.583206  ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴──────────┴───────────┴───────┴───────┴──────────┘


    # drop original A0 columns
    df.select(
        ~cs.ends_with(suffix),
        cs.ends_with(suffix).name.map(lambda s: s.removesuffix(suffix))
    ),
    # shape: (5, 4)
    # ┌──────┬───────┬───────┬──────────┐
    # │ nrs  ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴───────┴───────┴──────────┘

    sep='\n\n'
)

Regular Expressions

Alternatively you can use regular expressions to detect a range of suffix patterns

- polars.selectors.matches  # find columns matching a pattern
- re.sub                    # substitute in string based on pattern

We will need to ensure our pattern ends with a '$' to anchor the pattern to the end of the string.

import polars as pl
from polars import selectors as cs
import numpy as np
import re
from functools import partial

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names_A0": ["foo", "ham", "spam", "egg", None],
        "random_A0": np.random.rand(5),
        "A_A2": [True, True, False, False, False],
    }
)
digit=0
suffix = fr'_A{digit}$'

print(
    # keep original A0 columns
    df.with_columns(
        cs.matches(suffix).name.map(lambda s: re.sub(suffix, '', s))
    ),
    # shape: (5, 6)
    # ┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
    # │ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ foo      ┆ 0.713324  ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ ham      ┆ 0.980031  ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ spam     ┆ 0.242768  ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ egg      ┆ 0.528783  ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ null     ┆ 0.583206  ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴──────────┴───────────┴───────┴───────┴──────────┘


    # drop original A0 columns
    df.select(
        ~cs.matches(suffix),
        cs.matches(suffix).name.map(lambda s: re.sub(suffix, '', s))
    ),
    # shape: (5, 4)
    # ┌──────┬───────┬───────┬──────────┐
    # │ nrs  ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴───────┴───────┴──────────┘

    sep='\n\n'
)
6
Hericks On

You can use polars' column selectors to select the corresponding columns and then use .name.map to rename the output of the selector expression.

import polars.selectors as cs

df.with_columns(cs.matches(f"_A{digit}$").name.map(lambda name: name[:-3]))
shape: (5, 6)
┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
│ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
│ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
│ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
│ 1    ┆ foo      ┆ 0.626253  ┆ true  ┆ foo   ┆ 0.626253 │
│ 2    ┆ ham      ┆ 0.480437  ┆ true  ┆ ham   ┆ 0.480437 │
│ 3    ┆ spam     ┆ 0.789309  ┆ false ┆ spam  ┆ 0.789309 │
│ null ┆ egg      ┆ 0.126665  ┆ false ┆ egg   ┆ 0.126665 │
│ 5    ┆ null     ┆ 0.522989  ┆ false ┆ null  ┆ 0.522989 │
└──────┴──────────┴───────────┴───────┴───────┴──────────┘

Note. In the example above, we select all columns with names that contain the string "_A" followed a digit, followed by the string end ($). As the suffix is guaranteed to be of length 3, the new name is equal to the original name without the last 3 letters.