In polars select all column ending with pattern and add new columns without pattern

Question

In polars select all column ending with pattern and add new columns without pattern

117 Views Asked by DeltaIV At 23 February 2024 at 16:19

I have the following dataframe:

import polars as pl
import numpy as np

df = pl.DataFrame({
    "nrs": [1, 2, 3, None, 5],
    "names_A0": ["foo", "ham", "spam", "egg", None],
    "random_A0": np.random.rand(5),
    "A_A2": [True, True, False, False, False],
})
digit = 0

For each column X whose name ends with the string suf =f'_A{digit}', I want to add an identical column to df, whose name is the same as X, but without suf.

In the example, I need to add columns names and random to the original dataframe df, whose content is identical to that of columns names_A0 and random_A0 respectively.

Original Q&A

There are 2 best solutions below

Hericks On 23 February 2024 at 16:27

You can use polars' column selectors to select the corresponding columns and then use .name.map to rename the output of the selector expression.

import polars.selectors as cs

df.with_columns(cs.matches(f"_A{digit}$").name.map(lambda name: name[:-3]))

shape: (5, 6)
┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
│ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
│ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
│ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
│ 1    ┆ foo      ┆ 0.626253  ┆ true  ┆ foo   ┆ 0.626253 │
│ 2    ┆ ham      ┆ 0.480437  ┆ true  ┆ ham   ┆ 0.480437 │
│ 3    ┆ spam     ┆ 0.789309  ┆ false ┆ spam  ┆ 0.789309 │
│ null ┆ egg      ┆ 0.126665  ┆ false ┆ egg   ┆ 0.126665 │
│ 5    ┆ null     ┆ 0.522989  ┆ false ┆ null  ┆ 0.522989 │
└──────┴──────────┴───────────┴───────┴───────┴──────────┘

Note. In the example above, we select all columns with names that contain the string "_A" followed a digit, followed by the string end ($). As the suffix is guaranteed to be of length 3, the new name is equal to the original name without the last 3 letters.

**Cameron Riddell** · Accepted Answer · 2024-02-23T17:11:20.520000

You can you Polars Selectors along with some basic strings operations to accomplish this. Depending on what you how you expect this problem to evolve, you can jump straight to regular expressions, or use polars.selectors.ends_with/string.removesuffix

String Suffix Operations

This approach uses

- polars.selectors.ends_with # find columns ending with string
- string.removesuffix        # remove suffix from end of string

translating to

import polars as pl
from polars import selectors as cs
import numpy as np
import re
from functools import partial

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names_A0": ["foo", "ham", "spam", "egg", None],
        "random_A0": np.random.rand(5),
        "A_A2": [True, True, False, False, False],
    }
)
digit = 0
suffix = f'_A{digit}'

print(
    # keep original A0 columns
    df.with_columns(
        cs.ends_with(suffix).name.map(lambda s: s.removesuffix(suffix))
    ),
    # shape: (5, 6)
    # ┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
    # │ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ foo      ┆ 0.713324  ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ ham      ┆ 0.980031  ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ spam     ┆ 0.242768  ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ egg      ┆ 0.528783  ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ null     ┆ 0.583206  ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴──────────┴───────────┴───────┴───────┴──────────┘


    # drop original A0 columns
    df.select(
        ~cs.ends_with(suffix),
        cs.ends_with(suffix).name.map(lambda s: s.removesuffix(suffix))
    ),
    # shape: (5, 4)
    # ┌──────┬───────┬───────┬──────────┐
    # │ nrs  ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴───────┴───────┴──────────┘

    sep='\n\n'
)

Regular Expressions

Alternatively you can use regular expressions to detect a range of suffix patterns

- polars.selectors.matches  # find columns matching a pattern
- re.sub                    # substitute in string based on pattern

We will need to ensure our pattern ends with a '$' to anchor the pattern to the end of the string.

import polars as pl
from polars import selectors as cs
import numpy as np
import re
from functools import partial

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names_A0": ["foo", "ham", "spam", "egg", None],
        "random_A0": np.random.rand(5),
        "A_A2": [True, True, False, False, False],
    }
)
digit=0
suffix = fr'_A{digit}$'

print(
    # keep original A0 columns
    df.with_columns(
        cs.matches(suffix).name.map(lambda s: re.sub(suffix, '', s))
    ),
    # shape: (5, 6)
    # ┌──────┬──────────┬───────────┬───────┬───────┬──────────┐
    # │ nrs  ┆ names_A0 ┆ random_A0 ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---      ┆ ---       ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ str      ┆ f64       ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪══════════╪═══════════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ foo      ┆ 0.713324  ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ ham      ┆ 0.980031  ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ spam     ┆ 0.242768  ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ egg      ┆ 0.528783  ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ null     ┆ 0.583206  ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴──────────┴───────────┴───────┴───────┴──────────┘


    # drop original A0 columns
    df.select(
        ~cs.matches(suffix),
        cs.matches(suffix).name.map(lambda s: re.sub(suffix, '', s))
    ),
    # shape: (5, 4)
    # ┌──────┬───────┬───────┬──────────┐
    # │ nrs  ┆ A_A2  ┆ names ┆ random   │
    # │ ---  ┆ ---   ┆ ---   ┆ ---      │
    # │ i64  ┆ bool  ┆ str   ┆ f64      │
    # ╞══════╪═══════╪═══════╪══════════╡
    # │ 1    ┆ true  ┆ foo   ┆ 0.713324 │
    # │ 2    ┆ true  ┆ ham   ┆ 0.980031 │
    # │ 3    ┆ false ┆ spam  ┆ 0.242768 │
    # │ null ┆ false ┆ egg   ┆ 0.528783 │
    # │ 5    ┆ false ┆ null  ┆ 0.583206 │
    # └──────┴───────┴───────┴──────────┘

    sep='\n\n'
)

In polars select all column ending with pattern and add new columns without pattern

There are 2 best solutions below

String Suffix Operations

Regular Expressions

Related Questions in PYTHON

Related Questions in CALCULATED-COLUMNS

Related Questions in PYTHON-POLARS

Trending Questions

Popular # Hahtags

Popular Questions