Using either melt or wide_to_long for large dataset with inconsistent naming

Question

Using either melt or wide_to_long for large dataset with inconsistent naming

55 Views Asked by KerbFusion At 12 March 2024 at 23:03

I have a large dataset that is difficult to investigate without analysis tools. It's general form is this, but with 16 "ItemType0" columns and 16 "ItemType1", "ItemType2", etc columns.

It represents the properties (many of them) of up to 16 different items recorded at a single timestep, then properties of that timestep.

Time	ItemType0[0].property	ItemType0[1].property	Property
1	1	0	2
2	0	1	2
3	3	3	2

I'd like to receive:

Time	ItemType0.property	Property
1	1	2
2	0	2
3	3	2
1	0	2
2	1	2
3	3	2

import pandas as pd

wide_df = pd.DataFrame({
    "Time": [1,2,3],
    "ItemType0[0].property": [1,0,3],
    "ItemType0[1].property": [0,1,3],
    "Property": [2,2,2]})

What I've tried:

Melt:

ids = [col for col in wide_df.columns if "[" not in col]
inter_df = pd.melt(wide_df, id_vars=ids, var_name="Source")

MemoryError: Unable to allocate 28.3 GiB for an array with shape (15,506831712) and data type uint32

I wouldn't even know where to begin with pd.wide_to_long as everything doesn't start with the same.

Original Q&A

There are 2 best solutions below

**Andrej Kesely** · Answer 1 · 2024-03-12T23:27:10.417000

IIUC, you can try to group the properties by ItemTypeX and then explode:

df.columns = df.columns.str.replace(r"\[\d+\]", "", regex=True)

df = df.set_index(["Time", "Property"])
df = df.T.groupby(df.columns).agg(list).T

print(df.reset_index().explode(df.columns.to_list()))

Prints:

   Time  Property ItemType0.property
0     1         2                  1
0     1         2                  0
1     2         2                  0
1     2         2                  1
2     3         2                  3
2     3         2                  3

**sammywemmy** · Answer 2 · 2024-03-13T02:43:07.430000

One option is with pivot_longer, where you pass multiple .value to the names_to argument, to match with the multiple groups in the names_pattern argument:

# pip install pyjanitor
import janitor
import pandas as pd

(wide_df
.pivot_longer(
    column_names="*[*", 
    names_to=('.value','.value'), 
    names_pattern=r"(.+)\[\d+\](.+)"
    )
)
   Time  Property  ItemType0.property
0     1         2                   1
1     2         2                   0
2     3         2                   3
3     1         2                   0
4     2         2                   1
5     3         2                   3

pivot_longer abstracts the reshaping process - Another option, solely within Pandas, is with a reshaping of the columns, followed by stack:

ids = [col for col in wide_df if '[' not in col]
reshaped = wide_df.set_index(ids)
reshaped.columns = (reshaped
                    .columns
                    .str
                    .split(r'(\[\d+\])', expand=True)
                    .set_names([None,'drop',None])
                   )
reshaped = reshaped.stack(level='drop').droplevel('drop')
reshaped.columns = reshaped.columns.map(lambda x: ''.join(x))
reshaped.reset_index()

   Time  Property  ItemType0.property
0     1         2                   1
1     1         2                   0
2     2         2                   0
3     2         2                   1
4     3         2                   3
5     3         2                   3

Of course, with melting, you are increasing the number of rows, which ultimately, depending on your dataset will eat up memory. if you still get errors based on memory, then you may have to think of an alternative way to solve your issue without blowing it up. you may have to share more insights on what your end goal is.

Using either melt or wide_to_long for large dataset with inconsistent naming

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PANDAS-MELT

Trending Questions

Popular # Hahtags

Popular Questions