Using either melt or wide_to_long for large dataset with inconsistent naming

55 Views Asked by At

I have a large dataset that is difficult to investigate without analysis tools. It's general form is this, but with 16 "ItemType0" columns and 16 "ItemType1", "ItemType2", etc columns.

It represents the properties (many of them) of up to 16 different items recorded at a single timestep, then properties of that timestep.

Time ItemType0[0].property ItemType0[1].property Property
1 1 0 2
2 0 1 2
3 3 3 2

I'd like to receive:

Time ItemType0.property Property
1 1 2
2 0 2
3 3 2
1 0 2
2 1 2
3 3 2
import pandas as pd

wide_df = pd.DataFrame({
    "Time": [1,2,3],
    "ItemType0[0].property": [1,0,3],
    "ItemType0[1].property": [0,1,3],
    "Property": [2,2,2]})

What I've tried:

  1. Melt:

    ids = [col for col in wide_df.columns if "[" not in col]
    inter_df = pd.melt(wide_df, id_vars=ids, var_name="Source")
    

    MemoryError: Unable to allocate 28.3 GiB for an array with shape (15,506831712) and data type uint32

  2. I wouldn't even know where to begin with pd.wide_to_long as everything doesn't start with the same.

2

There are 2 best solutions below

0
Andrej Kesely On

IIUC, you can try to group the properties by ItemTypeX and then explode:

df.columns = df.columns.str.replace(r"\[\d+\]", "", regex=True)

df = df.set_index(["Time", "Property"])
df = df.T.groupby(df.columns).agg(list).T

print(df.reset_index().explode(df.columns.to_list()))

Prints:

   Time  Property ItemType0.property
0     1         2                  1
0     1         2                  0
1     2         2                  0
1     2         2                  1
2     3         2                  3
2     3         2                  3
0
sammywemmy On

One option is with pivot_longer, where you pass multiple .value to the names_to argument, to match with the multiple groups in the names_pattern argument:

# pip install pyjanitor
import janitor
import pandas as pd

(wide_df
.pivot_longer(
    column_names="*[*", 
    names_to=('.value','.value'), 
    names_pattern=r"(.+)\[\d+\](.+)"
    )
)
   Time  Property  ItemType0.property
0     1         2                   1
1     2         2                   0
2     3         2                   3
3     1         2                   0
4     2         2                   1
5     3         2                   3

pivot_longer abstracts the reshaping process - Another option, solely within Pandas, is with a reshaping of the columns, followed by stack:

ids = [col for col in wide_df if '[' not in col]
reshaped = wide_df.set_index(ids)
reshaped.columns = (reshaped
                    .columns
                    .str
                    .split(r'(\[\d+\])', expand=True)
                    .set_names([None,'drop',None])
                   )
reshaped = reshaped.stack(level='drop').droplevel('drop')
reshaped.columns = reshaped.columns.map(lambda x: ''.join(x))
reshaped.reset_index()

   Time  Property  ItemType0.property
0     1         2                   1
1     1         2                   0
2     2         2                   0
3     2         2                   1
4     3         2                   3
5     3         2                   3

Of course, with melting, you are increasing the number of rows, which ultimately, depending on your dataset will eat up memory. if you still get errors based on memory, then you may have to think of an alternative way to solve your issue without blowing it up. you may have to share more insights on what your end goal is.