Collapse together pandas row that respect a list of conditions

91 Views Asked by oettam_oisolliv At 10 February 2023 at 13:39

So, i have a dataframe of the type:

Doc	String
A	abc
A	def
A	ghi
B	jkl
B	mnop
B	qrst
B	uv

What I'm trying to do is to merge/collpase rows according to a two conditions:

they must be from the same document
they should be merged together up to a max length

I have

So that, for example if I will get max_len == 6:

Doc	String
A	abcdef
A	defghi
B	jkl
B	mnop
B	qrstuv

he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.

Original Q&A

There are 2 best solutions below

Timus On 11 February 2023 at 14:29 BEST ANSWER

I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:

def group(col, max_len=6):
    groups = []
    group = acc = 0
    for length in col.values:
        acc += length
        if max_len < acc:
            group, acc = group + 1, length
        groups.append(group)
    return groups

groups = df["String"].str.len().groupby(df["Doc"]).transform(group)
res = df.groupby(["Doc", groups], as_index=False).agg("".join)

The group function takes a column of string lengths for a Doc group and builds groups that meet the max_len condition. Based on that another groupby over Doc and groups then aggregates the strings.

Result for the sample:

  Doc  String
0   A  abcdef
1   A     ghi
2   B     jkl
3   B    mnop
4   B  qrstuv

SebDL On 10 February 2023 at 14:37

I have not tried to run this code so there might be bugs, but essentially:

uniques = list(set(df['Doc'].values))

new_df = pd.DataFrame(index=uniques, columns=df.columns)

for doc in uniques:

x_df = df.loc[df['Doc']==doc, 'String']

concatenated = sum(x_df['String'].values)[:max_length]

new_df.loc[doc, 'String'] = concatenated

Collapse together pandas row that respect a list of conditions

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATA-SCIENCE

Related Questions in SENTENCE

Trending Questions

Popular # Hahtags

Popular Questions