So, i have a dataframe of the type:
| Doc | String |
|---|---|
| A | abc |
| A | def |
| A | ghi |
| B | jkl |
| B | mnop |
| B | qrst |
| B | uv |
What I'm trying to do is to merge/collpase rows according to a two conditions:
- they must be from the same document
- they should be merged together up to a max length
I have
So that, for example if I will get max_len == 6:
| Doc | String |
|---|---|
| A | abcdef |
| A | defghi |
| B | jkl |
| B | mnop |
| B | qrstuv |
he output doesn't have to be that strict. To explain the why: i have a document and i was able to split it into sentences, I'd like to have it now in a dataframe with each "new sentence" being of maximal length.
I couldn't find a pure Pandas solution (i.e. do the grouping only by using Pandas methods). You could try the following though:
The
groupfunction takes a column of string lengths for aDocgroup and buildsgroupsthat meet themax_lencondition. Based on that anothergroupbyoverDocandgroupsthen aggregates the strings.Result for the sample: