How to split a pandas dataframe column with key/value pairs into multiple columns?

Question

How to split a pandas dataframe column with key/value pairs into multiple columns?

2k Views Asked by dsx At 29 March 2022 at 18:34

Situation

I've run Google's NLP sentiment analysis, which returned the column 'sentiment' with key value pairs for magnitude and score, per below:

Sentiment Analysis Results

This is my results, in the sentiment column, for dataframe df03.

index	text02	sentiment
01	Max Muncy is great!	magnitude: 0.8999999761581421\nscore: 0.8999999761581421
02	The worst Dodger is Max muncy.	magnitude: 0.800000011920929\nscore: -0.800000011920929
03	Max Muncy was great, but not so much now.	magnitude: 0.4000000059604645\nscore: -0.4000000059604645
04	What a fantastic guy, that Max muncy.	magnitude: 0.8999999761581421\nscore: 0.8999999761581421

Goal

I want to split the sentiment column into two columns, titled sentiment - magnitude and sentiment - score, with the column values listed accordingly.

The data format is newline delimited:

magnitude: 0.8999999761581421\nscore: 0.899999…

So I'm trying the Series.str.split method, like this:

df03['sentiment'].str.split(pat="\n", expand=True)

I'm not very familiar with ReGex, but did note that \n represents line feed, so figured that would be the right value to insert for the pat parameter.

The result is that all values are returned NaN.

index	0
01	NaN
02	NaN
03	NaN
04	NaN

I've tried a couple of different things, but none worked. df03['sentiment'].str.split(r"\n", expand=True) df03['sentiment'].str.split(pat=r"\n", expand=True)

I presume the problem is that \ is creating some kind of regex escape that's nullifying the n, but I'm not seeing anything on regexr.com to confirm that.

There's also the issue of splitting out the terms magnitude and score and placing them in the column headers, which I don't know if expand=True would include or not.

Greatly appreciate any input on what I'm doing wrong and where to focus troubleshooting.

doug

APPENDED

Original created dataframe:

index	text02
01	Max Muncy is great!
02	The worst Dodger is Max muncy.
03	Max Muncy was great, but not so much now.
04	What a fantastic guy, that Max muncy.

df03['sentiment']

01    magnitude: 0.8999999761581421\nscore: 0.899999...
02    magnitude: 0.800000011920929\nscore: -0.800000...
03    magnitude: 0.4000000059604645\nscore: -0.40000...
04    magnitude: 0.8999999761581421\nscore: 0.899999...
Name: sentiment, dtype: object

APPENDED 02

Ran this

df03['sentiment'].astype(str).str.split(pat=r"\\n| ", expand=True)

Returned this (not sure how to format like tables above)

|index|0|1|2|
|---|---|---|---|
|01|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
|02|magnitude:|0\.800000011920929
score:|-0\.800000011920929
|
|03|magnitude:|0\.4000000059604645
score:|-0\.4000000059604645
|
|04|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|

Original Q&A

There are 1 best solutions below

**baileythegreen** · Accepted Answer · 2022-03-29T19:07:13.837000

You need to specify the regular expression like this (with two slashes, and as a raw string):

df['sentiment'].str.split(pat=r"\\n", expand=True)

Here df and df['sentiment'] evaluate to:

df

index	text02	sentiment
1	Max Muncy is great!	magnitude: 0.8999999761581421\nscore: 0.89999...
2	The worst Dodger is Max muncy.	magnitude: 0.800000011920929\nscore: -0.80000...
3	Max Muncy was great, but not so much now.	magnitude: 0.4000000059604645\nscore: -0.4000...
4	What a fantastic guy, that Max muncy.	magnitude: 0.8999999761581421\nscore: 0.89999...

df['sentiment']
index             
1    magnitude: 0\.8999999761581421\nscore: 0.89999...
2    magnitude: 0\.800000011920929\nscore: -0.80000...
3    magnitude: 0\.4000000059604645\nscore: -0.4000...
4    magnitude: 0\.8999999761581421\nscore: 0.89999...
Name: sentiment, dtype: object

(which I think is your df03).

With those inputs, df['sentiment'].str.split(pat=r"\\n", expand=True) gives:

index	0	1
1	magnitude: 0.8999999761581421	score: 0.8999999761581421
2	magnitude: 0.800000011920929	score: -0.800000011920929
3	magnitude: 0.4000000059604645	score: -0.4000000059604645
4	magnitude: 0.8999999761581421	score: 0.8999999761581421

To rename the columns to Magnitude and Score, and remove those strings from the dataframe, you can modify the regular expression to split on either a line feed, or a space, then rename the columns. Then, selecting only the ones you want to keep, gives:

new = df['sentiment'].str.split(pat=r"\\n| ", expand=True)
new.columns = ["", "Magnitude", "", "Score"]
new[["Magnitude", "Score"]]

index	Magnitude	Score
1	0.8999999761581421	0.8999999761581421
2	0.800000011920929	-0.800000011920929
3	0.4000000059604645	-0.4000000059604645
4	0.8999999761581421	0.8999999761581421

Addendum

OP had to make some additional tweaks to achieve the result I got. These were using astype(str) to explicitly cast the value to a string, and removing the regex entirely, once they did that:

new = df['sentiment'].astype(str).str.split(expand=True)

By default, Series.str.split() splits on any whitespace, which sounds like the actual inputs being used have some unusual formatting where the last cell contains a line break, but is not expressing it as \n; without actually seeing the original, it's still a bit unclear.

How to split a pandas dataframe column with key/value pairs into multiple columns?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in GOOGLE-NATURAL-LANGUAGE

Trending Questions

Popular # Hahtags

Popular Questions