Situation
I've run Google's NLP sentiment analysis, which returned the column 'sentiment' with key value pairs for magnitude and score, per below:
Sentiment Analysis Results
This is my results, in the sentiment column, for dataframe df03.
| index | text02 | sentiment |
|---|---|---|
| 01 | Max Muncy is great! | magnitude: 0.8999999761581421\nscore: 0.8999999761581421 |
| 02 | The worst Dodger is Max muncy. | magnitude: 0.800000011920929\nscore: -0.800000011920929 |
| 03 | Max Muncy was great, but not so much now. | magnitude: 0.4000000059604645\nscore: -0.4000000059604645 |
| 04 | What a fantastic guy, that Max muncy. | magnitude: 0.8999999761581421\nscore: 0.8999999761581421 |
Goal
I want to split the sentiment column into two columns, titled sentiment - magnitude and sentiment - score, with the column values listed accordingly.
The data format is newline delimited:
magnitude: 0.8999999761581421\nscore: 0.899999…
So I'm trying the Series.str.split method, like this:
df03['sentiment'].str.split(pat="\n", expand=True)
I'm not very familiar with ReGex, but did note that \n represents line feed, so figured that would be the right value to insert for the pat parameter.
The result is that all values are returned NaN.
| index | 0 |
|---|---|
| 01 | NaN |
| 02 | NaN |
| 03 | NaN |
| 04 | NaN |
I've tried a couple of different things, but none worked.
df03['sentiment'].str.split(r"\n", expand=True)
df03['sentiment'].str.split(pat=r"\n", expand=True)
I presume the problem is that \ is creating some kind of regex escape that's nullifying the n, but I'm not seeing anything on regexr.com to confirm that.
There's also the issue of splitting out the terms magnitude and score and placing them in the column headers, which I don't know if expand=True would include or not.
Greatly appreciate any input on what I'm doing wrong and where to focus troubleshooting.
doug
APPENDED
Original created dataframe:
| index | text02 |
|---|---|
| 01 | Max Muncy is great! |
| 02 | The worst Dodger is Max muncy. |
| 03 | Max Muncy was great, but not so much now. |
| 04 | What a fantastic guy, that Max muncy. |
df03['sentiment']
01 magnitude: 0.8999999761581421\nscore: 0.899999...
02 magnitude: 0.800000011920929\nscore: -0.800000...
03 magnitude: 0.4000000059604645\nscore: -0.40000...
04 magnitude: 0.8999999761581421\nscore: 0.899999...
Name: sentiment, dtype: object
APPENDED 02
Ran this
df03['sentiment'].astype(str).str.split(pat=r"\\n| ", expand=True)
Returned this (not sure how to format like tables above)
|index|0|1|2|
|---|---|---|---|
|01|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
|02|magnitude:|0\.800000011920929
score:|-0\.800000011920929
|
|03|magnitude:|0\.4000000059604645
score:|-0\.4000000059604645
|
|04|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
You need to specify the regular expression like this (with two slashes, and as a raw string):
Here
dfanddf['sentiment']evaluate to:(which I think is your
df03).With those inputs,
df['sentiment'].str.split(pat=r"\\n", expand=True)gives:To rename the columns to
MagnitudeandScore, and remove those strings from the dataframe, you can modify the regular expression to split on either a line feed, or a space, then rename the columns. Then, selecting only the ones you want to keep, gives:Addendum
OP had to make some additional tweaks to achieve the result I got. These were using
astype(str)to explicitly cast the value to a string, and removing the regex entirely, once they did that:By default,
Series.str.split()splits on any whitespace, which sounds like the actual inputs being used have some unusual formatting where the last cell contains a line break, but is not expressing it as\n; without actually seeing the original, it's still a bit unclear.