How to remove \n and empty string in a column in a dataframe?

580 Views Asked by At

I have a data frame and one column consists of list value. I have attached the picture in excel format and data frame as well.

column
"[
""Hello""
]"
"[
""Hello"", 
 ""Hi""
]"
"[
""Hello"", 
 ""Hi"",
 """"
]"
"[
"""",
""Hello"", 
 ""Hi""
]"
"[
""Hello"",
""""
]"
"[
"""",
""Hello""

]"

1][1]enter image description hereThe column value looks like

column
------
[\n "Hello" \n]
[\n "Hello", \n "Hi"\n]
[\n "Hello", \n "Hi"\n, \n ""\n]
[\n ""\n, \n "Hello", \n "Hi"\n]
[\n "Hello" \n, \n ""\n]
[\n ""\n, \n "Hello" \n]

So, I want to remove \n and "" from the list and have value as

column
------
["Hello"]
["Hello", "Hi"]
["Hello", "Hi"]
["Hello", "Hi"]
["Hello"]
["Hello"]

So, how can we obtain following result using pandas and python?

3

There are 3 best solutions below

8
Utkonos On

I'm not sure how to handle the input data that you have because that is not correctly formatted Python. However, I think there are a couple of ways to solve the problem.

Input data (as correct Python)

column = [
    ['\n "Hello" \n'],
    ['\n "Hello"', '\n "Hi"\n'],
    ['\n "Hello"',' \n "Hi"\n', '\n ""\n'],
    ['\n ""\n', '\n "Hello"', '\n "Hi"\n'],
    ['\n "Hello" \n', '\n ""\n'],
    ['\n ""\n', '\n "Hello" \n']
]

Code: First map then List Comprehension

The map removes the whitespace including the newline \n characters. The list comprehension then removes the empty entries from each row ("").

def stripper(text):
    return text.strip().strip('"')

for row in column:
    output = list(map(stripper, row))
    print([i for i in output if i])

Output

['Hello']
['Hello', 'Hi']
['Hello', 'Hi']
['Hello', 'Hi']
['Hello']
['Hello']

Note that the end result has single quotes rather than double quotes. Let me know if this matters for what you're doing.

For fun

Just for fun, I took your input data absolutely literally, and wrote a set of replacements to result in exactly the output you have in the question.

Input data

column = r"""[\n "Hello" \n]
[\n "Hello", \n "Hi"\n]
[\n "Hello", \n "Hi"\n, \n ""\n]
[\n ""\n, \n "Hello", \n "Hi"\n]
[\n "Hello" \n, \n ""\n]
[\n ""\n, \n "Hello" \n]""".splitlines()

Code

for row in column:
    print(row.replace('\\n "', '"').replace('" \\n', '"').replace('""\\n, ', '').replace(', ""\\n', '').replace('"\\n', ''))

Output

["Hello"]
["Hello", "Hi]
["Hello", "Hi]
["Hello", "Hi]
["Hello"]
["Hello"]
0
Dhruv Awasthi On

Taking an example that you provided with a dataframe df with column name column, we will use the following code snippet

def remove_empty_line(row):
    updated_list = list()
    for elem in row:
        updated_list.append(elem.replace("\n", "").strip())
return updated_list

df["column"] = df["column"].apply(lambda row: remove_empty_line(row))

Now you can check your df with df.head()

0
Ian Thompson On

How about this?

from ast import literal_eval

import pandas as pd


# Recreating data.
column = [
    '[\n "Hello" \n]',
    '[\n "Hello", \n "Hi"\n]',
    '[\n "Hello", \n "Hi"\n, \n ""\n]',
    '[\n ""\n, \n "Hello", \n "Hi"\n]',
    '[\n "Hello" \n, \n ""\n]',
    '[\n ""\n, \n "Hello" \n]',
]
df = pd.DataFrame({"column": column})

out = df.assign(
    # We map the following operations to "column".
    column=df.column
    .map(
        # Iterate over each value -- each will become a list of literals (in this case, strings).
        lambda x: [
            # We unpack the filtered values into the list.
            *filter(
                # Check the bool value of each item produced by `literal_eval`, keeping only those that are True.
                bool, literal_eval(x)
            )
        ]
    )
)
print(out)
        column
0      [Hello]
1  [Hello, Hi]
2  [Hello, Hi]
3  [Hello, Hi]
4      [Hello]
5      [Hello]