Pyspark from_json function equvalent

220 Views Asked by At

I have dataframe with column containing json string, which is converted to dictionary using from_json function. Problem occured when json contains not typical string inside like: '\\"cde\\"', all json: '{"key":"abc","value":"\\"cde\\""}'.

When from_json function is applied, it returns null because I think it treats \\ as one char and it can not parse value due to many " inside.

Here is simple code snippet:

df = spark.createDataFrame(
    [
        (1, '{"key":"abc","value":"\\\\"cde\\\\""}')
    ],
    ["id", "text"]
)

df = df.withColumn('dictext', from_json(col('text'), json_schema))

display(df)

Is there way for cleaning such json or maybe encoding it somehow before callingfrom_json or using another function, which is able to parse such string?

1

There are 1 best solutions below

2
Bartosz Gajda On

Is there way for cleaning such json

For your case, I would suggest creating an UDF, that captures the cleaning rules that are relevant to your data. For the single line of data you have included, I created a sample UDF that removes all incorrect tokens and parses the JSON correctly:

from pyspark.sql.functions import udf

@udf("string")
def clean_json(json: str):
    return json.replace("\\", "").replace("\"\"", "\"")

# Applying the UDF
df = df.withColumn('dictext', from_json(clean_json(col('text')), json_schema))
display(df)

display(df)

Cleaning rows with regex

If you can capture all your unwanted characters with regular expression, then you don't need the UDF - you use your regex with regexp_replace function directly, like this:

from pyspark.sql.functions import regexp_replace

df = df.withColumn('dictext', from_json(regexp_replace('text', r'\\', '')), json_schema))

Docs for regexp_replace