I have dataframe with column containing json string, which is converted to dictionary using from_json function. Problem occured when json contains not typical string inside like: '\\"cde\\"', all json: '{"key":"abc","value":"\\"cde\\""}'.
When from_json function is applied, it returns null because I think it treats \\ as one char and it can not parse value due to many " inside.
Here is simple code snippet:
df = spark.createDataFrame(
[
(1, '{"key":"abc","value":"\\\\"cde\\\\""}')
],
["id", "text"]
)
df = df.withColumn('dictext', from_json(col('text'), json_schema))
display(df)
Is there way for cleaning such json or maybe encoding it somehow before callingfrom_json or using another function, which is able to parse such string?
For your case, I would suggest creating an UDF, that captures the cleaning rules that are relevant to your data. For the single line of data you have included, I created a sample UDF that removes all incorrect tokens and parses the JSON correctly:
Cleaning rows with regex
If you can capture all your unwanted characters with regular expression, then you don't need the UDF - you use your regex with
regexp_replacefunction directly, like this:Docs for regexp_replace