Removing special character in data in databricks

1.5k Views Asked by Eimis Pacheco At 11 February 2022 at 14:33

My company is in a migration project from MapR to databricks, and we have the following piece of code that used to work fine in this platform, but once in databricks it stopped working. I noticed that this is failing with this specific regular expression because this is not getting any error with others.

The Error is "Error while obtaining a new communication channel", and after that, we can not continue writing code and testing, something breaks.

I am attaching a screenshot for reference. [ Error while obtaining a new communication channel1

import pyspark.sql.functions as pyfunc

df=spark.read.parquet("/mnt/gpdipedlstgamrasp50565/stg_db/intermediate/ODX/ODW/STUDY_REPORT/Current/Data/")

df.count()

df = df.withColumn('CSR_RESULTS_SUMMARY', pyfunc.regexp_replace(pyfunc.col('CSR_RESULTS_SUMMARY'),u'([\ud800-\udfff\ufdd0-\ufdef\ufffe-\uffff+])',''))

df.show()

Thank you very much in advance.

Original Q&A

There are 1 best solutions below

blackbishop On 11 February 2022 at 15:44

I suspect the error is caused by the u'' in the regex pattern you pass to regexp_replace function. You should use r'' for python raw string:

df = df.withColumn(
    'column',
    pyfunc.regexp_replace(pyfunc.col('column'), r'[\ud800-\udfff\ufdd0-\ufdef\ufffe-\uffff]+', '')
)

Or use two backslashes \\ to escape:

df = df.withColumn(
    'column',
    pyfunc.regexp_replace(pyfunc.col('column'), '[\\ud800-\\udfff\\ufdd0-\\ufdef\\ufffe-\\uffff]+', '')
)

Removing special character in data in databricks

There are 1 best solutions below

Related Questions in PYSPARK

Related Questions in ERROR-HANDLING

Related Questions in SPECIAL-CHARACTERS

Related Questions in DATABRICKS

Related Questions in MAPR

Trending Questions

Popular # Hahtags

Popular Questions