Hello stackoverflowers,
Would you please help to take a look on how to replace the emoticon in scala dataframe?
import spark.implicits._
val df = Seq(
(8, "bat★ ⛱ ✨♂️⛷❤️"),
(64, "bb")
).toDF("number", "word")
df.show(false)
+------+-----------------------+
|number|word |
+------+-----------------------+
|8 |bat★ ⛱ ✨♂️⛷❤️|
|64 |bb |
+------+-----------------------+
df.select($"word", regexp_replace($"word", "[^\u0000-\uFFFF]", "").alias("word_revised")).show(false)
+-----------------------+---------------+
|word |word_revised |
+-----------------------+---------------+
|bat★ ⛱ ✨♂️⛷❤️|bat★ ⛱ ✨♂️⛷❤️|
|bb |bb |
+-----------------------+---------------+
The expected result is
+-----------------------+---------------+
|word |word_revised |
+-----------------------+---------------+
|bat★ ⛱ ✨♂️⛷❤️|bat|
|bb |bb |
+-----------------------+---------------+
Thank you so much for your helping, @fonkap. I am so sorry that chain in to the thread so late as I had get another sprint story to onboard during the past month. I would like to say the approach you posted almost works well for the emoticon. But there are some abnormal icon in my source data from our upstream. Do you have any suggestion on how to replace with them
scala> val df = Seq(
| (8, "♥♥♥♥♥☆ Condo֎۩ᴥ★Ąrt Ħouse Ŀocation")
| ).toDF("airPlaneId", "airPlaneName")
df: org.apache.spark.sql.DataFrame = [airPlaneId: int, airPlaneName: string]
scala> df.select($"airPlaneId", $"airPlaneName", regexp_replace($"airPlaneName", "[^\u0000-\u20CF]", "").alias("airPlaneName_revised")).show(false)
+----------+-----------------------------------+----------------------------+
|airPlaneId|airPlaneName |airPlaneName_revised |
+----------+-----------------------------------+----------------------------+
|8 |♥♥♥♥♥☆ Condo֎۞۩ᴥ★Ąrt Ħouse Ŀocation| Condo֎۞۩ᴥĄrt Ħouse Ŀocation|
+----------+-----------------------------------+----------------------------+
Looks like some symbol still remains as unexpected marked as underscore
Thank you for your sharing, @mck. And the purposed new approach is workable. Anyway, there is a unwanted replacement occurs.
scala> df.selectExpr(
| "airPlaneId",
| "airPlaneName",
| "replace(decode(encode(airPlaneName, 'ascii'), 'ascii'), '?', '?') airPlaneName_revised"
| ).show(false)
+----------+------------+--------------------+
|airPlaneId|airPlaneName|airPlaneName_revised|
+----------+------------+--------------------+
|8 |la Cité |la Cit? |
|9 |Aéroport |A?roport |
|10 |München |M?nchen |
|11 |la Tête |la T?te |
|12 |Sarrià |Sarri? |
+----------+------------+--------------------+
Just wondering that do we have any enhanced approach to exclude the kind of valid ascii, only process emoji or symbol, please?

regexp_replaceis doing it right. It is just that some of the "characters" you wrote are indeed in the\u0000-\uFFFFinterval.Proof:
Open emoji.txt with your browser and you'll see:
(It is worth noting that some characters are combinations)
The "filtered" string looks like:
So, everything looks right!
Finally, answering your question, you may want to use a narrower character interval, for example:
[^\u0000-\u20CF], and you will get the expected result.will output:
Take a look at: https://jrgraphix.net/research/unicode_blocks.php