This is regarding the stopwords removal. In the below code, the stopwords need to be removed from the inputDF which is given below. The code is working partially but the special characters #,$ and stopwords starting with special characters are not getting removed. kindly guide me on this. Thanks
//input data
val inputDF = Seq(
("1D Express ab ac"),
("2D Express a.c.e."),
("3D Express #"),
("4D Express & c.v"),
("5D Express enc (etc)"),
("6D Express $"),
("7D Express gm & g"),
("8D Express r. k."),
("9D Express 1"),
("10D Express ba.bc")
).toDF("input")
//stopwords list
val stopwords = Array("ab ac","a.c.e.","enc (etc)","gm & g","r. k.","& c.v","#","$","1","2")
//code logic
val stopwordsRegex = stopwords.map(s => s"(?i)\\b${Pattern.quote(s)}(?![\\p{L}\\p{N}])").mkString("|")
val stopwordsremoval = udf { input: String =>
input.replaceAll(stopwordsRegex, "").trim()
}
val outputDF = inputDF.withColumn("output", stopwordsremoval(col("input")))
outputDF.show(false)
The result which I am getting is
|Input |Output |
|--------------------|-----------------|
|1D Express ab ac |1D Express |
|2D Express a.c.e. |2D Express |
|3D Express # |3D Express # |
|33D Express & c.v |33D Express & c.v|
|4D Express enc (etc)|4D Express |
|44D Express $ |44D Express $ |
|5D Express gm & g |5D Express |
|7D Express r. k. |7D Express |
|8D Express 1 |8D Express |
|22D Express ba.bc |22D Express ba.bc|
|--------------------|-----------------|
The expected result is:
|input |output |
|--------------------|-----------------|
|1D Express ab ac |1D Express |
|2D Express a.c.e. |2D Express |
|3D Express # |3D Express |
|33D Express & c.v |33D Express |
|4D Express enc (etc)|4D Express |
|44D Express $ |44D Express |
|5D Express gm & g |5D Express |
|7D Express r. k. |7D Express |
|8D Express 1 |8D Express |
|22D Express ba.bc |22D Express ba.bc|
|--------------------|-----------------|