What is the correct way to use Regex_replace to replace a '|' with a '-' in Spark Java's column named details?

52 Views Asked by At

I have a column named details which has special character as | I want to replace it with - . How can I do it in spark Java?

I have tried:

  1. regex_replace(details ,"|","-")
  2. regex_replace(details ,"\\|","-")

Just wanted to know which is correct one or two and what is the use of \\ before the special character. What if I don't include \\, will my | be replaced in the details string?

2

There are 2 best solutions below

0
vilalabinot On

If you don't include \\, which is an escape sequence for the regular expression, you will get a - in between every single character of your string. Meaning that if you have:

my | text, and you use regexp_replace(details, "|", "-"), you will end up with: -m-y- -|- -t-e-x-t- because | has a special meaning in regular expression (a|b -> alternamte - match either a or b).

Therefore, you must use your second option which will return my - text.

0
Reilas On

There are a few things to consider.

With many coding languages, a string literal may include what are called escape sequences.
Essentially, it is a syntax used to represent a non-printing character, such as the new-line character.

Consider the following string value.

string = "stack\noverflow"

The reverse-solidus here is a signifier, called an escape character, which is used to render the character that immediately follows it as a qualifier—as opposed to, in this case, a literal n.
In this case, \n is mapped to the line-feed character, commonly referred to as a "newline" character.

If you evaluated string by printing it to the standard-out, it would result in the following.

stack
overflow

In your aggregate method, regex_replace, the first parameter is a regular expression pattern.
Regular expressions are just a set of syntactic values, used to represent a literal value.
For example, the expression [a-z] will match any lower-case letter, a through z.
So, in regex, you can type "[a-z]at", and it will match "bat", "cat", "mat", "vat", etc.

In regular expression patterns, the | character is a syntactic character, similar to the ||, or-operator, used in most coding languages.

If you want to represent a literal | character in a regular expression pattern, you must un-escape it, using the \, reverse-solidus, escape-character.

So, essentially your first parameter is evaluating to an escaped regular expression meta-character, which then needs to be escaped within the string literal.

So you arrive at 2 reverse-solidi.

In regard to what happens if you do not include the double reverse-solidi.

Regular expression pattern matching occurs in a left-to-right traversal of the valued string.

What will happen is it will match your first character, or nothing—redunant—and place the replacement value, in this case a -.

It will continue to the next character, and since it, again, matches, will place another replacement value.
This will continue until the end of the string is reached.

So, if your string value is "stack\noverflow", you would get the following replacement string.

-s-t-a-c-k-\n-o-v-e-r-f-l-o-w-