Replacing " " space with "" nothing using gsub( " ", "", chr)

72 Views Asked by At

Here is the format of my data, imported from a CSV.

print(donnees_ventes3$V8[1:10])
 [1] "0,00000"     "0,00000"     "0,00000"     "0,00000"     "0,00000"     "0,00000"     "4,22476"     "0,00000"     "1 086,16998" "0,00000"

I'm changing the "," for "." successfully but when getting to remove the space between the thousand and the hundred, it's not working.

Here is what I tried:

gsub(" ", "", donnees_ventes3$V8)
gsub(" ", "\\", donnees_ventes3$V8)
gsub("\\ ", "\\", donnees_ventes3$V8)
gsub("\\ ", "", donnees_ventes3$V8)
gsub("\\  ", "\\", donnees_ventes3$V8)` #thought there might be two space. 

Also tried:

str_replace_all(donnees_ventes3$V8,” “, “”)

I can't give a reproducible example as I tried creating an example vector:

exemple <- c("0,00000", "0,00000", "0,00000", "0,00000", "0,00000", "0,00000", "4,22476", "0,00000", "1 086,16998", "0,00000")

for which the gsub(" ", "", exemple) works and changes the 9th data for "1086,16998"

It means it has something to do with importing the csv and the format of the " " blank space. Here is how I import it:

read.csv("csv path", encoding="UTF-8", header=FALSE)

Might it might be the encoding?

Anyone has a clue about whats wrong here?

1

There are 1 best solutions below

2
SamR On

All that is spacy is not whitespace

Base R functions like gsub() use POSIX Extended Regular Expressions (when perl=FALSE). In these \s is defined as the character set [ \t\r\n\v\f], i.e. space, tab, carriage return, new line, vertical tab, form feed.

However, Unicode text can contain characters that print like a space but are not included in this set, such as No-Break Space (NBSP), i.e. \U00A0.

space  <- "This is a space"
not_space  <- "This\U00A0is\U00A0not"
print(space) # [1] "This is a space"
print(not_space) # [1] "This is not"

You cannot use "\\s" to replace these characters:

gsub("\\s", "", not_space)
# [1] "This is not"

Perl Compatible Regular Expressions (PCRE)

You can capture these spacy characters with PCRE. Interestingly, the Perl regex docs indicate that \s includes NBSP but R seems not yet to have implemented this:

gsub("\\s", "", not_space, perl= TRUE)
# [1] "This is not"

However, the PCRE docs note:

The sequences \h, \H, \v, and \V are features that were added to Perl at release 5.10

These represent horizontal and vertical spaces, and their negation. \h captures the 19 Unicode horizontal space characters, including NBSP:

gsub("\\h", "", space, perl= TRUE)
# [1] "Thisisaspace"
gsub("\\h", "", not_space, perl= TRUE)
# [1] "Thisisnot"

Identifying other characters

In the event you have a character not captured by \h or \v, you can easily see what it is with stringr::str_view():

stringr::str_view(space) 
# [1] │ This is a space
stringr::str_view(not_space) 
# [1] │ This{\u00a0}is{\u00a0}not

Or if you prefer only base R you can identify them slightly less clearly:

# Unicode for space (0020)
as.hexmode(utf8ToInt(" ")) # 20
# Unicode for NBSP, the 5th character of not_space
as.hexmode(utf8ToInt(substr(not_space, 5,5))) # a0

Individual replacements

It is laborious but you can replace individual characters by their character code:

gsub("\U00A0", "", not_space)
# [1] "Thisisnot"

Replacing all non-digits

Depending on your ultimate goal, you might want to remove all characters that are not digits:

v  <- c("1\U{00A0}086,16998", "1\U{00A0}086,16998")
print(v) # [1] "1 086,16998"
gsub("\\D+", "", v)
# [1] "108616998" "108616998"

Or if you are trying to remove spaces and change commas to decimal points you could remove all characters that are not digits and commas, then substitute the commas:

v |>
    gsub("[^0-9,]", "", x=_) |>
    gsub(",", ".", x=_)
# [1] "1086.16998" "1086.16998"

Replacing all Unicode space characters

Based on the question linked by Roland, "\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF" seems to be a comprehensive list of the 19 Unicode characters for spaces.

These are the same 19 horizontal space characters covered in the PCRE specification, so should be covered by \h with perl = TRUE. However, if you can't use PCRE, or you want to add in extra characters, you can list the relevant character codes and replace them with a space using chartr():

v <- chartr(
    "\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF",
    paste(rep(" ", 19), collapse = ""),
    v
)

Then apply whatever code you had written for data which contains spaces.