Here is the format of my data, imported from a CSV.
print(donnees_ventes3$V8[1:10])
[1] "0,00000" "0,00000" "0,00000" "0,00000" "0,00000" "0,00000" "4,22476" "0,00000" "1 086,16998" "0,00000"
I'm changing the "," for "." successfully but when getting to remove the space between the thousand and the hundred, it's not working.
Here is what I tried:
gsub(" ", "", donnees_ventes3$V8)
gsub(" ", "\\", donnees_ventes3$V8)
gsub("\\ ", "\\", donnees_ventes3$V8)
gsub("\\ ", "", donnees_ventes3$V8)
gsub("\\ ", "\\", donnees_ventes3$V8)` #thought there might be two space.
Also tried:
str_replace_all(donnees_ventes3$V8,” “, “”)
I can't give a reproducible example as I tried creating an example vector:
exemple <- c("0,00000", "0,00000", "0,00000", "0,00000", "0,00000", "0,00000", "4,22476", "0,00000", "1 086,16998", "0,00000")
for which the gsub(" ", "", exemple) works and changes the 9th data for "1086,16998"
It means it has something to do with importing the csv and the format of the " " blank space. Here is how I import it:
read.csv("csv path", encoding="UTF-8", header=FALSE)
Might it might be the encoding?
Anyone has a clue about whats wrong here?
All that is spacy is not whitespace
Base R functions like
gsub()use POSIX Extended Regular Expressions (whenperl=FALSE). In these\sis defined as the character set[ \t\r\n\v\f], i.e. space, tab, carriage return, new line, vertical tab, form feed.However, Unicode text can contain characters that print like a space but are not included in this set, such as No-Break Space (NBSP), i.e.
\U00A0.You cannot use
"\\s"to replace these characters:Perl Compatible Regular Expressions (PCRE)
You can capture these spacy characters with PCRE. Interestingly, the Perl regex docs indicate that
\sincludes NBSP but R seems not yet to have implemented this:However, the PCRE docs note:
These represent horizontal and vertical spaces, and their negation.
\hcaptures the 19 Unicode horizontal space characters, includingNBSP:Identifying other characters
In the event you have a character not captured by
\hor\v, you can easily see what it is withstringr::str_view():Or if you prefer only base R you can identify them slightly less clearly:
Individual replacements
It is laborious but you can replace individual characters by their character code:
Replacing all non-digits
Depending on your ultimate goal, you might want to remove all characters that are not digits:
Or if you are trying to remove spaces and change commas to decimal points you could remove all characters that are not digits and commas, then substitute the commas:
Replacing all Unicode space characters
Based on the question linked by Roland,
"\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF"seems to be a comprehensive list of the 19 Unicode characters for spaces.These are the same 19 horizontal space characters covered in the PCRE specification, so should be covered by
\hwithperl = TRUE. However, if you can't use PCRE, or you want to add in extra characters, you can list the relevant character codes and replace them with a space usingchartr():Then apply whatever code you had written for data which contains spaces.