Here's an example of 2 strings that are not technically the same. I know for a fact that these 2 strings come from the same original value and have just been processed differently.
str1 <- "Tapajós"
str2 <- "Tapajós"
# The 2 strings are different
str1 == str2
#> [1] FALSE
# Indeed the code points are different:
charToRaw(str1)
#> [1] 54 61 70 61 6a 6f cc 81 73
charToRaw(str2)
#> [1] 54 61 70 61 6a c3 b3 73
# Somehow stringi can figure out that they're actually the same
stringi::stri_cmp_equiv(str1, str2)
#> [1] TRUE
Created on 2022-03-16 by the reprex package (v2.0.0)
I'd like to turn str1 and str2 into the same string -- which I understand is referred to as their canonical form.
I tried a few things playing around with stringi and encodings, but I couldn't figure out how to do it.
The fact that stringi::stri_cmp_equiv recognises them as equivalent gives me hope though! I feel like I just lack the right keywords to find the answer.
I was indeed just missing a keyword: "unicode normalisation forms".
This is easily done with the
stri_trans_*functions fromstringi, typicallystringi::stri_trans_nfc.Created on 2022-03-16 by the reprex package (v2.0.0)