Convert string to canonical form with R

149 Views Asked by At

Here's an example of 2 strings that are not technically the same. I know for a fact that these 2 strings come from the same original value and have just been processed differently.

str1 <- "Tapajós"
str2 <- "Tapajós"

# The 2 strings are different
str1 == str2
#> [1] FALSE

# Indeed the code points are different:
charToRaw(str1)
#> [1] 54 61 70 61 6a 6f cc 81 73
charToRaw(str2)
#> [1] 54 61 70 61 6a c3 b3 73

# Somehow stringi can figure out that they're actually the same
stringi::stri_cmp_equiv(str1, str2)
#> [1] TRUE

Created on 2022-03-16 by the reprex package (v2.0.0)

I'd like to turn str1 and str2 into the same string -- which I understand is referred to as their canonical form.

I tried a few things playing around with stringi and encodings, but I couldn't figure out how to do it.

The fact that stringi::stri_cmp_equiv recognises them as equivalent gives me hope though! I feel like I just lack the right keywords to find the answer.

1

There are 1 best solutions below

0
asachet On

I was indeed just missing a keyword: "unicode normalisation forms".

This is easily done with the stri_trans_* functions from stringi, typically stringi::stri_trans_nfc.

str1 <- "Tapajós"
str2 <- "Tapajós"

# The 2 strings are different
str1 == str2
#> [1] FALSE

# But their "normalisation form C" or NFC is the same
stringi::stri_trans_nfc(str1)
#> [1] "Tapajós"
stringi::stri_trans_nfc(str1) == stringi::stri_trans_nfc(str2)
#> [1] TRUE

Created on 2022-03-16 by the reprex package (v2.0.0)