Wrong levenshtein UTF8

86 Views Asked by At

From here Levenshtein distance on diacritic characters I'm using this PHP function to calculate the levenshtein distance for UTF8 characters.

function levenshtein_php($str1, $str2){
  $length1 = mb_strlen( $str1, 'UTF-8');
  $length2 = mb_strlen( $str2, 'UTF-8');
  if( $length1 < $length2) return levenshtein_php($str2, $str1);
  if( $length1 == 0 ) return $length2;
  if( $str1 === $str2) return 0;
  $prevRow = range( 0, $length2);
  $currentRow = array();
  for ( $i = 0; $i < $length1; $i++ ) {
    $currentRow=array();
    $currentRow[0] = $i + 1;
    $c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
    for ( $j = 0; $j < $length2; $j++ ) {
      $c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
      $insertions = $prevRow[$j+1] + 1;
      $deletions = $currentRow[$j] + 1;
      $substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
      $currentRow[] = min($insertions, $deletions, $substitutions);
    }
    $prevRow = $currentRow;
  }
  return $prevRow[$length2];
}

When I'm using it with $string1 = 'încat de counștițe'; $string2= 'încântat';

I get that the levensthein difference is 13 , which I consider to be wrong.

The only 2 options that I see are the following changes to $string1:

    1)
    a) add the characters 'ânt' to reach $string1 = 'încântat de counștițe';
    b) delete the characters ' de counștițe' to reach $string1 = 'încântat' ;

which would lead to a difference of 17 changes


    2) 
    a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
    b) delete 't de cou' to reach $string1 = 'încânștițe';
    c) delete 'ș' to reach $string1 = 'încântițe';
    d) add characters 'at' to reach to reach $string1 = 'încântatițe';
    e) remove characters 'ițe' to reach $string1 = 'încântat';

with an levensthein distance of 15

Could you please help me correct the above levensthein_php code to return the correct difference. If there is any other existing PHP function I'd be happy to use it, however I unerstood that there is no function mb_levenshtein.

If it matters I'm running it on PHP 7.4.33.

2

There are 2 best solutions below

0
lukas.j On BEST ANSWER

Your second example of calculating the distance contains an error in my opinion:

a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
b) delete 't de cou' to reach $string1 = 'încânștițe';
c) delete 'ș' to reach $string1 = 'încântițe';
d) add characters 'at' to reach $string1 = 'încântatițe';
e) remove characters 'ițe' to reach $string1 = 'încântat';

The last two lines should read:

d) replace characters 'iț' with 'at' to reach $string1 = 'încântate';
e) remove character 'e' to reach $string1 = 'încântat';

Therefore the result is 13 and not 15, and the function works correctly.

6
Rick James On

Your example text implies that the character set is limited to certain Western European alphabets. I suggest you use the suitable conversion routine to change from UTF-8 to Latin1 (or other suitable encoding). Then the characters would be single-byte, and no "mb" routines are needed.

Try MySQL's CHARACTER SET cp1250 and latin2 -- they seem to have the desired s or t with cedilla. Try the _general_ci and _croatian_ci COLLATIONs if you need case and accent folding. Or _bin to be case and accent sensitive.