Match county names to predefined list

39 Views Asked by At

I read large lists of county names that people wrote down manually and now need to be matched to a predefined list of counties.

I understand that those matches cannot be perfect, but the operator should be shown a list of "wrong" county names together with their best matches, and ideally needs just to click "ok" to proceed and use them.

Until now, I just use levenshtein distance, which is great for catching typos, but not great for abbreviations. Real world examples that would be obvious to a human, but which levenshtein does not match correctly:

  • Input: Siegen
  • Should be: Siegen-Wittgenstein
  • Levenshtein: Hagen

or

  • Input: Rhein.-Berg. Kreis
  • Should be: Rheinisch-Bergischer Kreis
  • Levenshtein: Rhein-Sieg-Kreis

How can I catch those abbreviations as well?

I use PHP, but this more a question about the right algorithm than about PHP.

1

There are 1 best solutions below

2
Saaru Lindestøkke On

As you're dealing with geographic data, perhaps a geocoding approach could work? An example API would be Nominatim, but there are many other (paid) options.

If I use your two examples in the debugging interface of Nominatim I get the following:

Example 1:

  • Input: Siegen
  • Expected output: Siegen-Wittgenstein
  • Nominatim outputs (link):
    • Siegen, Kreis Siegen-Wittgenstein, North Rhine-Westphalia, Germany
    • Siegen, Haguenau-Wissembourg, Bas-Rhin, Grand Est, Metropolitan France, 67160, France
    • Siegen, Kreis Siegen-Wittgenstein, North Rhine-Westphalia, 57072, Germany

Example 2:

  • Input: Rhein.-Berg. Kreis
  • Expected output: Rheinisch-Bergischer Kreis
  • Nominatim outputs (link):
    • Geschäftsstelle DRK Kreisverband Rhein.-Berg. Kreis, 261, Hauptstraße, Heidkamp, Bergisch Gladbach, Rheinisch-Bergischer Kreis, North Rhine-Westphalia, 51465, Germany
    • Auf dem Kreis, Bad Berleburg, Kreis Siegen-Wittgenstein, North Rhine-Westphalia, Germany
    • Region Rhein-Neckar (HE), Kreis Bergstraße, Hesse, Germany

The quality is mixed, it definitely requires additional filtering. Without any additional filtering you get various resulttypes (administrative regions, a mountain peak, a charity) and also various countries (Germany and France).

However, you can apply filters on the detailed output to return only county names in a specific country to get the output you want.