I try to find an algorithm that calculates the minimal Levenshtein distance in a data bank very fast and does as much work as possible before I know the input string.
I have an algorithm that reads strings with an OCR algorithm. This algorithm makes a lot of mistakes. I have a data bank that consists of the strings I try to find with the OCR algorithm. This data bank is growing consistently. It is possible and also pretty likely that the read string is not part of that data bank, even if it is perfectly read by the OCR algorithm. In most cases, the string with the smallest Levenshtein distance will result in the most likely match in the data bank. So until now, I calculated the Levenshtein distance of the read string with every single entry in the data bank using the Wagner-Fischer algorithm. Sadly, this is the only time-critical part of the application and it takes a while. So I was wondering if I could speed up this part by pre-ordering the data bank in a certain way so that I could calculate the smallest Levenshtein distance in a data bank faster. Of course this pre-structuring would take a while itself, but the most critical part is the time between reading the string and finding the closest match in the data bank.
I try to find an algorithm to pre-structure the data of the data bank in a way, that I have to touch a minimal number of entries.
Is there a good way to improve the amount of strings I have to touch from the data bank? Can I structure the strings from the data bank in a certain way so that it excludes other strings efficiently for short strings?