Finding suburbs in an address string using difflib module. Important that index position in string is recorded along with the suburb. Therefore storing suburb and index. At this point I get a different index for the two word suburb matches (Lower Hutt and Point Howard) The print statement shows this.
import difflib
# Sample address line with multiple suburbs
addressLine = '15 Long St,Lower Hutt, Wool Merch Sorter Newby Road, Point Howard Captain Short St, Eastbourne, Farmer'
# print("address line:",addressLine)
# A list of the separate words of addressLine
splits = []
# Provide a ref list of correct suburb spelling
burbRefs = ['Days Bay', 'Eastbourne', 'Lowry Bay', 'Lower Hutt', 'Point Howard', 'Wainuiomata', 'York Bay']
# A place to store found suburbs and string positions as tuples
burbList = ()
# Replace all commas, in the copied string, with spaces to treat each word equally
addressLine = addressLine.replace(',',' ')
# For a double space remove one of the spaces
addressLine = addressLine.replace(' ',' ')
# Begin by splitting the string, loaded from the list, using spaces (most reliable method)
splits = addressLine.split(' ')
print("splits check:",splits)
# Load each string token from list of strings
for i in range(len(splits)):
# Test for a suburb - difflib returns a list with a single string if found
burbResult = difflib.get_close_matches(splits[i], burbRefs, n=1)
print("index",i,"split",splits[i],"burb result:", burbResult)
# If test for suburb successful
if burbResult != []:
# print("Suburbs found:",burbResult[0],"index:",i)
# Extract string from list
burb = burbResult[0]
# Store the suburb and string index where found
burbList += burb,i
# Re-initialize result variable
burbResult = ''
Output:
splits check: ['15', 'Long', 'St', 'Lower', 'Hutt', 'Wool', 'Merch', 'Sorter', 'Newby', 'Road', 'Point', 'Howard', 'Captain', 'Short', 'St', 'Eastbourne', 'Farmer']
index 0 split 15 burb result: []
index 1 split Long burb result: []
index 2 split St burb result: []
index 3 split Lower burb result: ['Lower Hutt']
index 4 split Hutt burb result: []
index 5 split Wool burb result: []
index 6 split Merch burb result: []
index 7 split Sorter burb result: []
index 8 split Newby burb result: []
index 9 split Road burb result: []
index 10 split Point burb result: []
index 11 split Howard burb result: ['Point Howard']
index 12 split Captain burb result: []
index 13 split Short burb result: []
index 14 split St burb result: []
index 15 split Eastbourne burb result: ['Eastbourne']
index 16 split Farmer burb result: []
With reference to the docs The word arg to difflib is the split[i]. However the example is using two words. But joining the words with an underscore is not an option. The possibilities argument is a matching double word in burbRefs. The third arg n=1is the maximum number of close matches to return.
Why is the index for Point Howard 11 when it should be 10 if consistent with the other double word Lower Hutt?