Interpreting Combined Address Fields using known list of correct data - Efficiency Question

43 Views Asked by At

One table contains:

ADDR varchar,
CONTACT varchar, 
REFER varchar,
TEMP_FK number

The Foreign Key and Type are our delineators for unique outputs, related to existing data relationships. Contact remains unchanged, and the Address column does not contain unique addresses.

The ADDR Column is the address column formatted as "1234 Main St Apt 12 City State ZIP-4ZIP" where '-4ZIP' is optional. This is the main problem we're dealing with. We want to parse this into it's components, formatted as:

Line1: "1234 Main St"
Line2: "Apt 12"
City: "City"
State "State Abbreviation"
ZIP: "5ZIP"

I've assembled 2 other tables provided by USPS, one containing every City / State / Zip Code combination, and the other for every Street Delineation used in the Unites States (eg. Street, St, Place, Pl, etc).

For this question, we're only dealing with the first table of City / State / ZipCode combinations.

Thankfully this process needs to happen just once and going forward the data should be in this new format. So far I've come up with the following for Cleaning up everything but Lines 1&2:

select *
from(
select  substr(ADDR, 1, instr(ADDR, ZP.CITY)-1) as LINES_1_2,
        CONTACT,
        ZP.CITY,
        ZP.ST,
        ZP.ZIP,
        ADDR,
        row_number() over (partition by ADDR order by ZP.CITY DESC) as ADD_UNIQ,
        row_number() over (partition by TEMP_FK || REFER order by TEMP_FK DESC) as KEY_UNIQ,
        TEMP_FK,
        REFER
from address_temp
cross apply(
    select  physical_city as CITY,
            physical_state_abv as ST,
            physical_zip as ZIP
            from ZIP_USPS
            where   ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_CITY ||  '%')
            AND (ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE_ABV || '%')
            OR ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE || '%'))
            AND ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_ZIP || '%')
)ZP
)
where KEY_UNIQ = 1

KEY_UNIQ Solves the Duplicating Rows issue, as the ZIP_USPS table has the same City, State, Zip Fields multiple times, as that table is unique based upon the last 4 of a Zip, eg XXXXX-XXXX. The last 4 of the ZIP is its own Column. ADD_UNIQ I was using to group addresses, and obtain a value for uniqueness, for later interpolating that we'll be doing.

My first iteration of this was using inner join instead of Cross Apply, but Cross Apply is faster. I'm assuming all of my lost time (creating a sample of 40 rows processes in around 80s) is from my where clause with all the wild cards and variable matches it has to check for (the ZIP_USPS table has 44045 rows). I thought of using distinct, on my cross apply, but it is around 0.66s slower when I did it that way. That statement was:

select  substr(ADDR, 1, instr(ADDR, ZP.CITY)-1) as LINES_1_2,
        CONTACT,
        ZP.CITY,
        ZP.ST,
        ZP.ZIP,
        ADDR,
        TEMP_FK,
        REFER
from address_temp
cross apply(
    select distinct  physical_city as CITY,
            physical_state_abv as ST,
            physical_zip as ZIP
            from ZIP_USPS
            where   ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_CITY ||  '%')
            AND (ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE_ABV || '%')
            OR ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE || '%'))
            AND ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_ZIP || '%')
)ZP

My data sample is not clean enough to fully just search by ZIP for my output, and I'd be worried about those few occasions where ZIPs overlap cities in larger metropolitan areas.

My main question here is efficiency. I think I'm doing something wrong, but can't really see a better way to do it.

What improvements could / should I make here?

Sample Data:

ADDR CONTACT REFER TEMP_FK
534 MAIN ST HAMPDEN MA 01036 156 final 200
1 WILLIAMS ST WILLIAMSBURG MA 01096 100 final 3000
650 DWIGHT ST HOLYOKE MA 01040 95 inv 3500
650 DWIGHT ST HOLYOKE MA 01040 95 final 3500
83 WINSOR ST LUDLOW MA 01056 300 inv 2333
40 POST OFFICE PARK WILBRAHAM MA 01095 250 inv 3333

Sample Output:

LINES_1_2 CITY ST ZIP ADDR CONTACT REFER TEMP_FK ADD_UNIQ KEY_UNIQ
534 MAIN ST HAMPDEN MA 01036 534 MAIN ST HAMPDEN MA 01036 156 final 200 1 1
1 WILLIAMS ST WILLIAMSBURG MA 01096 1 WILLIAMS ST WILLIAMSBURG MA 01096 100 final 3000 1 1
650 DWIGHT ST HOLYOKE MA 01040 650 DWIGHT ST HOLYOKE MA 01040 95 inv 3500 1 1
650 DWIGHT ST HOLYOKE MA 01040 650 DWIGHT ST HOLYOKE MA 01040 95 final 3501 2 1
83 WINSOR ST LUDLOW MA 01056 83 WINSOR ST LUDLOW MA 01056 300 inv 2333 1 1
40 POST OFFICE PARK WILBRAHAM MA 01095 40 POST OFFICE PARK WILBRAHAM MA 01095 250 inv 3333 1 1
0

There are 0 best solutions below