One table contains:
ADDR varchar,
CONTACT varchar,
REFER varchar,
TEMP_FK number
The Foreign Key and Type are our delineators for unique outputs, related to existing data relationships. Contact remains unchanged, and the Address column does not contain unique addresses.
The ADDR Column is the address column formatted as "1234 Main St Apt 12 City State ZIP-4ZIP" where '-4ZIP' is optional. This is the main problem we're dealing with. We want to parse this into it's components, formatted as:
Line1: "1234 Main St"
Line2: "Apt 12"
City: "City"
State "State Abbreviation"
ZIP: "5ZIP"
I've assembled 2 other tables provided by USPS, one containing every City / State / Zip Code combination, and the other for every Street Delineation used in the Unites States (eg. Street, St, Place, Pl, etc).
For this question, we're only dealing with the first table of City / State / ZipCode combinations.
Thankfully this process needs to happen just once and going forward the data should be in this new format. So far I've come up with the following for Cleaning up everything but Lines 1&2:
select *
from(
select substr(ADDR, 1, instr(ADDR, ZP.CITY)-1) as LINES_1_2,
CONTACT,
ZP.CITY,
ZP.ST,
ZP.ZIP,
ADDR,
row_number() over (partition by ADDR order by ZP.CITY DESC) as ADD_UNIQ,
row_number() over (partition by TEMP_FK || REFER order by TEMP_FK DESC) as KEY_UNIQ,
TEMP_FK,
REFER
from address_temp
cross apply(
select physical_city as CITY,
physical_state_abv as ST,
physical_zip as ZIP
from ZIP_USPS
where ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_CITY || '%')
AND (ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE_ABV || '%')
OR ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE || '%'))
AND ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_ZIP || '%')
)ZP
)
where KEY_UNIQ = 1
KEY_UNIQ Solves the Duplicating Rows issue, as the ZIP_USPS table has the same City, State, Zip Fields multiple times, as that table is unique based upon the last 4 of a Zip, eg XXXXX-XXXX. The last 4 of the ZIP is its own Column. ADD_UNIQ I was using to group addresses, and obtain a value for uniqueness, for later interpolating that we'll be doing.
My first iteration of this was using inner join instead of Cross Apply, but Cross Apply is faster. I'm assuming all of my lost time (creating a sample of 40 rows processes in around 80s) is from my where clause with all the wild cards and variable matches it has to check for (the ZIP_USPS table has 44045 rows). I thought of using distinct, on my cross apply, but it is around 0.66s slower when I did it that way. That statement was:
select substr(ADDR, 1, instr(ADDR, ZP.CITY)-1) as LINES_1_2,
CONTACT,
ZP.CITY,
ZP.ST,
ZP.ZIP,
ADDR,
TEMP_FK,
REFER
from address_temp
cross apply(
select distinct physical_city as CITY,
physical_state_abv as ST,
physical_zip as ZIP
from ZIP_USPS
where ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_CITY || '%')
AND (ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE_ABV || '%')
OR ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_STATE || '%'))
AND ADDR LIKE ('%' || ZIP_USPS.PHYSICAL_ZIP || '%')
)ZP
My data sample is not clean enough to fully just search by ZIP for my output, and I'd be worried about those few occasions where ZIPs overlap cities in larger metropolitan areas.
My main question here is efficiency. I think I'm doing something wrong, but can't really see a better way to do it.
What improvements could / should I make here?
Sample Data:
| ADDR | CONTACT | REFER | TEMP_FK |
|---|---|---|---|
| 534 MAIN ST HAMPDEN MA 01036 | 156 | final | 200 |
| 1 WILLIAMS ST WILLIAMSBURG MA 01096 | 100 | final | 3000 |
| 650 DWIGHT ST HOLYOKE MA 01040 | 95 | inv | 3500 |
| 650 DWIGHT ST HOLYOKE MA 01040 | 95 | final | 3500 |
| 83 WINSOR ST LUDLOW MA 01056 | 300 | inv | 2333 |
| 40 POST OFFICE PARK WILBRAHAM MA 01095 | 250 | inv | 3333 |
Sample Output:
| LINES_1_2 | CITY | ST | ZIP | ADDR | CONTACT | REFER | TEMP_FK | ADD_UNIQ | KEY_UNIQ |
|---|---|---|---|---|---|---|---|---|---|
| 534 MAIN ST | HAMPDEN | MA | 01036 | 534 MAIN ST HAMPDEN MA 01036 | 156 | final | 200 | 1 | 1 |
| 1 WILLIAMS ST | WILLIAMSBURG | MA | 01096 | 1 WILLIAMS ST WILLIAMSBURG MA 01096 | 100 | final | 3000 | 1 | 1 |
| 650 DWIGHT ST | HOLYOKE | MA | 01040 | 650 DWIGHT ST HOLYOKE MA 01040 | 95 | inv | 3500 | 1 | 1 |
| 650 DWIGHT ST | HOLYOKE | MA | 01040 | 650 DWIGHT ST HOLYOKE MA 01040 | 95 | final | 3501 | 2 | 1 |
| 83 WINSOR ST | LUDLOW | MA | 01056 | 83 WINSOR ST LUDLOW MA 01056 | 300 | inv | 2333 | 1 | 1 |
| 40 POST OFFICE PARK | WILBRAHAM | MA | 01095 | 40 POST OFFICE PARK WILBRAHAM MA 01095 | 250 | inv | 3333 | 1 | 1 |