I am trying to optimize a similarity (word_similarity) query, which searches multiple columns on 2 tables.
The tables and their indices look like this:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Table definitions
CREATE TABLE usr (
id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
loc_id bigint,
first_name text,
last_name text
);
CREATE TABLE loc (
id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
country text,
city text
);
-- Primary keys
ALTER TABLE ONLY usr
ADD CONSTRAINT usr_pkey PRIMARY KEY (id);
ALTER TABLE ONLY loc
ADD CONSTRAINT loc_pkey PRIMARY KEY (id);
-- Foreign key
ALTER TABLE ONLY usr.loc_id
ADD CONSTRAINT usr_loc_id_fkey FOREIGN KEY (loc_id) REFERENCES loc(id);
-- Indices
CREATE INDEX usr_loc_id_idx ON usr
USING btree (loc_id);
CREATE INDEX usr_first_name_last_name_idx ON usr
USING gist ((first_name || ' ' || last_name) gist_trgm_ops);
CREATE INDEX loc_country_city_idx ON loc
USING gist ((country || ' ' || city) gist_trgm_ops);
-- Insert dummy data
INSERT INTO loc (country, city)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25)
FROM generate_series(1, 100000) AS t;
INSERT INTO usr (first_name, last_name, loc_id)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25), trunc(random() * 99999 + 1)
FROM generate_series(1, 1000000) AS t;
When searching the columns of just one table, everything seems to be optimized fine:
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
ORDER BY 'baker' <<-> (usr.first_name || ' ' || usr.last_name)
LIMIT 20;
This produces the below query plan:
Limit (cost=0.70..236.19 rows=20 width=118) (actual time=397.086..596.842 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.70..1178.16 rows=100 width=118) (actual time=397.084..596.839 rows=1 loops=1)
-> Index Scan using usr_first_name_last_name_idx on usr (cost=0.41..422.41 rows=100 width=66) (actual time=397.038..596.792 rows=1 loops=1)
Index Cond: (((first_name || ' '::text) || last_name) %> 'baker'::text)
Order By: (((first_name || ' '::text) || last_name) <->> 'baker'::text)
-> Index Scan using loc_pkey on loc (cost=0.29..7.55 rows=1 width=56) (actual time=0.037..0.037 rows=1 loops=1)
Index Cond: (id = usr.loc_id)
Planning Time: 1.252 ms
Execution Time: 596.931 ms
My attempt to write a query that searches columns on both the usr and loc tables is:
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE
'baker' <% (usr.first_name || ' ' || usr.last_name)
OR
'baker' <% (loc.country || ' ' || loc.city)
ORDER BY
('baker' <<-> (usr.first_name || ' ' || usr.last_name))
+
('baker' <<-> (loc.country || ' ' || loc.city));
And the query plan for this one is:
Gather Merge (cost=21071.24..21090.61 rows=166 width=118) (actual time=6297.322..6301.774 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=20071.22..20071.42 rows=83 width=118) (actual time=6286.205..6286.208 rows=0 loops=3)
Sort Key: ((('baker'::text <<-> ((usr.first_name || ' '::text) || usr.last_name)) + ('baker'::text <<-> ((loc.country || ' '::text) || loc.city))))
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Parallel Hash Left Join (cost=2460.57..20068.57 rows=83 width=118) (actual time=4197.337..6284.058 rows=0 loops=3)
Hash Cond: (usr.loc_id = loc.id)
Filter: (('baker'::text <% ((usr.first_name || ' '::text) || usr.last_name)) OR ('baker'::text <% ((loc.country || ' '::text) || loc.city)))
Rows Removed by Filter: 333335
-> Parallel Seq Scan on usr (cost=0.00..16512.69 rows=416669 width=66) (actual time=0.078..214.231 rows=333335 loops=3)
-> Parallel Hash (cost=1725.25..1725.25 rows=58825 width=56) (actual time=23.072..23.073 rows=33334 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 10432kB
-> Parallel Seq Scan on loc (cost=0.00..1725.25 rows=58825 width=56) (actual time=0.007..5.594 rows=33334 loops=3)
Planning Time: 3.470 ms
Execution Time: 6301.859 ms
Is there a way I can improve this query? Or is such a "multi-table similarity" query not possible in Postgres?
Thank you!
It looks like
ORprevented your index from kicking in. You could instead run your select with each of the conditions separately,UNIONthat, then apply your desiredORDERafter: demo