Postgres: How to optimize similarity queries searching columns on multiple tables? (pg_trgm)

Question

Postgres: How to optimize similarity queries searching columns on multiple tables? (pg_trgm)

54 Views Asked by Marnix.hoh At 22 February 2024 at 08:24

I am trying to optimize a similarity (word_similarity) query, which searches multiple columns on 2 tables.

The tables and their indices look like this:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Table definitions
CREATE TABLE usr (
    id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
    loc_id bigint,
    first_name text,
    last_name text
);
CREATE TABLE loc (
    id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
    country text,
    city text
);

-- Primary keys
ALTER TABLE ONLY usr
    ADD CONSTRAINT usr_pkey PRIMARY KEY (id);
ALTER TABLE ONLY loc
    ADD CONSTRAINT loc_pkey PRIMARY KEY (id);

-- Foreign key
ALTER TABLE ONLY usr.loc_id
    ADD CONSTRAINT usr_loc_id_fkey FOREIGN KEY (loc_id) REFERENCES loc(id);

-- Indices
CREATE INDEX usr_loc_id_idx ON usr
    USING btree (loc_id);
    
CREATE INDEX usr_first_name_last_name_idx ON usr
    USING gist ((first_name || ' ' || last_name) gist_trgm_ops);
CREATE INDEX loc_country_city_idx ON loc
    USING gist ((country || ' ' || city) gist_trgm_ops);

-- Insert dummy data
INSERT INTO loc (country, city)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25)
FROM generate_series(1, 100000) AS t;

INSERT INTO usr (first_name, last_name, loc_id)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25), trunc(random() * 99999 + 1)
FROM generate_series(1, 1000000) AS t;

When searching the columns of just one table, everything seems to be optimized fine:

SELECT
    usr.id AS usr_id, usr.first_name, usr.last_name,
    loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
ORDER  BY 'baker' <<-> (usr.first_name || ' ' || usr.last_name)
LIMIT 20;

This produces the below query plan:

Limit  (cost=0.70..236.19 rows=20 width=118) (actual time=397.086..596.842 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=0.70..1178.16 rows=100 width=118) (actual time=397.084..596.839 rows=1 loops=1)
        ->  Index Scan using usr_first_name_last_name_idx on usr  (cost=0.41..422.41 rows=100 width=66) (actual time=397.038..596.792 rows=1 loops=1)
              Index Cond: (((first_name || ' '::text) || last_name) %> 'baker'::text)
              Order By: (((first_name || ' '::text) || last_name) <->> 'baker'::text)
        ->  Index Scan using loc_pkey on loc  (cost=0.29..7.55 rows=1 width=56) (actual time=0.037..0.037 rows=1 loops=1)
              Index Cond: (id = usr.loc_id)
Planning Time: 1.252 ms
Execution Time: 596.931 ms

My attempt to write a query that searches columns on both the usr and loc tables is:

SELECT
    usr.id AS usr_id, usr.first_name, usr.last_name,
    loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE
    'baker' <% (usr.first_name || ' ' || usr.last_name)
    OR
    'baker' <% (loc.country || ' ' || loc.city)
ORDER BY
    ('baker' <<-> (usr.first_name || ' ' || usr.last_name))
    +
    ('baker' <<-> (loc.country || ' ' || loc.city));

And the query plan for this one is:

Gather Merge  (cost=21071.24..21090.61 rows=166 width=118) (actual time=6297.322..6301.774 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Sort  (cost=20071.22..20071.42 rows=83 width=118) (actual time=6286.205..6286.208 rows=0 loops=3)
        Sort Key: ((('baker'::text <<-> ((usr.first_name || ' '::text) || usr.last_name)) + ('baker'::text <<-> ((loc.country || ' '::text) || loc.city))))
        Sort Method: quicksort  Memory: 25kB
        Worker 0:  Sort Method: quicksort  Memory: 25kB
        Worker 1:  Sort Method: quicksort  Memory: 25kB
        ->  Parallel Hash Left Join  (cost=2460.57..20068.57 rows=83 width=118) (actual time=4197.337..6284.058 rows=0 loops=3)
              Hash Cond: (usr.loc_id = loc.id)
              Filter: (('baker'::text <% ((usr.first_name || ' '::text) || usr.last_name)) OR ('baker'::text <% ((loc.country || ' '::text) || loc.city)))
              Rows Removed by Filter: 333335
              ->  Parallel Seq Scan on usr  (cost=0.00..16512.69 rows=416669 width=66) (actual time=0.078..214.231 rows=333335 loops=3)
              ->  Parallel Hash  (cost=1725.25..1725.25 rows=58825 width=56) (actual time=23.072..23.073 rows=33334 loops=3)
                    Buckets: 131072  Batches: 1  Memory Usage: 10432kB
                    ->  Parallel Seq Scan on loc  (cost=0.00..1725.25 rows=58825 width=56) (actual time=0.007..5.594 rows=33334 loops=3)
Planning Time: 3.470 ms
Execution Time: 6301.859 ms

Is there a way I can improve this query? Or is such a "multi-table similarity" query not possible in Postgres?

Thank you!

Original Q&A

There are 1 best solutions below

**Zegarek** · Accepted Answer · 2024-02-22T12:41:27.197000

It looks like OR prevented your index from kicking in. You could instead run your select with each of the conditions separately, UNION that, then apply your desired ORDER after: demo

EXPLAIN ANALYSE VERBOSE
SELECT * FROM
(   SELECT
        usr.id AS usr_id, usr.first_name, usr.last_name,
        loc.id AS loc_id, loc.country, loc.city
    FROM usr
    LEFT JOIN loc ON usr.loc_id = loc.id
    WHERE 'baker' <% (loc.country || ' ' || loc.city)
    UNION
    SELECT
        usr.id AS usr_id, usr.first_name, usr.last_name,
        loc.id AS loc_id, loc.country, loc.city
    FROM usr
    LEFT JOIN loc ON usr.loc_id = loc.id
    WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
) ORDER BY 
    ('baker' <<-> (first_name || ' ' || last_name))
    +
    ('baker' <<-> (country || ' ' || city));

| QUERY PLAN |
|:-----------|
| Sort  (cost=5112.22..5115.00 rows=1111 width=148) (actual time=34.165..34.171 rows=0 loops=1) |
|   Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
|   Sort Key: ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
|   Sort Method: quicksort  Memory: 25kB |
|   ->  Subquery Scan on \_  (cost=5025.47..5056.02 rows=1111 width=148) (actual time=34.159..34.164 rows=0 loops=1) |
|         Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, (('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city))) |
|         ->  HashAggregate  (cost=5025.47..5036.58 rows=1111 width=144) (actual time=34.158..34.163 rows=0 loops=1) |
|               Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
|               Group Key: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
|               Batches: 1  Memory Usage: 73kB |
|               ->  Append  (cost=798.13..5008.80 rows=1111 width=144) (actual time=34.144..34.148 rows=0 loops=1) |
|                     ->  Hash Join  (cost=798.13..2240.77 rows=555 width=144) (actual time=16.292..16.294 rows=0 loops=1) |
|                           Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
|                           Inner Unique: true |
|                           Hash Cond: (usr.loc\_id = loc.id) |
|                           ->  Seq Scan on usr.usr  (cost=0.00..1296.75 rows=55575 width=80) (actual time=0.007..0.007 rows=1 loops=1) |
|                                 Output: usr.id, usr.loc\_id, usr.first\_name, usr.last\_name |
|                           ->  Hash  (cost=791.23..791.23 rows=552 width=72) (actual time=16.280..16.281 rows=0 loops=1) |
|                                 Output: loc.id, loc.country, loc.city |
|                                 Buckets: 1024  Batches: 1  Memory Usage: 8kB |
|                                 ->  Bitmap Heap Scan on usr.loc  (cost=104.56..791.23 rows=552 width=72) (actual time=16.279..16.280 rows=0 loops=1) |
|                                       Output: loc.id, loc.country, loc.city |
|                                       Filter: ('baker'::text \<% ((loc.country \|\| ' '::text) \|\| loc.city)) |
|                                       ->  Bitmap Index Scan on loc\_country\_city\_idx  (cost=0.00..104.42 rows=552 width=0) (actual time=16.275..16.275 rows=0 loops=1) |
|                                             Index Cond: (((loc.country \|\| ' '::text) \|\| loc.city) %> 'baker'::text) |
|                     ->  Hash Left Join  (cost=2029.53..2762.48 rows=556 width=144) (actual time=17.844..17.845 rows=0 loops=1) |
|                           Output: usr\_1.id, usr\_1.first\_name, usr\_1.last\_name, loc\_1.id, loc\_1.country, loc\_1.city |
|                           Inner Unique: true |
|                           Hash Cond: (usr\_1.loc\_id = loc\_1.id) |
|                           ->  Bitmap Heap Scan on usr.usr usr\_1  (cost=104.59..836.07 rows=556 width=80) (actual time=17.843..17.843 rows=0 loops=1) |
|                                 Output: usr\_1.id, usr\_1.loc\_id, usr\_1.first\_name, usr\_1.last\_name |
|                                 Filter: ('baker'::text \<% ((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name)) |
|                                 ->  Bitmap Index Scan on usr\_first\_name\_last\_name\_idx  (cost=0.00..104.45 rows=556 width=0) (actual time=17.838..17.839 rows=0 loops=1) |
|                                       Index Cond: (((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name) %> 'baker'::text) |
|                           ->  Hash  (cost=1234.42..1234.42 rows=55242 width=72) (never executed) |
|                                 Output: loc\_1.id, loc\_1.country, loc\_1.city |
|                                 ->  Seq Scan on usr.loc loc\_1  (cost=0.00..1234.42 rows=55242 width=72) (never executed) |
|                                       Output: loc\_1.id, loc\_1.country, loc\_1.city |
| Planning Time: 0.314 ms |
| Execution Time: 34.307 ms |

Postgres: How to optimize similarity queries searching columns on multiple tables? (pg_trgm)

There are 1 best solutions below

Related Questions in POSTGRESQL

Related Questions in PERFORMANCE

Related Questions in INDEXING

Related Questions in QUERY-OPTIMIZATION

Related Questions in PG-TRGM

Trending Questions

Popular # Hahtags

Popular Questions