This query is slow because I think I'm using a lot of id values in the ANY() clause, isn't it?

126 Views Asked by At

I have this table in Postgres:

CREATE TABLE public.products 
(
    id character varying NOT NULL,
    created_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP NOT NULL,
    name character varying NOT NULL,
);

As id I'm using ULID (note the column type in Postgres is varchar with no fixed length).

I'm ready to change the column type if needed.

SELECT "products".* 
FROM "products" 
WHERE "id" = ANY('{01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF11PB3N1Q9TXME6KW1B,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0WJ53MH0WKNR65CBX1,01HQDMCF0ZV7TYZWRR6RN5V4QT,01HQDMCF0YRF7CB3DSC6DSAY54,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0YCJ8EFYCBKHWN4VQN,01HQDMCF0Z46AY42FC8FR953D8,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0ZBQ3A70GE8K41RW1V,01HQDMCF0WNYW4ZH5G0M8MDAQA,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0WT7K7W5GBFF3HVXHE,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0W24YQHH4EKN91S0JY,01HQDMCF0WT7K7W5GBFF3HVXHE,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF0YEQR7NJJJW88XNTW9,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0YRF7CB3DSC6DSAY54,01HQDMCF0WBY0YT5MR53KG9HS8,01HQDMCF0WBY0YT5MR53KG9HS8,01HQDMCF106M1667NNPTDBKQDB,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF0ZMPTQ0GHG0V87W0J4,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF0WBY0YT5MR53KG9HS8,01HQDMCF0WHNJ0P9BGVEFJE2JG,01HQDMCF109V71KW4MTFWVRTFR,and a lot more ~ 20k of them}') 
LIMIT 37520 OFFSET 0

which is slow.

IMPORTANT

I'm using a lot (~ 20k) id values in the ANY() clause. And I think this is the problem.

If I use EXPLAIN ANALYZE it says:

Limit  (cost=85.51..90.82 rows=154 width=172) (actual time=1.765..1.799 rows=138 loops=1)
  ->  Seq Scan on products  (cost=85.51..90.82 rows=154 width=172) (actual time=1.764..1.791 rows=138 loops=1)
        Filter: ((id)::text = ANY ('{01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF11PB3N1Q9TXME6KW1B,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0WJ53MH0WKNR65CBX1,01HQDMCF0ZV7TYZWRR6RN5V4QT,01HQDMCF0YRF7CB3DSC6DSAY54,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0YCJ8EFYCBKHWN4VQN,01HQDMCF0Z46AY42FC8FR953D8,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0ZBQ3A70GE8K41RW1V,01HQDMCF0WNYW4ZH5G0M8MDAQA,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0WT7K7W5GBFF3HVXHE,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0W24YQHH4EKN91S0JY,01HQDMCF0WT7K7W5GBFF3HVXHE,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF0YEQR7NJJJW88XNTW9,01HQDMCF11Z9VKEQXNZKEWXRDN,01HQDMCF0YRF7CB3DSC6DSAY54,01HQDMCF0WBY0YT5MR53KG9HS8,01HQDMCF0WBY0YT5MR53KG9HS8,01HQDMCF106M1667NNPTDBKQDB,01HQDMCF0S7QWQYBP2FW9HK8DS,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF0YFZBAV5K4ZQ3495FR,01HQDMCF0ZMPTQ0GHG0V87W0J4,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF0ZM8B6BJJK38PH7M2F,01HQDMCF0WBY0YT5MR53KG9HS8,01HQDMCF0WHNJ0P9BGVEFJE2JG,01HQDMCF109V71KW4MTFWVRTFR,and a lot more ~ 20k of them}'::text[]))
        Rows Removed by Filter: 16
Planning Time: 9.769 ms
Execution Time: 1.878 ms

I tried:

CREATE INDEX product_id_idx ON products(id);

and

CREATE INDEX product_id_idx ON products USING HASH(id);

and

CREATE INDEX product_id_pattern_idx ON products USING btree (id text_pattern_ops);

But they don't fix the extreme slowness.

What index can I create to improve the query?

1

There are 1 best solutions below

5
Zegarek On

I didn't go all the way up to 20k, but =ANY(ARRAY[]) does seem reluctant to use the indexes you made available, sequentially scanning a freshly vacuum analyze'd table with 200k rows. Demo1:

SELECT "products".* 
FROM "products" 
WHERE "id" = ANY(ARRAY[(random())::text,(random())::text,...around 400 of these ...(random())::text]) 
LIMIT 37520 OFFSET 0
QUERY PLAN
Limit  (cost=0.00..748971.00 rows=426 width=59) (actual time=17519.944..17519.952 rows=0 loops=1)
  Output: id, created_at, name
  ->  Seq Scan on public.products  (cost=0.00..748971.00 rows=426 width=59) (actual time=14365.282..14365.286 rows=0 loops=1)
        Output: id, created_at, name
        Filter: (products.id = ANY (ARRAY[(random())::text,(random())::text,...around 400 of these ...(random())::text]))
        Rows Removed by Filter: 200000
Planning Time: 1.387 ms
JIT:
  Functions: 4
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 10.140 ms, Inlining 81.166 ms, Optimization 1052.392 ms, Emission 2021.084 ms, Total 3164.782 ms
Execution Time: 17710.315 ms

It's resolved if you trade that for equivalent IN (SELECT UNNEST(ARRAY[])). Demo2:

SELECT "products".* 
FROM "products" 
WHERE "id" IN (SELECT UNNEST(ARRAY[(random())::text,(random())::text,...around 400 of these ...(random())::text])) 
LIMIT 37520 OFFSET 0
QUERY PLAN
Limit  (cost=6.41..1481.91 rows=426 width=59) (actual time=0.816..0.817 rows=0 loops=1)
  Output: products.id, products.created_at, products.name
  ->  Nested Loop  (cost=6.41..1481.91 rows=426 width=59) (actual time=0.815..0.816 rows=0 loops=1)
        Output: products.id, products.created_at, products.name
        ->  HashAggregate  (cost=6.41..8.41 rows=200 width=32) (actual time=0.198..0.251 rows=426 loops=1)
              Output: (unnest(ARRAY[(random())::text, ...400 of these..., (random())::text]))
              Group Key: unnest(ARRAY[(random())::text, ...400 of these..., (random())::text])
              Batches: 1  Memory Usage: 77kB
              ->  ProjectSet  (cost=0.00..5.34 rows=426 width=32) (actual time=0.069..0.102 rows=426 loops=1)
                    Output: unnest(ARRAY[(random())::text, ...400 of these..., (random())::text])
                    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
        ->  Index Scan using product_id_idx2 on public.products  (cost=0.00..7.36 rows=1 width=59) (actual time=0.001..0.001 rows=0 loops=426)
              Output: products.id, products.created_at, products.name
              Index Cond: (products.id = (unnest(ARRAY[(random())::text, ...400 of these..., (random())::text])))
Planning Time: 1.050 ms
Execution Time: 0.989 ms

Now using the index, it goes down from 17s to 1ms.


Turntables: I'll leave the above in for context but unless you're generating your ids similarly, it shouldn't apply. There still might be some advantage of IN over ANY, but your index should already work as it is. ANY: demo3

explain analyze verbose
WITH cte AS MATERIALIZED
  (select array_agg(r) as search_for_this
   from (select left(random()::text,9) as r
         from generate_series(1,2e4))_)
SELECT "products".* 
FROM "products", cte
WHERE "id" = ANY(search_for_this)
LIMIT 37520 OFFSET 0;
QUERY PLAN
Limit  (cost=62.59..100.77 rows=10 width=50) (actual time=26.927..217.212 rows=634 loops=1)
  Output: products.id, products.created_at, products.name
  CTE cte
    ->  Aggregate  (cost=22.50..22.52 rows=1 width=32) (actual time=13.186..13.188 rows=1 loops=1)
          Output: array_agg(("left"((random())::text, 9)))
          ->  Function Scan on pg_catalog.generate_series  (cost=0.00..20.00 rows=1000 width=32) (actual time=3.154..10.602 rows=20000 loops=1)
                Output: "left"((random())::text, 9)
                Function Call: generate_series('1'::numeric, '20000'::numeric)
  ->  Nested Loop  (cost=40.08..78.25 rows=10 width=50) (actual time=26.923..217.113 rows=634 loops=1)
        Output: products.id, products.created_at, products.name
        ->  CTE Scan on cte  (cost=0.00..0.02 rows=1 width=32) (actual time=13.293..13.295 rows=1 loops=1)
              Output: cte.search_for_this
        ->  Bitmap Heap Scan on public.products  (cost=40.08..78.13 rows=10 width=50) (actual time=13.622..203.663 rows=634 loops=1)
              Output: products.id, products.created_at, products.name
              Recheck Cond: (products.id = ANY (cte.search_for_this))
              Rows Removed by Index Recheck: 2
              Heap Blocks: exact=541
              ->  Bitmap Index Scan on product_id_idx2  (cost=0.00..40.08 rows=10 width=0) (actual time=13.093..13.093 rows=636 loops=1)
                    Index Cond: (products.id = ANY (cte.search_for_this))
Planning Time: 0.358 ms
Execution Time: 217.468 ms

IN: demo4

explain analyze verbose
WITH cte AS MATERIALIZED
  (select unnest(arr) as search_for_this 
   from(select array_agg(r) arr --app would pass the array here
        from (select left(random()::text,9) as r
              from generate_series(1,2e4))_)_)
SELECT "products".* 
FROM "products"
WHERE "id" in (select search_for_this from cte)
LIMIT 37520 OFFSET 0;
QUERY PLAN
Limit  (cost=22.80..103.17 rows=10 width=50) (actual time=24.485..47.409 rows=634 loops=1)
  Output: products.id, products.created_at, products.name
  CTE cte
    ->  ProjectSet  (cost=22.50..22.57 rows=10 width=32) (actual time=12.157..14.241 rows=20000 loops=1)
          Output: unnest((array_agg(("left"((random())::text, 9)))))
          ->  Aggregate  (cost=22.50..22.52 rows=1 width=32) (actual time=12.137..12.138 rows=1 loops=1)
                Output: array_agg(("left"((random())::text, 9)))
                ->  Function Scan on pg_catalog.generate_series  (cost=0.00..20.00 rows=1000 width=32) (actual time=3.152..10.016 rows=20000 loops=1)
                      Output: "left"((random())::text, 9)
                      Function Call: generate_series('1'::numeric, '20000'::numeric)
  ->  Nested Loop  (cost=0.23..80.60 rows=10 width=50) (actual time=24.484..47.333 rows=634 loops=1)
        Output: products.id, products.created_at, products.name
        ->  HashAggregate  (cost=0.23..0.33 rows=10 width=32) (actual time=24.343..28.019 rows=19976 loops=1)
              Output: cte.search_for_this
              Group Key: cte.search_for_this
              Batches: 1  Memory Usage: 1833kB
              ->  CTE Scan on cte  (cost=0.00..0.20 rows=10 width=32) (actual time=12.161..18.573 rows=20000 loops=1)
                    Output: cte.search_for_this
        ->  Index Scan using product_id_idx2 on public.products  (cost=0.00..8.02 rows=1 width=50) (actual time=0.001..0.001 rows=0 loops=19976)
              Output: products.id, products.created_at, products.name
              Index Cond: (products.id = cte.search_for_this)
              Rows Removed by Index Recheck: 0
Planning Time: 0.934 ms
Execution Time: 47.807 ms