How To Speed Up SQL IN Query With Thousand of Data

108 Views Asked by At

In my application it needs to get the id of a table in a specific data. The id is not ordered and is not predictable, so the query uses IN. Let's say it has a lot of id's inside like 100.000 ids

The query look like this and it takes approx 10 seconds

SELECT col1, col2, col3, ...
FROM table 
WHERE created_at BETWEEN date1 AND date2 
AND table_id IN (1, ..., 100000)

When I EXPLAIN ANALYZE this is the return

Gather (cost=1002.79..3322893.01 rows=823578 width=24) (actual time=761.500..5453.013 rows=219908 loops=1)

Workers Planned: 2

Workers Launched: 2

Planning Time: 2.433 ms

Execution Time: 5463.276 ms

The query look like this and it takes approx 4 seconds

SELECT col1, col2, col3, ... 
FROM table 
WHERE created_at BETWEEN date1 AND date2 
AND table_id = ANY (
    SELECT value 
    FROM UNNEST(ARRAY([1,...... 100000)]) as value ))

here's the EXPLAIN ANALYZE

Nested Loop (cost=14.52..2879232.45 rows=872265 width=24) (actual time=0.485..548.387 rows=219908 loops=1)

Planning Time: 0.938 ms

Execution Time: 558.358 ms

I have indexed the table_id and created_at using BTREE

My question is, how to make the query faster ? , is my second query already the most efficient ? and if it's the most efficient, is there any trade off I'm not aware or it has any bug when using ANY (SELECT value FROM UNNEST(ARRAY[])) ?

UPDATED

EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS ) First Query

QUERY PLAN 
Gather  (cost=1002.79..3324157.83 rows=823891 width=24) (actual time=2059.074..5390.528 rows=219908 loops=1) 
Output: col1, col2, col3 
Workers Planned: 2 
Workers Launched: 2 
Buffers: shared hit=315848 read=522241 dirtied=35 
I/O Timings: read=1103.008 
->  Parallel Seq Scan on table_name  (cost=2.79..3240768.73 rows=343288 width=24) (actual time=2056.255..5363.859 rows=73303 loops=3) 
        Output: col1, col2, col3 
        Filter: (table_id = ANY ('{1,......, 100000}'::integer[])) AND (created_at >= date1::timestamp with time zone) AND (created_at <= date2::timestamp with time zone) 
        Rows Removed by Filter: 33444492 
        Buffers: shared hit=315848 read=522241 dirtied=35 
        I/O Timings: read=1103.008 
        Worker 0:  actual time=2054.371..5372.919 rows=69788 loops=1 
        Buffers: shared hit=105747 read=174198 dirtied=3 
        I/O Timings: read=369.176 
        Worker 1:  actual time=2055.867..5369.968 rows=78986 loops=1 
        Buffers: shared hit=105794 read=173495 
        I/O Timings: read=367.726 
Settings: effective_cache_size = '32568248kB', jit = 'off', search_path = 'public' 
Query Identifier: -2771525684851677483 
Planning Time: 1.892 ms 
Execution Time: 5403.505 ms 

EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS ) Second Query

QUERY PLAN 
Nested Loop  (cost=14.52..2880414.99 rows=872605 width=24) (actual time=0.432..553.665 rows=219908 loops=1) 
Output: col1, col2, col3 
Buffers: shared hit=392247 dirtied=8 
->  HashAggregate  (cost=13.95..15.95 rows=200 width=4) (actual time=0.407..0.849 rows=1116 loops=1) 
        Output: value.value 
        Group Key: value.value 
        Batches: 1  Memory Usage: 145kB 
        ->  Function Scan on pg_catalog.unnest value  (cost=0.00..11.16 rows=1116 width=4) (actual time=0.071..0.143 rows=1116 loops=1) 
            Output: value.value 
            Function Call: unnest('{1,....,100000}'::integer[]) 
->  Index Scan using table_name_table_id_created_at_idx on table_name  (cost=0.57..14328.73 rows=782 width=16) (actual time=0.047..0.409 rows=197 loops=1116) 
        Output: col1, col2, col3 
        Index Cond: (ss.vehicle_id = value.value) 
        Filter: (created_at >= date1::timestamp with time zone) AND (created_at <= date2::timestamp with time zone) 
        Rows Removed by Filter: 229 
        Buffers: shared hit=392247 dirtied=8 
Settings: effective_cache_size = '32568248kB', jit = 'off', search_path = 'public' 
Query Identifier: 328163023760902820 
Planning Time: 0.374 ms 
Execution Time: 566.648 ms 
1

There are 1 best solutions below

0
bobflux On

Create test data:

CREATE UNLOGGED TABLE foo( id INT NOT NULL, created_at INT NOT NULL, data INT NOT NULL );
INSERT INTO foo SELECT n, random()*10000000, n FROM generate_series(1,40000000) n;
CREATE INDEX ON foo(id);
CREATE INDEX ON foo(created_at);
VACUUM ANALYZE foo;

Due to the large number of ids's in the query I'll use python:

ids = [ n*100 for n in range(100000) ]
cursor.execute( """
    EXPLAIN ANALYZE SELECT * FROM foo 
    WHERE id =ANY(%s) AND created_at BETWEEN 1000000 AND 3000000
    """, (ids,) )
for row in cursor:
    print(row[0][:200])

Index Scan using foo_id_idx on foo  (cost=0.56..334046.00 rows=20121 width=12) (actual time=8.092..331.779 rows=19845 loops=1)
  Index Cond: (id = ANY ('{0,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,
  Filter: ((created_at >= 1000000) AND (created_at <= 3000000))
  Rows Removed by Filter: 80154
Planning Time: 39.578 ms
Execution Time: 358.758 ms

Planning is slow, due to the large array.

It is using the index on id to fetch rows, then filters them based on created_at. Thus rows not satisfying the condition on created_at still require heap fetches. Including created_at in the index would be useful.

An index on (created_at,id) would allow to scan the requested range of created_at, but it cannot index on ids. So the ids would have to be pulled out of the index and filtered. This would only be useful if the condition on created_at is very narrow and the most selctive in the query. Looking at the row counts in your EXPLAIN, I don't feel this is the case.

An index with id as the first column allows to fetch rows for each id directly. Then created_at has to be compared with the requested range. I feel this is more useful.

CREATE INDEX ON foo( id ) INCLUDE ( created_at );

Index Scan using foo_id_created_at_idx on foo  (cost=0.56..334046.00 rows=20121 width=12) (actual time=3.955..278.250 rows=19845 loops=1)
  Index Cond: (id = ANY ('{0,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,
  Filter: ((created_at >= 1000000) AND (created_at <= 3000000))
  Rows Removed by Filter: 80154
Planning Time: 37.395 ms
Execution Time: 299.370 ms

This pulls created_at from the index, avoiding heap fetches for rows that will be rejected, so it is slightly faster.

CREATE INDEX ON foo( id, created_at );

This would be useful if there were many rows for each id, each having a different created_at value, which is not the case here.

This query may cause lots of random IOs, so if the table is on spinning disk and not SSD, it will take a lot longer.

Using IN() instead of =ANY() does not change anything.

Besides including created_at in the index to avoid extra IO, there's not much opportunity to make it faster. This will need one index scan per id, there are 100k, so it comes down to 3µs per id which is pretty fast. Transferring that many rows to the client will also take time.

If you really need it faster, I'd recommend splitting the batches of id's into smaller chunks, and executing it in parallel over several connections. This has the advantage of parallelizing data encoding and decoding, and also processing on the client.

The following parallel python code runs in 100ms, which is quite a bit faster.

db = None
def query( ids ):
    if not ids: return
    global db
    if not db:
        db = psycopg2.connect("user= password=  dbname=test")
        db.cursor().execute( "PREPARE myplan AS SELECT * FROM unnest($1::INTEGER[]) get_id JOIN foo ON (foo.id=get_id AND foo.created_at BETWEEN $2 AND $3)")
    cursor = db.cursor()
    cursor.execute( "EXECUTE myplan(%s,1000000,3000000)", (ids,) )

if __name__ == "__main__":
    ids = [ n*100 for n in range(100000) ]
    chunks = [ids[offset:(offset+1000)] for offset in range( 0, len(ids)+1, 1000 )]
    st = time.time()
    with Pool(10) as p:
        p.map(query, chunks)
    print( time.time()-st )