Postgres - How to find id's that are not used in different multiple tables (inactive id's) - badly written query

524 Views Asked by At

I have table towns which is main table. This table contains so many rows and it became so 'dirty' (someone inserted 5 milions rows) that I would like to get rid of unused towns.

There are 3 referent table that are using my town_id as reference to towns.

And I know there are many towns that are not used in this tables, and only if town_id is not found in neither of these 3 tables I am considering it as inactive and I would like to remove that town (because it's not used).

as you can see towns is used in this 2 different tables:

  • employees
  • offices

and for table * vendors there is vendor_id in table towns since one vendor can have multiple towns.

so if vendor_id in towns is null and town_id is not found in any of these 2 tables it is safe to remove it :)

I created a query which might work but it is taking tooooo much time to execute, and it looks something like this:

select count(*) 
from towns
where vendor_id is null 
    and id not in (select town_id from banks) 
    and id not in (select town_id from employees)

So basically I said, if vendor_is is null it means this town is definately not related to vendors and in the same time if same town is not in banks and employees, than it will be safe to remove it.. but query took too long, and never executed successfully...since towns has 5 milions rows and that is reason why it is so dirty..

In face I'm not able to execute given query since server terminated abnormally..

Here is full error message:

ERROR: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

Any kind of help would be awesome Thanks!

3

There are 3 best solutions below

0
nbk On

You can try an JOIN on big tables it would be faster then two IN

you could also try UNION ALL and live with the duplicates, as it is faster as UNION

Finally you can use a combined Index on id and vendor_id, to speed up the query

CREATE TABLe towns (id int , vendor_id int)
CREATE TABLE
CREATE tABLE banks (town_id int)
CREATE TABLE
CREATE tABLE employees (town_id int)
CREATE TABLE
select count(*) 
from towns t1 JOIN (select town_id from banks UNION select town_id from employees) t2 on t1.id <> t2.town_id
where vendor_id is null
count
0
SELECT 1

fiddle

4
Edouard On

You can join the tables using LEFT JOIN so that to identify the town_id for which there is no row in tables banks and employee in the WHERE clause :

WITH list AS
( SELECT t.town_id
    FROM towns AS t
    LEFT JOIN tbl.banks AS b ON b.town_id = t.town_id
    LEFT JOIN tbl.employees AS e ON e.town_id = t.town_id
   WHERE t.vendor_id IS NULL
     AND b.town_id IS NULL
     AND e.town_id IS NULL
   LIMIT 1000
)
DELETE FROM tbl.towns AS t
  USING list AS l
 WHERE t.town_id = l.town_id ;

Before launching the DELETE, you can check the indexes on your tables. Adding an index as follow can be usefull :

CREATE INDEX town_id_nulls ON towns (town_id NULLS FIRST) ;

Last but not least you can add a LIMIT clause in the cte so that to limit the number of rows you detele when you execute the DELETE and avoid the unexpected termination. As a consequence, you will have to relaunch the DELETE several times until there is no more row to delete.

0
deroby On

The trick is to first make a list of all the town_id's you want to keep and then start removing those that are not there. By looking in 2 tables you're making life harder for the server so let's just create 1 single list first.

-- build empty temp-table
CREATE TEMPORARY TABLE TEMP_must_keep 
AS
SELECT town_id       
  FROM tbl.towns
 WHERE 1 = 2;
 
-- get id's from first table
INSERT TEMP_must_keep (town_id)
SELECT DISTINCT town_id 
  FROM tbl.banks;
  
-- add index to speed up the EXCEPT below
CREATE UNIQUE INDEX idx_uq_must_keep_town_id ON TEMP_must_keep (town_id);

-- add new ones from second table
INSERT TEMP_must_keep (town_id)
SELECT town_id 
  FROM tbl.employees
 
EXCEPT -- auto-distincts

SELECT town_id 
  FROM TEMP_must_keep;
  
-- rebuild index simply to ensure little fragmentation
REINDEX TABLE TEMP_must_keep;

-- optional, but might help: create a temporary index on the towns table to speed up the delete
CREATE INDEX idx_towns_town_id_where_vendor_null ON tbl.towns (town_id) WHERE vendor IS NULL;

-- Now do actual delete
-- You can do a `SELECT COUNT(*)` rather than a `DELETE` first if you feel like it, both will probably take some time depending on your hardware.
DELETE 
  FROM tbl.towns as del      
 WHERE vendor_id is null 
   AND NOT EXISTS ( SELECT * 
                      FROM TEMP_must_keep mk
                     WHERE mk.town_id = del.town_id);
                     
                     
-- cleanup
DROP INDEX tbl.idx_towns_town_id_where_vendor_null;
DROP TABLE TEMP_must_keep;

                     

The idx_towns_town_id_where_vendor_null is optional and I'm not sure if it will actaully lower the total time but IMHO it will help out with the DELETE operation if only because the index should give the Query Optimizer a better view on what volumes to expect.