MySQL - Most performant way to prune a lot of orphaned rows

175 Views Asked by At

Refactoring some old code we've discovered that objects of class X are being created far too often and ~80% of them are being left without a reference.

I have ~10 tables that reference rows in the table of class X. I can easily identify the number of orphaned rows. The table has around 7 million rows and only around 1.5 million represent legitimately referenced objects.

After fixing the code causing this issue I need to delete all of these orphaned rows in a moderately performant way.

With my limited DB admin experience the only way I know to do this is to select all the legitimate IDs of 'X-type' objects and then do something like: DELETE FROM x WHERE id NOT IN (valid_references). That's something like 10 trillion comparisons, there has to be a better way.

1

There are 1 best solutions below

0
Ryan On

Posting this for any future finders. After doing some research I found this to be about as ideal a solution I could put together.

Step 1: Create a new temporary table with the exact same structure as the one we're trying to trim down in size.

Step 2: Identify tables/columns that reference the class in question (X).

SELECT * 
FROM information_schema.COLUMNS 
WHERE table_schema = 'my_db_name' 
AND column_name LIKE '%reference_column%'

Step 3: For each row in those tables/columns with a non-null reference, get the row of X that they reference and copy it to the new table, if it doesn't already exist in the new table. I did this table by table, and used a chunked approach of 1000 records at a time to save system memory. I wrote this logic in my application layer to better control it. I used INSERT IGNORE to avoid creating primary key conflicts since some of the references were to the same X object.

Step 4: Delete the old table, rename the new one to the same name as the old one.

SET FOREIGN_KEY_CHECKS=0;
DROP TABLE old_table; 
RENAME TABLE new_table TO old_table;
SET FOREIGN_KEY_CHECKS=1;

All told this takes about an hour to run in my environment where table X has ~7 million rows and I have ~10 other tables making reference to it.