Refactoring some old code we've discovered that objects of class X are being created far too often and ~80% of them are being left without a reference.
I have ~10 tables that reference rows in the table of class X. I can easily identify the number of orphaned rows. The table has around 7 million rows and only around 1.5 million represent legitimately referenced objects.
After fixing the code causing this issue I need to delete all of these orphaned rows in a moderately performant way.
With my limited DB admin experience the only way I know to do this is to select all the legitimate IDs of 'X-type' objects and then do something like: DELETE FROM x WHERE id NOT IN (valid_references). That's something like 10 trillion comparisons, there has to be a better way.
Posting this for any future finders. After doing some research I found this to be about as ideal a solution I could put together.
Step 1: Create a new temporary table with the exact same structure as the one we're trying to trim down in size.
Step 2: Identify tables/columns that reference the class in question (X).
Step 3: For each row in those tables/columns with a non-null reference, get the row of X that they reference and copy it to the new table, if it doesn't already exist in the new table. I did this table by table, and used a chunked approach of 1000 records at a time to save system memory. I wrote this logic in my application layer to better control it. I used
INSERT IGNOREto avoid creating primary key conflicts since some of the references were to the same X object.Step 4: Delete the old table, rename the new one to the same name as the old one.
All told this takes about an hour to run in my environment where table X has ~7 million rows and I have ~10 other tables making reference to it.