We have two tables, one containing a full list of appearances of URLs, the other containing a list of unique urls, as enforced an MD5 uniqueness index. Simplified DDL:
create table myschema.posted_url (
record_id serial primary key,
url varchar not null
)
;
create table myschema.unique_urls (
record_id serial primary key,
url varchar not null,
record_added timestamptz not null default now()
)
;
create unique index unique_urls_md5 on myschema.unique_urls (md5(url));
The uniqueness constraint has been working as expected for millions of rows in both tables. We have a single (python) process that adds new appearances of urls to posted_url, and then immediately runs this:
insert into myschema.unique_urls (url) (
select distinct url from myschema.posted_url where url not in (
select url from myschema.unique_urls
)
)
;
We got a single instance of failure:
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "unique_urls_md5"
DETAIL: Key (md5(url::text))=(<myurlmd5hash>) already exists.
The commit flow is that urls from a single entity are added, then the unique urls update is done, then commit, then move on to the next entity. In the above case, the crash prevented commit of the posted_url record, but the commit to unique_urls appears to have happened i.e. the unique_url record is timestamped as just before the crash and the posted_url records are timestamped as of the next time the process ran (next day).
How is this possible?
Something other process interacting with these tables? No - other processes only query. We even have checks in place to ensure we cannot run two of these processes in parallel.
Ok, lets assume I'm wrong about the above and something else came in and added the unique record. Isn't it still impossible for the insert statement to fail in this way? Doesn't it take place within a single transaction, so if the hash wasn't in the unique table when it was selected from, it shouldn't be in there at time of insert?
Thanks very much, and apologies I cannot post more full code for data privacy reasons.