I need to pseudonymize ids in dataset, in order to comply with GDPR. The IDs in question are integers from 0 to 10^7. I am looking form some elegant way to achieve this. The process must be repeatable and easily transferable, therefore I would like to avoid any addition of random seeds. I would also like to avoid having lookup table. In the end, I am looking for elegant function to transform those numbers in non trivial way, so a person who will be able to identify one or two id-pseudonymized id pairs from the data is not able to guess the function.
So far, I came up with splitting the number in two, adding different constants to each of those new numbers, following by modulo operation and recombination of the two numbers into one new id. I am hoping that someone here will suggest better approach.
edit: The ids in database are not static, some are removed while others are added.
I’m not sure there’s any feasible way to do this without having some sort of secret information kept separately.
Here’s why. Your IDs range from 0 to 9,999,999, inclusive. If you use any fixed hashing algorithm to map those IDs to hashes, it would be very computationally easy for an attacker to simply compute the hash of each of the ten million possible IDs, then cross-compare those hash outputs against the hashes of the real IDs to determine what those IDs are. (You could conceivably make this harder for an attacker to do by making your hash function very, very hard to compute, but then it’s not very useful for your own purposes.)
On the other hand, if there is some sort of secret information you can keep from an attacker, then you have more options available. For example, if you’re allowed a secret key, you could use that key as part of the hash so that an attacker couldn’t compute all possible hashes as described above without having the key. (@President James K. Polk’s comment mentions format-preserving encryption, which is typically done by having a secret key.)