PDF File dedupe issue with same content, but generated at different time periods from a docx

218 Views Asked by At

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.

def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()    
while len(data) > 0:
    hasher.update(data)
    data = file.read()
file.close()
return hasher.hexdigest()
2

There are 2 best solutions below

2
btilly On BEST ANSWER

One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.

Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

0
kimstik On

You can take a look at another existing PDF optimization project: https://github.com/pts/pdfsizeopt

What’s interesting is that you can see the amount of redundant data in the logs. In my experience, PDFs can be reduced to 20..80% of the original by removing redundancy.