PDF File dedupe issue with same content, but generated at different time periods from a docx

218 Views Asked by user1597990 At 21 October 2022 at 22:45

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.

def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()    
while len(data) > 0:
    hasher.update(data)
    data = file.read()
file.close()
return hasher.hexdigest()

Original Q&A

There are 2 best solutions below

btilly On 22 October 2022 at 05:00 BEST ANSWER

One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.

Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

kimstik On 10 November 2023 at 14:54

You can take a look at another existing PDF optimization project: https://github.com/pts/pdfsizeopt

What’s interesting is that you can see the amount of redundant data in the logs. In my experience, PDFs can be reduced to 20..80% of the original by removing redundancy.

PDF File dedupe issue with same content, but generated at different time periods from a docx

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in ALGORITHM

Related Questions in DUPLICATES

Related Questions in PYTHON-DEDUPE

Trending Questions

Popular # Hahtags

Popular Questions