How to check type of files using the header file signature (magic numbers)?

4.2k Views Asked by At

By entering the file with its extension, my code succeeds to detect the type of the file from the "magic number".

magic_numbers = {'png': bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A]),
                 'jpg': bytes([0xFF, 0xD8, 0xFF, 0xE0]),
                 #*********************#
                 'doc': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'xls': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'ppt': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 #*********************#
                 'docx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'xlsx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'pptx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 #*********************#
                 'pdf': bytes([0x25, 0x50, 0x44, 0x46]),
                 #*********************#
                 'dll': bytes([0x4D, 0x5A, 0x90, 0x00]),
                 'exe': bytes([0x4D, 0x5A]),

                 }

max_read_size = max(len(m) for m in magic_numbers.values()) 
 
with open('file.pdf', 'rb') as fd:
    file_head = fd.read(max_read_size)
 
if file_head.startswith(magic_numbers['pdf']):
    print("It's a PDF File")
else:
    print("It's not a PDF file")

I want to know how I can modify it without specifying this part of code, i.e. once I generate or I enter the file it shows me directly the type of the file.

if file_head.startswith(magic_numbers['pdf']):
    print("It's a PDF File")
else:
    print("It's not a PDF file")

I hope you understand me.

2

There are 2 best solutions below

9
Geoduck On BEST ANSWER

You most like just want to iterate over the loop and test them all.

You may be able to optimize or provide some error checking by using the extension as well. If you strip off the extension and check that first, you'll be successful most of the time, and if not you may not want to accept "baby.png" as an xlsx file. That would be suspicious and worthy of an error.

But, if you ignore extension, just loop over the entries:

for ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        print("It's a {} File".format(ext))

You probably want to put this in a function that returns the type, so you could just return the type instead of printing it out.

EDIT Since some share magic numbers, we need to assume the extension is correct until we know that it isn't. I would extract the extension from the filename. This could be done with Pathlib or just string split:

ext = filename.rsplit('.', 1)[-1]

then test it specifically

if ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        return ext

put the ext test first, so putting it all together:

ext = filename.rsplit('.', 1)[-1]
if ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        return ext

for ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]):
        return ext

return nil
0
Ryan Mediocre On

So I took my shot at iterating over the magic_numbers and using the file extension to confirm the correct one. This is working for me, hopefully it helps others.

from tkinter import filedialog as di

#Defines Magic Numbers by file type for comparison

magic_numbers = {'png': bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A]),
                 'jpg': bytes([0xFF, 0xD8, 0xFF, 0xE0]),
                 #*********************#
                 'doc': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'xls': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 'ppt': bytes([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1]),
                 #*********************#
                 'docx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'xlsx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 'pptx': bytes([0x50, 0x4B, 0x03, 0x04, 0x14, 0x00, 0x06, 0x00]),
                 #*********************#
                 'pdf': bytes([0x25, 0x50, 0x44, 0x46]),
                 #*********************#
                 'dll': bytes([0x4D, 0x5A, 0x90, 0x00]),
                 'exe': bytes([0x4D, 0x5A]),

                 }



max_read_size = max(len(m) for m in magic_numbers.values())

#Get file from user
def open_file_selection():
    file = di.askopenfile()
    return file.name

#File Scan, opens file in binary and grabs relevant code
with open(open_file_selection(), 'rb') as fd:
    file_head = fd.read(max_read_size)
    print(file_head)
    

#Compares file to magic numbers to determine file type
for ext in magic_numbers:
    if file_head.startswith(magic_numbers[ext]) and fd.name.rsplit('.', 1)[-1] == ext:
        print(f"It's a {ext} File")