How to read a file fast for checking signature/magic number?

1.2k Views Asked by At

I'm a student and I'm pretty new to C++ and security. I've been given an assignment about checking the signature/magic number in a file, and I'm having a little problem about speeding up the reading time.

My idea is to read the file in binary mode using ifstream, store its data in a vector, then translate it to a hexadecimal string. Finally, I'll check if the given signature exits in the hex string.

Things went theoretically fine, except that the the whole process of allocating the vector's memory, reading and converting the file's data takes ages. Only the reading part takes 44ms.

I wonder how I can improve this? Here is my code

UINT CheckForSignature(CString source, CString dest_path) {

    // source is the HEX string need to find in file, dest_path is the destination of the file

    ifstream file(dest_path, ios::binary);
    if (file.is_open()) {

        // check for size of the file
        file.seekg(0, ios::end);
        int iFileSize = file.tellg();
        // if the file size exceed 50MB, pass
        if (iFileSize > 50000000) {
            // return -1, means file exceed 50MB, which do not need to be checked
            return -1; 
        }

        // read file and store data in hex string

        file.seekg(0, ios::beg);
        vector<char> memblock(iFileSize);
        file.read(((char*)memblock.data()), iFileSize); // 18ms alloc memory

        ostringstream ostrData; // 44ms read file
        // add to a total of 62ms
        // if consider the time need to translate all the memblock
        // then this will be long as hell
        // need to improve this
        for (int i = 0; i < memblock.size(); i++) {
            int z = memblock[i] & 0xff;
            ostrData << hex << setfill('0') << setw(2) << z;
        }

        string strDataHex = ostrData.str();
        string strHexSource = (CT2A)source;
        if (strDataHex.find(strHexSource) != string::npos) {
            // return 1, means there exits the signature in the file
            return 1;
        }
        else {
            // return 0; means there isn't the signature in the file
            return 0;
        }

    }
}

I'm open to all help and suggestions about solutions and code improvement. Thank you very much!

1

There are 1 best solutions below

2
Manuel On

There are much more performant ways to read and examine file content.

Here I show one naive/simple way (just an example.)

I've created a 51M file with "0000" at the end (I've removed the size limit):

~/projects$ l data.bin 
-rw-r--r-- 1 manuel manuel 51M jul 27 02:51 data.bin

(Showing last two lines.)

~/projects$ tail data.bin | hexdump

0000b80 11b9 dddd 8fe9 bab1 134d 5645 eb74 81ce
0000b90 3030 3030 000a                         
0000b95

Running your code (20 runs):

~/projects$ ./runtest.sh 131072 20
0 2360 1 2333 2 2355 3 2360 4 2349 5 2350 6 2353 7 2346 8 2342 9 2381 10 2378 11 2394 12 2338 13 2363 14 2392 15 2374 16 2365 17 2433 18 2426 19 2397 
Average: 2369

Running my example (20 runs):

~/projects$ ./runtest.sh 131072 20 mio
0 105 1 103 2 104 3 104 4 104 5 105 6 104 7 104 8 104 9 102 10 102 11 104 12 104 13 103 14 102 15 103 16 103 17 105 18 104 19 104 
Average: 103

With 5M file.

Yours:

~/projects$ ./runtest.sh 131072 20
0 238 1 243 2 244 3 242 4 243 5 244 6 239 7 245 8 243 9 246 10 239 11 246 12 243 13 242 14 240 15 243 16 242 17 245 18 240 19 243 
Average: 242

Example:

~/projects$ ./runtest.sh 131072 20 mio
0 10 1 10 2 10 3 11 4 10 5 10 6 10 7 10 8 10 9 10 10 11 11 10 12 10 13 10 14 10 15 10 16 10 17 10 18 10 19 10 
Average: 10

Script to compile and run (you can try several buffer sizes for my example):

#! /bin/bash

n=10
mio=""
bs=1024

if [ "$1" != "" ]
then
    bs=$1
fi

if [ "$2" == "" ]
then
    echo "Ups. Repeating? Will try with 10"
else
    n=$2
fi

if [ "$3" != "" ]
then
    mio="-DMIO"
fi

rm -f main

g++ -Wall -Wextra -g main.cc -o main -Wpedantic -std=c++2a -DBLOCK_SIZE=$bs $mio

tot=0
run=0
while [ "$run" != "$n" ]
do
    text=$(./main)
    mic=$(echo $text | cut - -d' ' -f 4)
    echo -n "$run $mic "
    tot=$(($tot + $mic))
    run=$(($run + 1))
done
echo
tot=$(($tot / $run))

echo "Average: $tot"
int main()
{
    string dest_path{"data.bin"};
    const unsigned char hex[] = {0x30, 0x30, 0x30, 0x30, 0x00 }; //  what to look for
#ifdef MIO
    ifstream file(dest_path, ios::binary);
    int numblocks = 0;
    std::chrono::high_resolution_clock::time_point init;
    std::chrono::high_resolution_clock::time_point finish;
    bool found = false;
    bool you_bet = false;
    unsigned char memblock[BLOCK_SIZE];
    size_t posf = 0;
    size_t sizeofhex = sizeof(hex) - 1;
    
    if (file.is_open()) {
        init = std::chrono::high_resolution_clock::now();
        do {
            file.read((char *)memblock, BLOCK_SIZE);
            if (file.eof()) {
                you_bet = true;
            }
            for (long int i = 0; i < file.gcount(); ++i) {
                if (memblock[i] == hex[0] && std::memcmp(&memblock[i], hex, sizeofhex) == 0) {
                    finish = std::chrono::high_resolution_clock::now();
                    found = true;
                    posf = i;
                }
            }
            file.seekg(-sizeof(hex), ios::cur); // prevent between two blocks signature
            ++numblocks;
        } while (!you_bet || !found);
    }

    auto res = std::chrono::duration_cast<std::chrono::milliseconds>(finish - init).count();

    if (found) {
        cout << "Yep! Found! Milliseconds: " << res
             << " at page " << (numblocks/BLOCK_SIZE)
             << " byte " << posf
             << ", total " << ((numblocks * BLOCK_SIZE) + posf)
             << endl;
    } else {
        cout << "Hmm... not found"  << endl;
    }
#else
    std::chrono::high_resolution_clock::time_point init;
    std::chrono::high_resolution_clock::time_point finish;

    ifstream file(dest_path, ios::binary);
    if (file.is_open()) {

        // check for size of the file
        file.seekg(0, ios::end);
        int iFileSize = file.tellg();

        file.seekg(0, ios::beg);

        init = std::chrono::high_resolution_clock::now();
        vector<char> memblock(iFileSize);
        file.read(((char*)memblock.data()), iFileSize); // 18ms alloc memory

        ostringstream ostrData; // 44ms read file
        // add to a total of 62ms
        // if consider the time need to translate all the memblock
        // then this will be long as hell
        // need to improve this
        for (size_t i = 0; i < memblock.size(); i++) {
            int z = memblock[i] & 0xff;
            ostrData << hex << setfill('0') << setw(2) << z;
        }

        string strDataHex = ostrData.str();
        string strHexSource = "0000";
        if (strDataHex.find(strHexSource) != string::npos) {
            // return 1, means there exits the signature in the file
            finish = std::chrono::high_resolution_clock::now();
            auto res = std::chrono::duration_cast<std::chrono::milliseconds>(finish - init).count();
            cout << "Yep! Found! Microseconds: " << res
                 << endl;
            return 1;
        }
        else {
            // return 0; means there isn't the signature in the file
            return 0;
        }

    }
#endif
    return 1;
}