Test if the lines in file1 are a subset of the lines in file2

255 Views Asked by At

I have tried searching for an answer online but unfortunately without success. Therefore I am asking here:

I am trying to figure out if all lines in file1 are present in file2. Luckily I can just compare entire lines rather than individual words etc. Unluckily I am dealing with GB files so a few of the elementary solutions that I have tried gave me memory errors.

At the moment I have the following code which does not work. Some guidance would be much appreciated.

# Checks if all lines in file1 are present in file2
def isFile1SubsetOfFile2(file1 , file2):
    file1 = open(file1, "r")


    for line1 in file1:        
        with open(file2, "r+b") as f:

            mm=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 
            my_str_as_bytes = str.encode(line1)
            result = mm.find(line1.strip().encode())
            print(result)
            if result == -1:
                return False
    return True

Sample file2:

This is line1.
This is line2.
This is line3.
This is line4.
This is line5.
This is line6.
This is line7.
This is line8.
This is line9.

Should pass if e.g. file1 is:

This is line4.
This is line5.

Should fail if e.g. file1 is:

This is line4.
This is line10.

Edit: I have just added a working version of my code for others benefit. No memory errors but its very slow.

2

There are 2 best solutions below

4
MSeifert On

I'm not sure why it doesn't work but I think I know a way how you can solve it:

def is_subset_of(file1, file2):
    with open(file1, 'r') as f1, open(file2, 'r') as f2:
        for line in f1:
            line = line.strip()
            f2.seek(0)   # go to the start of f2
            if line not in (line2.strip() for line2 in f2):
                return False
    return True

This avoids opening the second file multiple times by always seeking to the start again for each line and at any moment you only hold 2 lines in memory. That should be very memory-friendly.

Another way (that could be faster) would be to sort both file1 and file2. That way you could compare line by line and move to the next line in the other file if the string is lexically smaller than the one from the first file. Instead of O(n**2) that could be performed in O(n*log(n)). However that's much more complicated and I don't know if sorting GB of files makes sense (could use too much memory!).

0
o11c On

Dealing with files that don't fit in memory is always hard.

If file1 fits in memory but file2 is too large, here is a solution:

# file1 and file2 are open file-like objects
unseen = set(file1)
for line in file2:
    unseen -= {line} # avoid exception from set.remove
#if unseen is empty, all lines were found in file2

Otherwise, you should sort (or perhaps CFBS-sort) at least one of the files.