I have tried searching for an answer online but unfortunately without success. Therefore I am asking here:
I am trying to figure out if all lines in file1 are present in file2. Luckily I can just compare entire lines rather than individual words etc. Unluckily I am dealing with GB files so a few of the elementary solutions that I have tried gave me memory errors.
At the moment I have the following code which does not work. Some guidance would be much appreciated.
# Checks if all lines in file1 are present in file2
def isFile1SubsetOfFile2(file1 , file2):
file1 = open(file1, "r")
for line1 in file1:
with open(file2, "r+b") as f:
mm=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
my_str_as_bytes = str.encode(line1)
result = mm.find(line1.strip().encode())
print(result)
if result == -1:
return False
return True
Sample file2:
This is line1.
This is line2.
This is line3.
This is line4.
This is line5.
This is line6.
This is line7.
This is line8.
This is line9.
Should pass if e.g. file1 is:
This is line4.
This is line5.
Should fail if e.g. file1 is:
This is line4.
This is line10.
Edit: I have just added a working version of my code for others benefit. No memory errors but its very slow.
I'm not sure why it doesn't work but I think I know a way how you can solve it:
This avoids opening the second file multiple times by always seeking to the start again for each line and at any moment you only hold 2 lines in memory. That should be very memory-friendly.
Another way (that could be faster) would be to sort both
file1andfile2. That way you could compare line by line and move to the next line in the other file if the string is lexically smaller than the one from the first file. Instead ofO(n**2)that could be performed inO(n*log(n)). However that's much more complicated and I don't know if sorting GB of files makes sense (could use too much memory!).