How do I copy all the duplicate lines of a file to a new file in Python?

247 Views Asked by At

I'm trying to write a code to copy all the duplicates of a file to a new file. The program I wrote checks the first 3 elements of each line and compares it to the next line.

f=open(r'C:\Users\xamer\Desktop\file.txt','r')
data=f.readlines()
f.close()
lines=data.copy()
dup=open(r'C:\Users\xamer\Desktop\duplicate.txt','a')
for x in data:
    for y in data:
        if (y[0]==x[0]) and (y[1]==x[1]) and (y[2]==x[2]):
            lines.append(y)
        else:
            lines.remove(y)
dup.write(lines)
dup.close()

I'm getting the following error:

Traceback (most recent call last):
  File "C:\Users\xamer\Desktop\file.py", line 80, in <module>
    lines.remove(y)
ValueError: list.remove(x): x not in list

Any suggestions?

1

There are 1 best solutions below

0
Antonino On BEST ANSWER

These snippets should do the job you were asking for. At the beginning I thought to create a duplicated_lines list and then writing it all at the end. But then I realized that I could optimize the code performance avoiding an additional final loop by just writing the repeated items on the fly

As underlined by another user it is not really clear if you want to check only adjacent double entries or repeated items independently from the position

In the first case - where repetitions are immediately after - this is the code:

# opening the source file
with open('hello.txt','r') as f:
    # returns a list containing the original lines
    data=f.readlines()

# creating the file to host the repeated lines
with open('duplicated.txt','a') as f:

    for i in range(0, len(data)-1):
        # stripping to avoid a bug if the last line is a repeated item
        if(data[i].strip('\n') == data[i+1].strip('\n')):
            print("Lines {}: {}".format(i, data[i]))
            print("Lines {}: {}".format(i+1, data[i+1]))
            #duplicated_lines.append(data[i])
            print("Line repeated: " + data[i])
            f.write("%s\n" % data[i])

If instead you wanna check repeated lines all along the file this is the code:

# opening the source file
with open('hello.txt','r') as f:
    # returns a list containing the original lines
    data=f.readlines()

# creating the file to host the repeated lines
with open('duplicated.txt','a') as f:    
    for i in range(0, len(data)-1):
        for j in range(i+1, len(data)):
            # stripping to avoid a bug if the last line is a repeated item
            if(data[i].strip('\n') == data[j].strip('\n')):
                print("Lines {}: {}".format(i, data[i]))
                print("Lines {}: {}".format(j, data[j]))
                #duplicated_lines.append(data[i])
                print("Line repeated: " + data[i])
                f.write("%s\n" % data[i])