I have a Python script that compares two texts and highlights the differences between them. However, the comparison is being affected by newline characters, causing mismatches for texts with different newline representations. For instance, "arti\ncle" and "article" are being treated as different.
I'm currently using the difflib
Here's a simplified version of my current code:
import difflib
def compare_texts(old_text, new_text):
old_lines = old_text.splitlines()
new_lines = new_text.splitlines()
d = difflib.Differ()
diff = d.compare(old_lines, new_lines)
added_lines = []
deleted_lines = []
for line in diff:
if line.startswith('+ '):
added_lines.append(line[2:])
elif line.startswith('- '):
deleted_lines.append(line[2:])
return added_lines, deleted_lines
if __name__ == "__main__":
old_text = "arti\ncle\nthis is some old text."
new_text = "article\nthis is some new text."
added_lines, deleted_lines = compare_texts(old_text, new_text)
print("Added lines:")
print('\n'.join(added_lines))
print("\nDeleted lines:")
print('\n'.join(deleted_lines))
Can someone suggest an effective way to compare texts that will handle newline characters appropriately, ensuring that "arti\ncle" and "article" are treated as the same during the comparison process?
EDIT1: In fact, lots of "\n" are introduced due to a pdf reading function. The idea maybe the following: if there is a "\n", we can try to delete it. If, after deleting it, we have a match, then we can consider that they are the same.
So "article" and "arti\ncle" are the same. "article" and "arti\nficial" are not.
I can't remove all "\n" because many of them are still useful.
EDIT2: knowing the origins of the bugs, we also may try this approach. Some random "\n" have been added due to a pdf reading function, so, we can try to delete some meaningless "\n" first.
Here's a suggested solution:
You need to handle the other cases, I only implemented
+ - -.This solution assumes only one linebreak can be in a word. And all 'good' linebreaks are lost.