Text comparison to ignore some newline characters in Python

Question

Text comparison to ignore some newline characters in Python

59 Views Asked by John Smith At 02 August 2023 at 07:50

I have a Python script that compares two texts and highlights the differences between them. However, the comparison is being affected by newline characters, causing mismatches for texts with different newline representations. For instance, "arti\ncle" and "article" are being treated as different.

I'm currently using the difflib

Here's a simplified version of my current code:

import difflib

def compare_texts(old_text, new_text):
    old_lines = old_text.splitlines()
    new_lines = new_text.splitlines()
    
    d = difflib.Differ()
    diff = d.compare(old_lines, new_lines)
    
    added_lines = []
    deleted_lines = []
    
    for line in diff:
        if line.startswith('+ '):
            added_lines.append(line[2:])
        elif line.startswith('- '):
            deleted_lines.append(line[2:])
    
    return added_lines, deleted_lines

if __name__ == "__main__":
    old_text = "arti\ncle\nthis is some old text."
    new_text = "article\nthis is some new text."
    
    added_lines, deleted_lines = compare_texts(old_text, new_text)
    
    print("Added lines:")
    print('\n'.join(added_lines))
    
    print("\nDeleted lines:")
    print('\n'.join(deleted_lines))

Can someone suggest an effective way to compare texts that will handle newline characters appropriately, ensuring that "arti\ncle" and "article" are treated as the same during the comparison process?

EDIT1: In fact, lots of "\n" are introduced due to a pdf reading function. The idea maybe the following: if there is a "\n", we can try to delete it. If, after deleting it, we have a match, then we can consider that they are the same.

So "article" and "arti\ncle" are the same. "article" and "arti\nficial" are not.

I can't remove all "\n" because many of them are still useful.

EDIT2: knowing the origins of the bugs, we also may try this approach. Some random "\n" have been added due to a pdf reading function, so, we can try to delete some meaningless "\n" first.

Original Q&A

There are 1 best solutions below

**Tomper** · Accepted Answer · 2023-08-02T08:42:07.373000

Here's a suggested solution:

I think you need to do a wordwise diff, not linewise. So replace spaces with linebreaks. (Or use a different diffing method)
Then check if two consecutively deleted or inserted lines joined together match with a neighboring inserted/deleted line

import difflib

def compare_texts(old_text, new_text):
    old_lines = old_text.splitlines()
    new_lines = new_text.splitlines()
    
    d = difflib.Differ()
    diff = d.compare(old_lines, new_lines)
    
    added_lines = []
    deleted_lines = []
    
    prev = None
    prev_prev = None
    for line in diff:
        if line.startswith('+ '):
            added_lines.append(line[2:])
        elif line.startswith('- '):
            deleted_lines.append(line[2:])
        if prev is not None and prev_prev is not None:
            # handle + - - 
            if prev_prev.startswith('+ ') and prev.startswith('- ') and line.startswith('- '):
                joined = prev[2:]  + line[2:]
                if joined == prev_prev[2:]:
                    # can remove diffs as they make up the same word
                    del added_lines[-1]
                    del deleted_lines[-1]
                    del deleted_lines[-1]
            # also handle   - - +    + + -    - + + 
        prev_prev = prev
        prev = line
    
    return added_lines, deleted_lines

if __name__ == "__main__":
    old_text = "arti\ncle\nthis is some old text."
    new_text = "article\nthis is some new text."
    
    added_lines, deleted_lines = compare_texts(old_text.replace(" ", "\n"), new_text.replace(" ", "\n"))
    
    print("Added lines:")
    print('\n'.join(added_lines))
    
    print("\nDeleted lines:")
    print('\n'.join(deleted_lines))

You need to handle the other cases, I only implemented + - -.

This solution assumes only one linebreak can be in a word. And all 'good' linebreaks are lost.

Text comparison to ignore some newline characters in Python

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in DIFFLIB

Trending Questions

Popular # Hahtags

Popular Questions