Git word diff missing deletion of newline

209 Views Asked by At

I use git's word diffing to find changes between texts on a per-character basis:

git diff --word-diff=porcelain --word-diff-regex='\[[^]]*\]?|.' --no-index original.txt changed.txt

(If you're wondering, the custom regex I use ensures that characters within brackets are never broken up – credit to jthill.)

The resulting diff does not indicate deletions or additions of newlines (neither with nor without my custom regex). And when I replace a newline with, say, a space, it only indicates the addition of the space, not the deletion of the the newline.

Given the following original

foo

bar

baz

and the following changed text (I removed one line break in the top half and added one in the bottom half)

foo
bar


baz

I get this porcelain-style diff, where ~ represents newlines:

@@ -1,5 +1,5 @@
 foo
~
~
 bar
~
 
~
~
 baz
~

But I want the following diff:

@@ <whatever> @@
 foo
-\n
~
 bar
~
~
+\n
baz
~

I have tried adding |\n to my regex, to no avail. (Btw git uses POSIX "extended" regular expressions.) The docs say that "[a] match that contains a newline is silently truncated(!) at the newline." I don't fully understand what this means but I suspect it could be the cause of the issue.

Is there any way to get git to produce the desired diff?

1

There are 1 best solutions below

2
Simeon On

At the current state, Git does not allow newlines to be words [1] but I'm hoping there's a more elegant solution than this involving tweaking the git settings. Regardless, here's a preprocessor-based solution:

sed -ze "s/\n/$(echo -ne '\ufffd')\n/g" original.txt > temp1.txt
sed -ze "s/\n/$(echo -ne '\ufffd')\n/g" changed.txt > temp2.txt
git diff --word-diff=porcelain --word-diff-regex='\[[^]]*\]?|.' --no-index temp1.txt temp2.txt | sed -zE "s/([^\+\-])$(echo -ne '\ufffd')/\1/g" | sed -ze "s/$(echo -ne '\ufffd')\n~\?/\\\n/g"
rm -rf temp1.txt temp2.txt

Basically it

  1. Replaces "\n" with "\n\ufffd" (appends a temporary unicode character) outputting temp1.txt and temp2.txt.
    • According to [2] there isn't yet a known way to git diff two string inputs with the latest version of git (not requiring .git) which is why temporary files are used rather than a one-liner.
  2. Then git diff the two files.
    • Removes any "\ufffd" that doesn't follow a "+" or "-"
    • Then replaces the remaining ones with "\n"
  3. Then clean up the intermediate files

Output:

@@ -1,5 +1,5 @@
 foo
~
-\n
 bar
~
 
~
+\n
 baz
~

Assumptions: The unicode character must not exist in the initial files, making the solution less elegant.

The git diff selected the second "~" to put "-\n", that's natural git and shouldn't change the output.

Adding | sed -ze "s/\(\n *\)\+/\n/g" to the end of line 3 will remove the double white space in the middle, but again this would deviate from git diff's natural output.


For additional research, word boundaries are computed in git's code at [3], which is called at [4] where the \n delimiter is hardcoded.