I use git's word diffing to find changes between texts on a per-character basis:
git diff --word-diff=porcelain --word-diff-regex='\[[^]]*\]?|.' --no-index original.txt changed.txt
(If you're wondering, the custom regex I use ensures that characters within brackets are never broken up – credit to jthill.)
The resulting diff does not indicate deletions or additions of newlines (neither with nor without my custom regex). And when I replace a newline with, say, a space, it only indicates the addition of the space, not the deletion of the the newline.
Given the following original
foo
bar
baz
and the following changed text (I removed one line break in the top half and added one in the bottom half)
foo
bar
baz
I get this porcelain-style diff, where ~ represents newlines:
@@ -1,5 +1,5 @@
foo
~
~
bar
~
~
~
baz
~
But I want the following diff:
@@ <whatever> @@
foo
-\n
~
bar
~
~
+\n
baz
~
I have tried adding |\n to my regex, to no avail. (Btw git uses POSIX "extended" regular expressions.) The docs say that "[a] match that contains a newline is silently truncated(!) at the newline." I don't fully understand what this means but I suspect it could be the cause of the issue.
Is there any way to get git to produce the desired diff?
At the current state, Git does not allow newlines to be words [1] but I'm hoping there's a more elegant solution than this involving tweaking the
gitsettings. Regardless, here's a preprocessor-based solution:Basically it
git diffthe two files.Output:
Assumptions: The unicode character must not exist in the initial files, making the solution less elegant.
The git diff selected the second "~" to put "-\n", that's natural git and shouldn't change the output.
Adding
| sed -ze "s/\(\n *\)\+/\n/g"to the end of line 3 will remove the double white space in the middle, but again this would deviate from git diff's natural output.For additional research, word boundaries are computed in git's code at [3], which is called at [4] where the
\ndelimiter is hardcoded.