I have a program in Golang that does some string manipulations to text from a file. It reads in the file, does the manipulations and then tries to setup the text it displays in the terminal with a red background for any text that was removed and a green background for any text that was added. This works fine until the the input text has unicode characters in it.
For example, I have characters like …, –, and ◇ that is occasionally present in the text that is get compared and formatted to display in the terminal. I get characters like â\u0080¦, and â\u0080\u0093, and â\u0097\u0087 respectively when I run the text through the following logic:
package stringdiff
import (
"bytes"
"strings"
"github.com/andreyvit/diff"
"github.com/fatih/color"
)
var (
red = color.New(color.BgRed, color.FgBlack).SprintFunc()
green = color.New(color.BgGreen, color.FgBlack).SprintFunc()
)
// GetPrettyDiffString gets the diff string of the 2 passed in values where removals have a red background and additions have a green background
func GetPrettyDiffString(original, new string) string {
diffString := diff.CharacterDiff(original, new)
var buff bytes.Buffer
var diffsLen = len(diffString)
var char, nextChar, nextNextChar, section string
var inSection bool
for i := 0; i < len(diffString); {
char = string(diffString[i])
if char == "(" && i+2 < diffsLen && !inSection {
nextChar = string(diffString[i+1])
nextNextChar = string(diffString[i+2])
if nextChar == "+" && nextNextChar == "+" {
inSection = true
i += 3
continue
} else if nextChar == "~" && nextNextChar == "~" {
inSection = true
i += 3
continue
}
} else if char == "~" && i+2 < diffsLen && string(diffString[i+1]) == "~" && string(diffString[i+2]) == ")" {
inSection = false
buff.WriteString(red(section))
section = ""
i += 3
continue
} else if char == "+" && i+2 < diffsLen && string(diffString[i+1]) == "+" && string(diffString[i+2]) == ")" {
inSection = false
buff.WriteString(green(section))
section = ""
i += 3
continue
}
if inSection {
section += char
} else {
buff.WriteString(char)
}
i++
}
return convertUnicodeStringsToVisualRepresentations(buff.String())
}
func convertUnicodeStringsToVisualRepresentations(val string) string {
val = strings.ReplaceAll(val, "â\u0080¦", "…")
val = strings.ReplaceAll(val, "â\u0080\u0093", "–")
val = strings.ReplaceAll(val, "â\u0097\u0087", "◇")
return val
}
I have had to add convertUnicodeStringsToVisualRepresentations to handle the unicode characters that I commonly encounter. I am likely hitting an issue where Golang expects the text to only have UTF-8 in it, so it improperly displays the text when it gets converted to the bytes buffer, but I am not 100% certain on that.
Is there a good way to fix this issue?
I have tests present for this logic here if you would like to see what some input looks like for the function.
Please let me know if there is a better way to ask this or if you need more information. Thanks for the help!
Edit: just for simplicities sake, I thought I would add the solution that worked for me here since this issue is being listed as a duplicate.
I looked at the associated question and found that none of the suggested answers worked for me. But then I looked at the answers in the comments and found that one from @ANisus worked. He suggested a solution in a comment that referenced the following golang playground: https://play.golang.org/p/dBrx_ZmrsMN
The logic from that go playground that helped me was
func repairLatin1(s string) (string, error) {
buf := make([]byte, 0, len(s))
for i, r := range s {
if r > 255 {
return "", fmt.Errorf("character %s at index %d is not part of latin1", string(r), i)
}
buf = append(buf, byte(r))
}
return string(buf), nil
}
I can't say I like the addition of an error in the logic I am using, but it does seem to pass my tests. I will need to make sure it actually works in an actual scenario, but it does seem to do the trick according to the UTs.