Golang Bytes.Buffer Converting Unicode to Weird Characters

33 Views Asked by At

I have a program in Golang that does some string manipulations to text from a file. It reads in the file, does the manipulations and then tries to setup the text it displays in the terminal with a red background for any text that was removed and a green background for any text that was added. This works fine until the the input text has unicode characters in it. For example, I have characters like , , and that is occasionally present in the text that is get compared and formatted to display in the terminal. I get characters like â\u0080¦, and â\u0080\u0093, and â\u0097\u0087 respectively when I run the text through the following logic:

package stringdiff

import (
    "bytes"
    "strings"

    "github.com/andreyvit/diff"
    "github.com/fatih/color"
)

var (
    red   = color.New(color.BgRed, color.FgBlack).SprintFunc()
    green = color.New(color.BgGreen, color.FgBlack).SprintFunc()
)

// GetPrettyDiffString gets the diff string of the 2 passed in values where removals have a red background and additions have a green background
func GetPrettyDiffString(original, new string) string {
    diffString := diff.CharacterDiff(original, new)

    var buff bytes.Buffer
    var diffsLen = len(diffString)
    var char, nextChar, nextNextChar, section string
    var inSection bool
    for i := 0; i < len(diffString); {
        char = string(diffString[i])
        if char == "(" && i+2 < diffsLen && !inSection {
            nextChar = string(diffString[i+1])
            nextNextChar = string(diffString[i+2])
            if nextChar == "+" && nextNextChar == "+" {
                inSection = true

                i += 3
                continue
            } else if nextChar == "~" && nextNextChar == "~" {
                inSection = true

                i += 3
                continue
            }
        } else if char == "~" && i+2 < diffsLen && string(diffString[i+1]) == "~" && string(diffString[i+2]) == ")" {
            inSection = false
            buff.WriteString(red(section))
            section = ""

            i += 3
            continue
        } else if char == "+" && i+2 < diffsLen && string(diffString[i+1]) == "+" && string(diffString[i+2]) == ")" {
            inSection = false
            buff.WriteString(green(section))
            section = ""

            i += 3
            continue
        }

        if inSection {
            section += char
        } else {
            buff.WriteString(char)
        }

        i++
    }

    return convertUnicodeStringsToVisualRepresentations(buff.String())
}

func convertUnicodeStringsToVisualRepresentations(val string) string {
    val = strings.ReplaceAll(val, "â\u0080¦", "…")
    val = strings.ReplaceAll(val, "â\u0080\u0093", "–")
    val = strings.ReplaceAll(val, "â\u0097\u0087", "◇")

    return val
}

I have had to add convertUnicodeStringsToVisualRepresentations to handle the unicode characters that I commonly encounter. I am likely hitting an issue where Golang expects the text to only have UTF-8 in it, so it improperly displays the text when it gets converted to the bytes buffer, but I am not 100% certain on that.

Is there a good way to fix this issue?

I have tests present for this logic here if you would like to see what some input looks like for the function.

Please let me know if there is a better way to ask this or if you need more information. Thanks for the help!

Edit: just for simplicities sake, I thought I would add the solution that worked for me here since this issue is being listed as a duplicate.

I looked at the associated question and found that none of the suggested answers worked for me. But then I looked at the answers in the comments and found that one from @ANisus worked. He suggested a solution in a comment that referenced the following golang playground: https://play.golang.org/p/dBrx_ZmrsMN

The logic from that go playground that helped me was

func repairLatin1(s string) (string, error) {
    buf := make([]byte, 0, len(s))
    for i, r := range s {
        if r > 255 {
            return "", fmt.Errorf("character %s at index %d is not part of latin1", string(r), i)
        }
        buf = append(buf, byte(r))
    }
    return string(buf), nil
}

I can't say I like the addition of an error in the logic I am using, but it does seem to pass my tests. I will need to make sure it actually works in an actual scenario, but it does seem to do the trick according to the UTs.

0

There are 0 best solutions below