read text file which has been encoded with UCS-2 little Endian using Go

1.7k Views Asked by At

I have a Go program to read a text file similar to the code below:

package main

import (
    "bufio"
    "log"
    "os"
)

func main() {
    file, err := os.Open("test.txt")

    if err != nil {
        log.Fatalf("failed opening file: %s", err)
    }

    scanner := bufio.NewScanner(file)
    scanner.Split(bufio.ScanLines)
    var txtlines []string

    for scanner.Scan() {
        txtlines = append(txtlines, scanner.Text())
    }

    file.Close()
}

Playground: https://play.golang.org/p/cnDOEFaT0lr

The code works fine for all the text files except the files which have been encoded with UCS-2 little endian. How can I convert the file to UFT8 format to read it?

1

There are 1 best solutions below

1
peterSO On

I have a Go program to read a text file. How can I convert the [UCS-2 little endian] file to UFT-8 format to read it?


Unicode

FAQ: UTF-8, UTF-16, UTF-32 & BOM

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters.

Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

UCS-2 is a proper subset of UTF-16.


For example,

package main

import (
    "bufio"
    "fmt"
    "os"

    "golang.org/x/text/encoding/unicode"
)

func main() {
    // "Language Learning and Teaching" written in 16 or more languages: UCS-2
    // http://www.humancomp.org/unichtm/unilang.htm
    f, err := os.Open("unilang.htm")
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    defer f.Close()

    dec := unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM).NewDecoder()
    scn := bufio.NewScanner(dec.Reader(f))
    for scn.Scan() {
        fmt.Println(scn.Text())
    }
    if err := scn.Err(); err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
}

Playground: https://play.golang.org/p/3VombFxUNb1