How to print characters from double byte character sets

1.5k Views Asked by At

Take a look at how it is possible to output all of the characters from a single byte character set printable or not. The output file will contain Japanese characters such as チホヤツセ.

Encoding enc = Encoding.GetEncoding("shift_jis");
byte[] m_bytes = new  byte [1];
StreamWriter sw = new StreamWriter(@"C:\shift_jis.txt");

for (int i = 0; i < 256; i++)
{
    m_bytes.SetValue ((byte)i,0);
    String Output = enc.GetString(m_bytes);
    sw.WriteLine(Output);
}

sw.Close();
sw.Dispose();

Here is my attempt to do this with a double byte character set.

Encoding enc = Encoding.GetEncoding("iso-2022-jp");
byte[] m_bytes = new byte[2];
StreamWriter sw = new StreamWriter(@"C:\iso-2022-jp.txt");

for (int i = 0; i < 256; i++)
{
    m_bytes.SetValue((byte)i, 0);

    for (int j = 0; j < 256; j++)
    {
        m_bytes.SetValue((byte)j, 1);
        String Output = null;
        Output = enc.GetString(m_bytes);
        sw.WriteLine(Output);
    }
}

sw.Close();
sw.Dispose();

The problem is the output file still only contains the first 255 characters. Each byte is evaluated separately and gives the character back for that byte individually. The output string always contains two characters and not one. Since characters in the character set are represented with two bytes you must have to specify them with two bytes right?

So how do you iterate through and print all characters from a double byte character set?

3

There are 3 best solutions below

1
xanatos On BEST ANSWER

If it is ok to have them in unicode order, you could:

Encoding enc = (Encoding)Encoding.GetEncoding("iso-2022-jp").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
char[] chars = new char[1];
byte[] bytes = new byte[16];

using (StreamWriter sw = new StreamWriter(@"C:\temp\iso-2022-jp.txt"))
{
    for (int i = 0; i <= char.MaxValue; i++)
    {
        chars[0] = (char)i;
        int count = enc.GetBytes(chars, 0, 1, bytes, 0);

        if (count != 0)
        {
            sw.WriteLine(chars[0]);
        }
    }
}

If you want to order it by byte sequence, you could:

Encoding enc = (Encoding)Encoding.GetEncoding("iso-2022-jp").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
char[] chars = new char[1];
byte[] bytes = new byte[16];

var lst = new List<Tuple<byte[], char>>();

for (int i = 0; i <= char.MaxValue; i++)
{
    chars[0] = (char)i;
    int count = enc.GetBytes(chars, 0, 1, bytes, 0);

    if (count != 0)
    {
        var bytes2 = new byte[count];
        Array.Copy(bytes, bytes2, count);
        lst.Add(Tuple.Create(bytes2, chars[0]));
    }
}

lst.Sort((x, y) =>
{
    int min = Math.Min(x.Item1.Length, y.Item1.Length);

    for (int i = 0; i < min; i++)
    {
        int cmp = x.Item1[i].CompareTo(y.Item1[i]);

        if (cmp != 0)
        {
            return cmp;
        }
    }

    return x.Item1.Length.CompareTo(y.Item1.Length);
});

using (StreamWriter sw = new StreamWriter(@"C:\temp\iso-2022-jp.txt"))
{
    foreach (var tuple in lst)
    {
        sw.WriteLine(tuple.Item2);

        // This will print the full byte sequence necessary to 
        // generate the char. Note that iso-2022-jp uses escape
        // sequences to "activate" subtables and to deactivate them.
        //sw.WriteLine("{0}: {1}", tuple.Item2, string.Join(",", tuple.Item1.Select(x => x.ToString("x2"))));
    }
}

or with a different sorting order (length first):

lst.Sort((x, y) =>
{
    int cmp2 = x.Item1.Length.CompareTo(y.Item1.Length);

    if (cmp2 != 0)
    {
        return cmp2;
    }

    int min = Math.Min(x.Item1.Length, y.Item1.Length);

    for (int i = 0; i < min; i++)
    {
        int cmp = x.Item1[i].CompareTo(y.Item1[i]);

        if (cmp != 0)
        {
            return cmp;
        }
    }

    return 0;
});

Note that in all the examples I'm only generating the chars of the basic BMP plane. I don't think that characters outside the basic BMP plane are included in any encoding... If necessary I can modify the code to support it.

Just out of curiousity, the first version of the code with handling of non-BMP characters (that aren't present in iso-2022-jp):

Encoding enc = (Encoding)Encoding.GetEncoding("iso-2022-jp").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
byte[] bytes = new byte[16];

using (StreamWriter sw = new StreamWriter(@"C:\temp\iso-2022-jp.txt"))
{
    int max = -1;
    for (int i = 0; i <= 0x10FFFF; i++)
    {
        if (i >= 0xD800 && i <= 0xDFFF)
        {
            continue;
        }

        string chars = char.ConvertFromUtf32(i);

        int count = enc.GetBytes(chars, 0, chars.Length, bytes, 0);

        if (count != 0)
        {
            sw.WriteLine(chars);
            max = i;
        }
    }

    Console.WriteLine("maximum codepoint: {0}", max);
}
0
nepdev On

This is an issue with the specific encoding you chose.

ISO-2022 encodings cannot just be listed number by number isolated - this is not Unicode. What a specific set of bytes means is determined by Escape sequences in the stream of bytes.

From the Wikipedia article (ISO/IEC 2022):

To represent multiple character sets, the ISO/IEC 2022 character encodings include escape sequences which indicate the character set for characters which follow.

1
g.pickardou On

You should use the writer configured to your encoding:

Encoding encoding = Encoding.GetEncoding("iso-2022-jp");
using (var stream = new FileStream(@"C:\iso-2022-jp.txt", FileMode.Create))
{
    using (StreamWriter writer = new StreamWriter(stream, encoding))
    {
        for (int i = 0; i <= char.MaxValue; i++)
        {
            // Each char goes separate line. One will be only 1 byte, others more with
            // the leading escape seq:
            writer.WriteLine(((char) i).ToString());
        }
    }
}