Why is reader in java reading characters wrong?

72 Views Asked by At

So i have been trying to read different characters from an unknown file (my own file extension .xs) but it doesn't seem to work. For example it reads this character '¸' as 65533 and its ASCII code is 184. Is it a problem with my code or with encoding (i am programming in Intellij). Here is my code:

import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;

public class Main {
    public static void main(String[] args) throws IOException {
        Reader reading = new FileReader("ala.xs");

        int character;
        while ((character = reading.read()) != -1) {
            char ch = (char) character;
            System.out.println((int)ch);
        }

        reading.close();
    }
}

Here is the file ala.xs:"š ° otmkla¸HR8"

Here is the output of my program:"65533 32 65533 32 111 116 109 107 108 97 65533 72 82 56"

I tried changing encoding but it doesn't seem to work and i am honestly losing hope. Is this error becouse of reader reading wrong or me?

2

There are 2 best solutions below

0
g00se On BEST ANSWER

Here is the file ala.xs:"š ° otmkla¸HR8"

How did you produce it and write it? 65533 is a replacement character and appears several times in your output, indicating that there are problems with encoding. You should probably use an explicit encoding when you read it, as currently you assume an encoding, so use an InputStreamReader with UTF-8

import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.Reader;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) throws IOException {
        Reader reading = new InputStreamReader(new FileInputStream("ala.xs"), StandardCharsets.UTF_8);

        int character;
        while ((character = reading.read()) != -1) {
            char ch = (char) character;
            System.out.println((int) ch);
        }

        reading.close();
    }
}

Showing correct file and execution:

goose@t410:/tmp$ cat ala.xs;echo 
š ° otmkla¸HR8
goose@t410:/tmp$ java Main
353
32
176
32
111
116
109
107
108
97
184
72
82
56
goose@t410:/tmp$ 

Obviously, make sure you can save it correctly in the first place as UTF-8

0
Michael Gantman On

Here is a tool you can use for diagnostics

testStr1 = "Report Type, Icon URL";
encoded1 = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(testStr1);
restored = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encoded1);
System.out.println(testStr1 + "\n" + encoded1 + "\n" + restored);

And the output I got from that code:

š ° otmkla¸HR8
\u0161\u0020\u00b0\u0020\u006f\u0074\u006d\u006b\u006c\u0061\u00b8\u0048\u0052\u0038
š ° otmkla¸HR8

Note that the forth symbol from the end has a code 00b8 which is Hexadecimal for 184. So, your data is intact. the answer from @g00se is correct - you need to use explicit encoding. If you like to use this utility it comes with Open source MgntUtils library (written and maintained by me). Here is the StringUnicodeEncoderDecoder Javadoc. The library could be obtained as Maven artifact or from Github (including source code and Javadoc)