Here's an excerpt from java.text.CharacterIterator documentation:
This
interfacedefines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methodsprevious()andnext()are used for iteration. They returnDONEif [...], signaling that the iterator has reached the end of the sequence.
static final char DONE: Constant that is returned when the iterator has reached either the end or the beginning of the text. The value is\uFFFF, the "not a character" value which should not occur in any valid Unicode string.
The italicized part is what I'm having trouble understanding, because from my tests, it looks like a Java String can most certainly contain \uFFFF, and there doesn't seem to be any problem with it, except obviously with the prescribed CharacterIterator traversal idiom that breaks because of a false positive (e.g. next() returns '\uFFFF' == DONE when it's not really "done").
Here's a snippet to illustrate the "problem" (see also on ideone.com):
import java.text.*;
public class CharacterIteratorTest {
// this is the prescribed traversal idiom from the documentation
public static void traverseForward(CharacterIterator iter) {
for(char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
System.out.print(c);
}
}
public static void main(String[] args) {
String s = "abc\uFFFFdef";
System.out.println(s);
// abc?def
System.out.println(s.indexOf('\uFFFF'));
// 3
traverseForward(new StringCharacterIterator(s));
// abc
}
}
So what is going on here?
- Is the prescribed traversal idiom "broken" because it makes the wrong assumption about
\uFFFF? - Is the
StringCharacterIteratorimplementation "broken" because it doesn't e.g.throwanIllegalArgumentExceptionif in fact\uFFFFis forbidden in valid Unicode strings? - Is it actually true that valid Unicode strings should not contain
\uFFFF? - If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing
Stringto contain\uFFFFanyway?
EDIT (2013-12-17): Peter O. brings up an excellent point below, which renders this answer wrong. Old answer below, for historical accuracy.
Answering your questions:
Is the prescribed traversal idiom "broken" because it makes the wrong assumption about \uFFFF?
No. U+FFFF is a so-called non-character. From Section 16.7 of the Unicode Standard:
Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact \uFFFF is forbidden in valid Unicode strings?
Not quite. Applications are allowed to use those code points internally in any way they want. Quoting the standard again:
So while you should never encounter such a string from the user, another application or a file, you may well put it into a Java String if you know what you're doing (this basically means that you cannot use the CharacterIterator on that string, though.
Is it actually true that valid Unicode strings should not contain \uFFFF?
As quoted above, any string used for interchange must not contain them. Within your application you're free to use them in whatever way they want.
Of course, a Java
char, being just a 16-bit unsigned integer doesn't really care about the value it holds as well.If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing String to contain \uFFFF anyway?
No. In fact, the section on noncharacters even suggests the use of U+FFFF as sentinel value:
CharacterIterator follows this in that it returns U+FFFF when no more characters are available. Of course, this means that if you have another use for that code point in your application you may consider using a different non-character for that purpose since U+FFFF is already taken – at least if you're using CharacterIterator.