Why is the maximum string literal length in Java 65534?

334 Views Asked by At

I compiled this code. (use javac 21.0.1)

public static final String MAX = "AAAAA ...";

If I repeat 'A' 65534 times in a literal, the compilation is OK.
But if I repeat 'A' 65535 times in a literal, I get a compile error 'constant string too long'.

Why is the length limit 65534 instead of 65535?

Specification

CONSTANT_Utf8_info {
u1 tag;
u2 length;
u1 bytes[length];
}

length: The value of the length item gives the number of bytes in the bytes array (not the length of the resulting string).

Chapter 4. The class File Format

The maximum length of u2 is 0xFFFF = 65535, not 65534.
In UTF-8, "A" is 1 byte, so isn't the string length limited to 65535?

Javac source code

jdk/langtools/src/share/classes/com/sun/tools/javac/jvm/Gen.java · openjdk/jdk

private void checkStringConstant(DiagnosticPosition pos, Object constValue) {
    if (nerrs != 0 || // only complain about a long string once
        constValue == null ||
        !(constValue instanceof String str) ||
        str.length() < PoolWriter.MAX_STRING_LENGTH)
        return;
    log.error(pos, Errors.LimitString);
    nerrs++;
}

Is the correct code str.length() <= PoolWriter.MAX_STRING_LENGTH ?

2

There are 2 best solutions below

1
aled On

The comparison has been in there since at least 2007. So it may be a bug that no one reported because no one used a string literal that big. I'm guessing the question is just about curiosity because I don't think anyone really needs that extra byte.

2
Holger On

Is the correct code str.length() <= PoolWriter.MAX_STRING_LENGTH ?

No, whether you use < or <= here makes no difference for the correctness of this check.

The cited format and its limitation apply to the number of encoded bytes in the modified UTF-8 format, not the length of the actual (UTF-16) string.

The number of bytes happen to be equal to the string length if you use only ASCII characters (excluding NUL), like your As. But when you replace the 65534th A with, e.g. an Ä, it will still pass the “maximum string length” check and create a constant pool entry with 65535 bytes.

Now, if you replace another A with Ä, it will still pass the string length check but a new error will appear:

error: UTF8 representation for string "AAAAAAAAAAAAAAAAAAAA..." is too long for the constant pool

which demonstrates that a check for “the real thing” is in place, which makes the preceding string length check entirely obsolete.