I THINK Latin characters are what I mean in my question, but I'm not entirely sure what the correct classification is. I'm trying to use a regex Pattern to test if a string contains non Latin characters. I'm expecting the following results
"abcDE 123"; // Yes, this should match
"!@#$%^&*"; // Yes, this should match
"aaàààäää"; // Yes, this should match
"ベビードラ"; // No, this shouldn't match
""; // No, this shouldn't match
My understanding is that the built-in {IsLatin} preset simply detects if any of the characters are Latin. I want to detect if any characters are not Latin.
Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
System.out.println("is NON latin");
return;
}
System.out.println("is latin");
TL;DR: Use regex
^[\p{Print}\p{IsLatin}]*$You want a regex that matches if the string consists of:
Easiest way is to combine
\p{IsLatin}with\p{Print}, wherePatterndefines\p{Print}as:\p{Print}- A printable character:[\p{Graph}\x20]\p{Graph}- A visible character:[\p{Alnum}\p{Punct}]\p{Alnum}- An alphanumeric character:[\p{Alpha}\p{Digit}]\p{Alpha}- An alphabetic character:[\p{Lower}\p{Upper}]\p{Lower}- A lower-case alphabetic character:[a-z]\p{Upper}- An upper-case alphabetic character:[A-Z]\p{Digit}- A decimal digit:[0-9]\p{Punct}- Punctuation: One of!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~\x20- A space:Which makes
\p{Print}the same as[\p{ASCII}&&\P{Cntrl}], i.e. ASCII characters that are not control characters.The
\p{Alpha}part overlaps with\p{IsLatin}, but that's fine, since the character class eliminates duplicates.So, regex is:
^[\p{Print}\p{IsLatin}]*$Test
Output