Detect non Latin characters with regex Pattern in Java

Question

Detect non Latin characters with regex Pattern in Java

2.6k Views Asked by XtevensChannel At 07 January 2021 at 22:12

I THINK Latin characters are what I mean in my question, but I'm not entirely sure what the correct classification is. I'm trying to use a regex Pattern to test if a string contains non Latin characters. I'm expecting the following results

"abcDE 123";  // Yes, this should match
"!@#$%^&*";   // Yes, this should match
"aaàààäää";   // Yes, this should match
"ベビードラ";   // No, this shouldn't match
"";  // No, this shouldn't match

My understanding is that the built-in {IsLatin} preset simply detects if any of the characters are Latin. I want to detect if any characters are not Latin.

Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
    System.out.println("is NON latin");
    return;
}
System.out.println("is latin");

Original Q&A

There are 2 best solutions below

Wiktor Stribiżew On 07 January 2021 at 23:52

All Latin Unicode character classes are:

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F

So, the answer is either

Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F

Note that underscores are removed from the Unicode property class names in Java.

See the Java demo:

List<String> strs = Arrays.asList(
        "abcDE 123",  // Yes, this should match
        "!@#$%^&*",   // Yes, this should match
        "aaàààäää",   // Yes, this should match
        "ベビードラ", // No, this shouldn't match
        "");     // No, this shouldn't match  
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
//Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
for (String str : strs) {
    Matcher matcher = LatinPattern.matcher(str);
    if (!matcher.find()) {
        System.out.println(str + " => is NON Latin");
        //return;
    } else {
        System.out.println(str + " => is Latin");
    }
}

Note: if you replace .find() with .matches(), you can throw away ^ and $ in the pattern.

Output:

abcDE 123 => is Latin
!@#$%^&* => is Latin
aaàààäää => is Latin
ベビードラ => is NON Latin
 => is NON Latin

**Andreas** · Accepted Answer · 2021-01-07T23:07:09.723000

TL;DR: Use regex ^[\p{Print}\p{IsLatin}]*$

You want a regex that matches if the string consists of:

Spaces
Digits
Punctuation
Latin characters (Unicode script "Latin")

Easiest way is to combine \p{IsLatin} with \p{Print}, where Pattern defines \p{Print} as:

\p{Print} - A printable character: [\p{Graph}\x20]
- \p{Graph} - A visible character: [\p{Alnum}\p{Punct}]
  - \p{Alnum} - An alphanumeric character: [\p{Alpha}\p{Digit}]
    - \p{Alpha} - An alphabetic character: [\p{Lower}\p{Upper}]
      - \p{Lower} - A lower-case alphabetic character: [a-z]
      - \p{Upper} - An upper-case alphabetic character: [A-Z]
    - \p{Digit} - A decimal digit: [0-9]
  - \p{Punct} - Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
- \x20 - A space:

Which makes \p{Print} the same as [\p{ASCII}&&\P{Cntrl}], i.e. ASCII characters that are not control characters.

The \p{Alpha} part overlaps with \p{IsLatin}, but that's fine, since the character class eliminates duplicates.

So, regex is: ^[\p{Print}\p{IsLatin}]*$

Test

Pattern latinPattern = Pattern.compile("^[\\p{Print}\\p{IsLatin}]*$");

String[] inputs = { "abcDE 123", "!@#$%^&*", "aaàààäää", "ベビードラ", "" };
for (String input : inputs) {
    System.out.print("\"" + input + "\": ");
    Matcher matcher = latinPattern.matcher(input);
    if (! matcher.find()) {
        System.out.println("is NON latin");
    } else {
        System.out.println("is latin");
    }
}

Output

"abcDE 123": is latin
"!@#$%^&*": is latin
"aaàààäää": is latin
"ベビードラ": is NON latin
"": is NON latin

Detect non Latin characters with regex Pattern in Java

There are 2 best solutions below

Related Questions in JAVA

Related Questions in REGEX

Related Questions in LATIN

Trending Questions

Popular # Hahtags

Popular Questions