awk and equivalence classes

285 Views Asked by At

Does gnu awk support POSIX equivalence classes?

Is it possible to match [[=a=]] using awk as it is done in grep?

$ echo ábÅ | grep [[=a=]]
ábÅ

$ echo ábÅ | grep -o [[=a=]]
á
Å
3

There are 3 best solutions below

2
Raymond Hettinger On BEST ANSWER

Per the GAWK User's Guide, "Caution: The library functions that gawk uses for regular expression matching currently only recognize POSIX character classes; they do not recognize collating symbols or equivalence classes.".

Accordingly, you're going to have to write-out the allowed equivalents in the regex /[aáÅ]/ or whatever you're looking for.

There are locale-aware character ranges but that doesn't seem to be what you're asking about.

2
James Brown On

See here, towards the end:

Locale-specific names for a list of characters that are equal. The name is enclosed between ‘[=’ and ‘=]’. For example, the name ‘e’ might be used to represent all of “e,” “ê,” “è,” and “é.” In this case, ‘[[=e=]]’ is a regexp that matches any of ‘e’, ‘ê’, ‘é’, or ‘è’.

These features are very valuable in non-English-speaking locales.

CAUTION: The library functions that gawk uses for regular expression matching currently recognize only POSIX character classes; they do not recognize collating symbols or equivalence classes.

0
RARE Kpop Manifesto On

You'll be surprised what gawk is willing to do these days :

 echo 'eÅêéAEè' \
                 \
 | mawk 'BEGIN { FS=RS="^$"
                   ORS=  ""
   } sub(/[\n]$/,"") +\
    gsub("[ \t]+|[\000-\b\v-\37!-\177]|"\
                 "[\200-\277]+","&\n")'  \
                                          \
 | gtee >( gpaste -s -d':' - | ecp >&2; )  \
                                            \ 
 | LC_ALL=C gawk -b -e '/[=[:lower:]=]|[=Å=]/'
  • e:Å:ê:é:A:E:è

     1   e
     2   Å
     3   ê
     4   é
     5   è
    

Even when I forced both non-multibyte "C" locale as well as using the byte mode flaw in gawk, it's willing to match it at the larger class level. However, it's unwilling to match the ASCII "A" if I only specified just the Scandinavian A-ring.