It seems a little strange to me that \w matches [a-zA-Z0-9_]. I wonder why 0-9 and _ are counted between word characters and why - is not counted between word characters.
If I want to split the sentence:
This is counter-example.
with (\w*\b) it will split the word counter-example to two parts. Similarly (count.*?\b) matches only counter.
Would it be possible to have something like \b with the result that - is included in word characters (\w)?
Or did I misunderstood the usage of \b? Are there some examples of standard usage of this?
The fact that
\wmatches the underscore along with uppercase and lowercase letters is historical: it is due to the fact that it was first introduced to match C identifiers.Well, this is true for Java's
\w(yes,\wwill not match accentuated characters in Java).\bhowever is an anchor, and it is not defined by the frontier between what is a word character and a non word character, in fact it is implementation-dependent.There is not really an anchor which does what you want, but if you want to match words and dashes, your best bet is
\w*(-\w*)*.Again, the
normal* (special normal*)*pattern!(and BTW,
\bis a "word anchor" in some dialects only, other implementations define\<and\>instead for the beginning and end of word anchors respectively)[edit for a gross error]