Word boundaries with extented set of characters

185 Views Asked by xralf At 16 December 2011 at 19:22

It seems a little strange to me that \w matches [a-zA-Z0-9_]. I wonder why 0-9 and _ are counted between word characters and why - is not counted between word characters.

If I want to split the sentence:

This is counter-example.

with (\w*\b) it will split the word counter-example to two parts. Similarly (count.*?\b) matches only counter.

Would it be possible to have something like \b with the result that - is included in word characters (\w)?

Or did I misunderstood the usage of \b? Are there some examples of standard usage of this?

Original Q&A

There are 2 best solutions below

fge On 16 December 2011 at 19:30 BEST ANSWER

The fact that \w matches the underscore along with uppercase and lowercase letters is historical: it is due to the fact that it was first introduced to match C identifiers.

Well, this is true for Java's \w (yes, \w will not match accentuated characters in Java).

\b however is an anchor, and it is not defined by the frontier between what is a word character and a non word character, in fact it is implementation-dependent.

There is not really an anchor which does what you want, but if you want to match words and dashes, your best bet is \w*(-\w*)*.

Again, the normal* (special normal*)* pattern!

(and BTW, \b is a "word anchor" in some dialects only, other implementations define \< and \> instead for the beginning and end of word anchors respectively)

[edit for a gross error]

noob On 16 December 2011 at 19:28

Use this: [\w-]*

For example you want to match something which ends with e and starts with co

String:

This is counter-example.

Regex:

co[\w-]*e

Match:

counter-example

Word boundaries with extented set of characters

There are 2 best solutions below

Related Questions in REGEX

Related Questions in WORD-BOUNDARIES

Trending Questions

Popular # Hahtags

Popular Questions