Regex word boundary: <|wb> v. <?wb>

150 Views Asked by At

Edit: @ikegami was the first person to respond and pointed out my typo.

The raku Regex docs say:

To match any word boundary, use <|w> or <?wb>. This is similar to \b in other languages.

This is what I'm seeing in rakudo:

[309] > "apa pz" ~~ / <|wb> p. /
「pa」

[310] > "apa pz" ~~ / <?wb> p. /
「pz」

<?wb> behaves the way I would expect. What is <|wb> doing?

In perl:

"apa pz" =~ / \b p. /xms;  
say $&;   # pz
3

There are 3 best solutions below

5
codesections On

<?wb> does what you want. <|wb> and <|w> are both, IIUC, unsupported syntax that should throw an error. And, indeed, they both do using the soon-to-be-released Raku AST compiler frontend:

$ RAKUDO_RAKUAST=1 raku -e 'say "apa pz" ~~ / <|w> p. /'
> ===SORRY!===
> Cannot find method 'apply-sink' on object of type NQPMu

More officially, no regex syntax starting with <| is spec'ed in Roast (Raku's test suite/specification).

(This means, of course, that it's an error for the docs to refer to <|w> syntax; I submitted a PR with a fix).

0
jubilatious1 On

Seems to be a difference between <|w> (specc'ed?) and <|wb> (not specc'ed?):

% raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.

To exit type 'exit' or '^D'
[0] > "apa pz" ~~ / <|wb> p. /
「pa」
[1] > "apa pz" ~~ / <|w> p. /
「pz」

The docs say: "To match any word boundary, use <|w> or <?wb>." I don't see <|wb> mentioned on that page (first link at bottom). But maybe it has been edited away?

The only guess I can venture is that somehow <|wb is mis-interpreted as a quoted list. See second link at bottom.

https://docs.raku.org/language/regexes#Word_boundary
https://docs.raku.org/language/regexes#Quoted_lists_are_LTM_matches

0
raiph On

TL;DR <|wb> is a typo.¹ If you're trying to figure out some raku feature, design.raku.org is a key resource.²˒³

S05: Regexes and rules

I pulled up S05: Regexes and rules and did an in-page search for <|⁴. The first match nailed it:

A leading | [in an assertion, ie starting with a < and ending with a >] indicates some kind of a zero-width boundary. You can refer to backslash sequences with this syntax; <|h> will match between a \h and a \H, for instance. Some examples:

  • <|w> word boundary
  • <|g> grapheme boundary (always matches in grapheme mode)
  • <|c> codepoint boundary (always matches in grapheme/codepoint mode)

I hope you can see why this instantly suggested to me a simple explanation: the syntax is (presumably) for use with a single character backslash sequence (eg \w, but not \wb).⁵

Footnotes

¹ I suspect they you wrote <|wb> because you were confusing <|w> with the <wb> assertion.

² By "spec" I mean "speculation" about "specification". The design documents were, and to a degree still are, authoritative about our evolution toward many current and future Raku features. We may have gone another route, or not yet gotten to where the "spec" pointed, but they're a key resource for understanding Raku's design.

³ Another key resource is the Raku project's IRC logs home. Discussion of Raku's design has been happening on IRC, and logged, since early 2005. TimToady (Larry Wall) comments, and bot logging of design doc commits, are often golden.

⁴ The <| search string has an initial space. (Otherwise you'll get a boatload of irrelevant matches instead because the source is written in RakuDoc.)

⁵ I checked a somewhat recent Rakudo and it didn't implement <|h>. And there's no support for \g or \c backslash sequences, so they're out too. Unless and until someone does the corresponding implementation work, the "specs" are, first and foremost, speculation.