Using spaces as delimiters in ANTLR4

Question

Using spaces as delimiters in ANTLR4

36 Views Asked by Ray Offiah At 29 February 2024 at 10:06

I’m a bit new to ANTLR4 (and parsing in general), and I’m wondering if I’ve bitten off a little more than I can chew.

I’m trying to parse this line into a series of tokens:

This is some text @1 @2   @/searchstring/

Looks simple enough, but the problem is that I need to separate them into:

This is some text

@1
@2
@/searchstring/

I’ve come up with this for the grammar:

grammar Callout;

callout        : phrase parameterlist ;

phrase         : ~'@'+? ;

parameterlist  : param  (WHITESPACE+ param)* ;

param          : numericparam | searchparam ;

numericparam   : '@' DIGITS ;

DIGITS         : [0-9]+ ;

searchparam    : '@/' SEARCH '/' ;

SEARCH         : ~[/]+ ;

WHITESPACE     : [\t ] ;

I try it out and it seems to match the whole string without the @ symbol (I imagine that’s because the longest match seems to be the whole string, though I’m not sure what happened to the @ characters).

Original Q&A

There are 1 best solutions below

**Bart Kiers** · Answer 1 · 2024-02-29T17:11:42.210000

and I’m wondering if I’ve bitten off a little more than I can chew.

Maybe. By looking at your grammar, I think it'd be good to start by learning/understanding the basics.

The part ~'@'+ in the rule phrase : ~'@'+? ; does not what you think it does. Using literal tokens (like '@') inside parser rules will cause ANTLR to create tokens (lexer rules) for you. Your rule phrase : ~'@'+?; is translated into the following:

phrase
 : ~T__0+?
 ;

T__0
 : '@'
 ;

And the ~ inside a parser rule does not cause characters to be excluded, but tokens to be excluded. In other words: ~T__0 does not mean: "match any character other than an @", but rather: "match any token other than the T__0 token". Best not use these literal tokens inside parser rules unless you know what you're doing (which you don't ;))

Also, ANTLR's lexer rules will consume as many characters as possible and the rule that matches the most, will "win". ANTLR will not try to match some other token if the parser is trying to match a certain token. So the rule SEARCH : ~[/]+ ; is too greedy: it will consume the input This is some text @1 @2 @ into a single token.

Try something like this instead:

grammar Callout;

callout        : phrase parameterlist EOF;
phrase         : WORD+;
parameterlist  : param+;
param          : NUMERICPARAM | SEARCHPARAM ;

NUMERICPARAM   : '@' [0-9]+ ;
SEARCHPARAM    : '@/' WORD '/' ;
WORD           : ~[ @/]+ ;
WHITESPACE     : [\t ] -> skip;

Using spaces as delimiters in ANTLR4

There are 1 best solutions below

Related Questions in ANTLR4

Trending Questions

Popular # Hahtags

Popular Questions