Lexical Analyzer with table-driven and character classification in rust

67 Views Asked by At

I am writing a lexer for C language as an exercise in Rust, for now I succedded implementing a lexer using backtracking, as follows:

lex_keyword(input)
.or_else(|| lex_identifier(input))
.or_else(|| lex_integer(input))
...
.unwrap_or(lex_unexpected(input));

This works as expected, however it is a bit slow, because it tries every possible solution until it finds one, and every lex_function, checks the whole input if it is valid or not and then tries to map to a token, for example on identifiers:

 let end = input.iter().position(|byte| byte.is_ascii_whitespace() || /* is a punctuator */).unwrap_or(input.len());

// And then

if input[..end].iter().all(|byte| /* is valid identifier member */) { /* Return an identifer */ } else { None }

I have read about how lexer generators implement a lexer analyzer.

So they basically convert the grammar to a NFA and then to a DFA and performs the main loop (which is good in theory).

I have read about lex and flex and yacc, and have seen that they do not generate just one lookup table for the DFA, but a bunch of them and also use character classes. And also their lookup tables aren't so huge, if you would implement a DFA, where a row(state) would have 128-columns (ASCII characters).

My questions is, how to implement character classes, and how to use them in the lookup table for the DFA, or can i create different lookup tables for different rules, for example:

token ::=
  keyword
| identifeer
| integer_literal
| float_literal
| ...

DFA_FOR_KEYWORD = [...]
DFA_FOR_IDENTIFIER = [...]
DFA_FOR_INTEGER = [...]
DFA_FOR_FLOAT = [...]
LEXER_DFA = [/* Which connects all the DFAs */]

Also about character_classes, how are they encoded, and can you reuse the same class for different parts?


For example:

Let's say that an identifier has to end with the letter 'p'

identifer ::= [a-zA-Z]*p

How many classes would I have?

  1. One for a-z, one for A-Z and one for 'p'
  2. One for a-zA-Z, and one for 'p'
  3. One for a-zA-Z, and check if it end with 'p'

Another question would be, how do you assemble them in the main loop?

Given the transition table, and given the character class and the last state, how do you now that is the next transition (in other words how to you codify the transitions table to move according to the character class)?

I look up on this link https://www.cs.man.ac.uk/%5C~pjj/cs211/ho/node6.html, but couldn't understand where the character classes are used or how are they encodded.

Also if you have some resources regarding this problem, please feel free to share

Thank you very much for your advice

P.S: I understand that a lexer generator does all of these things, however I would like to implement it by hand as a part of exercise.

0

There are 0 best solutions below