I have coded a Java lexical analyzer down below
Token.java looks like this
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public enum Token {
TK_MINUS ("-"),
TK_PLUS ("\\+"),
TK_MUL ("\\*"),
TK_DIV ("/"),
TK_NOT ("~"),
TK_AND ("&"),
TK_OR ("\\|"),
TK_LESS ("<"),
TK_LEG ("<="),
TK_GT (">"),
TK_GEQ (">="),
TK_EQ ("=="),
TK_ASSIGN ("="),
TK_OPEN ("\\("),
TK_CLOSE ("\\)"),
TK_SEMI (";"),
TK_COMMA (","),
TK_KEY_DEFINE ("define"),
TK_KEY_AS ("as"),
TK_KEY_IS ("is"),
TK_KEY_IF ("if"),
TK_KEY_THEN ("then"),
TK_KEY_ELSE ("else"),
TK_KEY_ENDIF ("endif"),
OPEN_BRACKET ("\\{"),
CLOSE_BRACKET ("\\}"),
STRING ("\"[^\"]+\""),
TK_FLOAT ("[+-]?([0-9]*[.])?[0-9]+"),
TK_DECIMAL("(?:0|[1-9](?:_*[0-9])*)[lL]?"),
TK_OCTAL("0[0-7](?:_*[0-7])*[lL]?"),
TK_HEXADECIMAL("0x[a-fA-F0-9](?:_*[a-fA-F0-9])*[lL]?"),
TK_BINARY("0[bB][01](?:_*[01])*[lL]?"),
IDENTIFIER ("\\w+");
private final Pattern pattern;
Token(String regex) {
pattern = Pattern.compile("^" + regex);
}
int endOfMatch(String s) {
Matcher m = pattern.matcher(s);
if (m.find()) {
return m.end();
}
return -1;
}
}
The Lexer class looks like this --> Lexer.java
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exausthed = false;
private String errorMessage = "";
private Set<Character> blankChars = new HashSet<Character>();
public Lexer(String filePath) {
try (Stream<String> st = Files.lines(Paths.get(filePath))) {
st.forEach(input::append);
} catch (IOException ex) {
exausthed = true;
errorMessage = "Could not read file: " + filePath;
return;
}
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
moveAhead();
}
public void moveAhead() {
if (exausthed) {
return;
}
if (input.length() == 0) {
exausthed = true;
return;
}
ignoreWhiteSpaces();
if (findNextToken()) {
return;
}
exausthed = true;
if (input.length() > 0) {
errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
}
}
private void ignoreWhiteSpaces() {
int charsToDelete = 0;
while (blankChars.contains(input.charAt(charsToDelete))) {
charsToDelete++;
}
if (charsToDelete > 0) {
input.delete(0, charsToDelete);
}
}
private boolean findNextToken() {
for (Token t : Token.values()) {
int end = t.endOfMatch(input.toString());
if (end != -1) {
token = t;
lexema = input.substring(0, end);
input.delete(0, end);
return true;
}
}
return false;
}
public Token currentToken() {
return token;
}
public String currentLexema() {
return lexema;
}
public boolean isSuccessful() {
return errorMessage.isEmpty();
}
public String errorMessage() {
return errorMessage;
}
public boolean isExausthed() {
return exausthed;
}
}
And I created a class which can be used to test this lexical analyzer named Try.java
package draft;
public class Try {
public static void main(String[] args) {
Lexer lexer = new Lexer("C:/Users/eimom/Documents/Input.txt");
System.out.println("Lexical Analysis");
System.out.println("-----------------");
while (!lexer.isExausthed()) {
System.out.printf("%-18s : %s \n",lexer.currentLexema() , lexer.currentToken());
lexer.moveAhead();
}
if (lexer.isSuccessful()) {
System.out.println("Ok! :D");
} else {
System.out.println(lexer.errorMessage());
}
}
}
So, lets say the Input.txt file contains
>=
0x10
()
11001100
-433
0125
0x3B
Then the output I expect is
>= TK_GEQ
0x10 TK_HEXADECIMAL
( TK_OPEN ,
) TK_CLOSE
11001100 TK_BINARY
-433 TK_DECIMAL
0125 TK_OCTAL
0x3B TK_BINARY
But instead I get
Lexical Analysis
------------------
> :TK_GT
= :TK_ASSIGN
0 :TK_FLOAT
x10 :IDENTIFIER
( :TK_OPEN
) :TK_CLOSE
11001100 :TK_FLOAT
- :TK_MINUS
43301250 :TK_FLOAT
x3B :IDENTIFIER
What can I do to correct these issues? It seems like the code doesn't end at a line, rather it continues and uses the next char on another line.
This is your own doing by using
Files.lines(Path), the stream ofFiles.linescontains the content of each line, without line-endings, so when you then combine all your lines back intoinput, you end up with the file content without linebreaks.Maybe you want to use
Files.readString(Path)instead. I also wonder why you don't use anReaderto read character by character. That is usually much more memory efficient than trying to read the entire file in memory (although that only becomes important if you want to analyse very large files).