How do I search and print for invalid lexemes and assign to it an appropriate error message

59 Views Asked by At

I'm trying to create a kind-of lexical analyzer in which it will recognize either an int or float input then split it into lexemes and tokens.

I'm using regex to find matching patterns and detect if it is valid or invalid. I already have the logic for detecting if it is valid and correctly assigning the tokens to the lexemes.

private static void printLexemes(String input, String dataType) {
    Map<String, String> lexemes = new HashMap<>();
    lexemes.put(dataType, "Data_Type");

    String[] tokens = input.split("\\s+|(?<=[=,;])(?=[^\\s])|(?=[=,;])(?<=[^\\s])");

    boolean afterDataType = false;

    String currentToken = "";

    for (String token : tokens) {
        if (!token.isEmpty()) {
            if (!afterDataType) {
                if (token.equals(dataType)) {
                    afterDataType = true;
                } else {
                    lexemes.put(token, "invalid lexeme");
                }
            } else {
                if (token.matches("[a-zA-Z_][a-zA-Z0-9_]*")) {
                    lexemes.put(token, "IDENTIFIER");
                } else if (token.matches("\\d+(\\.\\d+)?")) {
                    if (dataType.equals("float")) {
                        float floatValue = Float.parseFloat(token);
                        currentToken = Float.toString(floatValue);
                        lexemes.put(currentToken, "Constant(F)");

                    } else {
                        currentToken = token;
                        lexemes.put(currentToken, "Constant(I)");

                    }
                } else if (token.equals(";")) {
                    lexemes.put(token, "Semi_Colon");
                } else if (token.equals(",")) {
                    lexemes.put(token, "Comma");
                } else if (token.equals("=")) {
                    lexemes.put(token, "Equal_Sign");
                } else {
                    lexemes.put(token, "invalid lexeme");
                }
            }
        }
    }

however I'm having trouble in the find invalid lexemes part. I have tried a method that takes an input string and identifies invalid lexemes by splitting the input into tokens, checking for valid data types, reserved keywords, expected identifiers, equal signs, and semicolons, and adding any invalid lexemes found to a set.

 private static Set<String> findInvalidLexemes(String input) {
Set<String> invalidLexemes = new HashSet<>();
String[] tokens = input.split("\\s+|(?=[=,;])|(?<=[=,;])|(?<=\\b|\\B)");
String dataType = tokens[0]; // Extract the data type from the first token

if (!DataTypeChecker.isDataType(dataType)) {
    invalidLexemes.add(dataType + " (invalid data type)");
}

boolean identifierExpected = false;
boolean equalSignExpected = false;
boolean semicolonExpected = false;

for (String token : tokens) {
    if (!token.isEmpty() && !((token.matches("[a-zA-Z_][a-zA-Z0-9_]*") || token.matches("\\d+(\\.\\+)?") || token.equals(";") || token.equals(",") || token.equals("=") || token.equals(".")))) {
        if (token.contains(".")) {
            try {
                Double.parseDouble(token);
            } catch (NumberFormatException e) {
                invalidLexemes.add(dataType + " " + token);
            }
        } else {
            int val;
            try {
                val = Integer.parseInt(token);
            } catch (NumberFormatException e) {
                invalidLexemes.add(dataType + " " + token);
            }
        }
    } else if (isReservedKeyword(token)) {
        invalidLexemes.add(dataType + " " + token);
    } else {
        if (identifierExpected && !token.matches("[a-zA-Z_][a-zA-Z0-9_]*")) {
            invalidLexemes.add(dataType + " (missing identifier)");
        }
        if (equalSignExpected && !token.equals("=")) {
            invalidLexemes.add(dataType + " (missing equal sign)");
        }
        if (semicolonExpected && !token.equals(";")) {
            invalidLexemes.add(dataType + " (missing semicolon)");
        }

        identifierExpected = equalSignExpected = semicolonExpected = false;

        if (token.equals(dataType)) {
            identifierExpected = true;
        } else if (token.matches("[a-zA-Z_][a-zA-Z0-9_]*")) {
            if (!equalSignExpected) {
                identifierExpected = true;
            }
        } else if (token.equals("=")) {
            equalSignExpected = true;
        }
    }
}

if (identifierExpected && !equalSignExpected) {
    invalidLexemes.add(dataType + " (missing identifier)");
}
if (equalSignExpected && !semicolonExpected) {
    invalidLexemes.add(dataType + " (missing equal sign)");
}
if (semicolonExpected) {
    invalidLexemes.add(dataType + " (missing semicolon)");
}

return invalidLexemes;

}

but the problem in the code is that it incorrectly handles the case when the input contains invalid lexemes. The issue lies in the logic that checks for missing equal signs, missing identifiers, and invalid data types. The code assumes that the data type is always the first token, which may not be the case. the code does not handle the situation where the input contains multiple tokens that are invalid.

then I tried

    private static ArrayList<String> findInvalidLexemes(String[] lexemes) {
    ArrayList<String> invalidLexemes = new ArrayList<>();

    // Check for invalid data types and invalid identifiers
    boolean isInvalidIdentifier = false;
    for (String lexeme : lexemes) {
        if (!(lexeme.equals("int") || lexeme.equals("float"))) {
            if (!isValidIdentifier(lexeme)) {
                invalidLexemes.add("Invalid identifier: " + lexeme);
                isInvalidIdentifier = true;
            } else {
                if (!isInvalidIdentifier) {
                    invalidLexemes.add(lexeme);
                }
                isInvalidIdentifier = false;
            }
        }
    }

    // Check for invalid constant values
    for (String lexeme : lexemes) {
        if (lexeme.matches("\\d+(\\.\\d+)?")) {
            if (!isValidFloat(lexeme)) {
                invalidLexemes.add("Constant(F)");
            } else if (!isValidInt(lexeme)) {
                invalidLexemes.add("Constant(I)");
            }
        }
    }

    return invalidLexemes;
}

private static boolean isValidLexeme(String lexeme) {
    // Add validation logic
    return !lexeme.isBlank();
}

private static boolean isValidIdentifier(String identifier) {
    String regex = "^([a-zA-Z_$][a-zA-Z\\d_$]*)$";
    Pattern p = Pattern.compile(regex);

    String[] reservedKeywords = {"int", "float", "double", "char", "short", "long", "unsigned", "signed",
        "void", "for", "while", "do", "if", "else", "switch", "case", "break", "continue",
        "return", "goto", "struct", "union", "typedef", "enum", "static", "extern", "const",
        "volatile", "register", "auto"};
    
    if (identifier == null) {
        return false;
    }

    Matcher m = p.matcher(identifier);
    if (m.matches()) {
        if(Arrays.asList(reservedKeywords).contains(identifier)) {
            return false;
        }
        return true;
    }

    return false;
}

private static boolean isValidFloat(String lexeme) {
    try {
        float floatValue = Float.parseFloat(lexeme);
        return floatValue >= 0 && floatValue <= 3.4028235E38;
    } catch (NumberFormatException e) {
        return false;
    }
}

private static boolean isValidInt(String lexeme) {
    try {
        int intValue = Integer.parseInt(lexeme);
        return intValue >= 0 && intValue <= 2147483647;
    } catch (NumberFormatException e) {
        return false;
    }
}

which it treats each part of the input string as separate lexemes, rather than considering the entire input as a single lexeme. As a result, it identifies each character as an invalid identifier because they don't match the expected pattern for valid identifiers. the code does not handle assignments or the presence of an equal sign (=) in the input.

then I tried to keep it simple and tried a method that takes a string input representing a series of tokens and identifies invalid lexemes based on certain rules. It splits the input into tokens, checks for correctness based on data types, equals signs, and semicolons, and adds invalid tokens to a list.

private static List<String> findInvalidLexemes(String input) {
    List<String> invalidLexemes = new ArrayList<>();
    String[] tokens = input.split("\\s+|(?<=,)|(?=,)|(?<=;)|(?=;)"); // Split by whitespace or commas (including preserving commas)

    boolean afterDataType = false;
    boolean afterEqualsSign = false;
    boolean encounteredSemicolon = false;

    for (String token : tokens) {
        if (token.trim().isEmpty()) {
            continue; // Skip whitespace tokens
        }

        if (!afterDataType) {
            // Skip the data type
            if (token.equals("int") || token.equals("float")) {
                afterDataType = true;
            } else {
                invalidLexemes.add(token);
            }
        } else if (!afterEqualsSign) {
            if (token.equals("=")) {
                afterEqualsSign = true;
            } else if (!token.matches("[a-zA-Z_][a-zA-Z0-9_]*")) {
                invalidLexemes.add(token);
            }
        } else if (!encounteredSemicolon) {
            if (token.equals(";")) {
                encounteredSemicolon = true;
            } else {
                invalidLexemes.add(token);
            }
        } else {
            if (!token.matches("[a-zA-Z_][a-zA-Z0-9_]*") && !token.matches("\\d+(\\.\\d+)?")) {
                invalidLexemes.add(token);
            }
        }
    }

    return invalidLexemes;
}

but it incorrectly assumes that the first token after the data type should be an equals sign, which may not always be the case. The regular expression used to validate tokens as identifiers or numbers is overly restrictive and does not allow for certain valid tokens, potentially flagging them as invalid.

here's what i expect from the input and output

input: int 23jordan=23; output: ‘Invalid identifier 23jordan’

input: int x=; output: ‘Invalid, Missing constant’

1

There are 1 best solutions below

3
Reilas On

This concept is sometimes referred to as, Interpreting.

I recommend reviewing the concept of an Abstract Syntax Tree.
I believe it is what most compilers use to generate pre-bytecode data.

There are numerous books on writing interpreters, as well.  Packt Publishing offers, "Build Your Own Programming Language", which includes a chapter on Syntax Trees—chapter 5.

Furthermore, you can access the Java compiler class via the links below.
The compiler contains interpreting classes, which use an abstract syntax tree.

jdk.compiler (Java SE 20 & JDK 20).
com.sun.source.tree (Java SE 20 & JDK 20).

And, you can access the OpenJDK source code from GitHub.
GitHub – jdk/src/jdk.compiler/share/classes/com/sun/source.
GitHub – jdk/src/jdk.compiler/share/classes/com/sun/source/tree.