The code I've written is in Python, but the problem isn't Python-specific.

Effectively I'd like to check if a one line string that is meant to represent a mathematical expression comes out to be valid. I have heard of the Shunting Yard Algorithm and I intend to attempt my own version of it after I get over my initial problem which is potential whitespace between operators.

The approach I am taking is to go through character by character and tokenize them based on whether they are valid mathematical symbols or not. Parentheses get special treatment.

Overall, the approach I came up with does work, but I feel there could be some improvements or optimizations.

Step 1: Get the string into a state that can be tokenized

The symbols I want to use are the basic arithmetic operations: +-*/ as well as % for modular division, ^&| for bitwise, and later << and >> for shifting.

I also want to allow functions and user-defined variables in the string so text is also valid.

Examples of valid input with operations, functions, and vars:

1+sqrt(2)*abc
(sin(3)-cos(4))/def
  • The first one would tokenize into: 1, +, sqrt(, 2, ), *, abc
  • The second one would tokenize into: (, sin(, 3, ), -, cos(, 4, ), ), /, def

Whitespace complicates things

As stated, the current issue I'm facing is how to deal with whitespace. Using the above examples, the following would also be valid and would evaluate to the previous example after stripping whitespace between symbols:

1 + sqrt(2) * abc
( sin(3) - cos(4) ) / def

However, doing something like this would not due to the space between function name and parentheses:

1 + sqrt (2) * abc

Also, dealing with wrong input like:

1 + sq rt(2)
1 + + sqrt(2)
  • The first would evaluate to a function called sq rt() and would be flagged later as invalid
  • The second would create + + and that too would be flagged later as invalid

Current Process

    val = val.strip()
    
    mathTokens = ('+', '-', '*', '/', '%', '^', '&', '|', '~', '<' , '>')
    token = ''
    tokens = []

    mathMode = bool(val[0] in mathTokens)

    for j in range(0, len(val)):
        char = val[j]

        #Left paren
        if(char == '('):
            if(mathMode):
                token = token.strip()
                tokens.append(token)
                tokens.append(char)

            else:
                token += char
                tokens.append(token)

            token = ''
            continue
    
        #Right paren
        if(char == ')'):
            token = token.strip()
            tokens.append(token)
            tokens.append(char)
            token = ''
            continue

        #All others
        if(char == ' ' or (mathMode and char in mathTokens) or (not mathMode and char not in mathTokens)):
            token += char

        else:
            token = token.strip()

            #If a series of spaces only was found, don't append
            if(token):
                tokens.append(token)
            
            token = char
            mathMode = not mathMode
    
    #Append the last token unless it's spaces only (e.g. after right parentheses)
    token = token.strip()
    if(token):
        tokens.append(token)

    print(tokens)

Sample Input and Result

Here's some sample input (complete with spaces) which is in val in the code above:

1 + sqrt(2) * abc

This gets stored into the tokens list as:

[
    '1', 
    '+', 
    'sqrt(', 
    '2', 
    ')', 
    '*', 
    'abc'
]

Afterwards

Ultimately, after parsing and figuring out if it's a valid expression then the end result will be fully calculated. The calculation is going to be the second part of this. For the calculation, the Shunting Yard algorithm is what I'm thinking of going with. In general, from what I gather, I can take my tokens and evaluate accordingly.

Conclusion

As I said at the beginning, this approach does work but constantly doing strip() and looking up the symbols feels like it could be slow in situations where there potentially are lots of values to evaluate. A regex approach sounds like it'd complicate things way too much and I'm not sure if that'd even offer better performance.

Any chance there might be any algorithms for this particular situation and selectively purging whitespace?

I'm also wondering if anything should be changed for potential negative numbers as well.

That's my situation. Insight appreciated, thanks.

1

There are 1 best solutions below

0
flyakite On

If you only aim at checking the validity, one option is to use Abstract Syntax Trees to check the validation of e.g. mathematical equations:

import ast

def checkEquation(equation):
    try:
        ast.parse(equation)
        return 'valid'
    except SyntaxError:
        return 'invalid'

print( checkEquation('-1 - sqrt(2) * abc') )   #output: valid
print( checkEquation('-1 - sqrt(2 * abc') )    #output: invalid