Computational complexity for regular expressions

1.1k Views Asked by At

Regular expressions rapidly become too complex (for me) to understand. Even something so simple as [ab][cd], has several logical branches. My goal is to improve the maintainability of our code base, so answers to these questions could help us detect and fix complex code:

  1. Are there computational complexity metrics (similar to cyclomatic complexity) that include the complexity inherent in regular expressions?
  2. Are there any tools that produce a complexity number for regular expressions?
  3. Are there tools that can suggest simplifications of regular expressions?
2

There are 2 best solutions below

1
Dirk Herrmann On

You might try using the compiled form of the regexp and try mapping some code complexity metrics to that, like, lines of code, or cyclomatic complexity. To see what I mean, look at the following stackoverflow answer: https://stackoverflow.com/a/2348725/5747415, which shows how with perl you can access the compiled form of a regexp. Another example is shown here: http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions, quoting the tool output from that page:

Compiling REx '[bc]d(ef*g)+h[ij]k$'
1: ANYOF[bc](12)
12: EXACT <d>(14)
14: CURLYX[0] {1,32767}(28)
16:   OPEN1(18)
18:     EXACT <e>(20)
20:     STAR(23)
21:       EXACT <f>(0)
23:     EXACT <g>(25)
25:   CLOSE1(27)
27:   WHILEM[1/1](0)
28: NOTHING(29)
29: EXACT <h>(31)
31: ANYOF[ij](42)
42: EXACT <k>(44)
44: EOL(45)
45: END(0)

Btw., I congratulate you to the decision to improve the code maintainability. That said, I just have to express my doubts that any formal metric provides a better guidance than (or can even get close to) the judgement of experienced developers...

1
Patrick87 On

If you are dealing with a regular expression system equivalent to the formal regular expressions (which describe regular languages; no counting, no lookbehind, no matching pairs of parentheses, etc.), OR if you will only be dealing with regular expressions which use these features (despite your regex system being capable of describing non-regular languages), then there is a precise notion of complexity (or at least you could derive one) and there is a certain sense in which regular expressions can be "minimized".

By the Myhill-Nerode theorem, all regular languages have a finite number of equivalence classes under the indistinguishability relation on strings. These equivalence classes correspond directly to states in a minimal deterministic finite automaton for the regular language. You could take the number of states of a minimal deterministic finite automaton for the language to be the "fundamental" complexity of the language itself.

There are algorithms which can take you from a (formal) regular expression to a minimal deterministic finite automaton, and then back again to a regular expression. Doing this should give you a canonical regular expression for every regular language. I imagine - but have not worked out a proof - that the process of producing a regular expression from a minimal deterministic finite automaton could be modified so that it produces the shorted (in terms of number of operations) regular expression possible.

The complexity of the language could be the number of operations in such a canonical regular expression. The actual complexity of any given regular expression could be the number of operations in it. The ratio could give you a sense of how "inefficient" or "unnecessarily complex" your regular expression is.

If you really need non-reguar features of regex, then you are out of luck; there's no notion of computable minimization in higher order language classes. You can invent complexity metrics all day, but you'll never get a general algorithmic answer to "how inefficient is this compared to the baseline?" Another way to say what I mean is this: making a cake might be harder than making popcorn, but if you need a cake, you have to expend the extra effort to get what you need.