I'm playing with an integer input validator I decided to make while taking a C++ course. It's not an assignment, it's just a tangent I am exploring.
I thought it would be neat to be able to take all the strange stuff you can get from unsanitized input from various data sources and return an integer. Something that allows sloppy input to still work. I came up with this, and it works pretty good:
#include <cmath>
#include <iostream>
#include <regex>
#include <string>
using namespace std;
...
...
int validate_integer(const string& input) {
return round(stof(regex_replace(input, regex(R"([^\-0-9.]+)"), "")));
}
<2112>, {2112} [(2112)], "2112,", 2112.0, 2112 are all parsed to 2112 and returned as an integer.
I just thought it would be nice to throw an exception if the input was to garbled to parse correctly. I thought it would be pretty easy to create a regular expression to match with something like if (regex_match....) {throw std::invalid_argument ...}, but I'm finding this is not that easy to construct. I've been trying out my tests on regexr.com (https://regexr.com/7t7bb).
Here are some of the things I have tried, I've been copying them and pasting into a text editor every time I give up and start over, so some of these might be really goofy as each expression is at the end or a tweak cycle before I gave up and started over again:
([0-9]+^0-9\.-\d+)
(.*[\.\-0-9.]*[^\d]+[^\d].*)
(.+|[\-\.0-9]+[^0-9]+(?![0-9]+))
((.+^[\-\.]*[0-9]+)|(^[\-\.]*[0-9]))+[^0-9]+[^0-9]+
(([^0-9]+[\-\.]*[0-9]+)|(^[\-\.]*[0-9]+)(?![^0-9]+[0-9]+))
.*[\d]+(?<![^0-9]+).*
These should be valid:
<2112>
[(2112)]
"2112,"
2112.0
-2112
<span style = "numeral">2112</span>
These may be valid or not valid, I'd prefer valid:
.2112 (should return 0)
21.12 (should return 21)
98.89 (Should return 99)
These should NOT be valid:
21,12
"21","12,"
<span style = "font-size:18.0pt">2112</span>
21TwentyOne12
Am I even on the right track?
I would start by writing a grammar for the "forgiving parser" you are coding. It is not clear from your examples, for instance, whether
<2112is acceptable. Must the brackets be paired? Ditto for quotes, etc.Assuming that brackets and quotes do not need to be paired, you might have the following grammar:
sign
+|-digit
0|1|2|3|4|5|6|7|8|9unsigned-integer
digit { digit }
signed-integer
sign unsigned-integer
non-digit
any-character-that-is-not-a-digit
non-sign-non-digit
any-character-that-is-not-a-sign-or-digit
any-sequence-without-a-digit
non-digit { non-digit }
prefix
[ any-sequence-without-a-digit ]
prefix-not-ending-with-sign
[ [ prefix ] non-sign-non-digit ]
suffix
[ any-sequence-without-a-digit ]
forgiving-integer
prefix signed-integer suffix
prefix-not-ending-with-sign unsigned-integer suffix
Notes:
|are alternatives from which 1 must be chosenOne subtlety of this grammar is that integers can have only one sign. When more than one sign is present, all except the last are treated as part of the prefix, and, thus, are ignored.
Are the following interpretations acceptable? If not, then the grammar must be altered.
++42parses as+42--42parses as-42+-42parses as-42-+42parses as+42Another subtlety is that whitespace following a sign causes the sign to be treated as part of the prefix, and, thus, to be ignored. This is perhaps counterintuitive, and, frankly, may be unacceptable. Nevertheless, it is how the grammar works. (It would be relatively easy to erase or ignore whitespace following a sign, if that were desired.)
In the example below, the negative sign is ignored, because it is part of the prefix.
- 42parses as42Some of these corner cases also cause problems for the regex expression given in the OP. See the output section below.
A solution without
std::regexWith a grammar in hand, it should be easier to figure out an appropriate regular expression.
My solution, however, is to avoid the inefficiencies of
std::regex, in favor of coding a simple "parser."In the following program, function
validate_integerimplements the foregoing grammar. Whenvalidate_integersucceeds, it returns the integer it parsed. When it fails, it throws astd::runtime_error.Because
validate_integerusesstd::from_charsto convert the integer sequence, it will not convert the test case2112.0from the OP. The trailing.0is treated as a second integer. All the other test cases work as expected.The only tricky part is the initial loop that skips over non-numeric characters. When it encounters a sign (
+or-), it has to check the following character to decide whether the sign should be interpreted as the start of a numeric sequence. That is reflected in the "tricky" grammar for prefix and prefix-not-ending-with-sign given above.Output
The "hole" in the output, below the entry for 2112.0, is the failed conversion of the null-string.
The value at the end of each line is what the regex expresion given in the OP produces.