Regex Match a pattern that only contains one set of numeral, and not more

123 Views Asked by At

I'm playing with an integer input validator I decided to make while taking a C++ course. It's not an assignment, it's just a tangent I am exploring.

I thought it would be neat to be able to take all the strange stuff you can get from unsanitized input from various data sources and return an integer. Something that allows sloppy input to still work. I came up with this, and it works pretty good:

#include <cmath>
#include <iostream>
#include <regex>
#include <string>

using namespace std;

...
...

int validate_integer(const string& input) {
        return round(stof(regex_replace(input, regex(R"([^\-0-9.]+)"), "")));
}

<2112>, {2112} [(2112)], "2112,", 2112.0, 2112 are all parsed to 2112 and returned as an integer.

I just thought it would be nice to throw an exception if the input was to garbled to parse correctly. I thought it would be pretty easy to create a regular expression to match with something like if (regex_match....) {throw std::invalid_argument ...}, but I'm finding this is not that easy to construct. I've been trying out my tests on regexr.com (https://regexr.com/7t7bb).

Here are some of the things I have tried, I've been copying them and pasting into a text editor every time I give up and start over, so some of these might be really goofy as each expression is at the end or a tweak cycle before I gave up and started over again:

([0-9]+^0-9\.-\d+)
(.*[\.\-0-9.]*[^\d]+[^\d].*)
(.+|[\-\.0-9]+[^0-9]+(?![0-9]+))
((.+^[\-\.]*[0-9]+)|(^[\-\.]*[0-9]))+[^0-9]+[^0-9]+
(([^0-9]+[\-\.]*[0-9]+)|(^[\-\.]*[0-9]+)(?![^0-9]+[0-9]+))
.*[\d]+(?<![^0-9]+).*

These should be valid:

<2112>
[(2112)]
"2112,"
2112.0
-2112
<span style = "numeral">2112</span>

These may be valid or not valid, I'd prefer valid:

.2112  (should return 0)
21.12  (should return 21)
98.89  (Should return 99)

These should NOT be valid:

21,12
"21","12,"
<span style = "font-size:18.0pt">2112</span>
21TwentyOne12

Am I even on the right track?

2

There are 2 best solutions below

0
tbxfreeware On

I thought it would be pretty easy to create a regular expression to match with something like if (regex_match....) {throw std::invalid_argument ...}, but I'm finding this is not that easy to construct.

I would start by writing a grammar for the "forgiving parser" you are coding. It is not clear from your examples, for instance, whether <2112 is acceptable. Must the brackets be paired? Ditto for quotes, etc.

Assuming that brackets and quotes do not need to be paired, you might have the following grammar:

sign

+ | -

digit

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

unsigned-integer

digit  { digit }

signed-integer

signunsigned-integer

non-digit

any-character-that-is-not-a-digit

non-sign-non-digit

any-character-that-is-not-a-sign-or-digit

any-sequence-without-a-digit

non-digit  { non-digit }

prefix

[ any-sequence-without-a-digit ]

prefix-not-ending-with-sign

[ [ prefix ]  non-sign-non-digit ]

suffix

[ any-sequence-without-a-digit ]

forgiving-integer

prefix   signed-integer   suffix

prefix-not-ending-with-sign   unsigned-integer   suffix

Notes:
  • Items within square brackets are optional. They may appear either 0 or 1 time.
  • Items within curly braces are optional. They may appear 0 or more times.
  • Items separated by | are alternatives from which 1 must be chosen
  • Items on separate lines are alternatives from which 1 must be chosen

One subtlety of this grammar is that integers can have only one sign. When more than one sign is present, all except the last are treated as part of the prefix, and, thus, are ignored.

Are the following interpretations acceptable? If not, then the grammar must be altered.

  • ++42 parses as +42
  • --42 parses as -42
  • +-42 parses as -42
  • -+42 parses as +42

Another subtlety is that whitespace following a sign causes the sign to be treated as part of the prefix, and, thus, to be ignored. This is perhaps counterintuitive, and, frankly, may be unacceptable. Nevertheless, it is how the grammar works. (It would be relatively easy to erase or ignore whitespace following a sign, if that were desired.)

In the example below, the negative sign is ignored, because it is part of the prefix.

  • - 42 parses as 42

Some of these corner cases also cause problems for the regex expression given in the OP. See the output section below.

A solution without std::regex

With a grammar in hand, it should be easier to figure out an appropriate regular expression.

My solution, however, is to avoid the inefficiencies of std::regex, in favor of coding a simple "parser."

In the following program, function validate_integer implements the foregoing grammar. When validate_integer succeeds, it returns the integer it parsed. When it fails, it throws a std::runtime_error.

Because validate_integer uses std::from_chars to convert the integer sequence, it will not convert the test case 2112.0 from the OP. The trailing .0 is treated as a second integer. All the other test cases work as expected.

The only tricky part is the initial loop that skips over non-numeric characters. When it encounters a sign (+ or -), it has to check the following character to decide whether the sign should be interpreted as the start of a numeric sequence. That is reflected in the "tricky" grammar for prefix and prefix-not-ending-with-sign given above.

// main.cpp
#include <cctype>
#include <charconv>
#include <iomanip>
#include <iostream>
#include <regex>
#include <stdexcept>
#include <string>
#include <string_view>

std::string OP_validate_integer(const std::string& input)
try {
    double d = std::stof(std::regex_replace(input, std::regex(R"([^\-0-9.]+)"), ""));
    return std::to_string(static_cast<int>(std::round(d)));
}
catch (std::exception const& e) {
    return e.what();
}
bool is_digit(unsigned const char c) {
    return std::isdigit(c);
}
bool is_sign(const char c) {
    return c == '+' || c == '-';
}
int validate_integer(std::string const& s)
{
    enum : std::string::size_type { one = 1u };
    std::string::size_type i{};

    // skip over prefix
    while (i < s.length())
    {
        if (is_digit(s[i]) || is_sign(s[i])
            && i + one < s.length()
            && is_digit(s[i + one]))
            break;
        ++i;
    }

    // throw if nothing remains
    if (i == s.length())
        throw std::runtime_error("validation failed");

    // parse integer 
    // due to foregoing checks, this cannot fail
    if (s[i] == '+')
        ++i;  // `std::from_chars` does not accept leading plus sign.
    auto const first{ &s[i] };
    auto const last{ &s[s.length() - one] + one };
    int n;
    auto [end, ec] { std::from_chars(first, last, n) };
    i += end - first;

    // skip over suffix
    while (i < s.length() && !is_digit(s[i]))
        ++i;

    // throw if anything remains
    if (i != s.length())
        throw std::runtime_error("validation failed");

    return n;
}
void test(std::ostream& log, bool const expect, std::string s)
{
    std::streamsize const w{ 46 };
    try {
        auto n = validate_integer(s);
        log << std::setw(w) << s << " : " << std::setw(10) << n
            << ", OP : " << OP_validate_integer(s) << '\n';
    }
    catch (std::exception const& e) {
        log << std::setw(w) << s << " : " << e.what()
            << (expect ? "" : "  (as expected)")
            << ", OP : " << OP_validate_integer(s) << '\n';
    }
}
int main()
{
    auto& log{ std::cout };
    log << std::left;

    test(log, true, "<2112>");
    test(log, true, "[(2112)]");
    test(log, true, "\"2112, \"");
    test(log, true, "-2112");
    test(log, true, ".2112");
    test(log, true, "<span style = \"numeral\">2112</span>");
    log.put('\n');

    test(log, true, "++42");
    test(log, true, "--42");
    test(log, true, "+-42");
    test(log, true, "-+42");
    test(log, true, "- 42");
    log.put('\n');

    test(log, false, "2112.0");
    test(log, false, "");
    test(log, false, "21,12");
    test(log, false, "\"21\",\"12, \"");
    test(log, false, "<span style = \"font - size:18.0pt\">2112</span>");
    test(log, false, "Distance to sun = 9.3e+7 miles.");
    log.put('\n');

    return 0;
}
// end file: main.cpp

Output

The "hole" in the output, below the entry for 2112.0, is the failed conversion of the null-string.

The value at the end of each line is what the regex expresion given in the OP produces.

// From the OP:
std::round(std::stof(std::regex_replace(input, std::regex(R"([^\-0-9.]+)"), "")));
<2112>                                         : 2112      , OP : 2112
[(2112)]                                       : 2112      , OP : 2112
"2112, "                                       : 2112      , OP : 2112
-2112                                          : -2112     , OP : -2112
.2112                                          : 2112      , OP : 0
<span style = "numeral">2112</span>            : 2112      , OP : 2112

++42                                           : 42        , OP : 42
--42                                           : -42       , OP : invalid stof argument
+-42                                           : -42       , OP : -42
-+42                                           : 42        , OP : -42
- 42                                           : 42        , OP : -42

2112.0                                         : validation failed  (as expected), OP : 2112
                                               : validation failed  (as expected), OP : invalid stof argument
21,12                                          : validation failed  (as expected), OP : 2112
"21","12, "                                    : validation failed  (as expected), OP : 2112
<span style = "font - size:18.0pt">2112</span> : validation failed  (as expected), OP : -18
Distance to sun = 9.3e+7 miles.                : validation failed  (as expected), OP : 9
0
Stephan Peters On

The Answer:

[^0-9.-]*[-.0-9]+[^0-9.-]+[-.0-9]+.*

Or, in a language that natively supports genuine regular expressions (C++ does not):

[^\d.-]*[-.\d]+[^\d.-]+[-.\d]+.*

I had to add a second filter to handle a hyphen that is not intended to be a sign:

[^0-9-]*[-]+[^0-9]+.*

This will cause an exception with input like:

<span class="crimson-text">-2112</span>

It took me a while, but this works:

#include <cmath>
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int validate_integer(const string& input);
...
...
int validate_integer(const string& input) {
    if(regex_match(input, regex("[^0-9.-]*[-.0-9]+[^0-9.-]+[-.0-9]+.*")) |
       regex_match(input, regex("[^0-9-]*[-]+[^0-9]+.*"))) {
            throw std::invalid_argument("invalid input: " + input + " \n");
    }
    return round(stof(regex_replace(input, regex(R"([^\-0-9.]+)"), "")));
}

In the comments @xaxxon mentioned I should be using pcre (pcre2). This is definitely correct. I haven’t gotten it to work on my machine yet, I will have to investigate a compile time error related to a macro definition. I’m shocked C++ doesn’t support operators like \d \w \s \D \W \S etc. This is test output:

The input was: "<2112>"; The parsed integer is 2112
The input was: "{2112}"; The parsed integer is 2112
The input was: "[(2112)]"; The parsed integer is 2112
The input was: ""2112,""; The parsed integer is 2112
The input was: "-2112"; The parsed integer is -2112                           
The input was: "<span style = "numeral">2112</span>"; The parsed integer is 2112
The input was: "yyz=2112"; The parsed integer is 2112
The input was: "The number is 2112."; The parsed integer is 2112
The input was: ".2112"; The parsed integer is 0    
The input was: "2112.0"; The parsed integer is 2112
The input was: "21.12"; The parsed integer is 21
The input was: "98.89"; The parsed integer is 99
The input was: "21,12"; Threw an invalid argument exception
The input was: ""21","12","; Threw an invalid argument exception
The input was: "<span style = "font-size:18.0pt">2112</span>"; Threw an invalid argument exception
The input was: "21TwentyOne12"; Threw an invalid argument exception

Though a parsing engine like @tbxfreeware created for this thread would be a better solution in many situations, for example, if the brackets need to be balanced.