Loose Regex for capturing Phone number

36 Views Asked by At

I am trying to build a feature for where I am going through messages between users and attempting to store all U.S. phone numbers that may have possibly been shared in the the message. I want to be very loose about the phone numbers that I store. To do that, I came up with the following regex in PHP (clear explanation given below)

"/(?:\+?1[.\s-]*)?(?:\(?\d{1,3}\)[.\s-]*)?(?:\d{3}[.\s-]+)(?:\d{4}[.\s-]*)(?:(ext|ext\.|Ext|Ext\.|extension|Extension)?[.\s-]*\d{1,6})?|(?:\+?1?\d{10})/",

(?:+?1[.\s-]*)?: This part handles an optional country code (+1) with an optional separator (dot, space, or hyphen). It's optional because I want to capture phone numbers without the country code as well

(?:(?\d{1,3})[.\s-]*)?: This part handles an optional area code enclosed in parentheses

(?:\d{3}[.\s-]+): This part matches the first three digits of the phone number followed by a separator (can be '.' '-' or spaces)

(?:\d{4}[.\s-]*): This part matches the next four digits of the phone number followed by an optional separator (can be '.' '-' or spaces)

(?:(ext|ext.|Ext|Ext.|extension|Extension)?[.\s-]*\d{1,6})?: This part captures optional extensions (case-insensitive) with an optional separator and up to six digits.

|: This is an alternation operator, allowing the regular expression to match either the pattern before or after it.

(?:+?1?\d{10}): This part handles an alternative pattern for phone numbers without explicit separators, where there could be an optional country code (+1) and 10 digits.

However, this regex is a match for the following string

+44 20 7123 4567 where 123 4567 is the match

What should I use to avoid capturing this?

2

There are 2 best solutions below

0
Valentin Marolf On

Not sure, if this mtaches all your cases, but if you add (?!\+\d{0,2}[^1]) at the beginning, you can ensure that the string doesn't start with a + symbol followed by up to 2 digits and a character other than 1.

0
ThW On

It might be possible inside the regular expression, but why not just filter the result in PHP? Not everything has to be solved with a single regular expression.

One problem here is that a look behind assertion (aka "(not) prefixed by ...") needs to have a fixed length - but a country code can have different lengths.

I would suggest matching any possible phone number. This would consume characters otherwise matched by partial matches. Then iterate the matches and use a specific pattern to match an US Phone number in any variant you require.

Note: In the following example I am using the x (Extended) modifier. This allows to format, indent and comment the pattern.

$patternPhoneMaybe = '(
    # optional prefixing +
    \+?
    # digits and separator characters
    (?:
      \d+[- .]*
      |
      \(\d+\)[- .]*
    )+
    # optional extension  
    (?:
      (?:[eE]xt(?:[.]|ension)\s*)
      \d{1,6} 
    )? 
)x';

if (preg_match_all($patternPhoneMaybe, getData(), $matches)) {
    $filtered = array_filter(
        array_map(fn($match) => parseNumberMatch($match), $matches[0]),
        'is_array'
    );
    var_dump($filtered);
}


function parseNumberMatch(string $input): ?array {

    $patternPhoneUS = '(
        ^
        # optional country code
        (?<country>(?:00|\+)1)? 
        # optional separator
        [- .]? 
         # area code
        (?<area>\(\d{3}\)|\d{3})
        # optional separator
        [- .]? 
        # 7 digit phone number with optional separator
        (?<number>
          \d{3}
          [- .]?
          \d{4}
        )
        # optional extension  
        (?: 
          [- ] # mandatory separator
          (?:[eE]xt(?:[.]|ension)\s*)?
          (?<extension>\d{1,6}) 
        )?
        $
    )x';

    if (preg_match($patternPhoneUS, trim($input), $match)) {
        return $match;
    }
    
    return null; 
    
}


function getData() {
return <<<'TEXT'
US

+1 718 123 4567
+1 (718) 123-4567
+17181234567
001 (718) 123 4567

Other Country

+23 (718) 123 4567
0023 (718) 123 4567

+1 (718) 123 4567-1
+1 (718) 123 4567-123456
+1 (718) 123 4567 ext.1
+1 (718) 123 4567 Extension 1

US Variants

2124567890
212-456-7890
(212)456-7890
(212)-456-7890
212.456.7890
212 456 7890
+12124567890
+12124567890
+1 212.456.7890

Other Numbers

718
123.45

TEXT;
}