I am trying to build a feature for where I am going through messages between users and attempting to store all U.S. phone numbers that may have possibly been shared in the the message. I want to be very loose about the phone numbers that I store. To do that, I came up with the following regex in PHP (clear explanation given below)
"/(?:\+?1[.\s-]*)?(?:\(?\d{1,3}\)[.\s-]*)?(?:\d{3}[.\s-]+)(?:\d{4}[.\s-]*)(?:(ext|ext\.|Ext|Ext\.|extension|Extension)?[.\s-]*\d{1,6})?|(?:\+?1?\d{10})/",
(?:+?1[.\s-]*)?: This part handles an optional country code (+1) with an optional separator (dot, space, or hyphen). It's optional because I want to capture phone numbers without the country code as well
(?:(?\d{1,3})[.\s-]*)?: This part handles an optional area code enclosed in parentheses
(?:\d{3}[.\s-]+): This part matches the first three digits of the phone number followed by a separator (can be '.' '-' or spaces)
(?:\d{4}[.\s-]*): This part matches the next four digits of the phone number followed by an optional separator (can be '.' '-' or spaces)
(?:(ext|ext.|Ext|Ext.|extension|Extension)?[.\s-]*\d{1,6})?: This part captures optional extensions (case-insensitive) with an optional separator and up to six digits.
|: This is an alternation operator, allowing the regular expression to match either the pattern before or after it.
(?:+?1?\d{10}): This part handles an alternative pattern for phone numbers without explicit separators, where there could be an optional country code (+1) and 10 digits.
However, this regex is a match for the following string
+44 20 7123 4567 where 123 4567 is the match
What should I use to avoid capturing this?
Not sure, if this mtaches all your cases, but if you add
(?!\+\d{0,2}[^1])at the beginning, you can ensure that the string doesn't start with a + symbol followed by up to 2 digits and a character other than 1.