Regex pattern for tweets

186 Views Asked by user1059114207 At 07 December 2022 at 01:24

I am building a tweet classification model, and I am having trouble finding a regex pattern that fits what I am looking for. what I want the regex pattern to pick up:

Any hashtags used in the tweet but without the hash mark (example - #omg to just omg)
Any mentions used in the tweet but without the @ symbol (example - @username to just username)
I don't want any numbers or any words containing numbers returned ( this is the most difficult task for me)
Other than that, I just want all words returned

Thank you in advance if you can help

Currently I am using this pattern:** r"(?u)\b\w\w+\b"** but it is failing to remove numbers.

Original Q&A

There are 1 best solutions below

dc-ddfe On 07 December 2022 at 02:54

This regex should work.

(#|@)?(?![^ ]*\d[^ ]*)([^ ]+)

Explanation:

(#|@)?: A 'hash' or 'at' character. Match 0 or 1 times.

(?!): Negative lookahead. Check ahead to see if the pattern in the brackets matches. If so, negate the match.

[^ ]*\d[^ ]*: any number of not space characters, followed by a digit, followed by any number of space characters. This is nested in the negative lookahead, so if a number is found in the username or hashtag, the match is negated.

([^ ]+): One or more not space characters. The negative lookahead is a 0-length match, so if it passes, fetch the rest of the username/hashtag (Grouped with brackets so you can replace with $2).

Regex pattern for tweets

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in MACHINE-LEARNING

Related Questions in TEXT-CLASSIFICATION

Related Questions in QREGULAREXPRESSION

Trending Questions

Popular # Hahtags

Popular Questions