We have a keyword database, for example:
1 | MySQL
2 | Database Developer
3 | Software Engineer
4 | Director of Development
And we need a fast way to scan a large set of text (say a resume in base text) to match these exact words, with spaces.
Granted, most of the text we're scanning is well over 1,000 characters, it could be something like:
I am a Senior Software Engineer with Director of Development experience specializing in JavaScript, JSON, and Web Development. I also have Management experience and have worked in Banking, Cloud, and custom CRM development environments. I am also familiar with .NET and Kittens.
But they need to be exact matches; for example we want "Director of Development" to match, and not "Director" or "Development" by themselves.
Right now we are using REGEXP like this:
SELECT
*
FROM
sit_keyword sk
WHERE
@large_resume_text REGEXP CONCAT('\\b',sk.keyword,'\\b');
It works fine, but it's very slow, and has issues with + and -. But it works - matching the FULL TEXT of the keyword and not just pieces of it.
However, the fastest way to do this would be to use MATCH AGAINST. Note the sit_keyword table has a FULLTEXT index.
e.g.
SELECT
*
FROM
sit_keyword sk
WHERE
MATCH(sk.keyword) AGAINST (@large_resume_text);
The problem is the MATCH AGAINST system breaks the words from the MATCH(col1) up into separate words. For example "Software Engineer" will return a match against the text if either word "Software" or "Engineer" is in the AGAINST target. Even if we put quotes around the col1 keywords, it's the same. This is causing our MATCH AGAINST to pull out keywords that are not true matches.
How can we force MATCH AGAINST to match the entire MATCH(col1) value against the target text, and not the individual words in col1?
Notes:
the)+on other words.MATCHshould be fast, but (as you say) not adequate. Add theLIKEto filter out the rest.