MySQL - MATCH Keywords with Spaces AGAINST Large Text

797 Views Asked by At

We have a keyword database, for example:

1 | MySQL
2 | Database Developer
3 | Software Engineer
4 | Director of Development

And we need a fast way to scan a large set of text (say a resume in base text) to match these exact words, with spaces.

Granted, most of the text we're scanning is well over 1,000 characters, it could be something like:

I am a Senior Software Engineer with Director of Development experience specializing in JavaScript, JSON, and Web Development. I also have Management experience and have worked in Banking, Cloud, and custom CRM development environments. I am also familiar with .NET and Kittens.

But they need to be exact matches; for example we want "Director of Development" to match, and not "Director" or "Development" by themselves.

Right now we are using REGEXP like this:

SELECT
    *
FROM
    sit_keyword sk
WHERE
    @large_resume_text REGEXP CONCAT('\\b',sk.keyword,'\\b');

It works fine, but it's very slow, and has issues with + and -. But it works - matching the FULL TEXT of the keyword and not just pieces of it.

However, the fastest way to do this would be to use MATCH AGAINST. Note the sit_keyword table has a FULLTEXT index.

e.g.

SELECT
    *
FROM
    sit_keyword sk
WHERE
    MATCH(sk.keyword) AGAINST (@large_resume_text);

The problem is the MATCH AGAINST system breaks the words from the MATCH(col1) up into separate words. For example "Software Engineer" will return a match against the text if either word "Software" or "Engineer" is in the AGAINST target. Even if we put quotes around the col1 keywords, it's the same. This is causing our MATCH AGAINST to pull out keywords that are not true matches.

How can we force MATCH AGAINST to match the entire MATCH(col1) value against the target text, and not the individual words in col1?

1

There are 1 best solutions below

8
Rick James On
WHERE MATCH(resume) AGAINST ("+Director +Development" IN BOOLEAN MODE)
  AND resume LIKE "%Director of Development%"

Notes:

  • Remove words shorter than the min word length established for your FULLTEXT index.
  • Remove any "stop" words (eg, the)
  • Put + on other words.
  • The MATCH should be fast, but (as you say) not adequate. Add the LIKE to filter out the rest.