How to use REGEXP_REPLACE to convert a verbal BOOLEAN collection of words to MATCH AGAINST format? [MySQL 8.0.36]

80 Views Asked by At

I need to convert a text query from standard "verbal" boolean to the FULLLTEXT MATCH AGAINST + and - system.

For example; this value:

((word1 AND word2) OR (word3 AND word4 AND word5)) OR word6 NOT word7

Needs to be converted to:

((+word1 +word2) (+word3 +word4 +word5) word6 -word7

The issue I'm having is when I try to use REGEXP_REPLACE with a pattern like this:

\\b([[:alnum:]_]+)\\b

I can't figure out how to have the function use the word that matched to append/amend to it. I've seen other examples trying to use \1 or something like that to represent the matching value, but it simply doesn't work.

So how can I command REGEXP_REPLACE to dynamically scan a string for "(word +word" and replace it with "(+word +word" ? Also needs to do a leading simple one like:

It's also a bit more complex, because if we have:

(word1 AND word2) AND (word3 OR word4)

That would need to be:

+(+word1 +word2) +(word3 word4)

Any guidance on the best way to do this without writing an ultra slow parser would be appreciated.

2

There are 2 best solutions below

0
Floobinator On BEST ANSWER

I figured out how to convert a AND/OR/NOT boolean statement like the below to a compatible MATCH AGAINST solution, and hope this helps other people.

This converts:

(word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7

TO

(WORD1 WORD2) (+WORD3 +WORD4 +WORD5) WORD6 -WORD7

CODE:

SET in_text_search = '(word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7';
SET text_reformatted = UCASE(TRIM(in_text_search));
SET text_reformatted = REPLACE(text_reformatted,'  ',' '); -- Strip extra spaces

-- Make sure the count of parentheses (if exist) is equal.
IF (INSTR(text_reformatted,'(') > 0 OR INSTR(text_reformatted,')') > 0) THEN
        SET count_character = LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,'(',''));
        IF (count_character <> (LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,')','')))) THEN
                -- TRIGGER ERROR 'imbalanced parentheses';
                LEAVE SP;
        END IF;
END IF;

SET text_reformatted = REPLACE(text_reformatted,' AND ',' +');
SET text_reformatted = REPLACE(text_reformatted,' OR ',' ');
SET text_reformatted = REPLACE(text_reformatted,' NOT ',' -');

-- Primary replace
SET text_reformatted = REGEXP_REPLACE(text_reformatted, '(?<![-+])\\b(\\w+)\\b(?=(?:\\s*\\+|\\s*$))', '+$1', 1, 0, 'c');

-- Cleanup the cart before the horse, for words like .NET etc.
SET text_reformatted = REPLACE(text_reformatted,'.+','+.');
SET text_reformatted = REPLACE(text_reformatted,'.-','-.');
0
Andrei Odegov On

In a way, this SQL-style solution was made for fun, but maybe it will help someone.

with RECURSIVE
  t1(in_text_search) as (
    values
      row('(word1 OR word2) and .net AND (word3 AND word4 AND word5) OR word6 NOT word7')
  ),
  r as (
    select
      0 as occ,
      cast('' as char(127)) as tok,
      in_text_search,
      0 as parentheses,
      cast('process' as char(30)) as status
    from t1
    union all
    select
      r.occ+1,
      case
        when t.tok = 'AND' then '+'
        when t.tok = 'NOT' then '-'
        when t.tok = 'OR' then ' '
        else coalesce(t.tok, '')
      end,
      r.in_text_search,
      p.parentheses,
      case
        when p.parentheses < 0 then 'imbalanced parentheses'
        when t.tok is null then
          case
            when p.parentheses != 0 then 'imbalanced parentheses'
            else 'done'
          end
        else 'process'
      end
    from r,
    LATERAL(
      select
        ucase(regexp_substr(
          r.in_text_search,
          '[()]|[^()[:space:]]+',
          1,
          r.occ+1
        )) as tok
    ) as t,
    LATERAL(
      select
        r.parentheses +
          case
            when t.tok like '(' then 1
            when t.tok like ')' then -1
            else 0
          end as parentheses
    ) as p
    where status = 'process'
  ),
  t2 as (
    select
      occ,
      tok,
      lead(tok, 2) over(order by occ) as tok_2nd,
      status
    from r
  ),
  t3 as (
    select
      occ,
      case
        when tok = '(' and tok_2nd in ('+', '-') then concat(tok, tok_2nd)
        when tok in ('+', '-') then concat(' ', tok)
        else tok
      end as tok,
      status
    from t2
  )
select
  GROUP_CONCAT(tok order by occ separator '') as text_reformatted
from t3
;
+----------------------------------------------------------+
|                     text_reformatted                     |
+----------------------------------------------------------+
| (WORD1 WORD2) +.NET +(+WORD3 +WORD4 +WORD5) WORD6 -WORD7 |
+----------------------------------------------------------+

db<>fiddle