Implementing a simple fulltext search I encountered a problem with the combination of boolean mode and phrases. Also worth noting is that the column has a binary collation (utf8_bin) whilst the table does not have this.
Given the following setup:
CREATE TABLE `test` (
`test_id` int(11) NOT NULL AUTO_INCREMENT,
`text_bin` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`test_id`),
FULLTEXT KEY `text_bin` (`text_bin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `test` (`test_id`, `text_bin`) VALUES
(1, 'Lorem Ipsum Dolor Sit Amet.'),
(2, 'Consectetuer Adipiscing Elit.'),
(3, 'Amet Sit Dolor Ipsum Lorem.')
;
Then running this query:
SELECT t.test_id, t.text_bin,
MATCH(t.text_bin) AGAINST ('Lorem Ipsum' IN BOOLEAN MODE) as m_Words,
MATCH(t.text_bin) AGAINST ('"Lorem Ipsum"' IN BOOLEAN MODE) as m_Phrase,
MATCH(t.text_bin) AGAINST ('Lorem' IN BOOLEAN MODE) as m_Lorem,
MATCH(t.text_bin) AGAINST ('Ipsum' IN BOOLEAN MODE) as m_Ipsum
FROM test t
;
This yields the following results:
| test_id | text_bin | m_Words | m_Phrase | m_Lorem | m_Ipsum |
|---|---|---|---|---|---|
| 1 | Lorem Ipsum Dolor Sit Amet. | 0.0620 | 0 | 0.0310 | 0.0310 |
| 2 | Consectetuer Adipiscing Elit. | 0 | 0 | 0 | 0 |
| 3 | Amet Sit Dolor Ipsum Lorem. | 0.0620 | 0 | 0.0310 | 0.0310 |
(Note: I shortened the numbers to 4 decimal places for better readability.)
For the column m_Phrase I would expect a value greater then 0 on the first row. Is this a bug or can someone explain why the result is 0?
DB Fiddle: https://www.db-fiddle.com/f/8qxR3SiPVtESU3saebhgBG/0
My gut instinct is that this is to do with
spacedelimiter. The same happens withutf8_unicode_ciEg in this fiddle different collations give different results with both case sensitivity and matching of spaces as part of literals - https://www.db-fiddle.com/f/pi78uuA1RCFeToaRH9skcK/2
Update - Fiddle above updated with special chars (
.#*), which match phrases as expected.I don't know the solution to your problem other than use a different collation - just use
utf8_general_ciIt would be interesting to know your use case or even the performance considerations that you face that (assumably) prompt you to use
utf8_bin.Update 2:
This looks like a reported bug. Although, from 2018... One fix appears to be to downgrade to version
5.7.19.It doesn't appear as though there is a working latest version approach for both a FULLTEXT based query with literal phrases and case sensitivity. Very very strange. The simplest (not perfect) approach would be the prepend each word with a plus,
+Lorem +Ipsum, but would obviously lead to some false positives. (See above fiddle withm_allandm_AllLowerfields.