What is the best approach for searching some keywords inside a field that contains a big text in a ElasticSearch index?
I have some words that I want to search inside a field named my_field with these constraints:
- I can pass the list of the words as separate elements or together as a single string with a delimiter(like the space), the important is that each one is searched
- The words can contain typos or can be written in different ways, like OpenAI can be written as
open aioropenai(in lowercase). I want all of these combinations to be searched, but prioritized the results with the exact match.
Let's make an example. My words are:
ctoopenai
So I can keep them separated or treated like a string "cto open ai", in google search engine style. The words can be also:
ctoopenai
because they come from an algorithm that extracts keywords from a text and can split unique keywords in 2 "common" words or not.
The document I want as the first result has a my_field that contains a long text with: ".....cto.....open ai...". So I tried with a match query since I read there is the fuzziness parameter to control the Levenshtein distance.
With these 2 queries the result is found:
Query ok 1 (fuzziness 0 with 3 terms):✅
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "0" }}},
{ "match": { "my_field": { "query": "open", "fuzziness": "0" }}},
{ "match": { "my_field": { "query": "ai", "fuzziness": "0" }}}
],
"minimum_should_match" : 1
}
}
}
Query ok 2 (fuzziness 0 with 1 string):✅
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto open ai", "fuzziness": "0" }}}
],
"minimum_should_match" : 1
}
}
}
(even if I change the order of the words in the query).
But I want to find the same result even if:
- the text contains
open ai - my query has
openai, because it's a little change/typo.
So I tried with:
Query error 3 (fuzziness AUTO with 2 terms and typo):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
But it finds other results before it and the strange thing is that if I use the same query of case 1, but with AUTO in place of 0, it finds other documents before, that maybe have only 1/3 words in the my_field, and not all of the 3. While I know that 1 document contains all of the 3 words exactly, so I don't understand why this is not prioritized:
Query error 4 (fuzziness AUTO with the 3 original terms that worked before with 0):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "open", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "ai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
I tried also with a mixed approach, given a boost to the match without "fuzziness"="AUTO", but with no luck:
Query error 5 (mixed fuzziness with 2 terms and typo):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "boost": 10 }}},
{ "match": { "my_field": { "query": "openai", "boost": 10 }}},
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
So how can I make a query flexible to all of these typos/litlle changes and see prioritized the documents that contains perfectly the possible combinations?
I would index my_field twice, once as is and then second time where I would first split words on cases but then combine words in bigrams using shingle filter. In the search I would search both the original field and the bigrams field giving the original field higher boost.
There are different ways of doing this depending on how many words mingled together you want to match the boost level, etc, but hopefully this example will get you started: