ElasticSearch search list of keywords with possible typos

Question

ElasticSearch search list of keywords with possible typos

57 Views Asked by Paolo Magnani At 27 September 2023 at 15:03

What is the best approach for searching some keywords inside a field that contains a big text in a ElasticSearch index?

I have some words that I want to search inside a field named my_field with these constraints:

I can pass the list of the words as separate elements or together as a single string with a delimiter(like the space), the important is that each one is searched
The words can contain typos or can be written in different ways, like OpenAI can be written as open ai or openai (in lowercase). I want all of these combinations to be searched, but prioritized the results with the exact match.

Let's make an example. My words are:

cto
open
ai

So I can keep them separated or treated like a string "cto open ai", in google search engine style. The words can be also:

cto
openai

because they come from an algorithm that extracts keywords from a text and can split unique keywords in 2 "common" words or not.

The document I want as the first result has a my_field that contains a long text with: ".....cto.....open ai...". So I tried with a match query since I read there is the fuzziness parameter to control the Levenshtein distance.

With these 2 queries the result is found:

Query ok 1 (fuzziness 0 with 3 terms):✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "0" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "0"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "0"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

Query ok 2 (fuzziness 0 with 1 string):✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto open ai", "fuzziness": "0" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

(even if I change the order of the words in the query).

But I want to find the same result even if:

the text contains open ai
my query has openai, because it's a little change/typo.

So I tried with:

Query error 3 (fuzziness AUTO with 2 terms and typo):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

But it finds other results before it and the strange thing is that if I use the same query of case 1, but with AUTO in place of 0, it finds other documents before, that maybe have only 1/3 words in the my_field, and not all of the 3. While I know that 1 document contains all of the 3 words exactly, so I don't understand why this is not prioritized:

Query error 4 (fuzziness AUTO with the 3 original terms that worked before with 0):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "AUTO"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

I tried also with a mixed approach, given a boost to the match without "fuzziness"="AUTO", but with no luck:

Query error 5 (mixed fuzziness with 2 terms and typo):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "boost": 10 }}}, 
        { "match": { "my_field": { "query": "openai", "boost": 10  }}},
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

So how can I make a query flexible to all of these typos/litlle changes and see prioritized the documents that contains perfectly the possible combinations?

Original Q&A

There are 1 best solutions below

**imotov** · Answer 1 · 2023-09-27T20:23:39.353000

I would index my_field twice, once as is and then second time where I would first split words on cases but then combine words in bigrams using shingle filter. In the search I would search both the original field and the bigrams field giving the original field higher boost.

There are different ways of doing this depending on how many words mingled together you want to match the boost level, etc, but hopefully this example will get you started:

DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "tuples_index": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": false,
          "token_separator": ""
        },
        "tuples_search": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": true,
          "token_separator": ""
        }
      }, 
      "analyzer": {
        "standard_shingle_index": {
          "tokenizer": "standard",
          "filter": [ "word_delimiter", "lowercase", "tuples_index" ]
        },
        "standard_shingle_search": {
          "tokenizer": "standard",
          "filter": [ "word_delimiter", "lowercase", "tuples_search" ]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "fields": {
          "tuples": {
            "type": "text",
            "analyzer": "standard_shingle_index",
            "search_analyzer": "standard_shingle_search"
          }
        }
      }
    }
  }
}

PUT my_index/_bulk?refresh
{"index": {}}
{"my_field": "Mira Murati (born 1988) is a United States-based, Albanian-born engineer, researcher and business executive. She is currently the chief technology officer of OpenAI, the artificial intelligence research company that develops ChatGPT." }
{"index": {}}
{"my_field": "Women You Should Know: Mira Murati, CTO of Open A.I." }

GET my_index/_validate/query?explain

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "OpenAI",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "my_field.tuples": {
              "query": "OpenAI"
            }
          }
        }
      ]
    }
  }
}

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "Open AI",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "my_field.tuples": {
              "query": "Open AI"
            }
          }
        }
      ]
    }
  }
}

ElasticSearch search list of keywords with possible typos

There are 1 best solutions below

Related Questions in ELASTICSEARCH

Related Questions in SEARCH

Related Questions in KEYWORD

Related Questions in FUZZY-SEARCH

Related Questions in ELASTICSEARCH-QUERY

Trending Questions

Popular # Hahtags

Popular Questions