Elasticsearch highlighter false positives

Question

Elasticsearch highlighter false positives

783 Views Asked by mbatchkarov At 04 March 2018 at 14:28

I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:

multiple adjacent character ngram highlights are not merged into one
tra is incorrectly highlighted in doc 9

The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.

Example code

#!/usr/bin/env bash

# Example based on  
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from 
# https://github.com/elastic/elasticsearch/issues/21000

DELETE INDEX IF EXISTS

curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'

CREATE NEW INDEX

curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
    "settings": {
    "analysis": {
      "analyzer": {
        "trigrams": {
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "3",
          "max_gram": "3",
          "token_chars": [
            "letter",
            "digit",
            "symbol",
            "punctuation"
          ]
        }
      }
    }
},
    "mappings": {
        "my_type": {
            "properties": {
                "text": {
                    "type":     "text",
                    "analyzer": "trigrams",
                    "term_vector": "with_positions_offsets"
                }
            }
        }
    }
}
'
printf '\n-------------\n'

POPULATE INDEX

curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1  # Give ES time to index

QUERY

curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match": {
            "text": {
                "query": "auftrag",
                "minimum_should_match": "100%"
            }
        }
    },
      "highlight": {
        "fields": {
          "text": {
            "fragment_size": 120,
            "type": "fvh"
          }
        }
      }
}
'

The hits I get are (abbreviated):

"hits" : [
      {
        "_id" : "9",
        "_source" : {
          "text" : "betrag auftragen"
        },
        "highlight" : {
          "text" : [
            "be<em>tra</em>g <em>auf</em><em>tra</em>gen"
          ]
        }
      },
      {
        "_id" : "7",
        "_source" : {
          "text" : "auftragen"
        },
        "highlight" : {
          "text" : [
            "<em>auf</em><em>tra</em>gen"
          ]
        }
      }
    ]

I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.

Original Q&A

There are 2 best solutions below

Yorling On 07 January 2021 at 06:44

I have the same problem here, with ngram(trigram) tokenizer, got incomplete highlight like:

query with `match`: samp
field data: sample
result highlight: <em>sam</em>ple
expected highlight: <em>samp</em>le

Use match_phrase and use fvh highlight type when set the field's term_vector to with_positions_offsets, this may get the correct highlight.

<em>samp</em>le

I hope this can help you as you do not need to change the tokenizer nor increase max_gram.

But my problem is that I want to use simple_query_string which does not support using phrase for default field query, the only way is using quote to wrap the string like "samp", but as there is some logic in query string so I cant do it for users, and require users to do it neither.

Solution from @piotr-pradzynski may not help me as I have a lot of data, increase the max_gram will lead to lots of storage usage.

**Piotr Pradzynski** · Accepted Answer · 2018-03-14T07:11:21.327000

The problem here is not with highlighting but with this how you are using nGram analyzer.

First of all when you are configure mapping this way:

"mappings": {
  "my_type": {
    "properties": {
      "text": {
        "type"       : "text",
        "analyzer"   : "trigrams",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}

you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:

your text from the document 9 = "betrag auftragen" is split for trigrams so in the index you have something like: [bet, etr, tra, rag, auf, uft, ftr, tra, rag, age, gen]
your text from the document 7 = "auftragen" is split for trigrams so in the index you have something like: [auf, uft, ftr, tra, rag, age, gen]
your search term = "auftrag" is also split for trigrams and Elasticsearch is see it as: [auf, uft, ftr, tra, rag]
at the end Elasticsearch matches all the trigrams from search with those from your index and because of this you have 'auf' and 'tra' highlighted separately. 'ufa', 'ftr', and 'rag' also matches, but they overlaps 'auf' and 'tra' and are not highlighted.

First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add search_analyzer property to your mapping:

"mappings": {
  "my_type": {
    "properties": {
      "text": {
        "type"           : "text",
        "analyzer"       : "trigrams",
        "search_analyzer": "standard",
        "term_vector"    : "with_positions_offsets"
      }
    }
  }
}

Now words from a search term are treated by standard analyzer as separate words so in your case, it will be just "auftrag".

But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.

Now you need to improve your nGram tokenizer by increasing max_gram:

"tokenizer": {
  "my_ngram_tokenizer": {
    "type": "nGram",
    "min_gram": "3",
    "max_gram": "10",
    "token_chars": [
      "letter",
      "digit",
      "symbol",
      "punctuation"
    ]
  }
}

This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.

After this two improvements, highlighting in your search result should look as below:

"betrag <em>auftrag</em>en"

for document 9 and:

"<em>auftrag</em>en"

for document 7.

This is how ngrams and highlighting works together. I know that ES documentation is saying:

It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.

This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.

Elasticsearch highlighter false positives

There are 2 best solutions below

Related Questions in ELASTICSEARCH

Related Questions in HIGHLIGHTING

Trending Questions

Popular # Hahtags

Popular Questions