I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:
- multiple adjacent character ngram highlights are not merged into one
trais incorrectly highlighted in doc 9
The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.
Example code
#!/usr/bin/env bash
# Example based on
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from
# https://github.com/elastic/elasticsearch/issues/21000
DELETE INDEX IF EXISTS
curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'
CREATE NEW INDEX
curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
}
'
printf '\n-------------\n'
POPULATE INDEX
curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1 # Give ES time to index
QUERY
curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"text": {
"query": "auftrag",
"minimum_should_match": "100%"
}
}
},
"highlight": {
"fields": {
"text": {
"fragment_size": 120,
"type": "fvh"
}
}
}
}
'
The hits I get are (abbreviated):
"hits" : [
{
"_id" : "9",
"_source" : {
"text" : "betrag auftragen"
},
"highlight" : {
"text" : [
"be<em>tra</em>g <em>auf</em><em>tra</em>gen"
]
}
},
{
"_id" : "7",
"_source" : {
"text" : "auftragen"
},
"highlight" : {
"text" : [
"<em>auf</em><em>tra</em>gen"
]
}
}
]
I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.
The problem here is not with highlighting but with this how you are using nGram analyzer.
First of all when you are configure mapping this way:
you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:
First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add
search_analyzerproperty to your mapping:Now words from a search term are treated by
standardanalyzer as separate words so in your case, it will be just "auftrag".But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.
Now you need to improve your nGram tokenizer by increasing
max_gram:This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.
After this two improvements, highlighting in your search result should look as below:
for document 9 and:
for document 7.
This is how ngrams and highlighting works together. I know that ES documentation is saying:
This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.