let's consider this analyzer : I want to split my text with the @ character first and then to n-gram each tokens from the 1st character to the n character.
GET my_idx/_analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"@"
],
"filter": [
"trim",
"lowercase",
"asciifolding",
"edge_ngram_1_60_filter"
]
},
"text": ["[email protected]"]
}
I got the following results, which is ok :
{
"tokens" : [
{
"token" : "m",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "m-",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "m-h",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "d",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "dl",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "dl.",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "dl.f",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "dl.fr",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
}
]
}
this analyzer only cuts in only 2 positions because of the tokenizer type. I would like to get 11 positions.
{
"tokens" : [
{
"token" : "m",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "m-",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "m-h",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "d",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 3
},
{
"token" : "dl",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 4
},
{
"token" : "dl.",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 5
},
{
"token" : "dl.f",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 6
},
{
"token" : "dl.fr",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 7
}
]
}
Any ideas ? thanks !