ElasticSearch, how to get position for each generated token with the tokenizer type char_group + applied filter

20 Views Asked by toch At 26 June 2023 at 19:24

let's consider this analyzer : I want to split my text with the @ character first and then to n-gram each tokens from the 1st character to the n character.

GET my_idx/_analyze
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      "@"
    ],
     "filter": [
          "trim",
          "lowercase",
          "asciifolding",
          "edge_ngram_1_60_filter"
        ]
  },
  "text": ["[email protected]"]
}

I got the following results, which is ok :

 {
  "tokens" : [
    {
      "token" : "m",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "m-",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "m-h",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "d",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "dl",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "dl.",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "dl.f",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "dl.fr",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    }
  ]
}

this analyzer only cuts in only 2 positions because of the tokenizer type. I would like to get 11 positions.

    {
  "tokens" : [
    {
      "token" : "m",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "m-",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "m-h",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "d",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "dl",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "dl.",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "dl.f",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dl.fr",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 7
    }
  ]
}

Any ideas ? thanks !

Original Q&A

ElasticSearch, how to get position for each generated token with the tokenizer type char_group + applied filter

There are 0 best solutions below

Related Questions in ELASTICSEARCH

Related Questions in TOKENIZE

Related Questions in ANALYZER

Trending Questions

Popular # Hahtags

Popular Questions