Tokenize each words from any start_offset

31 Views Asked by At

I would like to tokenize the following text :

  "text": "king martin"

into

[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti,  rtin, t, ti, tin, i, in, n]

But more especially into :

 [kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]

It is a way to get these tokens? I have tried with the following tokenizer, but how to say :"Start at any start_offset ?"

  "ngram_tokenizer": {
        "type": "edge_ngram",
        "min_gram": "3",
        "max_gram": "15",
        "token_chars": [
          "letter",
          "digit"
        ]
      }

Thank you !

1

There are 1 best solutions below

1
Musab Dogan On BEST ANSWER

You can use the ngram tokenizer rather than edge_gram.

PUT test_ngram_stack
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    },
    "index.max_ngram_diff": 10
  }
}

POST test_ngram_stack/_analyze
{
  "analyzer": "my_analyzer",
  "text": "king martin"
}

enter image description here