Ingest pipeline should only work on new incoming documents and output to new index

847 Views Asked by At

I have an index with millions of documents and it gets new documents periodically. I created an ingest pipeline for it. But I only want it to work on the new incoming documents because previous document count is huge.

I connected my index and ingest pipeline using _reindex like this:

POST _reindex
{
  "source": {
    "index": "index*"
  },
  "dest": {
    "index": "new_index",
    "pipeline": "pipeline"
  }
}

also my current pipeline is as follows:

{
  "processors": [
    {
      "gsub": {
        "field": "my_field",
        "pattern": "regex",
        "replacement": ""
      }
    }
  ]
}

This ingest pipeline tries to work on every document on the index. But I only want it to work on the new upcoming data. How can I achieve this?

1

There are 1 best solutions below

6
Val On

You don't need a _reindex for this, otherwise you're basically running it on all existing documents.

You simply need to configure your index with a default_pipeline setting:

PUT index*/_settings
{
   "index.default_pipeline": "pipeline"
}

UPDATE:

There's no feature in ES that automatically triggers the indexing of a document in i2 based on the indexing of a document in i1. You could probably achieve something close to what you expect using something like Logstash that regularly polls an index (every minute) for documents arrived during the last minutes and sends them documents to a second index through your pipeline, but that's a solution outside of Elasticsearch

input {
  elasticsearch {
    hosts => "localhost:9200"
    index => "i1"
    schedule => "* * * * *"
    query => '{ "query": { "range": { "@timestamp": { "gt": "now-1m"} } } }'
  }
}
output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "i2"
    pipeline => "my_pipeline"
  }
}