Atomic alias swap fails with index_not_found_exception on a totally unrelated index

1.2k Views Asked by At

I want to replace and index with zero-downtime, as described in the ES documentation.

I am doing so by:

  • creating a new index my_index_v2 with the new data
  • refreshing the new index
  • then swapping them in an atomic operation, by performing the following request:

POST /_aliases

{
    "actions": [
        { "remove": { "index": "*", "alias": "my_index" }},
        { "add":    { "index": "my_index_v2", "alias": "my_index" }}
    ]
}

This works as expected, except when it randomly fails with 404 response. The error message is:

{
   "error": {
      "root_cause": ... (same)
      "type": "index_not_found_exception",
      "reason": "no such index",
      "resource.type": "index_or_alias",
      "resource.id": "my_unrelated_index_v13",
      "index": "my_unrelated_index_v13"
   },
   "status": 404
}
  • Afterwards, and only if it the swap worked, we delete the now unused indices that were associated with this and only this alias.

The whole operation happens periodically every few minutes. Similar operations to the one described might happen at the same time in the cluster, on other aliases/indices. The error happens randomly, every several hours.

Is there a reason why these operations would interfere with each other? What is going on?

EDIT: clarified the DELETE step at the end.

1

There are 1 best solutions below

1
On BEST ANSWER

This is difficult to reproduce on a local environment because it seems to only happen on highly concurrent scenarios. However... as pointed out by @Eirini Graonidou in the comments, this really looks like an ES bug, solved in PR 23153

From the pull request (emphasis mine):

This either leads to puzzling responses when a bad request is sent to Elasticsearch (if an index named "bad-request" does not exist then it produces an index not found exception and otherwise responds with the index settings for the index named "bad-request").

This does not explain the "bad request" situation, but definitely explains why the error message does not make sense.

More importantly: Upgrading elasticsearch solves this issue