mitjafelicijan.com - content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md

Path: mitjafelicijan.com / content / posts / 2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md (raw)
  1---
  2title: The strange case of Elasticsearch allocation failure
  3url: the-strange-case-of-elasticsearch-allocation-failure.html
  4date: 2020-03-29T12:00:00+02:00
  5type: post
  6draft: false
  7---
  8
  9I've been using Elasticsearch in production for 5 years now and never had a
 10single problem with it. Hell, never even known there could be a problem. Just
 11worked. All this time. The first node that I deployed is still being used in
 12production, never updated, upgraded, touched in anyway.
 13
 14All this bliss came to an abrupt end this Friday when I got notification that
 15Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong!
 16Quickly after that I got another email which sent chills down my spine.  Cluster
 17is now red. RED! Now, shit really hit the fan!
 18
 19I tried googling what could be the problem and after executing allocation
 20function noticed that some shards were unassigned and 5 attempts were already
 21made (which is BTW to my luck the maximum) and that meant I am basically fucked.
 22They also applied that one should wait for cluster to re-balance itself. So, I
 23waited. One hour, two hours, several hours. Nothing, still RED.
 24
 25The strangest thing about it all was, that queries were still being fulfilled.
 26Data was coming out. On the outside it looked like nothing was wrong but
 27everybody that would look at the cluster would know immediately that something
 28was very very wrong and we were living on borrowed time here.
 29
 30> **Please, DO NOT do what I did.** Seriously! Please ask someone on official
 31forums or if you know an expert please consult him. There could be million of
 32reasons and these solution fit my problem. Maybe in your case it would
 33disastrous. I had all the data backed up and even if I would fail spectacularly
 34I would be able to restore the data. It would be a huge pain and I would loose
 35couple of days but I had a plan B.
 36
 37Executing allocation and told me what the problem was but no clear solution yet.
 38
 39```yaml
 40GET /_cat/allocation?format=json
 41```
 42
 43I got a message that `ALLOCATION_FAILED` with additional info `failed to create
 44shard, failure ioexception[failed to obtain in-memory shard lock]`.  Well
 45splendid! I must also say that our cluster is capable more than enough to handle
 46the traffic. Also JVM memory pressure never was an issue. So what happened
 47really then?
 48
 49I tried also re-routing failed ones with no success due to AWS restrictions on
 50having managed Elasticsearch cluster (they lock some of the functions).
 51
 52```yaml
 53POST /_cluster/reroute?retry_failed=true
 54```
 55
 56I got a message that significantly reduced my options.
 57
 58```json
 59{
 60  "Message": "Your request: '/_cluster/reroute' is not allowed."
 61}
 62```
 63
 64After that I went on a hunt again. I won't bother you with all the details
 65because hours/days went by until I was finally able to re-index the problematic
 66index and hoped for the best. Until that moment even re-indexing was giving me
 67errors.
 68
 69```yaml
 70POST _reindex
 71{
 72  "source": {
 73    "index": "myindex"
 74  },
 75  "dest": {
 76    "index": "myindex-new"
 77  }
 78}
 79```
 80
 81I needed to do this multiple times to get all the documents re-indexed. Then I
 82dropped the original one with the following command.
 83
 84```yaml
 85DELETE /myindex
 86```
 87
 88And re-indexed again new one in the original one (well by name only).
 89
 90```yaml
 91POST _reindex
 92{
 93  "source": {
 94    "index": "myindex-new"
 95  },
 96  "dest": {
 97    "index": "myindex"
 98  }
 99}
100```
101
102On the surface it looks like all is working but I have a long road in front of
103me to get all the things working again. Cluster now shows that it is in Green
104mode but I am also getting a notification that the cluster has processing status
105which could mean million of things.
106
107Godspeed!
108