diff options
| author | Mitja Felicijan <m@mitjafelicijan.com> | 2023-07-08 23:25:41 +0200 |
|---|---|---|
| committer | Mitja Felicijan <m@mitjafelicijan.com> | 2023-07-08 23:25:41 +0200 |
| commit | cd6644ea4ddc78597934ab0ef5ba50e3c3daa927 (patch) | |
| tree | 03de331a8db6386dfd6fa75155bfbcea6b4feaf3 /content/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md | |
| parent | 84ed124529ffeee1590295b8de3a8faf51848680 (diff) | |
| download | mitjafelicijan.com-cd6644ea4ddc78597934ab0ef5ba50e3c3daa927.tar.gz | |
Moved to a simpler SSG
Diffstat (limited to 'content/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md')
| -rw-r--r-- | content/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md | 108 |
1 files changed, 108 insertions, 0 deletions
diff --git a/content/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/content/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md new file mode 100644 index 0000000..efe88fa --- /dev/null +++ b/content/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md | |||
| @@ -0,0 +1,108 @@ | |||
| 1 | --- | ||
| 2 | title: The strange case of Elasticsearch allocation failure | ||
| 3 | url: the-strange-case-of-elasticsearch-allocation-failure.html | ||
| 4 | date: 2020-03-29T12:00:00+02:00 | ||
| 5 | type: post | ||
| 6 | draft: false | ||
| 7 | --- | ||
| 8 | |||
| 9 | I've been using Elasticsearch in production for 5 years now and never had a | ||
| 10 | single problem with it. Hell, never even known there could be a problem. Just | ||
| 11 | worked. All this time. The first node that I deployed is still being used in | ||
| 12 | production, never updated, upgraded, touched in anyway. | ||
| 13 | |||
| 14 | All this bliss came to an abrupt end this Friday when I got notification that | ||
| 15 | Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong! | ||
| 16 | Quickly after that I got another email which sent chills down my spine. Cluster | ||
| 17 | is now red. RED! Now, shit really hit the fan! | ||
| 18 | |||
| 19 | I tried googling what could be the problem and after executing allocation | ||
| 20 | function noticed that some shards were unassigned and 5 attempts were already | ||
| 21 | made (which is BTW to my luck the maximum) and that meant I am basically fucked. | ||
| 22 | They also applied that one should wait for cluster to re-balance itself. So, I | ||
| 23 | waited. One hour, two hours, several hours. Nothing, still RED. | ||
| 24 | |||
| 25 | The strangest thing about it all was, that queries were still being fulfilled. | ||
| 26 | Data was coming out. On the outside it looked like nothing was wrong but | ||
| 27 | everybody that would look at the cluster would know immediately that something | ||
| 28 | was very very wrong and we were living on borrowed time here. | ||
| 29 | |||
| 30 | > **Please, DO NOT do what I did.** Seriously! Please ask someone on official | ||
| 31 | forums or if you know an expert please consult him. There could be million of | ||
| 32 | reasons and these solution fit my problem. Maybe in your case it would | ||
| 33 | disastrous. I had all the data backed up and even if I would fail spectacularly | ||
| 34 | I would be able to restore the data. It would be a huge pain and I would loose | ||
| 35 | couple of days but I had a plan B. | ||
| 36 | |||
| 37 | Executing allocation and told me what the problem was but no clear solution yet. | ||
| 38 | |||
| 39 | ```yaml | ||
| 40 | GET /_cat/allocation?format=json | ||
| 41 | ``` | ||
| 42 | |||
| 43 | I got a message that `ALLOCATION_FAILED` with additional info `failed to create | ||
| 44 | shard, failure ioexception[failed to obtain in-memory shard lock]`. Well | ||
| 45 | splendid! I must also say that our cluster is capable more than enough to handle | ||
| 46 | the traffic. Also JVM memory pressure never was an issue. So what happened | ||
| 47 | really then? | ||
| 48 | |||
| 49 | I tried also re-routing failed ones with no success due to AWS restrictions on | ||
| 50 | having managed Elasticsearch cluster (they lock some of the functions). | ||
| 51 | |||
| 52 | ```yaml | ||
| 53 | POST /_cluster/reroute?retry_failed=true | ||
| 54 | ``` | ||
| 55 | |||
| 56 | I got a message that significantly reduced my options. | ||
| 57 | |||
| 58 | ```json | ||
| 59 | { | ||
| 60 | "Message": "Your request: '/_cluster/reroute' is not allowed." | ||
| 61 | } | ||
| 62 | ``` | ||
| 63 | |||
| 64 | After that I went on a hunt again. I won't bother you with all the details | ||
| 65 | because hours/days went by until I was finally able to re-index the problematic | ||
| 66 | index and hoped for the best. Until that moment even re-indexing was giving me | ||
| 67 | errors. | ||
| 68 | |||
| 69 | ```yaml | ||
| 70 | POST _reindex | ||
| 71 | { | ||
| 72 | "source": { | ||
| 73 | "index": "myindex" | ||
| 74 | }, | ||
| 75 | "dest": { | ||
| 76 | "index": "myindex-new" | ||
| 77 | } | ||
| 78 | } | ||
| 79 | ``` | ||
| 80 | |||
| 81 | I needed to do this multiple times to get all the documents re-indexed. Then I | ||
| 82 | dropped the original one with the following command. | ||
| 83 | |||
| 84 | ```yaml | ||
| 85 | DELETE /myindex | ||
| 86 | ``` | ||
| 87 | |||
| 88 | And re-indexed again new one in the original one (well by name only). | ||
| 89 | |||
| 90 | ```yaml | ||
| 91 | POST _reindex | ||
| 92 | { | ||
| 93 | "source": { | ||
| 94 | "index": "myindex-new" | ||
| 95 | }, | ||
| 96 | "dest": { | ||
| 97 | "index": "myindex" | ||
| 98 | } | ||
| 99 | } | ||
| 100 | ``` | ||
| 101 | |||
| 102 | On the surface it looks like all is working but I have a long road in front of | ||
| 103 | me to get all the things working again. Cluster now shows that it is in Green | ||
| 104 | mode but I am also getting a notification that the cluster has processing status | ||
| 105 | which could mean million of things. | ||
| 106 | |||
| 107 | Godspeed! | ||
| 108 | |||
