aboutsummaryrefslogtreecommitdiff
path: root/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
diff options
context:
space:
mode:
Diffstat (limited to '_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md')
-rw-r--r--_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md109
1 files changed, 109 insertions, 0 deletions
diff --git a/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
new file mode 100644
index 0000000..1aa3536
--- /dev/null
+++ b/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
@@ -0,0 +1,109 @@
1---
2title: The strange case of Elasticsearch allocation failure
3permalink: /the-strange-case-of-elasticsearch-allocation-failure.html
4date: 2020-03-29T12:00:00+02:00
5layout: post
6type: post
7draft: false
8---
9
10I've been using Elasticsearch in production for 5 years now and never had a
11single problem with it. Hell, never even known there could be a problem. Just
12worked. All this time. The first node that I deployed is still being used in
13production, never updated, upgraded, touched in anyway.
14
15All this bliss came to an abrupt end this Friday when I got notification that
16Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong!
17Quickly after that I got another email which sent chills down my spine. Cluster
18is now red. RED! Now, shit really hit the fan!
19
20I tried googling what could be the problem and after executing allocation
21function noticed that some shards were unassigned and 5 attempts were already
22made (which is BTW to my luck the maximum) and that meant I am basically fucked.
23They also applied that one should wait for cluster to re-balance itself. So, I
24waited. One hour, two hours, several hours. Nothing, still RED.
25
26The strangest thing about it all was, that queries were still being fulfilled.
27Data was coming out. On the outside it looked like nothing was wrong but
28everybody that would look at the cluster would know immediately that something
29was very very wrong and we were living on borrowed time here.
30
31> **Please, DO NOT do what I did.** Seriously! Please ask someone on official
32forums or if you know an expert please consult him. There could be million of
33reasons and these solution fit my problem. Maybe in your case it would
34disastrous. I had all the data backed up and even if I would fail spectacularly
35I would be able to restore the data. It would be a huge pain and I would loose
36couple of days but I had a plan B.
37
38Executing allocation and told me what the problem was but no clear solution yet.
39
40```yaml
41GET /_cat/allocation?format=json
42```
43
44I got a message that `ALLOCATION_FAILED` with additional info `failed to create
45shard, failure ioexception[failed to obtain in-memory shard lock]`. Well
46splendid! I must also say that our cluster is capable more than enough to handle
47the traffic. Also JVM memory pressure never was an issue. So what happened
48really then?
49
50I tried also re-routing failed ones with no success due to AWS restrictions on
51having managed Elasticsearch cluster (they lock some of the functions).
52
53```yaml
54POST /_cluster/reroute?retry_failed=true
55```
56
57I got a message that significantly reduced my options.
58
59```json
60{
61 "Message": "Your request: '/_cluster/reroute' is not allowed."
62}
63```
64
65After that I went on a hunt again. I won't bother you with all the details
66because hours/days went by until I was finally able to re-index the problematic
67index and hoped for the best. Until that moment even re-indexing was giving me
68errors.
69
70```yaml
71POST _reindex
72{
73 "source": {
74 "index": "myindex"
75 },
76 "dest": {
77 "index": "myindex-new"
78 }
79}
80```
81
82I needed to do this multiple times to get all the documents re-indexed. Then I
83dropped the original one with the following command.
84
85```yaml
86DELETE /myindex
87```
88
89And re-indexed again new one in the original one (well by name only).
90
91```yaml
92POST _reindex
93{
94 "source": {
95 "index": "myindex-new"
96 },
97 "dest": {
98 "index": "myindex"
99 }
100}
101```
102
103On the surface it looks like all is working but I have a long road in front of
104me to get all the things working again. Cluster now shows that it is in Green
105mode but I am also getting a notification that the cluster has processing status
106which could mean million of things.
107
108Godspeed!
109