aboutsummaryrefslogtreecommitdiff
path: root/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
diff options
context:
space:
mode:
authorMitja Felicijan <mitja.felicijan@gmail.com>2023-06-27 14:50:20 +0200
committerMitja Felicijan <mitja.felicijan@gmail.com>2023-06-27 14:50:20 +0200
commit8697555125c57ae64a0c9b78514b4aac4fd523de (patch)
treea699df53a7c35a4425f30bca86982c4341f6de40 /content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
parent33b2615a5038bc85036081e8b5e0da8584d88097 (diff)
downloadmitjafelicijan.com-8697555125c57ae64a0c9b78514b4aac4fd523de.tar.gz
Massive formatting and added figcaption
Diffstat (limited to 'content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md')
-rw-r--r--content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md74
1 files changed, 37 insertions, 37 deletions
diff --git a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
index d0f4bac..bf1d710 100644
--- a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
+++ b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
@@ -5,33 +5,33 @@ date: 2020-03-29T12:00:00+02:00
5draft: false 5draft: false
6--- 6---
7 7
8I've been using Elasticsearch in production for 5 years now and never had a 8I've been using Elasticsearch in production for 5 years now and never had a
9single problem with it. Hell, never even known there could be a problem. Just 9single problem with it. Hell, never even known there could be a problem. Just
10worked. All this time. The first node that I deployed is still being used in 10worked. All this time. The first node that I deployed is still being used in
11production, never updated, upgraded, touched in anyway. 11production, never updated, upgraded, touched in anyway.
12 12
13All this bliss came to an abrupt end this Friday when I got notification that 13All this bliss came to an abrupt end this Friday when I got notification that
14Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong! 14Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong!
15Quickly after that I got another email which sent chills down my spine. 15Quickly after that I got another email which sent chills down my spine. Cluster
16Cluster is now red. RED! Now, shit really hit the fan! 16is now red. RED! Now, shit really hit the fan!
17 17
18I tried googling what could be the problem and after executing allocation 18I tried googling what could be the problem and after executing allocation
19function noticed that some shards were unassigned and 5 attempts were already 19function noticed that some shards were unassigned and 5 attempts were already
20made (which is BTW to my luck the maximum) and that meant I am basically fucked. 20made (which is BTW to my luck the maximum) and that meant I am basically fucked.
21They also applied that one should wait for cluster to re-balance itself. So, 21They also applied that one should wait for cluster to re-balance itself. So, I
22I waited. One hour, two hours, several hours. Nothing, still RED. 22waited. One hour, two hours, several hours. Nothing, still RED.
23 23
24The strangest thing about it all was, that queries were still being fulfilled. 24The strangest thing about it all was, that queries were still being fulfilled.
25Data was coming out. On the outside it looked like nothing was wrong but 25Data was coming out. On the outside it looked like nothing was wrong but
26everybody that would look at the cluster would know immediately that something 26everybody that would look at the cluster would know immediately that something
27was very very wrong and we were living on borrowed time here. 27was very very wrong and we were living on borrowed time here.
28 28
29> **Please, DO NOT do what I did.** Seriously! Please ask someone on official 29> **Please, DO NOT do what I did.** Seriously! Please ask someone on official
30forums or if you know an expert please consult him. There could be million of 30forums or if you know an expert please consult him. There could be million of
31reasons and these solution fit my problem. Maybe in your case it would 31reasons and these solution fit my problem. Maybe in your case it would
32disastrous. I had all the data backed up and even if I would fail spectacularly 32disastrous. I had all the data backed up and even if I would fail spectacularly
33I would be able to restore the data. It would be a huge pain and I would 33I would be able to restore the data. It would be a huge pain and I would loose
34loose couple of days but I had a plan B. 34couple of days but I had a plan B.
35 35
36Executing allocation and told me what the problem was but no clear solution yet. 36Executing allocation and told me what the problem was but no clear solution yet.
37 37
@@ -39,14 +39,14 @@ Executing allocation and told me what the problem was but no clear solution yet.
39GET /_cat/allocation?format=json 39GET /_cat/allocation?format=json
40``` 40```
41 41
42I got a message that `ALLOCATION_FAILED` with additional info 42I got a message that `ALLOCATION_FAILED` with additional info `failed to create
43`failed to create shard, failure ioexception[failed to obtain in-memory shard lock]`. 43shard, failure ioexception[failed to obtain in-memory shard lock]`. Well
44Well splendid! I must also say that our cluster is capable more than enough 44splendid! I must also say that our cluster is capable more than enough to handle
45to handle the traffic. Also JVM memory pressure never was an issue. So what 45the traffic. Also JVM memory pressure never was an issue. So what happened
46happened really then? 46really then?
47 47
48I tried also re-routing failed ones with no success due to AWS restrictions 48I tried also re-routing failed ones with no success due to AWS restrictions on
49on having managed Elasticsearch cluster (they lock some of the functions). 49having managed Elasticsearch cluster (they lock some of the functions).
50 50
51```yaml 51```yaml
52POST /_cluster/reroute?retry_failed=true 52POST /_cluster/reroute?retry_failed=true
@@ -60,10 +60,10 @@ I got a message that significantly reduced my options.
60} 60}
61``` 61```
62 62
63After that I went on a hunt again. I won't bother you with all the details 63After that I went on a hunt again. I won't bother you with all the details
64because hours/days went by until I was finally able to re-index the problematic 64because hours/days went by until I was finally able to re-index the problematic
65index and hoped for the best. Until that moment even re-indexing was giving 65index and hoped for the best. Until that moment even re-indexing was giving me
66me errors. 66errors.
67 67
68```yaml 68```yaml
69POST _reindex 69POST _reindex
@@ -77,8 +77,8 @@ POST _reindex
77} 77}
78``` 78```
79 79
80I needed to do this multiple times to get all the documents re-indexed. Then 80I needed to do this multiple times to get all the documents re-indexed. Then I
81I dropped the original one with the following command. 81dropped the original one with the following command.
82 82
83```yaml 83```yaml
84DELETE /myindex 84DELETE /myindex
@@ -98,10 +98,10 @@ POST _reindex
98} 98}
99``` 99```
100 100
101On the surface it looks like all is working but I have a long road in front 101On the surface it looks like all is working but I have a long road in front of
102of me to get all the things working again. Cluster now shows that it is in 102me to get all the things working again. Cluster now shows that it is in Green
103Green mode but I am also getting a notification that the cluster has 103mode but I am also getting a notification that the cluster has processing status
104processing status which could mean million of things. 104which could mean million of things.
105 105
106Godspeed! 106Godspeed!
107 107