diff options
| author | Mitja Felicijan <mitja.felicijan@gmail.com> | 2023-06-27 14:50:20 +0200 |
|---|---|---|
| committer | Mitja Felicijan <mitja.felicijan@gmail.com> | 2023-06-27 14:50:20 +0200 |
| commit | 8697555125c57ae64a0c9b78514b4aac4fd523de (patch) | |
| tree | a699df53a7c35a4425f30bca86982c4341f6de40 /content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md | |
| parent | 33b2615a5038bc85036081e8b5e0da8584d88097 (diff) | |
| download | mitjafelicijan.com-8697555125c57ae64a0c9b78514b4aac4fd523de.tar.gz | |
Massive formatting and added figcaption
Diffstat (limited to 'content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md')
| -rw-r--r-- | content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md | 74 |
1 files changed, 37 insertions, 37 deletions
diff --git a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md index d0f4bac..bf1d710 100644 --- a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md +++ b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md | |||
| @@ -5,33 +5,33 @@ date: 2020-03-29T12:00:00+02:00 | |||
| 5 | draft: false | 5 | draft: false |
| 6 | --- | 6 | --- |
| 7 | 7 | ||
| 8 | I've been using Elasticsearch in production for 5 years now and never had a | 8 | I've been using Elasticsearch in production for 5 years now and never had a |
| 9 | single problem with it. Hell, never even known there could be a problem. Just | 9 | single problem with it. Hell, never even known there could be a problem. Just |
| 10 | worked. All this time. The first node that I deployed is still being used in | 10 | worked. All this time. The first node that I deployed is still being used in |
| 11 | production, never updated, upgraded, touched in anyway. | 11 | production, never updated, upgraded, touched in anyway. |
| 12 | 12 | ||
| 13 | All this bliss came to an abrupt end this Friday when I got notification that | 13 | All this bliss came to an abrupt end this Friday when I got notification that |
| 14 | Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong! | 14 | Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong! |
| 15 | Quickly after that I got another email which sent chills down my spine. | 15 | Quickly after that I got another email which sent chills down my spine. Cluster |
| 16 | Cluster is now red. RED! Now, shit really hit the fan! | 16 | is now red. RED! Now, shit really hit the fan! |
| 17 | 17 | ||
| 18 | I tried googling what could be the problem and after executing allocation | 18 | I tried googling what could be the problem and after executing allocation |
| 19 | function noticed that some shards were unassigned and 5 attempts were already | 19 | function noticed that some shards were unassigned and 5 attempts were already |
| 20 | made (which is BTW to my luck the maximum) and that meant I am basically fucked. | 20 | made (which is BTW to my luck the maximum) and that meant I am basically fucked. |
| 21 | They also applied that one should wait for cluster to re-balance itself. So, | 21 | They also applied that one should wait for cluster to re-balance itself. So, I |
| 22 | I waited. One hour, two hours, several hours. Nothing, still RED. | 22 | waited. One hour, two hours, several hours. Nothing, still RED. |
| 23 | 23 | ||
| 24 | The strangest thing about it all was, that queries were still being fulfilled. | 24 | The strangest thing about it all was, that queries were still being fulfilled. |
| 25 | Data was coming out. On the outside it looked like nothing was wrong but | 25 | Data was coming out. On the outside it looked like nothing was wrong but |
| 26 | everybody that would look at the cluster would know immediately that something | 26 | everybody that would look at the cluster would know immediately that something |
| 27 | was very very wrong and we were living on borrowed time here. | 27 | was very very wrong and we were living on borrowed time here. |
| 28 | 28 | ||
| 29 | > **Please, DO NOT do what I did.** Seriously! Please ask someone on official | 29 | > **Please, DO NOT do what I did.** Seriously! Please ask someone on official |
| 30 | forums or if you know an expert please consult him. There could be million of | 30 | forums or if you know an expert please consult him. There could be million of |
| 31 | reasons and these solution fit my problem. Maybe in your case it would | 31 | reasons and these solution fit my problem. Maybe in your case it would |
| 32 | disastrous. I had all the data backed up and even if I would fail spectacularly | 32 | disastrous. I had all the data backed up and even if I would fail spectacularly |
| 33 | I would be able to restore the data. It would be a huge pain and I would | 33 | I would be able to restore the data. It would be a huge pain and I would loose |
| 34 | loose couple of days but I had a plan B. | 34 | couple of days but I had a plan B. |
| 35 | 35 | ||
| 36 | Executing allocation and told me what the problem was but no clear solution yet. | 36 | Executing allocation and told me what the problem was but no clear solution yet. |
| 37 | 37 | ||
| @@ -39,14 +39,14 @@ Executing allocation and told me what the problem was but no clear solution yet. | |||
| 39 | GET /_cat/allocation?format=json | 39 | GET /_cat/allocation?format=json |
| 40 | ``` | 40 | ``` |
| 41 | 41 | ||
| 42 | I got a message that `ALLOCATION_FAILED` with additional info | 42 | I got a message that `ALLOCATION_FAILED` with additional info `failed to create |
| 43 | `failed to create shard, failure ioexception[failed to obtain in-memory shard lock]`. | 43 | shard, failure ioexception[failed to obtain in-memory shard lock]`. Well |
| 44 | Well splendid! I must also say that our cluster is capable more than enough | 44 | splendid! I must also say that our cluster is capable more than enough to handle |
| 45 | to handle the traffic. Also JVM memory pressure never was an issue. So what | 45 | the traffic. Also JVM memory pressure never was an issue. So what happened |
| 46 | happened really then? | 46 | really then? |
| 47 | 47 | ||
| 48 | I tried also re-routing failed ones with no success due to AWS restrictions | 48 | I tried also re-routing failed ones with no success due to AWS restrictions on |
| 49 | on having managed Elasticsearch cluster (they lock some of the functions). | 49 | having managed Elasticsearch cluster (they lock some of the functions). |
| 50 | 50 | ||
| 51 | ```yaml | 51 | ```yaml |
| 52 | POST /_cluster/reroute?retry_failed=true | 52 | POST /_cluster/reroute?retry_failed=true |
| @@ -60,10 +60,10 @@ I got a message that significantly reduced my options. | |||
| 60 | } | 60 | } |
| 61 | ``` | 61 | ``` |
| 62 | 62 | ||
| 63 | After that I went on a hunt again. I won't bother you with all the details | 63 | After that I went on a hunt again. I won't bother you with all the details |
| 64 | because hours/days went by until I was finally able to re-index the problematic | 64 | because hours/days went by until I was finally able to re-index the problematic |
| 65 | index and hoped for the best. Until that moment even re-indexing was giving | 65 | index and hoped for the best. Until that moment even re-indexing was giving me |
| 66 | me errors. | 66 | errors. |
| 67 | 67 | ||
| 68 | ```yaml | 68 | ```yaml |
| 69 | POST _reindex | 69 | POST _reindex |
| @@ -77,8 +77,8 @@ POST _reindex | |||
| 77 | } | 77 | } |
| 78 | ``` | 78 | ``` |
| 79 | 79 | ||
| 80 | I needed to do this multiple times to get all the documents re-indexed. Then | 80 | I needed to do this multiple times to get all the documents re-indexed. Then I |
| 81 | I dropped the original one with the following command. | 81 | dropped the original one with the following command. |
| 82 | 82 | ||
| 83 | ```yaml | 83 | ```yaml |
| 84 | DELETE /myindex | 84 | DELETE /myindex |
| @@ -98,10 +98,10 @@ POST _reindex | |||
| 98 | } | 98 | } |
| 99 | ``` | 99 | ``` |
| 100 | 100 | ||
| 101 | On the surface it looks like all is working but I have a long road in front | 101 | On the surface it looks like all is working but I have a long road in front of |
| 102 | of me to get all the things working again. Cluster now shows that it is in | 102 | me to get all the things working again. Cluster now shows that it is in Green |
| 103 | Green mode but I am also getting a notification that the cluster has | 103 | mode but I am also getting a notification that the cluster has processing status |
| 104 | processing status which could mean million of things. | 104 | which could mean million of things. |
| 105 | 105 | ||
| 106 | Godspeed! | 106 | Godspeed! |
| 107 | 107 | ||
