1 files changed, 46 insertions, 15 deletions
diff --git a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
index a976a7e..d0f4bac 100644
--- a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
+++ b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
@@ -1,19 +1,37 @@
 ---
 title: The strange case of Elasticsearch allocation failure
 url: the-strange-case-of-elasticsearch-allocation-failure.html
-date: 2020-03-29
+date: 2020-03-29T12:00:00+02:00
 draft: false
 ---
-I've been using Elasticsearch in production for 5 years now and never had a single problem with it. Hell, never even known there could be a problem. Just worked. All this time. The first node that I deployed is still being used in production, never updated, upgraded, touched in anyway.
+I've been using Elasticsearch in production for 5 years now and never had a 
+single problem with it. Hell, never even known there could be a problem. Just 
-All this bliss came to an abrupt end this Friday when I got notification that Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong! Quickly after that I got another email which sent chills down my spine. Cluster is now red. RED! Now, shit really hit the fan!
+worked. All this time. The first node that I deployed is still being used in
+production, never updated, upgraded, touched in anyway.
-I tried googling what could be the problem and after executing allocation function noticed that some shards were unassigned and 5 attempts were already made (which is BTW to my luck the maximum) and that meant I am basically fucked. They also applied that one should wait for cluster to re-balance itself. So, I waited. One hour, two hours, several hours. Nothing, still RED.
+All this bliss came to an abrupt end this Friday when I got notification that 
-The strangest thing about it all was, that queries were still being fulfilled. Data was coming out. On the outside it looked like nothing was wrong but everybody that would look at the cluster would know immediately that something was very very wrong and we were living on borrowed time here.
+Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong! 
+Quickly after that I got another email which sent chills down my spine. 
-> **Please, DO NOT do what I did.** Seriously! Please ask someone on official forums or if you know an expert please consult him. There could be million of reasons and these solution fit my problem. Maybe in your case it would disastrous. I had all the data backed up and even if I would fail spectacularly I would be able to restore the data. It would be a huge pain and I would loose couple of days but I had a plan B.
+Cluster is now red. RED! Now, shit really hit the fan!
+I tried googling what could be the problem and after executing allocation 
+function noticed that some shards were unassigned and 5 attempts were already 
+made (which is BTW to my luck the maximum) and that meant I am basically fucked. 
+They also applied that one should wait for cluster to re-balance itself. So, 
+I waited. One hour, two hours, several hours. Nothing, still RED.
+The strangest thing about it all was, that queries were still being fulfilled. 
+Data was coming out. On the outside it looked like nothing was wrong but 
+everybody that would look at the cluster would know immediately that something 
+was very very wrong and we were living on borrowed time here.
+> **Please, DO NOT do what I did.** Seriously! Please ask someone on official 
+forums or if you know an expert please consult him. There could be million of 
+reasons and these solution fit my problem. Maybe in your case it would 
+disastrous. I had all the data backed up and even if I would fail spectacularly 
+I would be able to restore the data. It would be a huge pain and I would 
+loose couple of days but I had a plan B.
 Executing allocation and told me what the problem was but no clear solution yet.
@@ -21,9 +39,14 @@ Executing allocation and told me what the problem was but no clear solution yet.
 GET /_cat/allocation?format=json
 ```
-I got a message that `ALLOCATION_FAILED` with additional info `failed to create shard, failure ioexception[failed to obtain in-memory shard lock]`. Well splendid! I must also say that our cluster is capable more than enough to handle the traffic. Also JVM memory pressure never was an issue. So what happened really then?
+I got a message that `ALLOCATION_FAILED` with additional info 
+`failed to create shard, failure ioexception[failed to obtain in-memory shard lock]`. 
+Well splendid! I must also say that our cluster is capable more than enough 
+to handle the traffic. Also JVM memory pressure never was an issue. So what
+happened really then?
-I tried also re-routing failed ones with no success due to AWS restrictions on having managed Elasticsearch cluster (they lock some of the functions).
+I tried also re-routing failed ones with no success due to AWS restrictions 
+on having managed Elasticsearch cluster (they lock some of the functions).
 ```yaml
 POST /_cluster/reroute?retry_failed=true
@@ -37,7 +60,10 @@ I got a message that significantly reduced my options.
 }
 ```
-After that I went on a hunt again. I won't bother you with all the details because hours/days went by until I was finally able to re-index the problematic index and hoped for the best. Until that moment even re-indexing was giving me errors.
+After that I went on a hunt again. I won't bother you with all the details 
+because hours/days went by until I was finally able to re-index the problematic 
+index and hoped for the best. Until that moment even re-indexing was giving 
+me errors.
 ```yaml
 POST _reindex
@@ -51,7 +77,8 @@ POST _reindex
 }
 ```
-I needed to do this multiple times to get all the documents re-indexed. Then I dropped the original one with the following command.
+I needed to do this multiple times to get all the documents re-indexed. Then 
+I dropped the original one with the following command.
 ```yaml
 DELETE /myindex
@@ -71,6 +98,10 @@ POST _reindex
 }
 ```
-On the surface it looks like all is working but I have a long road in front of me to get all the things working again. Cluster now shows that it is in Green mode but I am also getting a notification that the cluster has processing status which could mean million of things.
+On the surface it looks like all is working but I have a long road in front 
+of me to get all the things working again. Cluster now shows that it is in 
+Green mode but I am also getting a notification that the cluster has 
+processing status which could mean million of things.
 Godspeed!

diff --git a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md index a976a7e..d0f4bac 100644 --- a/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md +++ b/content/posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
@@ -1,19 +1,37 @@
1	---	1	---
2	title: The strange case of Elasticsearch allocation failure	2	title: The strange case of Elasticsearch allocation failure
3	url: the-strange-case-of-elasticsearch-allocation-failure.html	3	url: the-strange-case-of-elasticsearch-allocation-failure.html
4	date: 2020-03-29	4	date: 2020-03-29T12:00:00+02:00
5	draft: false	5	draft: false
6	---	6	---
7		7
8	I've been using Elasticsearch in production for 5 years now and never had a single problem with it. Hell, never even known there could be a problem. Just worked. All this time. The first node that I deployed is still being used in production, never updated, upgraded, touched in anyway.	8	I've been using Elasticsearch in production for 5 years now and never had a
9		9	single problem with it. Hell, never even known there could be a problem. Just
10	All this bliss came to an abrupt end this Friday when I got notification that Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong! Quickly after that I got another email which sent chills down my spine. Cluster is now red. RED! Now, shit really hit the fan!	10	worked. All this time. The first node that I deployed is still being used in
11		11	production, never updated, upgraded, touched in anyway.
12	I tried googling what could be the problem and after executing allocation function noticed that some shards were unassigned and 5 attempts were already made (which is BTW to my luck the maximum) and that meant I am basically fucked. They also applied that one should wait for cluster to re-balance itself. So, I waited. One hour, two hours, several hours. Nothing, still RED.	12
13		13	All this bliss came to an abrupt end this Friday when I got notification that
14	The strangest thing about it all was, that queries were still being fulfilled. Data was coming out. On the outside it looked like nothing was wrong but everybody that would look at the cluster would know immediately that something was very very wrong and we were living on borrowed time here.	14	Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong!
15		15	Quickly after that I got another email which sent chills down my spine.
16	> Please, DO NOT do what I did. Seriously! Please ask someone on official forums or if you know an expert please consult him. There could be million of reasons and these solution fit my problem. Maybe in your case it would disastrous. I had all the data backed up and even if I would fail spectacularly I would be able to restore the data. It would be a huge pain and I would loose couple of days but I had a plan B.	16	Cluster is now red. RED! Now, shit really hit the fan!
		17
		18	I tried googling what could be the problem and after executing allocation
		19	function noticed that some shards were unassigned and 5 attempts were already
		20	made (which is BTW to my luck the maximum) and that meant I am basically fucked.
		21	They also applied that one should wait for cluster to re-balance itself. So,
		22	I waited. One hour, two hours, several hours. Nothing, still RED.
		23
		24	The strangest thing about it all was, that queries were still being fulfilled.
		25	Data was coming out. On the outside it looked like nothing was wrong but
		26	everybody that would look at the cluster would know immediately that something
		27	was very very wrong and we were living on borrowed time here.
		28
		29	> Please, DO NOT do what I did. Seriously! Please ask someone on official
		30	forums or if you know an expert please consult him. There could be million of
		31	reasons and these solution fit my problem. Maybe in your case it would
		32	disastrous. I had all the data backed up and even if I would fail spectacularly
		33	I would be able to restore the data. It would be a huge pain and I would
		34	loose couple of days but I had a plan B.
17		35
18	Executing allocation and told me what the problem was but no clear solution yet.	36	Executing allocation and told me what the problem was but no clear solution yet.
19		37
@@ -21,9 +39,14 @@ Executing allocation and told me what the problem was but no clear solution yet.
21	GET /_cat/allocation?format=json	39	GET /_cat/allocation?format=json
22	```	40	```
23		41
24	I got a message that `ALLOCATION_FAILED` with additional info `failed to create shard, failure ioexception[failed to obtain in-memory shard lock]`. Well splendid! I must also say that our cluster is capable more than enough to handle the traffic. Also JVM memory pressure never was an issue. So what happened really then?	42	I got a message that `ALLOCATION_FAILED` with additional info
		43	`failed to create shard, failure ioexception[failed to obtain in-memory shard lock]`.
		44	Well splendid! I must also say that our cluster is capable more than enough
		45	to handle the traffic. Also JVM memory pressure never was an issue. So what
		46	happened really then?
25		47
26	I tried also re-routing failed ones with no success due to AWS restrictions on having managed Elasticsearch cluster (they lock some of the functions).	48	I tried also re-routing failed ones with no success due to AWS restrictions
		49	on having managed Elasticsearch cluster (they lock some of the functions).
27		50
28	```yaml	51	```yaml
29	POST /_cluster/reroute?retry_failed=true	52	POST /_cluster/reroute?retry_failed=true
@@ -37,7 +60,10 @@ I got a message that significantly reduced my options.
37	}	60	}
38	```	61	```
39		62
40	After that I went on a hunt again. I won't bother you with all the details because hours/days went by until I was finally able to re-index the problematic index and hoped for the best. Until that moment even re-indexing was giving me errors.	63	After that I went on a hunt again. I won't bother you with all the details
		64	because hours/days went by until I was finally able to re-index the problematic
		65	index and hoped for the best. Until that moment even re-indexing was giving
		66	me errors.
41		67
42	```yaml	68	```yaml
43	POST _reindex	69	POST _reindex
@@ -51,7 +77,8 @@ POST _reindex
51	}	77	}
52	```	78	```
53		79
54	I needed to do this multiple times to get all the documents re-indexed. Then I dropped the original one with the following command.	80	I needed to do this multiple times to get all the documents re-indexed. Then
		81	I dropped the original one with the following command.
55		82
56	```yaml	83	```yaml
57	DELETE /myindex	84	DELETE /myindex
@@ -71,6 +98,10 @@ POST _reindex
71	}	98	}
72	```	99	```
73		100
74	On the surface it looks like all is working but I have a long road in front of me to get all the things working again. Cluster now shows that it is in Green mode but I am also getting a notification that the cluster has processing status which could mean million of things.	101	On the surface it looks like all is working but I have a long road in front
		102	of me to get all the things working again. Cluster now shows that it is in
		103	Green mode but I am also getting a notification that the cluster has
		104	processing status which could mean million of things.
75		105
76	Godspeed!	106	Godspeed!
		107