1 files changed, 109 insertions, 0 deletions
diff --git a/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
new file mode 100644
index 0000000..1aa3536
--- /dev/null
+++ b/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
@@ -0,0 +1,109 @@
+---
+title: The strange case of Elasticsearch allocation failure
+permalink: /the-strange-case-of-elasticsearch-allocation-failure.html
+date: 2020-03-29T12:00:00+02:00
+layout: post
+type: post
+draft: false
+---
+I've been using Elasticsearch in production for 5 years now and never had a
+single problem with it. Hell, never even known there could be a problem. Just
+worked. All this time. The first node that I deployed is still being used in
+production, never updated, upgraded, touched in anyway.
+All this bliss came to an abrupt end this Friday when I got notification that
+Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong!
+Quickly after that I got another email which sent chills down my spine.  Cluster
+is now red. RED! Now, shit really hit the fan!
+I tried googling what could be the problem and after executing allocation
+function noticed that some shards were unassigned and 5 attempts were already
+made (which is BTW to my luck the maximum) and that meant I am basically fucked.
+They also applied that one should wait for cluster to re-balance itself. So, I
+waited. One hour, two hours, several hours. Nothing, still RED.
+The strangest thing about it all was, that queries were still being fulfilled.
+Data was coming out. On the outside it looked like nothing was wrong but
+everybody that would look at the cluster would know immediately that something
+was very very wrong and we were living on borrowed time here.
+> **Please, DO NOT do what I did.** Seriously! Please ask someone on official
+forums or if you know an expert please consult him. There could be million of
+reasons and these solution fit my problem. Maybe in your case it would
+disastrous. I had all the data backed up and even if I would fail spectacularly
+I would be able to restore the data. It would be a huge pain and I would loose
+couple of days but I had a plan B.
+Executing allocation and told me what the problem was but no clear solution yet.
+```yaml
+GET /_cat/allocation?format=json
+```
+I got a message that `ALLOCATION_FAILED` with additional info `failed to create
+shard, failure ioexception[failed to obtain in-memory shard lock]`.  Well
+splendid! I must also say that our cluster is capable more than enough to handle
+the traffic. Also JVM memory pressure never was an issue. So what happened
+really then?
+I tried also re-routing failed ones with no success due to AWS restrictions on
+having managed Elasticsearch cluster (they lock some of the functions).
+```yaml
+POST /_cluster/reroute?retry_failed=true
+```
+I got a message that significantly reduced my options.
+```json
+{
+  "Message": "Your request: '/_cluster/reroute' is not allowed."
+}
+```
+After that I went on a hunt again. I won't bother you with all the details
+because hours/days went by until I was finally able to re-index the problematic
+index and hoped for the best. Until that moment even re-indexing was giving me
+errors.
+```yaml
+POST _reindex
+{
+  "source": {
+    "index": "myindex"
+  },
+  "dest": {
+    "index": "myindex-new"
+  }
+}
+```
+I needed to do this multiple times to get all the documents re-indexed. Then I
+dropped the original one with the following command.
+```yaml
+DELETE /myindex
+```
+And re-indexed again new one in the original one (well by name only).
+```yaml
+POST _reindex
+{
+  "source": {
+    "index": "myindex-new"
+  },
+  "dest": {
+    "index": "myindex"
+  }
+}
+```
+On the surface it looks like all is working but I have a long road in front of
+me to get all the things working again. Cluster now shows that it is in Green
+mode but I am also getting a notification that the cluster has processing status
+which could mean million of things.
+Godspeed!

diff --git a/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md b/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md new file mode 100644 index 0000000..1aa3536 --- /dev/null +++ b/_posts/2020-03-29-the-strange-case-of-elasticsearch-allocation-failure.md
@@ -0,0 +1,109 @@
	1	---
	2	title: The strange case of Elasticsearch allocation failure
	3	permalink: /the-strange-case-of-elasticsearch-allocation-failure.html
	4	date: 2020-03-29T12:00:00+02:00
	5	layout: post
	6	type: post
	7	draft: false
	8	---
	9
	10	I've been using Elasticsearch in production for 5 years now and never had a
	11	single problem with it. Hell, never even known there could be a problem. Just
	12	worked. All this time. The first node that I deployed is still being used in
	13	production, never updated, upgraded, touched in anyway.
	14
	15	All this bliss came to an abrupt end this Friday when I got notification that
	16	Elasticsearch cluster went warm. Well, warm is not that bad right? Wrong!
	17	Quickly after that I got another email which sent chills down my spine. Cluster
	18	is now red. RED! Now, shit really hit the fan!
	19
	20	I tried googling what could be the problem and after executing allocation
	21	function noticed that some shards were unassigned and 5 attempts were already
	22	made (which is BTW to my luck the maximum) and that meant I am basically fucked.
	23	They also applied that one should wait for cluster to re-balance itself. So, I
	24	waited. One hour, two hours, several hours. Nothing, still RED.
	25
	26	The strangest thing about it all was, that queries were still being fulfilled.
	27	Data was coming out. On the outside it looked like nothing was wrong but
	28	everybody that would look at the cluster would know immediately that something
	29	was very very wrong and we were living on borrowed time here.
	30
	31	> Please, DO NOT do what I did. Seriously! Please ask someone on official
	32	forums or if you know an expert please consult him. There could be million of
	33	reasons and these solution fit my problem. Maybe in your case it would
	34	disastrous. I had all the data backed up and even if I would fail spectacularly
	35	I would be able to restore the data. It would be a huge pain and I would loose
	36	couple of days but I had a plan B.
	37
	38	Executing allocation and told me what the problem was but no clear solution yet.
	39
	40	```yaml
	41	GET /_cat/allocation?format=json
	42	```
	43
	44	I got a message that `ALLOCATION_FAILED` with additional info `failed to create
	45	shard, failure ioexception[failed to obtain in-memory shard lock]`. Well
	46	splendid! I must also say that our cluster is capable more than enough to handle
	47	the traffic. Also JVM memory pressure never was an issue. So what happened
	48	really then?
	49
	50	I tried also re-routing failed ones with no success due to AWS restrictions on
	51	having managed Elasticsearch cluster (they lock some of the functions).
	52
	53	```yaml
	54	POST /_cluster/reroute?retry_failed=true
	55	```
	56
	57	I got a message that significantly reduced my options.
	58
	59	```json
	60	{
	61	"Message": "Your request: '/_cluster/reroute' is not allowed."
	62	}
	63	```
	64
	65	After that I went on a hunt again. I won't bother you with all the details
	66	because hours/days went by until I was finally able to re-index the problematic
	67	index and hoped for the best. Until that moment even re-indexing was giving me
	68	errors.
	69
	70	```yaml
	71	POST _reindex
	72	{
	73	"source": {
	74	"index": "myindex"
	75	},
	76	"dest": {
	77	"index": "myindex-new"
	78	}
	79	}
	80	```
	81
	82	I needed to do this multiple times to get all the documents re-indexed. Then I
	83	dropped the original one with the following command.
	84
	85	```yaml
	86	DELETE /myindex
	87	```
	88
	89	And re-indexed again new one in the original one (well by name only).
	90
	91	```yaml
	92	POST _reindex
	93	{
	94	"source": {
	95	"index": "myindex-new"
	96	},
	97	"dest": {
	98	"index": "myindex"
	99	}
	100	}
	101	```
	102
	103	On the surface it looks like all is working but I have a long road in front of
	104	me to get all the things working again. Cluster now shows that it is in Green
	105	mode but I am also getting a notification that the cluster has processing status
	106	which could mean million of things.
	107
	108	Godspeed!
	109