Recently I ran out of space on a 5 node Elasticsearch cluster. Events were not being indexed, and Logstash had amassed a 10GB disk-backed queue. It was not pretty
I discovered that the fifth node was configured incorrectly and was storing the ES data on one of the smaller disk partitions. I stopped the Elasticsearch service on this node while I formulated a plan.
Unfortunately, I didn’t have the time (or confidence) to move the entire /var
directory to the large partition (which happened to be serving the /home
folder: mounted as /dev/mapper/centos-home
), so I instead created a new folder at /home/elasticsearch
(so it would be on the large partition), and “symlinked”/var/elasticsearch
to the new home folder on the larger partition ln -s /home/elasticsearch/elasticsearch /var/lib/elasticsearch
After creating the Symlink, I started the Elasticsearch service, and watched the logs. After some time, I noticed that there were still no primary shards assigned to this new nodes (despite it being the only node with disk space utilization below the threshold), so I dug in a bit more
This is where I learned about /_cluster/allocation/explain
which provides details about why certain shards may have an allocation problem. Ah ha! After 5 failed attempts to unassigned shards to my new node, Elasticsearch just needed a little kick to re run the allocation process: I opened up the Kibana console, and ran POST /_cluster/reroute?retry_failed=true
to force the algorithm to re-evaluate the location of shards
Within about 90 seconds, the Elasticsearch cluster began rerouting all of the unassigned shards, and my logstash disk-queue began to shrink as the events poured into the freshly allocated shards on my new node.
Problem solved.
Stay tuned for next week when I pay off the technical debt incurred by placing my Elasticsearch shards on a symlink 😬