Recently I ran out of space on a 5 node Elasticsearch cluster. Events were not being indexed, and Logstash had amassed a 10GB disk-backed queue. It was not pretty
I discovered that the fifth node was configured incorrectly and was storing the ES data on one of the smaller disk partitions. I stopped the Elasticsearch service on this node while I formulated a plan.
Unfortunately, I didn’t have the time (or confidence) to move the entire
/var directory to the large partition (which happened to be serving the
/home folder: mounted as
/dev/mapper/centos-home), so I instead created a new folder at
/home/elasticsearch (so it would be on the large partition), and “symlinked”
/var/elasticsearch to the new home folder on the larger partition
ln -s /home/elasticsearch/elasticsearch /var/lib/elasticsearch
After creating the Symlink, I started the Elasticsearch service, and watched the logs. After some time, I noticed that there were still no primary shards assigned to this new nodes (despite it being the only node with disk space utilization below the threshold), so I dug in a bit more
This is where I learned about
/_cluster/allocation/explain which provides details about why certain shards may have an allocation problem. Ah ha! After 5 failed attempts to unassigned shards to my new node, Elasticsearch just needed a little kick to re run the allocation process: I opened up the Kibana console, and ran
POST /_cluster/reroute?retry_failed=true to force the algorithm to re-evaluate the location of shards
Within about 90 seconds, the Elasticsearch cluster began rerouting all of the unassigned shards, and my logstash disk-queue began to shrink as the events poured into the freshly allocated shards on my new node.
Stay tuned for next week when I pay off the technical debt incurred by placing my Elasticsearch shards on a symlink 😬