Reviewing some case studies
This section discusses some real-world scenarios of Elasticsearch node failure and how to address them.
The ES process quits unexpectedly
A few weeks ago we noticed in Marvel that the Elasticsearch process was down on one of our nodes. We restarted Elasticsearch on this node, and everything seemed to return to normal. However, checking Marvel later on in the week, we notice that the node is down again. We decide to look at the Elasticsearch log files, but don't notice any exceptions. As we don't see anything in the Elasticsearch log, we suspect that the operating system may have killed Elasticsearch. Checking
/var/log/syslog, we see the error:
Out of memory: Kill process 5969 (java) score 446 or sacrifice child ...