Recently, I had an inquiry come in from a fellow consultant about setting up a disaster recovery...
An Unstable Index Cluster and How to Fix It with Splunk
Buckets of Tears from the Index Tier: When out of the box Configs don’t cut it
I once arrived at a customer site where their index cluster was extremely unstable. The indexers were often “crashing” and sometimes they would bring the whole cluster down with them. They tried adding cores, which worked for a little while. They added RAM to no avail. They tried adding indexers and things just kept getting worse!
It wasn’t difficult to discover what was causing the problem. They had too many buckets!
This customer had a lot of data. They were ingesting several terabytes per day and some of their indexes had more than a year’s worth. They also had a lot of indexers, 75 of them to be precise, and more than 70 active indexes on those indexers. They had close to half a million buckets.
There comes a point in the life of a growing Splunk installation where all the standard configurations need tuning for scale, and this installation had hit that point quite some time ago. As use of their system increased and each individual component became busier, the cluster master (CM) just couldn’t keep track of all those buckets anymore.
Tailoring the Cluster to the Workload
As it turned out, those indexers weren’t crashing, not really. The CM was simply losing track of the heartbeats amidst all the other work it had to do. Once it thought it had lost an indexer, it started trying to fix the replication and search factors; and the additional workload caused the problem to snowball.
Since there are dozens of config files and thousands of settings, where do we start?
Heartbeats
When your CM thinks an indexer is down, but a quick SSH check shows that indexer chugging along just fine, it means the CM is missing a heartbeat.
The CM considers an indexer “down” if it goes more than 60 seconds without receiving a heartbeat from one of its slaves. Once you go north of 50 Indexers or 100,000 buckets, you’re going to want to turn these settings up!
There are two settings – one for the CM and one for the indexers that you’ll want to adjust in the server.conf:
- heartbeat_period (Indexer): This controls how often the indexers attempts to send heartbeats to the CM. The default is 1 second, so if your indexers are already busy with searching, indexing and replication, this certainly isn’t helping. It can take a little testing, but set between 5 and 30 seconds.
- heartbeat_timeout (CM): This determines the point where the CM considers an indexer to be down, and once an indexer is down the CM starts performing fixups to replicate buckets from the dead indexer to its peers. If your CM couldn’t keep up with heartbeats BEFORE fix-ups; things are about to get worse. You can set this to a multiple of the heartbeat period; typically between 20x and 60x.
Job Frequency
Having more buckets means that the cluster master has a lot more work to do. It will have to go through each bucket to perform a variety of jobs, such as replication and search functions to meet the search and replication factors, primary jobs to ensure every bucket has a primary copy at each site, as well as other maintenance work like rolling buckets through their various stages.
The job frequency is controlled by a server.conf setting on the CM called the service_interval, which is set to 1 second by default. A good guideline for tuning is to increase this by 1 second for every 50,000 buckets or so.
Embracing Splunk With August Schell
Of course, there are more than a few other configuration settings that can really get things humming, and August Schell Splunk engineers are experts at performance tuning installations of all sizes.
Whether you’re finding yourself struggling with clusters, indexes, indexers or something else with your Splunk installation, or you’re just starting to explore Splunk for your IT environment, August Schell can help your IT and security team. Partnering with a Splunk consultant such as the engineering team at August Schell will ensure your Splunk clusters are stable and aim to make your environment work as smoothly and efficiently as possible.
If your agency needs assistance with your Splunk environment, get in touch with an August Schell specialist, or call us at (301)-838-9470.