If we talk about performance problems, we can think of the following use cases from the end-user perspective.
- Search Performance is slow, i.e you execute a search from Kibana and the response is slow or times out.
- Indexing Performance is slow. i.e You don’t see your index data in time or the data is backlogged
As always, the compute resources are the prime suspects of any kind of performance issue. The top of the list are CPU and Memory, though other factors like diskio, network performance.
We need to ensure that basic requirements are taken care of in terms of allocation of resources to elasticsearch and logstash processes. By default, ES and Logstash processes start with 2G heap space. For most production installations this may not suffice. The following recommendations are better to be adhered to before searching for the actual cause of a performance issue.
- For production installations, we need analyzer VMs with minimum 32GB memory while 64GB is recommended for data intensive or cluster installations.
- A minimum of 16 CPU Cores.
- The Elasticsearch heap (Xmx and Xms) should be set to half of the available memory. For example, we should set it to 16GB on a 32G system and 32G on a 64G system. Please note on a 64G system, to use efficient heap management(using compressed object pointers), we need to set Xmx/Xms to 32766m rather than 32G. For more details, refer Heap-Sizing-Elasticsearch .
- The logstash memory is solely dependent on the processing activity of logstash, the amount of filters, and data volume. It may be safe to start at 4G of heap and monitor usage over time.
- It is also recommended to leave at least 30-50% of the total memory unoccupied for Lucene(the library on top of which ES is built) to use the cache space of RAM, which in turn can speed up searches.
Let’s come back to the 2 performance use-cases that we discussed.
Search Performance is slow
Things that need to be checked during bad search performance.
- Usually index data is distributed across multiple shards. In a cluster setup, these shards can be on any of the nodes part of the cluster. When ES gets a search request, it needs to collate the requested data from all the shards and send back the results to kibana.
- Do check if the search request is trying to get a huge amount of data. Usually kibana waits for 30 secs before timing out. The search can get genuinely slow if ES is trying to get a huge amount of data (across days/months)
- If the amount of data is not much and still the search is slow, there can be 2 possibilities, the network is slow (which is quite unlikely in production), or the ES is busy and starving for resources. ES uses thread pools for search and indexing and the size of the threadpool is automatically decided using the available cores. However, we can set the queue_size to some reasonable value so we can have the requests queued till a thread is available for processing the query. The following setting in elasticsearch will set the threadpool.search.queue_size to 5000.
threadpool.search.queue_size : 5000 - Another possible reason is frequent garbage collections. ES is a java application and it uses CMS GC to avoid long “stop the world” pauses while collecting objects. It does minor pauses, however, the pause can still be long if the objects are filling up fast on the tenured (old) generation. More work the ES does, the more the heap gets filled up and subsequently, GC pauses are warranted. ES GCs are tough to be monitored over the command line and we would need scripts that send ES monitoring data to vuSmartMaps. There are a bunch of scripts and dashboards used at NPCI/NaCH/RBL for ES performance and we can take the help of them.
- Look for any errors in Elasticsearch logs indicating any failed shards related to the index from which data is being searched.
- Look for cluster state and state of the index on which the data is being searched. The index should be green or yellow. The Elasticsearch cluster monitoring commands are described in the next section.
- Do look carefully at the top command and free commands on CLI to see if there are any other processes (not related to ES) hogging memory and CPU. We have seen this a couple of times at NPCI.
Indexing Performance
When we say indexing is slow, we are talking about issues where we don’t see updated data in kibana and we are double sure that there are no issues from data source to analyzer logstash. So there can be 2 possibilities here
- The data is not indexed at all ES. You are most likely to see some error in ES logs if indexing is failing for a particular data and a reason. For example, you have a defined mapping of string for a specific field and you are sending an integer or object type.
- The data is indexed in ES but backlogged. Usually an indexing delay at ES will start backlogging data at logstash and from there to redis. ES logs are very elaborate to a level that it makes literally impractical to debug anything quickly. So ES monitoring dashboards will give some insights on the rate at which events are coming from Redis to logstash and the rate at which ES is currently indexing data. From our experience a single node (64GB, 16 core, 32G heap) should be able to index about 1500-2000 events/per second comfortably.
- Replication: Please note ES is configured for shard replication(1 replica by default). This means every shard created will have an exact replica distributed across the cluster nodes. Replication is also an equivalent indexing operation in ES. So if you are facing peak hour indexing issues in ES, consider disabling replication during the day and enable it back at non-peak hours (night/midnight). You may set up a cron job for this. Please note that having data with no replica shards brings in the risk of losing the data in case if any of the shards become corrupt or not recoverable. At NPCI, the default ES template has number_of_replicas as 0 which means the daily index will be created with no replicas and the setting is changed for the index on the night of the same day. This gives a lot of improvement in ES performance. A simple shell script to enable replicas for a given index is like this. (for disabling, just change the number from 1 to 0)
#!/bin/bash
index=$1
curl -XPUT localhost:9200/$1/_settings -d '{
"index": {
"number_of_replicas": "1"
}
}'
- Refresh Interval: Please ensure the Elasticsearch index refresh interval is 30s. You can query any index to get this value for that index. In general, the vunet ES template is configured with refresh_interval of 30s. A 30s refresh interval means search queries show data that’s max delayed by 30s. In a real-time production environment, this has been found to be ok practically given the performance improvement it gives on a data-intensive environment.
ubuntudipikab-172-31-6-243:~$ curl -s
localhost:9200/vunet-1-1-synthlog-2019.04.30/_settings?pretty
{
"vunet-1-1-synthlog-2019.04.30" : {
"settings" : {
"index" : {
"refresh_interval" : "30s",
"number_of_shards" : "2",
"provided_name" : "vunet-1-1-synthlog-2019.04.30",
"creation_date" : "1556582408091",
"number_of_replicas" : "1",
"uuid" : "rR7Qnk7_TRy6uXYsHjF7aA",
"version" : {
"created" : "6010299"
}
}
}
}
}