One fine day, your ElasticSearch cluster starts rejecting write requests. You find following error in log,
elasticsearch.helpers.errors.BulkIndexError: ('287 document(s) failed to index.', [{'index': {'_index': 'foo-bar', '_type': '_doc', '_id': '0086f6dbb767f5fa3eac0adf97031259', 'status': 429, 'error': {'type': 'es_rejected_execution_exception', 'reason': 'rejected execution of processing of [1095797768][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[foo-bar][3]] containing [61] requests, target allocation id: Rhcq_0fjR5eTqj0LCOtIQw, primary term: 3 on EsThreadPoolExecutor[name = node-01/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7a93b5a8[Running, pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 268040145]]'}
What is happening exactly?
Each ElasticSearch Data Node maintains a thread-pool with a queue for write operations. The default queue size for this thread-pool is 200. If the queue is full, and the node receives more write requests, then it rejects the extra requests.
How to check write thread-pool queue size?
ElasticSearch provides _cat/thread_pool
API to check different kinds of thread-pools.
For our use case, we can hit the API with the write
parameter.
GET /_cat/thread_pool/write?v
Response:
node_name name active queue rejected
node-04 write 0 0 0
node-02 write 2 8 0
node-03 write 0 0 0
node-01 write 2 199 191958
By looking at the response, it looks like node-01 is getting more requests compared to other nodes. This can happen if you have most of the write-heavy indexes or shards on one node.
Solutions
Solution 1: Distribute uneven Write Heavy Shards
In the mentioned case, you have to move some shards or even indexes to other nodes.
You can achieve this using a shard allocation routing setting at the index level in the following way.
PUT foo-bar/_settings
{
"index" : {
"routing" : {
"allocation" : {
"include" : {
"_name" : "node-03,node-04,node-02"
}
}
}
}
}
Using this setting, you are telling ElasticSearch to distribute shards of index foo-bar
to the other 3 nodes. We have taken this decision because node-01 has other write-heavy indexes.
Solution 2: Increase Refresh Interval of Index
By default, the refresh interval of ElasticSearch Index is 1 second. Refresh interval negatively affects the write speed of the index. You can increase refresh interval using index level setting.
PUT foo-bar/_settings
{
"index.refresh_interval": "10s"
}
Solution 3: Add more nodes
if the write queue-size of all nodes is reaching towards 200, then you might consider adding more nodes to the cluster. In this case, the ElasticSearch write queues will be filled on all nodes. Example.
GET /_cat/thread_pool/write?v
Response:
node_name name active queue rejected
node-04 write 2 200 19151
node-02 write 2 189 1990
node-03 write 2 192 11958
node-01 write 2 199 91959