It appears that the aforementioned issue still exists... Unfortunately, I could not reproduce the issue at all on a testing instance, so I need to take the live instance for debugging. Cluster 8 (approximately 3700 servers) may be down or unresponsive during this period. Sorry about that.
I give up. Since having more clusters reduces the rate of the issue, I'm just gonna let that be for now. ¯_(ツ)_/¯
I found the cause! There's apparently a lot of CPU spikes, meaning from 30% to 100%, for less than a second each time. It is really hard to figure out why this happens, so give me some time.
I am still unable to figure out the issue. I will attempt to increase the number of clusters and restart the entire bot.