If you’ve ever worked on performance issues with an IO- intensive app, there is a good chance you already know that the application performance degrades when the disks are stressed out. This fact is usually well known, but the reasons behind it aren’t always clear. I’d like to try and clarify what’s going on behind the scenes.
In a typical scenario, when data is written to a file, it is first written to a memory area reserved as the page cache. The page holding the newly written data is considered dirty. After a period of time per the kernel IO policy, the kernel flushes the dirty data to the device queue to persist to hard disk. Once the data gets to the queue, the rest is mechanical: The device driver reads the IO requests, then spins, seeks and writes to the physical disk block where the file is. The journal file is written first if enabled, then the actual file.
In a recent discussion with a few other engineers, an idea of disabling file system journaling came up to improve disk write latency. While this is certainly true because it is one less disk operation per write, the actual time gained is negligible because the journal file is in the same block as the file to be written. The benefits of having the journal file to restore the disk from a crash far outweighs the little latency saved.
More importantly, the bottleneck of an IO-intensive app is usually when the system flushes the dirty pages to disk, not during the journaling step. The throughput of flushing is limited by the device bandwidth. A typical 15K RPM could reach a bandwidth of 120MB/sec in the best case of sequential access, in case of random IO the actual bandwidth is even less. To better illustrate, assuming the system uses the default Redhat Linux flush policy of 30 seconds, and the application writes at a rate of 20 MB/sec. After 30 seconds, the system would have accumulated 600 MB of dirty data to flush to disk (assuming we haven’t crossed the dirty page ratio before that). In Linux, the flushing is done by the pdflush daemon. Assuming best case scenario of sequential writes, it would use up all the pdflush threads (8 by default) and 5 full seconds to flush the dirty data from page cache to the device queue. The side effect in this 5 seconds is twofold: the page cache area is busily accessed exclusively, and the disk bandwidth is exhausted. One way to monitor this is to check the disk queue length. On Linux, it’s the “Writeback” value in /proc/meminfo or the “avgqu-sz” in sar.
The scenario gets more complicated when the JVM’s garbage collector is in the middle of cleaning the heap and is caught in the kernel-busily-flushing moment. Some GC events are stop-the-world (STW) event that require all the application threads to pause in order to reach a safe state such that the objects can be moved around in the heap. If one of application threads is trying to write while the kernel is busy flushing, this thread will get stuck behind the flushing job and won’t be able to respond to the summon to pause. This would cause a ripple effect: The busy disk blocks the write thread, the thread prolonged the GC pause, and the pause makes the application appear non-responsive. This problem can be identified in the GC log when there is long STW pause with little CPU time spent in “usr” and “sys”, correlated with disk activity.
The situation gets even worse when the system is also memory bound. By default, in Linux, swapping is enabled with “swappiness” set to 60. With this setting, when the system is under memory pressure and IO is busy, it would aggressively swap out the “idle” memory pages to disk (yes, more disk IO!) from the running process to make more room for the page cache. The process with pages swapped out suffer because as it becomes active again, it has to swap back in the pages first. On the other hand, if swappiness is disabled (by setting it to 0 in /proc/sys/vm/swappiness) to keep the processes’ pages in memory as much as possible, then without choice, the kernel would flush the dirty data more frequently to free up pages. Premature and frequent flushings increase the IOPS pressure and decrease the efficiency of the disk access, making the original cause of the problem even worse.
So what’s the mitigation strategy?
At the system level, always review the kernel’s flushing policy for your specific IO workload. To adjust for optimal performance usually require a few round of testings. The recommended knobs to adjust for Linux are:
-
/proc/sys/vm/dirty_expire_centiseconds
-
/proc/sys/vm/dirty_background_ratio
At the device configuration level, storing the frequently accessed files on different devices could also help to avoid the problem of a single congested device queue. Alternatively, if applicable, having multiple sets of RAID1 would yield more device queues, which is better than having a single volume with all the disks which only has one device queue.
e.g. The figures below show that with 4 disks, the 2x RAID1 setup (on the left) provides two device queues accessible to the system. Whereas, the single 4-devices RAID10 setup (on the right) has one.
If cost is not prohibitive, consider upgrading to SSD which has 6 to 7 times the bandwidth of the spinning disk. The caveats for SSD, however, is its occasional needs to do data compaction. The performance impact of data compaction could be bad, and this discussion is best saved for a separate topic. Also pay attention to the logical volume manager (LVM). The use of LVM introduces a small latency overhead that when compared to SAS disk access speed, is negligible. However, small latency becomes much more noticeable when SSD is used because of their blasting fast speed compared to SAS.
At the application level, review any configurations to avoid double buffering the dirty pages. e.g. In MySQL you can enable DIRECT_IO to make the dirty data to cache only in MySQL’s memory and not in the system’s page cache.
Last but not least, always avoid unnecessary disk access.
Looking for more information on this topic? Check out the Mule ESB Performance Tuning Guide for best practices and tips from MuleSoft performance experts.