Reading Time: 15 minutes

Today I'm going to share some valuables lessons learned about developing highly concurrent software. These are real life lessons that come straight from the development of the ESB. This is a story about deadlocks, context switches, CPU usage and profiling, focusing in how to diagnose this issues which is often the hardest part of the solution.

So the story begins a couple of weeks ago when I started working on a new feature for Mule 3.5.0. You'll hear the details about it soon enough but basically it's a new feature that aims to address integration use cases which requires processing huge amount of . As you can imagine then, this feature has to deal with parallelization, process management, atomicity, consistency, distributed locks, etc…. All the fun stuff!

Initial symptoms

latest report
Learn why we are the Leaders in management and iPaaS

So after 2 iterations of development I had my first alpha version which I proudly handed over my friendly QA co-worker (Luciano Gandini you're the man!). He coded a couple of sample apps and started testing. Of course he came back with a list full of bugs, but one of them was about a performance bottleneck that in some cases turned into a deadlock. The issue also had this image attached:

Diagnose

Solving these kind of problems is pretty much like being a veterinarian. Your patient cannot tell you where it hurts or how it feels. All I had at the moment was a worried farmer saying “my Mule is walking slowly and doesn't have much appetite”, which makes images like the one above extremely valuable… That is of course, as long as we're able to interpret it. That's the hard part!

So, what's does the image above tell us? The image is a screenshot from a  software called VisualVM, which is a profiling application that monitors a JVM instance and gives you useful insights about memory, CPU usage, and the active threads and their state. Here we can see:

  • The threads that are showing belong to a thread pool that my new feature creates. The tool showed a lot of other threads but for illustrative purposes I'm just showing these ones.
  • Next to each thread name there's a timeline that shows each thread's state through a period of time
  • When the thread is green, then it means it was processing.
  • When it's yellow, it means that the wait() method was called on it and it's waiting for a notify() or notifyAll() invokation to wake it up
  • Finally, red means that the thread is waiting to gain access over a monitor (which in simpler words means that it has reached a synchronized block or is waiting for some kind of lock)

So at this point we can already have our first conclusions:

  • All threads seem to be getting red at the same time, which much likely means that they're all trying to access the same synchronized block at the same time. Also, they're red most of the time which explains why work is being done slowly.
  • Also, there're some threads that spend quite some time being yellow, aka waiting for something else to happen. Because there're so many threads blocked, it's hard to deduce why that's happening at this point, so for now we'll only focus on the red threads.

Allright! So now we know what's happening which is a lot! However, we still don't know why it's happening… Meaning, we don't know which pieces of code are causing this. That's what a thread dump is useful for. Basically, a thread dump gives you the current stack trace for each of these threads at a given time so that you can see what they were actually doing. The dump showed this:


  • at org.mule.module.batch.engine.RecordBuffer$BufferHolder.access$3(RecordBuffer.java:128)
    at org.mule.module.batch.engine.RecordBuffer.add(RecordBuffer.java:88)
    at org.mule.module.batch.BatchStepAggregate.add(BatchStepAggregate.java:101)
    at org.mule.module.batch.DefaultBatchStep.onRecord(DefaultBatchStep.java:144)
    at org.mule.module.batch.BatchRecordWork.run(BatchRecordWork.java:62)
    at org.mule.work.WorkerContext.run(WorkerContext.java:294)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:921)
    at java.lang.Thread.run(Thread.java:680)

So that's it! Threads are getting locked at line 88 of the RecordBuffer class. Congratulations: You just found your tumor!

Treatment

Luckily, this tumor was operable so with a little refactoring I was able to avoid any contention and remove the deadlock. But as you know, no surgical procedure is without an aftermatch, so before declaring the bug as fixed I did the test again and ran a profiling of my own. This time I used another profiler called Yourkit instead of VisualVM, mainly to take advantage of its usability features.

NOTE: I chose YourKit because I like its usability features and because has purchased licenses for it, so it came handy to do so. However, I do want to emphasise that other products such as JProfiler or Mission Control could have got the job done just as well. Just choose the one you like the most!

After repeating the test, I got this second image:

So red is mostly gone…. The ones you still see are because of network I/O and have reasonable length. However, what's all the deal with all the yellow? Now most of my threads went from being blocked to being waiting. That's not cool either! Deadlock are gone so the app would not hang anymore but performance hasn't gone up, it's just as slow as it was before. Additionally at this point I noticed a new symptom I hadn't seen before: CPU usage was really low. Like under 20% percent.

What do we do now? The same as before, getting a thread dump on the waiting threads. The results were more surprising! Remember that I told you that these threads were part of a thread pool that my new feature creates? Well, it turns out that the threads were actually idle, just sitting there waiting for more work to come. I went through the code many times asking myself “how in hell is this possible?”… The answer was quite simple: while fighting the high complexity of the code, I lost sight of the simplicity of the configuration

In short, I have a thread pool to do processing. We'll call the threads in that pool “worker threads”.  Plus, there's another thread which has the responsibility of dispatching work to that pool (we'll call it dispatcher thread). The dispatcher thread was able to generate work way faster than the worker threads were able to finish it. As a result, the pool gets exhausted pretty quickly. Thread pools often have configurable strategies to deal with exhaustion… Some strategies reject the excess of work, some wait… Turns out that the default configuration of the thread pool I was using was to execute the job in the invoking thread. That means, that the dispatcher thread could not keep dispatching work because it was busy processing the work that it was supposed to give to others. So, by the time the worker threads were ready to take more work in, they had to wait for the dispatcher thread to be available again.

As simple as this bug is, actually seeing it can be quite tricky. How did I discover this? By looking at the thread dump for the dispatcher thread. Once again, profilers are your friend.

Solution was easy: I just configured the thread pool to wait in case of exhaustion. In that way, as soon as a worker thread becomes available, the dispatcher thread wakes up and gives work to it. The profiler now looked like this:

Now we see that the worker threads are making efficient use of the CPU and there're no red dots other than the inevitable I/O. CPU usage was now around 100% per cent and the app was 58% faster.

Less concurrency is more

One last thing: When I was checking CPU usage I noticed  that although usage was around 100%, more than half was being used in “system operations” and only around 45% was actually being used by mule. This happened because the thread pool was way bigger than necessary. Parallelism is good, but you can never lose sight of the fact that the more threads you have, more overhead in context switches you'll suffer. So, I took my thread pool down from 100 threads to only 16. The result? My test was now one minute faster. So remember: when you're tuning for maximum CPU usage,  don't go for as many threads as you can allocate. Instead, search for the number that will give you maximum CPU usage with the lower context switch overhead.

Summary take-aways

I hope you find this useful. This is how we make our Mule gallop!

The key things you should remember from this post are:

  • When testing performance, use a profiler to make sure you're making efficient use of your resources
  • Try to keep synchronized blocks to a minimum
  • Try to reduce wait times
  • Do not abuse your ability to start threads. Reduce the context switch overhead
  • But most importantly: It doesn't matter how much data you have if you're not able to interpret it  and turn it into information

Thank you for reading. See you next post around!