Chasing the bottleneck: True story about fighting thread contention in your code

motif

Today I’m going to share some valuables lessons learned about developing highly concurrent software. These are real life lessons that come straight from the development of the Mule ESB. This is a story about deadlocks, context switches, CPU usage and profiling, focusing in how to diagnose this issues which is often the hardest part of the solution.

So the story begins a couple of weeks ago when I started working on a new feature for Mule 3.5.0. You’ll hear the details about it soon enough but basically it’s a new feature that aims to address integration use cases which requires processing huge amount of data. As you can imagine then, this feature has to deal with parallelization, process management, atomicity, consistency, distributed locks, etc…. All the fun stuff!

Initial symptoms

So after 2 iterations of development I had my first alpha version which I proudly handed over my friendly QA co-worker (Luciano Gandini you’re the man!). He coded a couple of sample apps and started testing. Of course he came back with a list full of bugs, but one of them was about a performance bottleneck that in some cases turned into a . The issue also had this image attached:

Diagnose

Solving these kind of problems is pretty much like being a veterinarian. Your patient cannot tell you where it hurts or how it feels. All I had at the moment was a worried farmer saying “my Mule is walking slowly and doesn’t have much appetite”, which makes images like the one above extremely valuable… That is of course, as long as we’re able to interpret it. That’s the hard part!

So, what’s does the image above tell us? The image is a screenshot from a  software called , which is a profiling application that monitors a JVM instance and gives you useful insights about memory, CPU usage, and the active threads and their state. Here we can see:

  • The threads that are showing belong to a thread pool that my new feature creates. The tool showed a lot of other threads but for illustrative purposes I’m just showing these ones.
  • Next to each thread name there’s a timeline that shows each thread’s state through a period of time
  • When the thread is green, then it means it was processing.
  • When it’s yellow, it means that the wait() method was called on it and it’s waiting for a notify() or notifyAll() invokation to wake it up
  • Finally, red means that the thread is waiting to gain access over a (which in simpler words means that it has reached a synchronized block or is waiting for some kind of )

So at this point we can already have our first conclusions:

  • All threads seem to be getting red at the same time, which much likely means that they’re all trying to access the same synchronized block at the same time. Also, they’re red most of the time which explains why work is being done slowly.
  • Also, there’re some threads that spend quite some time being yellow, aka waiting for something else to happen. Because there’re so many threads blocked, it’s hard to deduce why that’s happening at this point, so for now we’ll only focus on the red threads.

Allright! So now we know what’s happening which is a lot! However, we still don’t know why it’s happening… Meaning, we don’t know which pieces of code are causing this. That’s what a thread dump is useful for. Basically, a thread dump gives you the current stack trace for each of these threads at a given time so that you can see what they were actually doing. The dump showed this:

  • waiting to lock <7e69a3f18> (a org.mule.module.batch.engine.RecordBuffer$BufferHolder)
    at org.mule.module.batch.engine.RecordBuffer$BufferHolder.access$3(RecordBuffer.java:128)
    at org.mule.module.batch.engine.RecordBuffer.add(RecordBuffer.java:88)
    at org.mule.module.batch.BatchStepAggregate.add(BatchStepAggregate.java:101)
    at org.mule.module.batch.DefaultBatchStep.onRecord(DefaultBatchStep.java:144)
    at org.mule.module.batch.BatchRecordWork.run(BatchRecordWork.java:62)
    at org.mule.work.WorkerContext.run(WorkerContext.java:294)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:921)
    at java.lang.Thread.run(Thread.java:680)

So that’s it! Threads are getting locked at line 88 of the RecordBuffer class. Congratulations: You just found your tumor!

Treatment

Luckily, this tumor was operable so with a little refactoring I was able to avoid any contention and remove the deadlock. But as you know, no surgical procedure is without an aftermatch, so before declaring the bug as fixed I did the test again and ran a profiling of my own. This time I used another profiler called Yourkit instead of VisualVM, mainly to take advantage of its usability features.

NOTE: I chose YourKit because I like its usability features and because Mulesoft has purchased licenses for it, so it came handy to do so. However, I do want to emphasise that other products such as JProfiler or Mission Control could have got the job done just as well. Just choose the one you like the most!

After repeating the test, I got this second image:

So red is mostly gone…. The ones you still see are because of network I/O and have reasonable length. However, what’s all the deal with all the yellow? Now most of my threads went from being blocked to being waiting. That’s not cool either! Deadlock are gone so the app would not hang anymore but performance hasn’t gone up, it’s just as slow as it was before. Additionally at this point I noticed a new symptom I hadn’t seen before: CPU usage was really low. Like under 20% percent.

What do we do now? The same as before, getting a thread dump on the waiting threads. The results were more surprising! Remember that I told you that these threads were part of a thread pool that my new feature creates? Well, it turns out that the threads were actually idle, just sitting there waiting for more work to come. I went through the code many times asking myself “how in hell is this possible?”… The answer was quite simple: while fighting the high complexity of the code, I lost sight of the simplicity of the configuration

In short, I have a thread pool to do processing. We’ll call the threads in that pool “worker threads”.  Plus, there’s another thread which has the responsibility of dispatching work to that pool (we’ll call it dispatcher thread). The dispatcher thread was able to generate work way faster than the worker threads were able to finish it. As a result, the pool gets exhausted pretty quickly. Thread pools often have configurable strategies to deal with exhaustion… Some strategies reject the excess of work, some wait… Turns out that the default configuration of the thread pool I was using was to execute the job in the invoking thread. That means, that the dispatcher thread could not keep dispatching work because it was busy processing the work that it was supposed to give to others. So, by the time the worker threads were ready to take more work in, they had to wait for the dispatcher thread to be available again.

As simple as this bug is, actually seeing it can be quite tricky. How did I discover this? By looking at the thread dump for the dispatcher thread. Once again, profilers are your friend.

Solution was easy: I just configured the thread pool to wait in case of exhaustion. In that way, as soon as a worker thread becomes available, the dispatcher thread wakes up and gives work to it. The profiler now looked like this:

Now we see that the worker threads are making efficient use of the CPU and there’re no red dots other than the inevitable I/O. CPU usage was now around 100% per cent and the app was 58% faster.

Less concurrency is more

One last thing: When I was checking CPU usage I noticed  that although usage was around 100%, more than half was being used in “system operations” and only around 45% was actually being used by mule. This happened because the thread pool was way bigger than necessary. Parallelism is good, but you can never lose sight of the fact that the more threads you have, more overhead in context switches you’ll suffer. So, I took my thread pool down from 100 threads to only 16. The result? My test was now one minute faster. So remember: when you’re tuning for maximum CPU usage,  don’t go for as many threads as you can allocate. Instead, search for the number that will give you maximum CPU usage with the lower context switch overhead.

Summary take-aways

I hope you find this useful. This is how we make our Mule gallop!

The key things you should remember from this post are:

  • When testing performance, use a profiler to make sure you’re making efficient use of your resources
  • Try to keep synchronized blocks to a minimum
  • Try to reduce wait times
  • Do not abuse your ability to start threads. Reduce the context switch overhead
  • But most importantly: It doesn’t matter how much data you have if you’re not able to interpret it  and turn it into information

Thank you for reading. See you next post around!

 

 

 


We'd love to hear your opinion on this post


30 Responses to “Chasing the bottleneck: True story about fighting thread contention in your code”

  1. Hi,

    Good post.

    A few typos.

    -> You’ll hear the details about it soon enough but basically it’s a new feature that aims to integration use cases which requires processing huge amount of data.

    … aims to address integration use cases …

    -> The threads that are showing belong to a thread poll that my new feature creates.

    .. thread pool …

    Cheers
    Rob

  2. Thank you Rob! Fixed now 😉

  3. Nice.

    The graphics help visualize what is going on and your explanation concisely and simply explained the issues.

  4. Buen articulo

  5. I have a few questions:
    1. You mentioned CPU usage how did you measure that? I can’t seem to find good ways to find it.
    2. How did you know the architecture of your server that there is one dispatcher vs worker threads.

    Thanks!

    • Hello Mayumi,

      1) The easiest way to measure load is to simply open the system monitor that comes with your OS and seeing the overall CPU usage and which processes are responsible for it. Some monitors are better than others, for example, in Windows the monitor doesn’t provide much information. On OSX, it gives more info by differentiating process load from system load. In either case, most Java profilers have a CPU tab which would also provide this information in greater detail.

      2) I knew the architecture of the server because I wrote it :). However, if you go back to the screenshot which shows the threads on the application and their running state, you’ll see that each thread has a name. If whoever wrote the server you’re using followed minimum best practices, then hopefully the threads on your app also have a descriptive name. By looking at those names and the relation between their running states you’re likely to figure out your server’s threading model.

      Hope this helps,

      Cheers

  6. Good one!!!

  7. Excellent Information. Thanks Mariano

  8. Hi Mariano,

    The blog is Fantastic, it’s the exact way how we Performance Engineers do verify the application’s performance given to us, except the the one thing; when CPU usage was reaching to 100%, wasn’t it considered to be a potential problem for the application’s performance?

    Thanks,
    Andy

    • Hello Andy,

      Great question! 100% usage of CPU is not always a sign of degrading performance. In this case, this was a batch job eating a lot of records and performing IO intensive tasks. Also the machine was not doing anything else.

      So, in this particular case, if the CPU nearly tops, it means that we’re not wasting time waiting for IO and it was in that situation that we got the best throughput. However, you’re absolutely right about 100% being a potential issue in different scenari

      Thanks!

  9. This is a really well written article…thanks. It describes the issue I am seeing in my app to a T. I have many threads sitting around WAITING + lower CPU – many other blogs/articles suggest this is normal & not an issue. But performance is not as it should be.
    I do have one question though: I can’t tell which options you are referring to which would allow it to “wait in case of exhaustion”. We currently have a bounded-queue-thread-pool defined in Jboss. Thank you for any further insight you can give!

    • Hello Nick,

      To be honest I haven’t used JBoss in ages so I don’t really remember how to do it there, but what I was referring by options was to the was to the RejectedExecutionHandler used to create a standard Java ThreadPoolExecutor. There’re many default implementations, some throw exceptions, some wait, etc… In my particular case, the best performance came from waiting, since otherwise the dispatcher thread would start executing tasks instead of being dispatching them. As a result, when the actual worker threads became available, they went idle because the dispatcher thread responsible for giving them more work was busy doing something else.

  10. Hi , thanks for the post.

    Please let me know how to use profiler in mule server. to see these red, green , yellow colors? what is software i have to install to get the thread dump? i am using mule 3.5.2 .

    • Hello Anil,

      In this particular case I used a tool called Yourkit, but any Profiler tool such as VisualVM, Oracle Mission Control or JProfiler can pretty much provide the same information.

      Regards

  11. This was sooo helpful.
    The clearest and most concise explanation of how to read this particular tool’s data.

    Thanks a lot for sharing this!

  12. Hi,

    I am running into an issue with my batch programs where, when I use the default batch threading profile, I am running into high CPU Utilization issue.

    I am running into the same issue when I configure a threading profile as follows. Good performance but high CPU usage.

    maxThreadsActive=”8″ poolExhaustedAction=”WAIT” maxBufferSize=”25″ threadWaitTimeout=”-1″

    I noticed that the CPU utilization is high when the dispatcher thread is in runnable status (does that it mean it is waiting?) as seen in yourkit profiler. I wonder why the dispatches takes so much CPU when it is waiting.

    When I change the above threading profile to: maxThreadsActive=”4″ poolExhaustedAction=”RUN”, the CPU utilization falls down (because the dispatcher thread is busy working and not waiting) to single digits but I am losing the performance.

    Would really appreciate any tips on resolving the issue here.

    Thanks.

    • Hello,

      Why do you see high CPU usage and good performance as a bad thing? High CPU usage means that you’re not wasting processing time blocked waiting for IO events or any other stops. High CPU usage and good performance is what you should want for a batch process.

      • If the batch programs use up all of the CPU when running, what happens when we have 10 batch programs and 3 of them are trying to run concurrently? Will that not be an issue? Will that not bring the server down?

        Thats the only fear… otherwise, I completely agree with your point that ‘High CPU usage means that you’re not wasting processing time blocked waiting for IO events or any other stops’.

        Thanks.

        • The batch:job element has an attribute called scheduling-strategy. On its default value, no two job instances of that job will run concurrently, they will be executed one at a time in the order they were triggered. You can change it to ROUND_ROBIN, in which case, the thread pool available for that job is shared between between all the available job instances following a round robin algorithm, preventing the scenario you mention.

          If you still have a sizing issue, then you have to try it as a sizing issue. Add more boxes to your cluster, try to do a better job at defining execution windows, try to make your job lighter, etc

          • Scheduling-strategy works incase of multiple runs of the same batch program. However, I was referring to the runs from different batch programs… what happens when different batch programs are being run at the same time and all of them are trying to use 100% CPU?

            In that case, I think I agree with your 2nd point. We will have to define better execution windows and increase the cores as we build more apps.

          • Exactly. In addition to that, you can also try playing with the pooling profiles of each job by limiting how many resources each job has to assign

  13. Hi,
    Nice post.
    I have one query. How did you identify the optimal number of threads for your thread pool. Was it trial and error or some other heuristic approach

    • Hello, glad you liked the post. The optimal config came from a trial and error process. We defined 3 canonical apps which we thought captured at least 80% of the most common use cases for the feature. We ran performance tests and profiles on several configurations and found a default which made sense to us.

      Regards

  14. Thanks for the response.
    I have an additional query. I need to improve the performance of an application which reads data from files and subsequently performs operations on it. The files are read for about 25 source systems, so I created a threadPool with coreThreads=5 and queue size=25(used spring integration).
    The application has several daemon threads named Dispatcher, but I do not see them performing any operations.
    However, there is a thread task-scheduler which is reading the file(the task which was supposed to be done by my thread pool). I am not sure if this is the dispatcher thread.
    Is there any way I can confirm if this is the dispatcher thread and know its configuration(i.e. whether it will execute the task itself if the threads in pool are not idle)

    • I’m not sure on how Spring integration works under the hood, but what I’d do is to use a profiler and try to see if there’s some kind of pattern between the amount of time that the worker threads are idle and the amount of time that the dispatchers are idle.

  15. Detailed way on Analysing bottleneck !!
    Thanks a lot that helps

  16. This is an excellent post by Mariano Gonzalez. Thanks a ton !!!

    • hi Mariano, This is an excellent post with the exact required details for analysis. One question on dispatcher thread “wait in case of exhaustion”. How to set this configuration and where, what is your environment?
      Would like to know configuration of dispatcher thread before and after fix.
      Thanks in Advance,
      Ananth

      • Hello,

        There’s a batch:threading-profile element which allows you to configure it. Check the docs for more info.

        Best,

  17. Hi, This is an excellent post and bang on point.
    I am trying to create a program to demonstrate this, I am only struck at one point on how and where to configure dispatcher as “wait in case of exhaustion”. If you can share which environment we can simulate this and the configuration details it would help.
    Thanks,
    Ananth