Hello from MuleSoft’s performance team!
This post describes a real-world tuning example in which we worked with a customer to optimize their Mule ESB application.
A customer presented us with an application that was to be a proxy to several endpoints. As such, it needed to be very lightweight since the endpoints introduced their own latency. We required the application to provide high throughput and minimal latency.
This real-world example shows how we helped the customer tune their application from a number of angles. We had quite an adventure: the performance metrics were a crime, the usual suspects were innocent, and there were some unexpected twists. But our story has a happy ending. What started at 125 tps and high latency ended at 7600 tps and low latency.
For more info on the tips and tricks we describe here, please see our Performance Tuning Guide.
The original synopsis of this tuning case was recorded by Wai Ip. Additional contributors include Daniel Feist and Rupesh Ramachandran. Edited by Mohammed Abouzahr.
1. The Use Case
- Application input/output: The application took REST/JSON requests, proxying either a back-end REST/JSON or REST/XML service, depending on the service type. The response payload was between 10kB and 300kB.
- Test Client: JMeter was our load client of choice. For quick validations and fast test iterations, we used Apache Bench (AB). We then used JMeter to get more consistent and realistic measurements with ramp-up, ramp-down, think time between requests, etc.
- Mule ESB Host: We ran Mule ESB on a 32-core Amazon EC2 Redhat Linux instance.
- Load Driver: We placed the JMeter test client and the back-end web service simulator, Tomcat with pass-through static response served from memory, on 32-core Amazon EC2 Redhat Linux servers.
- Profiler: We used YourKit, a Java profiler, to inspect Mule ESB.
2. Priority Metrics
The main requirement was to keep latency consistently low. That is, large latency spikes were unacceptable. As long as that requirement was met, we aimed to maximize throughput.
3. Detailed Tuning Notes
- Heap Size: We tested various min/max heap sizes, including 2 GB, 8 GB, and 16 GB. The application was not memory-bound. So increasing the heap size did not significantly improve performance metrics. We settled on 2 GB.
- Garbage Collection: Occasional stop-the-world GC pauses caused latency spikes. Recall that minimizing latency spikes was required for this use case. We used CMS, concurrent mark-and-sweep, to reduce both maximum and average latency. (Mule ESB had the lowest average latency among all of the API vendors benchmarked.) See the next section for specific GC parameters.
- HTTP Connector: We used a RAML-first design approach, which required APIKit. Our initial setup used APIKit set to Jetty inbound (APIKit’s default is HTTP inbound). Unfortunatley, that gave only half of the throughput of using Jetty inbound directly, without APIKit. So we did not use APIKit for the final benchmark project. Instead, we employed jetty:inbound with http:outbound for the API proxy scenario.
- Jetty Tuning: For Jetty, we set
maxThreadsto 255 (in some cases, 1000+ may be required), leaving the
minThreadsvalue at 10. (We made these adjustments in the
jetty-config.xmlfile. The Jetty connector points to this file using the
- Log Level: We reduced logging overhead by turning all logging levels to
WARN. (We did this at the application level in
log4j.propertiesand at the Mule ESB server level using the log level property in
- Scale Out: We avoided H/A clusters in MMC and kept the Mule instances as standalone. That helped avoid H/A overhead.
- Ulimits: Under high load, we found a “too many files opened” error in the Mule ESB log. Increasing
ulimit -nto 65535 solved the problem (the default was only 1024). We also used keepalive on the test client. (This was done through the
Use KeepAlivesetting in JMeter. We used the
-kflag in Apache Bench.) Keepalive ensures that the test client does not open up too many connections to Mule and run out of file descriptors.We then encountered an “OutOfMemory: Unable to create new native thread” error. Increasing the
ulimit -u 65535fixed the problem. (In Linux, a thread and process are the same. Our high-load tests caused us to hit the default process limit.)
- Network: Large payloads in a high-throughput performance test led to network saturation well before CPU or RAM. (For instance, 1 Gigabit ethernet, 1GbE, is common at the time of writing. 1GbE expressed in terms of bytes is 125MB/s in each direction, inbound and outbound. If the payload size is 1MB, the application will be unable to achieve more than 125tps when hit from a remote client in the same network.) Switching to Amazon’s extravagant EC2 10 Gigabit dedicated network option increased throughput tenfold before saturating the network again. We set up the test client (JMeter), the Mule ESB and the back-end simulator on the same dedicated 10GbE network.
- TCP/IP: We tuned the Linux TCP/IP stack with the following values to minimize the sockets in a
TIMEWAITstate. This choice served to limit the number of ephemeral ports available for new HTTP connections.
>net</a>.ipv4.netfilter.ip_conntrack_max = 32768
>net</a>.ipv4.tcp_tw_recycle = 0
>net</a>.ipv4.tcp_tw_reuse = 0
>net</a>.ipv4.tcp_orphan_retries = 1
>net</a>.ipv4.tcp_fin_timeout = 25
>net</a>.ipv4.tcp_max_orphans = 8192
>net</a>.ipv4.ip_local_port_range = 32768 61000.
- Tune HTTP outbound: We observed some unacceptable latency spikes, as high as 1.5s. Adding a dispatcher-threading-profile for the outbound/back-end HTTP connector resolved the problem. The high latency was caused by threads waiting for objects in the dispatcher pool. Increasing the dispatcher max threads value to 1000 from the default 16 brought decreased the latency spikes from 1.5s to 10ms and gave much higher throughput, as well.
4. Mule ESB Properties
This is the final list of properties we used in Mule ESB’s
# GC Logging wrapper.java.additional.4=-XX:+PrintGCApplicationStoppedTime wrapper.java.additional.5=-XX:+PrintGCDetails wrapper.java.additional.6=-XX:+PrintGCDateStamps wrapper.java.additional.7=-XX:+PrintTenuringDistribution wrapper.java.additional.8=-XX:ErrorFile=%MULE_HOME%/logs/err.log wrapper.java.additional.9=-Xloggc:%MULE_HOME%/logs/gc.log # Mule Java flags for low latency wrapper.java.additional.10=-XX:+AlwaysPreTouch wrapper.java.additional.11=-server wrapper.java.additional.12=-XX:PermSize=128m wrapper.java.additional.13=-XX:MaxPermSize=128m wrapper.java.additional.14=-XX:NewRatio=3 wrapper.java.additional.15=-XX:+UseConcMarkSweepGC wrapper.java.additional.16=-XX:CMSInitiatingOccupancyFraction=60 wrapper.java.additional.17=-XX:+UseCMSInitiatingOccupancyOnly wrapper.java.additional.19=-XX:CompileThreshold=1000 wrapper.java.additional.20=-XX:MaxTenuringThreshold=8 wrapper.java.additional.21=-XX:TargetSurvivorRatio=90 wrapper.java.additional.22=-XX:SurvivorRatio=8 wrapper.java.additional.23=-XX:+CMSScavengeBeforeRemark wrapper.java.additional.24=-XX:PretenureSizeThreshold=512m wrapper.java.additional.25=-XX:CMSFullGCsBeforeCompaction=1 wrapper.java.additional.26=-XX:CMSTriggerPermRatio=80 wrapper.java.additional.27=-XX:CMSMaxAbortablePrecleanTime=6000 wrapper.java.additional.28=-XX:+CMSConcurrentMTEnabled wrapper.java.additional.29=-XX:+UseParNewGC # Optimize GC threads for 32 core machines wrapper.java.additional.30=-XX:ConcGCThreads=20 wrapper.java.additional.31=-XX:ParallelGCThreads=20 wrapper.java.additional.30=-XX:+AggressiveOpts wrapper.java.additional.31=-Xss228k #OTHER PARAMS OF INTEREST IN THIS FILE # Initial Java Heap Size (in MB) wrapper.java.initmemory=2048 # Maximum Java Heap Size (in MB) wrapper.java.maxmemory=2048 # Log Level for console output. (See docs for log levels) wrapper.console.loglevel=WARN # Log Level for log file output. (See docs for log levels) wrapper.logfile.loglevel=WARN
5. Final Test Results
- Throughput: 7600 tps
10GbE network saturated at this point.
- Average Transaction Latency: 5 ms
This is the round trip time measured from JMeter.
- API Gateway Latency: 2 ms
This is the latency introduced with Mule ESB in middle, versus going direct to back-end service.
- CPU Usage: 35%
There were 32 cores.
- Memory Usage: 2 GB Java heap
60GB RAM was available.
125 tps to 7600! Here’s a summary of how we achieved that increase of over 6000%. CPU and memory, common performance bottlenecks, were not our primary hurdle. Instead, the key was to increase network bandwidth to handle the massive payload. Using CMS instead of the standard parallel GC helped reduce latency. Other adjustments ensured that we could handle and fully utilize that incredible throughput.
Stay tuned for more posts from MuleSoft’s performance team. We’re not as fast as Mule ESB, but we aim for about one transaction per month.