Fact: Batch Jobs are tricky to handle when exceptions raise. The problem is the huge amounts of data that these jobs are designed to take. If you're processing 1 million records you simply can't log everything. Logs would become huge and unreadable. Not to mention the performance toll it would take. On the other hand, if you log too little then it's impossible to know what went wrong, and if 30 thousand records failed, not knowing what's wrong with them can be a royal pain. This is a trade-off not simple to overcome.
We took the feedback we got from the early releases around this issue, and now that Mule 3.5.0 and the Batch Module finally went GA with the May 2014 release, small but significant improvements took place to help you with these errors. Let's take a look at some simple scenarios.
Before we start
This post assumes you're familiar with the new Batch module included in mule 3.5. If that's not the case, here're some links you might want to read before continuing:
Records Failing in a Step
So, let's take a simple scenario. I have a simple step and records tend to fail there. This is the simplest manifestation of the trade-off we talked about at the beginning of this post. Logging a whole lot of unmanageable text or log so little that is barely useful. The strategy we took here is to not log exceptions but to log stack traces. Yeap, you read correctly. When we found an exception in a step we perform the following algorithm:
- Get the exception's full stack trace
- Strip that stack trace from any messages. Even if all records always throw the same exception, those messages probably contain record specific information. E.g: ‘user Walter White is not eligible because he's wanted by the DEA'. The user Jesse Pinkman would throw a slightly different message, but bottom line is that the exception is actually the same.
- Verify if this stack trace was already logged in the current step. If this is the first time that stack trace is found on the current step, then we log it adding a nice message saying that we will not log it again for the current step. Otherwise, if we already saw that stack trace in the given step, then we simply ignore it but keep count of how many times that exception was found
Let's see an example of how the WantedByDEAException would be logged:
GOTCHA: This is done on a by step basis. If another step also throws the same exception, it WILL be logged once more for that step.
The above is good. We get to see the exception without clogging the logs. However, because I only see the exception once, I don't really know how many times it happened in total. For that we added an exception summary which is logged when the on-complete phase is reached. This summary shows for every exception type, how many times it happened in each step. Suppose that now we have a job with two steps, and both are capable of throwing BatchException. Then the exception summary would look like this:
Here we can see that the first step failed 10 times and the second failed 9.
Getting more out of the logs
As this post starts saying, the motivation for all of this is the trade-off between large chunks of data and logging efficiency, that's why all the items described above work with Mule's default configuration of only logging INFO level messages (if you're not familiar with logging levels, take a look at this) .
There're special cases like debugging or when you're motivation to use batch is not the ability to handle large datasets but recoverability, resilience, management, etc (see this post for more examples). For those cases, you can get a whole lot of extra and extremely verbose information by setting the batch category to DEBUG level in your app's log4j configuration like this:
CAUTION: Use the DEBUG level with care. Definitively not advised for production environments unless you're absolutely sure that your jobs will not handle significantly large data sets.
Error Handling MEL functions
The early access release of the Batch module already shipped with some error handling MEL functions that can be used in the context of a batch step. Let's refresh that real quick:
- #[isSuccessfulRecord()]: Boolean functions which returns true if the current record has not thrown exceptions in any prior step.
- #[isFailedRecord()]: Boolean functions which returns true if the current record has thrown exceptions in any prior step.
- #[failureExceptionForStep(String)]: This function receives the name of a step as a String argument. If the current record threw exception on that step, then it returns the actual Exception object. Otherwise it returns null
Again, the above is just fine. However, we quickly learn that those felt short in some cases. Let's consider the following scenario:
- A batch job that polls files containing contact information. Those contacts are transformer using DataMapper and fed into the job
- The first step transforms them again and inserts them into Salesforce
- The second step transforms again and inserts them into Google Contacts
So far so good. The challenging new requirement is to be able to write into a JMS dead-letter queue per each record that has failed. To keep it simple, let's just say that the message will be the exception itself. In case you haven't noticed, there's a gotcha in this requirement: each record could have failed in both steps, which means that the same record would translate into two JMS messages.
To support this and other similar scenarios, we added these additional functions which also require to be executed in the context of a batch step:
- #[getFirstException()]: Returns the Exception for the very first step in which the current record has failed. If the record hasn't failed in any step, then it returns null.
- #[getLastException()]: Returns the Exception for the last step in which the current record has failed. If the record hasn't failed in any step, then it returns null.
So let's take a look at how this application would look like:
As you see, a pretty standard Batch Job. Because this is an error handling post, I won't go into detail of the whole app and I'll just focus on the last step. Let's take a closer look:
Since the goal of this step is to gather failures, it makes sense to skip the ones who haven't got any, thus we set the accept policy to failures only.
The first thing we do is to use the set-payload processor and the getStepExceptions() MEL function together so that the message payload is set to the exceptions map.
Remember how we said before that a given record could have many exceptions? That's the nice thing about this map, it gives you access to the exceptions found in all steps. Since our goal is to send the exceptions through JMS and we don't care about the steps, we can simply use a foreach to iterate over the map's values (the exceptions) and send them through a JMS outbound endpoint:
And we're done! We have happily aggregated exceptions!
The above features were introduced in the GA release of Mule 3.5.0. We tried our best to help you find equilibrium between useful and meaningful messages without generating undesired verbosity. Have better ideas or suggestions? Please reach out!
Hope you liked this!