Handle Errors in your Batch Job… Like a Champ!

Fact: Jobs are tricky to handle when exceptions raise. The problem is the huge amounts of data that these jobs are designed to take. If you’re processing 1 million records you simply can’t log everything. Logs would become huge and unreadable. Not to mention the performance toll it would take. On the other hand, if you log too little then it’s impossible to know what went wrong, and if 30 thousand records failed, not knowing what’s wrong with them can be a royal pain. This is a trade-off not simple to overcome.

We took the feedback we got from the early releases around this issue, and now that Mule 3.5.0 and the Batch Module finally went GA with the May 2014 release, small but significant improvements took place to help you with these errors. Let’s take a look at some simple scenarios.

Before we start

This post assumes you’re familiar with the new Batch module included in 3.5. If that’s not the case, here’re some links you might want to read before continuing:

Records Failing in a Step

So, let’s take a simple scenario. I have a simple step and records tend to fail there.  This is the simplest manifestation of the trade-off we talked about at the beginning of this post. Logging a whole lot of unmanageable  text or log so little that is barely useful. The strategy we took here is to not log exceptions but to log stack traces. Yeap, you read correctly. When we found an exception in a step we perform the following algorithm:

  • Get the exception’s full stack trace
  • Strip that stack trace from any messages. Even if all records always throw the same exception, those messages probably contain record specific information. E.g: ‘user Walter White is not eligible because he’s wanted by the DEA’. The user Jesse Pinkman would throw a slightly different message, but bottom line is that the exception is actually the same.
  • Verify if this stack trace was already logged in the current step. If this is the first time that stack trace is found on the current step, then we log it adding a nice message saying that we will not log it again for the current step. Otherwise, if we already saw that stack trace in the given step, then we simply ignore it but keep count of how many times that exception was found

Let’s see an example of how the WantedByDEAException would be logged:

GOTCHA: This is done on a by step basis. If another step also throws the same exception, it WILL be logged once more for that step.

Exception Summary

The above is good. We get to see the exception without clogging the logs. However, because I only see the exception once, I don’t really know how many times it happened in total. For that we added an exception summary which is logged when the on-complete phase is reached. This summary shows for every exception type, how many times it happened in each step. Suppose that now we have a job with two steps, and both are capable of throwing BatchException. Then the exception summary would look like this:

Here we can see that the first step failed 10 times and the second failed 9.

Getting more out of the logs

As this post starts saying, the motivation for all of this is the trade-off between large chunks of data and logging efficiency, that’s why all the items described above work with Mule’s default configuration of only logging INFO level messages (if you’re not familiar with logging levels, take a look at this) .

There’re special cases like debugging or when you’re motivation to use batch is not the ability to handle large datasets but recoverability, resilience, management, etc (see this post for more examples). For those cases, you can get a whole lot of extra and extremely verbose information by setting the batch category to DEBUG level in your app’s log4j configuration like this:

CAUTION: Use the DEBUG level with care. Definitively not advised for production environments unless you’re absolutely sure that your jobs will not handle significantly large data sets.

Error Handling MEL functions

The early access release of the Batch module already shipped with some error handling MEL functions that can be used in the context of a batch step. Let’s refresh that real quick:

  • #[isSuccessfulRecord()]: Boolean functions which returns true if the current record has not thrown exceptions in any prior step.
  • #[isFailedRecord()]: Boolean functions which returns true if the current record has thrown exceptions in any prior step.
  • #[failureExceptionForStep(String)]: This function receives the name of a step as a String argument. If the current record threw exception on that step, then it returns the actual Exception object. Otherwise it returns null

Again, the above is just fine. However, we quickly learn that those felt short in some cases. Let’s consider the following scenario:

  • A batch job that polls files containing contact information. Those contacts are transformer using DataMapper and fed into the job
  • The first step transforms them again and inserts them into Salesforce
  • The second step transforms again and inserts them into Google Contacts

So far so good. The challenging new requirement is to be able to write into a JMS dead-letter queue per each record that has failed. To keep it simple, let’s just say that the message will be the exception itself. In case you haven’t noticed, there’s a gotcha in this requirement: each record could have failed in both steps, which means that the same record would translate into two JMS messages.

To support this and other similar scenarios, we added these additional functions which also require to be executed in the context of a batch step:

  • #[getStepExceptions()]: Returns a java Map<String, Exception> in which the keys are the name of a batch step in which the current record has failed and the value is the exception itself. If the record hasn’t failed in any step, this Map will be empty but will never be null. Also, the Map contains no entries for steps in which the record hasn’t failed.
  • #[getFirstException()]: Returns the Exception for the very first step in which the current record has failed. If the record hasn’t failed in any step, then it returns null.
  • #[getLastException()]: Returns the Exception for the last step in which the current record has failed. If the record hasn’t failed in any step, then it returns null.

So let’s take a look at how this application would look like:

As you see, a pretty standard Batch Job. Because this is an error handling post, I won’t go into detail of the whole app and I’ll just focus on the last step. Let’s take a closer look:

Since the goal of this step is to gather failures, it makes sense to skip the ones who haven’t got any, thus we set the accept policy to failures only.

The first thing we do is to use the set-payload processor and the getStepExceptions() MEL function together so that the message payload is set to the exceptions map.

Remember how we said before that a given record could have many exceptions? That’s the nice thing about this map, it gives you access to the exceptions found in all steps. Since our goal is to send the exceptions through JMS and we don’t care about the steps, we can simply use a foreach to iterate over the map’s values (the exceptions) and send them through a JMS outbound endpoint:

And we’re done! We have happily aggregated exceptions!

Summary

The above features were introduced in the GA release of Mule 3.5.0. We tried our best to help you find equilibrium between useful and meaningful messages without generating undesired verbosity. Have better ideas or suggestions? Please reach out!

Hope you liked this!


We'd love to hear your opinion on this post


8 Responses to “Handle Errors in your Batch Job… Like a Champ!”

  1. Hi Mariano,

    I am trying to understand exception handling in mule, so for catch exception strategy.
    use case:

    Http > RESTCompent where RestComponent throws the exception so here I need to handled the exception and need to set header status and business errorcodes as well which will show consumer who is accessing the muleserver.

    Could you please provide your throughts to configure in Anypointstudio.

    Thanks,
    Srikanth N

    • Hello,

      in the exception strategy you can use http response builder to set headers and other properties.

      Regards

  2. Hi,

    I need to catch exceptions for every failed record, I am following the procedure mentioned by you, but in “foreach” I am routing this flow to another flow where I have to combinely log errors for others flow as well in salesforce object. Do I need to add a variable (if yes what would be the expression) or we can directly use datamapper selecting this as a source ? As I am seeing the output is a map then let me know how we will configure it to route it to salesforce.

    • Hello hemantharjinder@gmail.com,

      I don’t quite understand the scenario you describe, but if what you need is to route failed records you can just add a step with accept-policy=”ONLY_FAILURES”. That would cause that step to only process records which have failed in any prior step(s).

  3. Is there a way to get access to the stack trace in an ‘ONLY-FAILURES’ step? The issue I see is that I can retrieve the first or last exception but it’s only the simple terse exception message in the case of a SFDC error. For instance, I see ‘Failed to invoke update. Message payload is of type: RegularImmutableList (org.mule.api.MessagingException)’ but the useful fault is in the stack trace which doesn’t seem to be available in getStepExceptions() either. I see the trace in the batch logs, but I want to potentially make decisions on the useful exception message.

    • The exception returned by the MEL functions are plain old Java exceptions. That means that if you want to get to the root of it you need to start navigating the exception using the getCause() method. This utility method might be handy: org.apache.commons.lang.exception.ExceptionUtils#getRootCause

      Also, remember that you’re not forced to just see the first or last exception. You can #[getStepExceptions()] #[failureExceptionForStep(String)] to get all exceptions or one of a particular step.

  4. Hi Mariano,

    I have a three batch process. A record is failed in both the first step and 2nd step . The third step accepts only failures and I was expecting that this record would be processed twice in the third step, one for each step failure, but it is processed only once. However when I print #[getStepExceptions()] in the 3rd step, I can see map entries for both the exceptions. Can you explain why only one failure record is processed in the 3rd step?

    • The behavior you see is the intended one. Each record goes through each step only once.