Whether for data migration, getting data from legacy systems, compiling reports, or just aggregating, there’s going to be a need for a nightly, weekly, or even up-to-the-minute data analysis. Executing batch jobs are the best way to complete these tasks without devoting excessive work hours to each.

When it comes to batch processing, one oversight is almost guaranteed: nobody bothers building it in until ¾ of the way through the project. Spring Batch is an easy-to-use system that will save work hours and hair pulling for all of a project’s batching needs. The following examples will provide an introduction to Spring Batch as well as some guidance to adding spring batch into upcoming projects.

Should I even be using Spring Batch?

  1. Is the project already using Spring? – Use Spring Batch
  2. Are there multiple forms of data being batched, such as CSV, XML, REST calls, and flat files? – Use Spring Batch
  3. Is there a need for incremental polling of new data? – Use Spring Batch
  4. Hate writing boilerplate code such as paging and fail-over? – Yep, use Spring Batch.

Reasons to probably avoid using Spring Batch:

  1. Is the project written in a different language than Spring/Java?
    - It is possible run Spring Batch as a stand-alone project, made easier with Spring Boot, but chances are good that the language of choice already has an easier option to integrate.
  2. Does the project already employ Quartz Scheduling?
    - There is no need to reinvent the wheel while the car is being driven. The exception to this may be having a large number of batch programs to write, or, if the batch programs must process large amounts of data.
  3. Is there a need to schedule events dynamically?
    - One example would be an order that is completed, so it becomes necessary to schedule a report to be run three days from now. Quartz Scheduler can be easier to set up for these cases, and it handles dynamic event scheduling better.

Core Concepts

Tons of documentation exists to explain the underlying technology behind Spring Batch, found here. The following is a high-level overview that can get any new users up and running.

There are three stages for batch processing: reader, processor, and writer. The reader reads in data. The data is then processed, and finally, the data is written out. The processing stage is optional.

A notable item about Spring Batch: it doesn’t care what the data structure is, be it flat text files, CSV, or a database. Even better, Spring Batch comes packaged with pre-configured readers and writers; meaning typical file formats work out of the box.

The base structure of Spring Batch revolves around jobs. Jobs, in turn, are composed of steps, such as the reader, writer, etc. Most of this is simple to describe in XML.
Another item worth noting: Spring Batch has paging and fail-over built in. No code, not even much XML to write, as we’ll see in a minute.

Setting up Spring Batch

A solid suggestion is to create the project using Spring Batch 3.0. This ensures access to JSR-352, which moved most of the Spring Batch implementation in the Java specification. Objects, such as jobs, can be used without much library importing, which happened pre-3.0. Also, Spring Batch 3.0 provides some nice future-proofing as it supports both Spring 4 and Java 8. Adding support for Hadoop, additional out of the box writers and readers, better scaling, and stronger dependency injection, 3.0 becomes the clear choice.

The immediate shortcoming of Spring Batch is that Spring 4.0 must already be used in the project, or the project must already have multiple versions of Spring. Using older versions is fine and most of the examples provided below should work well.

The following assumes that the project in question will be a Spring-based project, meaning there should already be a transactional manager setup.

Data Source
The transactional manager should be set up like so:

1
2
3
<bean id="txSpringManager" class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
   <property name="dataSource" ref="dataSource"/>
</bean>

Then add Spring Batch into your project:

1
2
3
4
5
6
7
8
9
10
<bean id="jobRepository" class="org.springframework.batch.core.repository.support.JobRepositoryFactoryBean" >
   <property name="dataSource" ref="dataSource"/>
   <property name="transactionManager" ref="txSpringManager"/>
   <property name="isolationLevelForCreate" value="ISOLATION_READ_COMMITTED" />
   <property name="tablePrefix" value="BATCH_"/>
</bean>
 
<bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
   <property name="jobRepository" ref="jobRepository"/>
</bean>

Next, declare your jobFactory and then your jobLauncher. The factory will combine your data source with your transactional manager and let you specify a wide array of parameters, mostly for handling the data source. This manager is then passed into the jobLauncher. Done! Easy, right?

Database Tables
The final step in the data source setup is the actual tables. Spring Batch uses these tables to control the start and execution of each batch job. It is possible to have these jobs running on multiple machines — as long as they are using the same data source a job will never launch more than one instance at a time.

Databases are also great for auditing and re-launching failed job instances.
A wide variety of databases offer different options. The Meta-Data Schema explains a bit more about the table setup. Scripts come pre-packaged within the Spring Batch .jar. Simply unpack the .jar and go to org.springframework.batch.core. Most of the necessary setup scripts can be found there.

Jobs
Setting up jobs is fairly straightforward. Create a separate XML file for job declarations. Doing so keeps jobs less confusing and allows all jobs to be housed under one roof.

1
2
3
4
5
6
7
8
9
10
<job id="exampleBatchJob" xmlns="http://www.springframework.org/schema/batch">
   <step id="processData">
      <tasklet>
         <chunk reader="unclosedPackageReportReader" processor="unclosedPackageReportProcessor" writer="unclosedPackageReportWriter" commit-interval="1000" />
      </tasklet>
   </step>
   <step id="generateReportFor ExampleJob">
      {report logic}
   </step>
</job>

Each job consists of an id and step id, both of which must be unique to the project. Within a tasklet for each step are the declarations for reader, writer, and processor. As mentioned, the processor is optional, so leave it out if not necessary. If more than one job needs to be processed in order, create multiple steps. If the jobs are not sequential, create two separate jobs.

A great feature to note from the code above is the commit interval. This is, of course, the size of the data chunks to be processed at once. 1,000 rows is a good number for a simple job. The lower the number, the less strain on the server, but the slower the process, too.

There are several ways to declare jobs. The preferred method is to write a separate XML document for each job. Putting those XML documents into their own folders reduces clutter and allows searching by the title of the job. If there are only a couple of jobs, feel free to place them all on the same XML.

1
2
3
4
5
6
7
8
9
10
<bean id="exampleReader" class="org.springframework.batch.item.database.JdbcCursorItemReader" scope="step">
   <property name="dataSource" ref="dataSource" />
   <property name="sql" value="SELECT * FROM order WHERE status='completed'" />
   <property name="rowMapper">
      <bean class="com.example.OrderMapper" />
   </property>
</bean>
 
<bean id ="exampleProcessor" class="com.example.OrderProcessor"></bean>
<bean id="exampleWriter" class="com.example.OrderWriter" />

ItemReaders
ItemReaders do exactly what their title suggests, read data one item at a time. Both the readers and the writers there can be set up in two ways, either through XML, or with simple declarations and Java code.

In the above XML, we declare our SQL and a mapper to create a set of items. This could also have been a FlatFileItemReader and CSV or .txt file. It is also possible to declare a mapper as a Java class and bean declaration, or write the mapping in XML. If you need to do some preprocessing, such as dropping results from the list of items passed to the processor, follow the example of the ItemWriter below. Be sure to extend ItemReader rather than ItemWriter.

ItemProcessor
An ItemProcessor is almost always a bean declaration with logic falling into a Java class. Declare the ItemProcessor as above, linking to the necessary Java class. The class itself should have the following structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
public class OrderProcessor implements ItemProcessor<Order, Report> {
       @Override
       public Report process(Order item) throws Exception {
              //filter object which age = 30
              if(item.getAmount<10){
                      return null; // null = ignore this object
              }
              Report report = new Report();
              report.setOrderId(item.getId());
              //more report logic
              return report;
       }
}

The following example illustrates a commonly used, basic pattern.
First, extend ItemProcesser declaring the incoming type and outgoing type. In this case, taking in an Order object and spitting out a Report object. Then it becomes possible to manipulate the data as if the overridden process method was a for loop. If it returns null, the object is dropped. If the item is returned, it is passed to the reader.

ItemWriter
This is the reverse of the ItemReader. It uses a similar syntax to the ItemProcessor but requires insertion into the database at the end.

1
2
3
4
5
6
7
8
9
10
public class OrderWriter implements ItemWriter<Report>{
    @Autowired
    OrderDao orderDao;
 
    @Override
    public void write(List<? extends Report> items) throws Exception {
       for (Report item :  items){      
            reportDao.save(item)
        }
    }

Data can be manipulated in both the reader and the writer, but the preferred approach keeps most of the logic in the processor phase. This keeps the process cleaner than mixing data manipulation into the reader and the writer. Projects may not even need the processor if, for instance, data is simply being taken from a CSV and moved into a specific database.

Final Thoughts

Working with Spring Batch is relatively painless. Although simple, the examples above showcase how quickly Spring Batch can be set up and running smoothly. For individuals with no prior Spring Batch knowledge, implementing Spring Batch and creating a functioning job may take as little as a couple workdays. After the initial setup, building fairly complex batch jobs can take around another day. This kind of quick prototyping coupled with the ability to work within agile methodologies is what really makes a new technology so attractive. So much so, in fact, that it sparked a need to write something about it!

Relavant Links

Spring Batch Home - http://projects.spring.io/spring-batch/

Excellent Mkyong Tutorial - http://www.mkyong.com/tutorials/spring-batch-tutorial/

Github - https://github.com/spring-projects/spring-batch

Obligatory Wikipedia - http://en.wikipedia.org/wiki/Spring_Batch