Spring One Wrap-up: Spring Batch, Spring Hadoop and Spring XD - codecentric AG Blog
Here it comes, the second part of my Spring One wrap-up, this time not from sunny California but from rainy Germany. The first one was about Spring IO and Spring Boot, and it’ll be all about batch now. I’ll focus on three projects here, one of them around for quite a long time (Spring Batch), one fairly new (Spring for Hadoop) and one brand new (Spring XD).
If you haven’t heard of it by now: there’s the JSR-352 standardizing batch application development in Java SE and EE. Spring Batch was deeply involved in bringing the JSR forward, so it’s not a surprise to hear that Spring Batch 3.0 will be fully compliant with the spec. And yes, even the read-process-write cycles that differ between the current implementation of Spring Batch and the spec will be optionally adapted. You’ll have the choice in the future to either do read-process-read-process-bulkwrite (JSR-352 style) or read-read-process-process-bulkwrite (classic Spring Batch style).
It’s worth mentioning (like Michael Minella did in his talk) that Spring Batch is much more than just an implementation of the spec. First of all it offers a wide range of ready-to-use components, like
ItemWriters for almost every technology. Then it offers more parallelization options, and with Spring Batch Admin there is ready-to-use management tool. And with Spring for Hadoop and Spring XD it’s ready to work in the classic and in the big data space.
Spring for Apache Hadoop
This project is about making it easier to work with Hadoop APIs, being able to use dependency injection and so on. In addition, it provides Spring Batch
Tasklet implementations to plug Hadoop tasks into a Spring Batch workflow, like a
ScriptTasklet or tasklets for MapReduce. This way Spring Batch becomes the driver for Big Data processing.
Spring YARN is really brand new, just announced at the conference, as a sub-project to Spring for Apache Hadoop. As you might know, YARN is the foundational framework for Hadoop 2. Basically it’s there to distribute work over a Hadoop cluster, but this time the work may be anything, not just
Reducer tasks. There was a very interesting presentation showing how a partitioned Spring Batch job ran on a Hadoop cluster, and the partitions were distributed all over the cluster. You heard right, not starting a Hadoop MapReduce job from the outside of the cluster, but running Spring Batch partitioned IN the cluster. You can find the example code here.
Spring XD belongs to the Execution part of the Spring IO platform. If you have to put it in one sentence: it’s a runtime environment for data processing.
You may define data streams reading from sources and writing to sinks, with processors in between to transform the data. The source could be Twitter, the sink could be HDFS, and the processor could convert the data to a certain format. Behind the scenes it uses Spring Integration, as you might have guessed.
You may define batch jobs that can be triggered by all kinds of triggers, including streams. Here Spring Batch is used, and that includes the possibility to include Hadoop in your processing.
You may create custom components (jobs, sources, processors, sinks) and deploy them to the Spring XD server, and all of those components get their isolated classloaders.
You may run Spring XD in single node mode – or distributed.
And with certain analytics components you may easily define various metrics on your streams and jobs.
Why is that a big thing?
- First of all, the foundation of this project are very mature technologies that have been around for ages: Spring Batch and Spring Integration. Spring XD adds a runtime environment and a lot of gimmicks around it (for example the DSL), but in the end, it’s Spring Integration and Spring Batch, and we know that they do their job very well.
- During the last years, a lot of new technologies and projects have emerged, from Big Data processing platforms like Hadoop to many NoSQL stores like MongoDB, Redis, Neo4J and so on. Most of them may easily be integrated into Spring Batch and Spring Integration through the Spring Data projects, including Spring for Apache Hadoop.
- There’s no need to start big. Not everything is Big Data (TM). Not everybody needs Hadoop. So if you just wanna know what’s happening under your company’s hash tag on Twitter, create a stream reading from Twitter writing into a relational database, and run Spring XD in single node mode. That’s a perfect valid use case. But if you one day see the need for Big Data storage, it’s no problem to add Hadoop to the processing.
- You may even use Spring XD as your central platform for running Spring Batch jobs. Spring Batch Admin will be integrated into the server, and triggering batch jobs can be done in many different ways, from cron to stream to custom.
- All of it is open source under Apache 2.0 license.
For me Spring XD has been the most exciting new thing on Spring One, because I think it really has been missing before. A central data processing platform capable of integrating all the exciting new technologies, all open source and free – I don’t think there is anything similar out there.
With all the new things this year at or around Spring One it really feels like the Spring eco system is gaining momentum, an impression that I didn’t get for example two years ago at the Spring One in Chicago. It’ll be very interesting to see where all the new stuff is standing one year from now!