Big Data - What to do with it? (Part 1 of 2) - codecentric AG Blog

:

Analyze it, of course!

Uh, right….. but how?

That’s what I’ll be going to talk about in this first part of a two-part blog series about Big Data analytics. So, if you are interested in getting some ideas and answers, please bear with me 🙂

Data’s the new currency

We all know the hype: how companies like Google or Facebook that offer quality services “free of charge” make billions of dollars in the process, by leveraging the data provided by their customers. This scared the established players and, in doing so, created a whole new breed of data applications around one problem: Big Data.

But, what exactly is Big Data?

Unfortunately, that’s one of those questions where you ask two people and get three different opinions. One definition that has come to be more widely accepted is the 3 V’s:

– High volume (amount of data),
– High velocity (speed of data in and out),
– High variety (range of data types, sources and structures).

As a rule of thumb: if it becomes a problem to get your data into your RDBMS of choice, then most likely you’ve encountered Big Data.

In my opinion, the first V (volume) is overrated (in germany, at least) – it may be called Big Data, but it’s the third V (variety) that truly distinguishes Big Data-storages from traditional RDBMS.

Now, the good things of Big Data and, more to the point, most systems that sprang into being are

– that you can store the data as it happens to come along, no matter its structure
– that the systems scale virtually without limit, and
– that you therefore have the possibility to discover new insights, trends and opportunities in your business to act upon.

Of course, at the end of the day, “act upon” means “earn money”.

Data is useless unless combined with other data to create information

That is a simple fact and that is also the point where it gets bad: not only do we face (possibly huge amounts of) semi-structured data, but querying and interconnecting Big Data is much harder when compared to traditional RDBMS supporting SQL. Most of the Big Data systems out there use their own mechanism and/or language to query the data (as of today – maybe in the future there will be something like SQL for Big Data or traditional RDBMS get up to speed). This means, you’ll need at least one expert with the particular system at hand to efficiently query the data.

Analyzing the data to extract the required information to decide on the right action is the real challenge

And this is where it gets ugly: as of today, there exists far from optimal support for data analysts to do their work and analyze the data stored in Big Data storages efficiently. Usually, it looks like this:

The data analyst and the expert work together with the former asking the questions and the latter doing his best to provide the answers by querying the system. This is simpler said than done, because depending on the complexity of the question, it may take quite a while before an answer can be provided with multiple map-reduce-jobs required in the process. And that’s not all: usually the data analyst needs to provide his findings in a way that can be understood by the business people (i.e. reports), which is a lot easier with the right tools.

As we all know, time is money and the time required to analyze Big Data and extract useful information out of it can become such a big factor that Big Data can overwhelm an organization.

I’m not saying that with traditional RDBMS you won’t face the same problems. I’m saying that with those systems

– data analysts can work mostly without the need for additional experts, because most of them know SQL and
– there exist a lot of tools that assist them in their work

making it a lot easier and therefore way faster for them to extract and provide useful information out of the data.

So, now what? Stick with traditional RDBMS whenever possible and only try Big Data if you really, really know what you are doing and getting yourself into?

Well… it’s always best if you know what you are doing, isn’t it?

Enter: JasperReports

Huh? Where did that come from?

First of all: the heading might also have been “Enter: Pentaho” or maybe even “Enter: BIRT”. JasperReports is not the only tool suite providing features for Big Data analysis — it’s the one that I know the best.

The point is: there really are tools out there that assist data analysts when working with Big Data, but that fact doesn’t seem to be widely known — even among people trying to sell Big Data.

For all of you who don’t know: JasperReports has been around for quite a while (since 2001) and is at its heart an open source java reporting library. Over the years, there has been created a whole ecosystem of tools around this core, so that today JasperReports offers a whole BI Suite.

Now, how does JasperReports aid data analysts in working with Big Data?

In many ways; some are part of the Community Edition (Open Source), others are only available in the professional editions and have to be paid for.

At the core is still the open source reporting library which has available connectors to many Big Data storages like MongoDB, Hadoop (Hive and HBase) or Google BigQuery. The main advantage is that it is much more likely to find someone being able to create a useful JasperReports-report than someone to extract useful information out of said storages. The second advantage is the open source, graphical report developing tool iReport that aids in creating the reports and provides additional features for Big Data like auto-discovery of available fields.

There is one caveat: you still need the knowledge of the query-functionality of the storage system at hand to create the connection to your data and to keep that connection effective (so, you probably should keep the expert). However, when that is done, the report designer (aka data analyst) can work with the data like he is used to. In this blog I provide a JasperReports tutorial connecting to a MongoDB-instance which demonstrates how that works.

However, it is still a lot of work to create the report (if you took a peek at the tutorial, you’d agree), which let’s one wonder, if there isn’t an easier option to work with the data.
Well, there is!

The JasperReports-ecosystem features a server component which, at the community edition level, functions as a central repository for report storage and provides a GUI and APIs for report execution and -scheduling. At the professional edition level, this component provides (among other things) an ad-hoc reporting GUI that allows data analysts to explore the data on-the-fly, literally “playing around” with it.

I find this feature to be extremely powerful, especially when considering the nature of Big Data: data analysts don’t have to know how the data in storage is structured up-front; they can explore the data as it is made available by the initial connection query (remember the caveat above!). In the second part of this blog I will explore this in more detail.

In closing, I would like to repeat, that in my opinion the discussion around Big Data focused too much on the “huge amount/high velocity of data”-part where it instead should focus more on the “high variety of data”-part. This part offers greater possibilities, in my opinion much greater than huge amount/velocity of data in itself, but also bears greater risks, because of the difficulty of extracting useful information.

Ok, on to part two.