Capping IT Off

Capping IT Off

Opinions expressed on this blog reflect the writer’s views and not the position of the Capgemini Group

What is Big Data?

2011 has been a year rich with new trends in the Business Intelligence (BI) space some of which have rocketed this year (#1):

  • Mobile BI: Adapting analytic tools to mobile characteristics (always worn, rarely off, sometimes geographically localized, small screen, multiple versions, private usage …)
  • In-memory analytics: Having all data in memory to improve dramatically the response time of the analytic tools, and allowing new kinds of usage
  • Big Data: Which is…?
Of course, I will not answer this question in just one sentence. Or it would have been strange putting this question in the title.

I have just Googled this to look for some good definitions of “What is Big Data” by major vendors (sorry I don’t have all the vendors only the best Google ranks on this topic, it is a good way to be non-exhaustive, I will reuse it)

  • For IBM, Big Data spans 3 dimensions : variety, velocity and volume,
  • For Teradata, Mayank Bawa states that Big Data means different analytics, data structure and diversity,
  • For EMC, Bill Schmarzo agrees with Gartner, Forrester, IDC and McKinsey that big data is more than just data volume; it includes data velocity, data variety and data complexity.
I could continue like this for a long time. For vendors, Big Data is always presented as a great opportunity. As Gartner, they all evoke:
  • Data volume: terabytes, petabytes, billions of rows per day, hour or minute,
  • Data variety: mixing point of sales, call data records, machine generated data, scanned documents, social networking data, smart metering data, structured and unstructured data, which includes data complexity, not only because of data variety but also because data may be really complex to analyze, like video, binary data from M2M communications…
  • Need of velocity: big amount of data means data becomes out-of-date very quickly, so it is important to use data as fast as possible.
I would like to change the point of view, and have a look at Big Data from the business point of view instead of being the marketing or the professional side. And from their point of view, Big Data is never an opportunity; it is first of all an issue they need to face. They (we) don’t manipulate terabytes or petabytes of data because they (we) like, they (we) do that because they (we) have no choice. They (we) would prefer to be able to do the same while manipulating only a few gigabytes…or even less.

The Big Data issue occurs when data volume has increased so much that IT is no longer able to satisfy your data needs like having the sales dashboard ready at 8 am or giving the value rank of a customer in a call center in less than a minute…

  • It may happen with low data volume if the architecture has not been designed to be scalable, like a website that goes down during the first day of discount promotion.
  • It may happen without data variety, I guess all the big retailers will claim their Point of Sales have a uniform modeling but making analytics and data mining on that takes too much time.
  • It may happen without need of velocity, like the Insurance companies that have to keep contracts and documents for more than 30 years!
Facing data variety needs a dedicated architecture and a way to model (or even a way NOT to model) and a real business case that justifies it and explains how to use unstructured data. I’m always suspicious when the goal is first we store, after which we’ll see how to use it. I’m also doubtful when people tell me that storing a URL is full of variety… And every time I meet a strong variety of data (like telecommunication data where we’ll find files defined by the HW vendors, different by models and by generation), when trying to understand the need, it is always needed to understand the data, so at least to model it (so transform it) at least partially even if at the end a search engine may be enough. But I’ll come later to the big data vault concept and focus first on velocity constraint and volume increase.

It is important to explain that nowadays, the technology is able to face scalability issues (volume and velocity as from the architecture point of view; velocity is only a performance constraint). In the end, the technical answer to Big Data needs is always a scalable solution. I’m not speaking about a 100% scalability solution; I’m just speaking about a solution that is able to support massive volume increases (data volume, user quantity, usage increase …)

Nowadays, 3 main solutions exist to face scalability needs:

  • Using farm servers when being able to work with fully independent and unitary small jobs. This is easily done for web servers or application servers and is the best way to guarantee scalability for the application part. For every user request, a load balancer chooses a server and this one will be able without the other servers of the farm to answer to the need. This solution is perfect for the front end part, even for the data integration part (EAI, ESB, ETL …). But it can’t work like this on the database part.
  • Having a specialized database appliance. Solutions like (non exhaustive, alphabetical sort) EMC Greenplum, Oracle Endeca, HP Vertica, IBM DB2 ISAS, IBM Netezza, Kognitio WX2, Microsoft Parallel Data Warehouse, Oracle Exadata, SAP Sybase IQ or Teradata have been built to support very massive quantity of data.
  • Using Hadoop Distributed File System (HDFS) allows a fully scalable storage of non-structured or structured data. This solution is used by eBay, Yahoo, LinkedIn or Facebook to store and manipulate their huge volume of structured and unstructured data. HDFS is based on 2 main concepts. The data is split between the servers and each job uses a “Map Reduce” context to be able to split the job in many small jobs launched on different servers. These 2 main concepts guarantee the scalability of the platform. To have a high level vision of the Hadoop architecture, you can go on my post What is Hadoop?
This means that the data volume explosion is no more a technical issue. It may be a financial issue, as putting in place a new architecture is always costly, but it is a good sign that all the solutions evoked are based on commodity hardware. It means it works with standard PCs (of course, servers PC) instead of large and expensive mainframe servers and SAN bays (#2).

A dedicated blogpost lists all Big Data vendors and technology.

My advice is that for each new project or big evolution, it is mandatory to take into account the capacity of the solution to be scalable and to plan how large and unplanned data volume increases can be faced.


#1:  About the BI trends of 2011, in case you want to add other topics, I consider that BI in the cloud will probably be the next new trend (not yet one, as the questions are coming but the potential is still higher). I also consider that Agile BI and Real-Time BI are no more trends, they're reality. If you see other BI trends, please tell me, I don’t pretend to be exhaustive.

#2: Note that it is funny seeing that the solutions designed to manipulate very large volumes of data are based on small PCs instead of using large UNIX machines.

About the author

Manuel Sevilla

Leave a comment

Your email address will not be published. Required fields are marked *.