2011 has been a year rich with new trends in the Business Intelligence (BI) space some of which have rocketed this year (#1):
- Mobile BI: Adapting analytic tools to mobile characteristics (always worn, rarely off, sometimes geographically localized, small screen, multiple versions, private usage …)
- In-memory analytics: Having all data in memory to improve dramatically the response time of the analytic tools, and allowing new kinds of usage
- Big Data: Which is…?
I have just Googled this to look for some good definitions of “What is Big Data” by major vendors (sorry I don’t have all the vendors only the best Google ranks on this topic, it is a good way to be non-exhaustive, I will reuse it)
- For IBM, Big Data spans 3 dimensions : variety, velocity and volume,
- For Teradata, Mayank Bawa states that Big Data means different analytics, data structure and diversity,
- For EMC, Bill Schmarzo agrees with Gartner, Forrester, IDC and McKinsey that big data is more than just data volume; it includes data velocity, data variety and data complexity.
- Data volume: terabytes, petabytes, billions of rows per day, hour or minute,
- Data variety: mixing point of sales, call data records, machine generated data, scanned documents, social networking data, smart metering data, structured and unstructured data, which includes data complexity, not only because of data variety but also because data may be really complex to analyze, like video, binary data from M2M communications…
- Need of velocity: big amount of data means data becomes out-of-date very quickly, so it is important to use data as fast as possible.
The Big Data issue occurs when data volume has increased so much that IT is no longer able to satisfy your data needs like having the sales dashboard ready at 8 am or giving the value rank of a customer in a call center in less than a minute…
- It may happen with low data volume if the architecture has not been designed to be scalable, like a website that goes down during the first day of discount promotion.
- It may happen without data variety, I guess all the big retailers will claim their Point of Sales have a uniform modeling but making analytics and data mining on that takes too much time.
- It may happen without need of velocity, like the Insurance companies that have to keep contracts and documents for more than 30 years!
It is important to explain that nowadays, the technology is able to face scalability issues (volume and velocity as from the architecture point of view; velocity is only a performance constraint). In the end, the technical answer to Big Data needs is always a scalable solution. I’m not speaking about a 100% scalability solution; I’m just speaking about a solution that is able to support massive volume increases (data volume, user quantity, usage increase …)
Nowadays, 3 main solutions exist to face scalability needs:
- Using farm servers when being able to work with fully independent and unitary small jobs. This is easily done for web servers or application servers and is the best way to guarantee scalability for the application part. For every user request, a load balancer chooses a server and this one will be able without the other servers of the farm to answer to the need. This solution is perfect for the front end part, even for the data integration part (EAI, ESB, ETL …). But it can’t work like this on the database part.
- Having a specialized database appliance. Solutions like (non exhaustive, alphabetical sort) EMC Greenplum, Oracle Endeca, HP Vertica, IBM DB2 ISAS, IBM Netezza, Kognitio WX2, Microsoft Parallel Data Warehouse, Oracle Exadata, SAP Sybase IQ or Teradata have been built to support very massive quantity of data.
- Using Hadoop Distributed File System (HDFS) allows a fully scalable storage of non-structured or structured data. This solution is used by eBay, Yahoo, LinkedIn or Facebook to store and manipulate their huge volume of structured and unstructured data. HDFS is based on 2 main concepts. The data is split between the servers and each job uses a “Map Reduce” context to be able to split the job in many small jobs launched on different servers. These 2 main concepts guarantee the scalability of the platform. To have a high level vision of the Hadoop architecture, you can go on my post What is Hadoop?
A dedicated blogpost lists all Big Data vendors and technology.
My advice is that for each new project or big evolution, it is mandatory to take into account the capacity of the solution to be scalable and to plan how large and unplanned data volume increases can be faced.
#1: About the BI trends of 2011, in case you want to add other topics, I consider that BI in the cloud will probably be the next new trend (not yet one, as the questions are coming but the potential is still higher). I also consider that Agile BI and Real-Time BI are no more trends, they're reality. If you see other BI trends, please tell me, I don’t pretend to be exhaustive.
#2: Note that it is funny seeing that the solutions designed to manipulate very large volumes of data are based on small PCs instead of using large UNIX machines.