Not Only SQL, Not Only Hadoop…
I think the Big Data topic is changing, from hype to some reality. A few months ago, many of my customers were thinking of Big Data as something designed only for large internet companies like Google, Yahoo, Groupon etc. More recently they are starting to discuss more and more how Big Data can help them, solve some of their problems, help them to do things that were not possible, but more importantly, they are really listening to the market and are meeting many vendors or IS to better understand what Big Data really is.
Big Data solutions were born to solve 2 issues classic databases were not able to:
- Being really very scalable at low cost
- Being able to work with non-modeled and non-structured data (i.e. internet data originally)
To design this new family of solutions, the word NoSQL has been invented and used for the first time in 1998. NoSQL doesn’t mean No SQL, but Not only SQL! And the SQL word represents the relational databases, not the SQL language. Using the No SQL expression may be confusing, but it sounds really good, and this is why it is still used today. It regroups a lot of technologies like Cassandra, Neo4J, MongoDB, HBase, and by extension, Hadoop (remember, Hadoop is not only one tool but a combination of many tools, see my “What is Hadoop” post).
To be direct and clear, I think Hadoop has won the war even if I frequently meet Cassandra or neo4J fans, and I see MongoDB used in start-up and internet companies... but for a company like Capgemini, these companies are not our main market.
Hadoop won the war as all the big or BIM specialized vendors (IBM, Microsoft, Oracle, SAP, EMC, HP, Teradata, Informatica, SAS…) have built solutions on Hadoop or at least connectors to Hadoop, whereas I am not aware of such a massive involvement on another No SQL technology. There is also an ecosystem of very specialized Hadoop players; Cloudera (Oracle and IBM partner), HortonWorks (Teradata partner), MapR (EMC partner) and all the support of Apache. These companies are very aggressive and active. And every time I meet customers who are aware of Big Data, - they have always heard about Hadoop, so no more battles (I know I can have reactions to that, please prove me I’m wrong, I’d be happy knowing the war is not yet ended even if I’m peaceful).
I have met many customers who think Big Data s Hadoop.
Whilst it is extreme (and wrong) to think that relational databases are dead and that No SQL solutions should be used everywhere, it is also wrong to think that Big Data is only Hadoop.
A Big Data project is first and foremost a business plan to demonstrate the value of investing in Big Data. And the implementation by itself involves a few steps:
- Data acquisition: From internal databases, from external sources, from machines, from people, with a full portfolio of tools and with all the intellectual property (IP), legal and privacy constraints
- Data marshaling: All the data acquired has to be sorted to be removed (non-useful data) or stored in the best format (through Hadoop or No SQL solutions but also BI appliances, in-memory solutions…)
- Analytics: All this data has to be mined, to be used to do predictive, to alert, to find innovative correlations
- Action: Once something is discovered thanks to the Analytics phase, it has to be used to feed the transactional systems to transform the insights into money (cost reduction or revenue increase)
- Data governance: All this is utopia without data quality and an efficient master data management solution
I think Hadoop is part of the solution, as Hadoop is part of the data storing thanks to HBase and HDFS and part of the Analytics phase with MapReduce but it doesn’t cover the other phases, and very often cannot cover 100% of Data Marshaling and Analytics.
So the solution, and therefore Big Data, is Not only Hadoop!