(or Hadoop for dummy architects like me)
I’m sure you’ve heard about Big Data. If not, I recommend you my blog post “What is Big Data ?”
The most well known technology used for Big Data is Hadoop. It is used by Yahoo, eBay, LinkedIn and Facebook. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantity of data (petabytes or even more) at very low cost (compared to SAN bay systems).
Hadoop is an open source suite, under an apache foundation: http://hadoop.apache.org/.
The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:
- Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.
- Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power (describing MapReduce would worth a dedicated post).
Some other components are often installed on Hadoop solutions:
- HBase is inspired from Google’s BigTable. HBase is a non-relational, scalable, and fault-tolerant database that is layered on top of HDFS. HBase is written in Java. Each row is identified by a key and consists of an arbitrary number of columns that can be grouped into column families.
- ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is used by HBase, and can be used by MapReduce programs.
- Solr / Lucene as search engine. This query engine library has been developed by Apache for more than 10 years.
- Languages. Two languages are identified as original Hadoop languages: PIG and Hive. For instance, you can use them to develop MapReduce processes at a higher level than MapReduce procedures. Other languages may be used, like C, Java or JAQL. Through JDBC or ODBC connectors (or directly in the languages) SQL can be used too.
Even if the most known Hadoop suite is provided by a very specialized actor named Cloudera (also by MapR, HortonWorks, and of course Apache), big vendors are positioning themselves on this technology:
- IBM has got BigInsights (Cloudera distribution plus their own custom version of Hadoop called GPFS) and has recently acquired many niche actors in the analytical and big data market (like Platform Computing which has got a product enhancing the capabilities and performance of MapReduce)
- Oracle has launched BigData machine. Also based on Cloudera, this server is dedicated to storage and usage of non-structured content (as structured content stays on Exadata)
- Informatica has a tool called HParser to complete PowerCenter This tool is built to launch Informatica process in a MapReduce mode, distributed on the Hadoop servers.
- Microsoft has got a dedicated Hadoop version supported by Apache for Microsoft Windows and for Azure, their cloud solution, and a big native integration with SQL Server 2012.
- Some very large database solutions like EMC Greenplum (partnering with MapR), HP Vertica (partnering with Cloudera), Teradata Aster Data (partnering with HortonWorks) or SAP Sybase IQ are able to connect directly to HDFS.
… and if you want to know more about other Big Data solutions, here is a blogpost listing all big data vendors and technologies.