What is Hadoop?

Publish date:

(or Hadoop for dummy architects like me) I’m sure you’ve heard about Big Data. If not, I recommend you my blog post “What is Big Data ?” The most well known technology used for Big Data is Hadoop. It is used by Yahoo, eBay, LinkedIn and Facebook. It has been inspired from Google publications on MapReduce, […]

(or Hadoop for dummy architects like me)

I’m sure you’ve heard about Big Data. If not, I recommend you my blog post “What is Big Data ?”

The most well known technology used for Big Data is Hadoop. It is used by Yahoo, eBay, LinkedIn and Facebook. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantity of data (petabytes or even more) at very low cost (compared to SAN bay systems).

Hadoop is an open source suite, under an apache foundation: http://hadoop.apache.org/.

The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:

  • Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.
  • Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power (describing MapReduce would worth a dedicated post).

Some other components are often installed on Hadoop solutions:

  • HBase is inspired from Google’s BigTable. HBase is a non-relational, scalable, and fault-tolerant database that is layered on top of HDFS. HBase is written in Java. Each row is identified by a key and consists of an arbitrary number of columns that can be grouped into column families.
  • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is used by HBase, and can be used by MapReduce programs.
  • Solr / Lucene as search engine. This query engine library has been developed by Apache for more than 10 years.
  • Languages. Two languages are identified as original Hadoop languages: PIG and Hive. For instance, you can use them to develop MapReduce processes at a higher level than MapReduce procedures. Other languages may be used, like C, Java or JAQL. Through JDBC or ODBC connectors (or directly in the languages) SQL can be used too.

 

 

Hadoop Architecture

 

Even if the most known Hadoop suite is provided by a very specialized actor named Cloudera (also by MapR, HortonWorks, and of course Apache), big vendors are positioning themselves on this technology:

  • IBM has got BigInsights (Cloudera distribution plus their own custom version of Hadoop called GPFS) and has recently acquired many niche actors in the analytical and big data market (like Platform Computing which has got a product enhancing the capabilities and performance of MapReduce)
  • Oracle has launched BigData machine. Also based on Cloudera, this server is dedicated to storage and usage of non-structured content (as structured content stays on Exadata)
  • Informatica has a tool called HParser to complete PowerCenter This tool is built to launch Informatica process in a MapReduce mode, distributed on the Hadoop servers.
  • Microsoft has got a dedicated Hadoop version supported by Apache for Microsoft Windows and for Azure, their cloud solution, and a big native integration with SQL Server 2012.
  • Some very large database solutions like EMC Greenplum (partnering with MapR), HP Vertica (partnering with Cloudera), Teradata Aster Data (partnering with HortonWorks) or SAP Sybase IQ are able to connect directly to HDFS.

Now you know what Hadoop is, look at what No Hadoop is…

… and if you want to know more about other Big Data solutions, here is a blogpost listing all big data vendors and technologies.

Related Posts

Architecture

Getting to the heart of it all, or life as a lead architect at Capgemini

Danish Nadeem
Date icon December 13, 2019

For Capgemini, the architect is at the heart of any engagement and the company has a very...

data

Zombies, wizards, werewolves, and a test automation silver bullet

Grant Volker
Date icon November 21, 2019

Expectations of technology have dramatically changed over the years, creating a demand for...

data

Data in your company – no-mans-land or your competitive advantage?

Łukasz Grygorczuk
Date icon September 10, 2019

Can you transform digitally without having a proper data setup?

cookies.

By continuing to navigate on this website, you accept the use of cookies.

For more information and to change the setting of cookies on your computer, please read our Privacy Policy.

Close

Close cookie information