Hadoop生态圈一览

Hadoop是一个大规模的分布式可扩展的批处理大数据框架(PB级别),可以运行于千台服务器的集群。

最近几年Hadoop生态圈持续成长。 涌现了很多的关于工具和框架的术语。 也有很多的组织基于Hadoop进行研究和创新, 使Hadoop变得越来越好,越来越容易。基于作者的几周的潜心研究, 他提供了一个思维脑图, 描绘了Hadoop的生态圈, 可以帮助我们了解Hadoop生态圈的全貌。

英文原址: Hadoop Ecosystem at a glance

Core

  • HDFS: Hadoop Distribute File System (HDFS) is a distributed file system designed to run on a commodity cluster of machines. HDFS is highly fault tolerant and is useful for processing large data sets. A Map Reduce job, described below, typically processes data stored in HDFS. Files in HDFS are split into blocks, typically 64MB or 128MB, and stored across nodes in the cluster. Each block of data is also replicated across more nodes, generally three, in the cluster to avoid data loss in case of a node failure.
  • MapReduce: MapReduce is a software framework for processing large data sets, petabyte scale, on a cluster of commodity hardware. When MapReduce jobs are run, Hadoop splits the input and locates the nodes on the cluster. The actual jobs are then run at or close to the node where the data is residing so that the data is as close to the computation node as possible. This avoids transfer of huge amount of data across the network so that the network does not become a bottleneck or get flooded.

Distributions

  • Apache: Purely Open Source distribution of Hadoop maintained by the community at Apache Software Foundation.
  • Cloudera: Cloudera’s distribution of Hadoop that is built on top of Apache Hadoop. The distribution includes capabilities such as management, security, high availability and integration with a wide variety of hardware and software solutions. Cloudera is the leading distributor of Hadoop.
  • Horton Works: This also builds on the Open Source Apache Hadoop with claims to enterprise readiness. It also claims to be the only distribution that is available for Windows servers.
  • MapR: Hadoop distribution with some unique features, most notably the ability to mount the Hadoop cluster over NFS.
  • Intel: Intel’s Open Source version of Hadoop distribution.
  • Greenplum: Greenplum’s distribution is called Pivotal HD. One of the highlights of this distribution is a SQL-based database engine on Hadoop that allows querying of data in Hadoop using SQL.
  • Amazon EMR : Amazon’s hosted version of MapReduce is called Elastic Map Reduce. This is part of the Amazon Web Services (AWS). EMR allows a Hadoop cluster to be deployed and MapReduce jobs to be run in the AWS cloud with just a few clicks.

Related Projects

  • Avro: Avro is a data serialization framework that is useful in Hadoop and other systems. The framework allows one to define schema which is language independent so that data can be interchanged between different languages.e.g A Hadoop client in a different language can use Avro as the data serialization framework to communicate with the Hadoop server which is in Java.
  • Pig: Framework for analyzing large data sets using a high level language called Pig Latin. Scripts written in Pig Latin are compiled by the framework in MapReduce jobs which are run on the Hadoop cluster. Pig eases development of MapReduce jobs. A set of MapReduce jobs which may take hundreds of lines of code can be written with just few lines of Pig Latin scripts. At Yahoo > 60% of Hadoop usage is on Pig.
  • Hive : Hive is a data warehouse framework that stores querying of large data sets stored in Hadoop. To do this, Hive provides a high-level SQL like language called HiveQL. Traditional MapReduce programs can be plugged into HiveQL where its more efficient to have these instead of HiveQL.
  • HBase : HBase is a distributed scalable data store based on Hadoop. HBase is a distributed, versioned, column-oriented database modeled after Google’s BigTable.
  • Mahout : Mahout is a scalable Machine learning library. Mahout utilizes Hadoop to achieve massive scalability.
  • YARN : YARN is the next generation of MapReduce a.k.a MapReduce 2. The MapReduce framework was overhauled using YARN to overcome the scalability bottlenecks in earlier version of MapReduce when it was run over a very large cluster(thousands of nodes).
  • Ozzie : In a real life scenario, a MapReduce deployment typically involves running a sequence of MapReduce and other pre and post processing jobs at scheduled times or based on data availability. Ozzie is a workflow scheduler system that eases the creation and management of these workflows. A workflow is defined using XML in which one can do HDFS operations, run MapReduce jobs, Pig scripts, Streaming jobs, branching, chaining, etc.
  • Flume : A distributed, reliable and available service for collecting, aggregating and moving log data to HDFS. This is typically useful in systems where log data needs to be moved to HDFS periodically for processing.
  • Sqoop : Sqoop is designed for transferring data between Hadoop and relational databases.
  • Cascading : Application framework for building applications using Hadoop.

Related Technologies

Below is a list of related BigData technologies but which follow an architecture different from Hadoop.

  • Twitter Storm : As opposed to Hadoop which is a batch processing system, Storm is a distributed real-time processing system developed by Twitter. Storm is fast, scalable and easy to use.
  • HPCC : High Performance Computing Cluster (HPCC) is an MPP (Massive Parallel Processing) computing platform for solving Big Data problems. HPCC follows an architecture different from Hadoop. It boasts a few differences and advantages over Hadoop.
  • Dremel : A scalable interactive ad-hoc query system for analysis of read-only nested data by Google. Google’s BigQuery service is reportedly based on Dremel.

This is not an exhaustive list and there may be many other projects, tools, and organizations related to Hadoop. I have tried to touch the most popular ones here.

In future posts, I wish to talk in detail about some of the less-talked-about projects in the above list but which can be really useful in day-to-day running and managing of a Hadoop cluster.