Category: Hadoop

Hadoop Ecosystem Overview

7/11/2016

Before you can traverse through the Hadoop environment it is important to identify and learn about the key players. In this post I will provide an overview of the applications, tools and interfaces currently available in the Hadoop ecosystem. I will categorize each product by its functionality (Storage, Processing, Querying, External Integration & Coordination) and provide a description along with architecture, uses and resource likes. It is important to note that in a few months or years these products might become obsolete and replaced by another.

1. Storage

HDFS
The primary distributed file system used by Hadoop applications which runs on large clusters of commodity machines. HDFS clusters consist of a NameNode that manages the file system metadata and DataNodes that store the actual data.

Architecture:
HDFS Architecture

Uses:
-Storage of large imported files from applications outside of the Hadoop ecosystem
-Staging of imported files to be processed by Hadoop applications

Resource:
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

Architecture:
HBase Architecture

Uses:
-Storage of large data volumes (billions of rows) atop clusters of commodity hardware
-Bulk storage of logs, documents, real-time activity feeds and raw imported data
-Consistent performance of reads/writes to data used by Hadoop applications
-Data Store than can be aggregated or processed using MapReduce functionality
-Data platform for Analytics and Machine Learning

Resource:
http://hbase.apache.org/book/architecture.html

HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data to a tabular form as opposed to files. It also provides REST APIs so that external systems can access these tables’ metadata.

Architecture:
HCatalog Architecture

Uses:
-Centralized location of storage for data used by Hadoop applications
-Reusable data store for sequenced and iterated Hadoop processes (ex: ETL)
-Storage of data in a relational abstraction

Resource:
https://cwiki.apache.org/confluence/display/Hive/HCatalog

2. Processing

MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map or Reduce functions.

Architecture:
MapReduce Architecture

Uses:
-Aggregation (Counting, Sorting, Filtering, Stitching) on large and desperate data sets
-Scalable parallelism of Map or Reduce tasks
-Distributed task execution
-Machine learning

Resource:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Purpose

Pig
A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended functions (UDFs) using Java.

Architecture:
Pig Architecture

Uses:
-Scripting environment to execute ETL tasks/procedures on raw data in HDFS
-SQL based language for creating and running complex MapReduce functions
-Data processing, stitching, schematizing on large and desperate data sets

Resource:
http://pig.apache.org/docs/r0.12.1/index.html

3. Querying

HIVE
A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query language based on SQL semantics (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data.

Architecture:
Hive Architecture

Uses:
-Schematized data store for housing large amounts of raw data
-SQL-like Environment to execute analysis and querying tasks on raw data in HDFS
-Integration with outside RDBMS applications

Resource:
https://cwiki.apache.org/confluence/display/Hive/Home#Home-ApacheHive

4. External integration

Flume
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS. Flume's transports large quantities of event data using a steaming data flow architecture that is fault tolerant and failover recovery ready.

Architecture:
Flume Architecture

Uses:
-Transportation of large amounts of event data (network traffic, logs, email messages)
-Stream data from multiple sources into HDFS
-Guaranteed and reliable real-time data streaming to Hadoop applications

Resource:
http://flume.apache.org/FlumeUserGuide.html

290 Comments

Big Data Landscape 2016

2/9/2016

3 Comments

Just came across this interesting visualization of the top tools and technologies for 2016 broken out by functionality. Hope this gives you a good understanding of the key players in each space. Use this as a template to exploring new tools and technologies. I guarantee that employers will start requiring skills with some of these tools real soon. I will try to post some interesting articles based on this image. If there is something specific you want me to cover please mention in the comments.

3 Comments

Hadoop example hardware

9/22/2014

4 Comments

Hardware architecture

With the diversity of use for Hadoop it can be difficult for organizations to solidify on what commodity hardware can be used to sustain a Hadoop cluster. This post is designed to provide an overview of the hardware architecture as well as examples of hardware that can be/is used. It is important to note that every instance and use is different, thus, the hardware examples specified in this post should be taken as just examples.

Master Nodes

Slave Nodes make up the vast majority of machines and do all the dirty work of storing the data and running the computations. Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes.

Listed below is an example for a middle grade hardware used for Slave Nodes.

slave nodes

Processors: Mid-grade processors (ex: 2 x 6-core 3 GHz)
Memory: 48-96GB RAM
Network: 1Gb Ethernet
Drives: 6 x 2TB drives per node (Non-RAID)

The Master nodes oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce). The Name Node oversees and coordinates the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using Map Reduce.

Listed below is an example for a middle grade hardware used for Master Nodes.

Processors: Mid-grade processors (ex: 2 x 6-core 3 GHz)
Memory: 64-128GB RAM
Network: 10Gb Ethernet
Drives: 12 x 3TB drives per node (RAID)

4 Comments

Getting Started With Big Data

3/17/2014

2 Comments

When I first started on my journey I was overwhelmed with all of the terminology and concepts associated with Big Data. I knew that I wanted to get started learning but I had no clue or idea where to start. At the time there were no introductory resources I could use as a starting point. In this space it is extremely important to grasp the fundamentals and basics before proceeding further.

For your convenience, I have gathered a collection of useful, visual and easy to understand resources to help you get started. The resources below are designed to provide an education on fundamentals/concepts, sandbox environments and tutorials. Together, the collection of resources provides individuals at any understanding a place to start their Big Data learning and journey. Please feel free to add this to getting started toolkit because you will refer back to these resources occasionally.

My advise as you learn from the resources below is to pace yourself and take one step at a time. Make sure that you understand the concepts before proceeding to the sandbox and tutorials. Please feel free to reach out with any questions that you may have and I will do my best to provide an answer.

Good Luck!

Understanding Big Data Concepts:
What is NoSQL
Links:
- http://youtu.be/qUV2j3XBRHc
- http://youtu.be/pHAItWE7QMU
- http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQL-Whitepaper.pdf

What is Hadoop
Links:
- http://youtu.be/3Wmdy80QOvw
- http://youtu.be/xYnS9PQRXTg
- http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf

What is Big Data
Links:
- http://youtu.be/j-0cUmUyb-Y
- http://youtu.be/ahZGEusG13A
- http://www.slideshare.net/remyavivek/big-data-ppt-23276173

Big Data Sandboxes:
Hortonworks
Link:(http://hortonworks.com/products/hortonworks-sandbox/)

Cloudera
Link:(http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-5-0-x.html)

Tutorials:
Hortonworks Hadoop, Hive and Pig tutorials
Link:(http://hortonworks.com/tutorials/)

Cloudera Hadoop and MapReduce tutorials
Link:(http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial.html)

Spark tutorials
Link:(https://spark.apache.org/documentation.html)

Flume tutorials
Link:(http://www.openscg.com/2013/09/using-hadoop-to-flume-twitter-data/)

Hive tutorials
Link:(https://cwiki.apache.org/confluence/display/Hive/Tutorial)

Sample Data Resources:
Tableau Data Sets
Link:(http://www.tableausoftware.com/public/community/sample-data-sets)

InfoChimps Data Sets
Link:(http://www.infochimps.com/datasets)

World Health Organization Data Sets
Link:(http://www.who.int/research/en/)

Sports Data Sets
Link:(http://www.amstat.org/sections/sis/sports%20data%20resources/)

Miscellaneous Data Sets
Link:(http://mathforum.org/workshops/sum96/data.collections/datalibrary/data.set6.html)

2 Comments

Hadoop Ecosystem Overview

1. Storage

2. Processing

3. Querying

4. External integration

Big Data Landscape 2016

Hadoop example hardware

Hardware architecture

Master Nodes

slave nodes

Getting Started With Big Data

Subscribe

Categories

Archives

Hadoop Ecosystem Overview

1. Storage

2. Processing

3. Querying

4. External integration

Big Data Landscape 2016﻿

Hadoop example hard﻿ware

Hardware architecture

Master Nodes

slave nodes

Getting Started With Big Data

Subscribe

Categories

Archives

Big Data Landscape 2016

Hadoop example hardware