Category: Big Data

Hadoop Ecosystem Overview

7/11/2016

Before you can traverse through the Hadoop environment it is important to identify and learn about the key players. In this post I will provide an overview of the applications, tools and interfaces currently available in the Hadoop ecosystem. I will categorize each product by its functionality (Storage, Processing, Querying, External Integration & Coordination) and provide a description along with architecture, uses and resource likes. It is important to note that in a few months or years these products might become obsolete and replaced by another.

1. Storage

HDFS
The primary distributed file system used by Hadoop applications which runs on large clusters of commodity machines. HDFS clusters consist of a NameNode that manages the file system metadata and DataNodes that store the actual data.

Architecture:
HDFS Architecture

Uses:
-Storage of large imported files from applications outside of the Hadoop ecosystem
-Staging of imported files to be processed by Hadoop applications

Resource:
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

Architecture:
HBase Architecture

Uses:
-Storage of large data volumes (billions of rows) atop clusters of commodity hardware
-Bulk storage of logs, documents, real-time activity feeds and raw imported data
-Consistent performance of reads/writes to data used by Hadoop applications
-Data Store than can be aggregated or processed using MapReduce functionality
-Data platform for Analytics and Machine Learning

Resource:
http://hbase.apache.org/book/architecture.html

HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data to a tabular form as opposed to files. It also provides REST APIs so that external systems can access these tables’ metadata.

Architecture:
HCatalog Architecture

Uses:
-Centralized location of storage for data used by Hadoop applications
-Reusable data store for sequenced and iterated Hadoop processes (ex: ETL)
-Storage of data in a relational abstraction

Resource:
https://cwiki.apache.org/confluence/display/Hive/HCatalog

2. Processing

MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map or Reduce functions.

Architecture:
MapReduce Architecture

Uses:
-Aggregation (Counting, Sorting, Filtering, Stitching) on large and desperate data sets
-Scalable parallelism of Map or Reduce tasks
-Distributed task execution
-Machine learning

Resource:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Purpose

Pig
A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended functions (UDFs) using Java.

Architecture:
Pig Architecture

Uses:
-Scripting environment to execute ETL tasks/procedures on raw data in HDFS
-SQL based language for creating and running complex MapReduce functions
-Data processing, stitching, schematizing on large and desperate data sets

Resource:
http://pig.apache.org/docs/r0.12.1/index.html

3. Querying

HIVE
A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query language based on SQL semantics (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data.

Architecture:
Hive Architecture

Uses:
-Schematized data store for housing large amounts of raw data
-SQL-like Environment to execute analysis and querying tasks on raw data in HDFS
-Integration with outside RDBMS applications

Resource:
https://cwiki.apache.org/confluence/display/Hive/Home#Home-ApacheHive

4. External integration

Flume
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS. Flume's transports large quantities of event data using a steaming data flow architecture that is fault tolerant and failover recovery ready.

Architecture:
Flume Architecture

Uses:
-Transportation of large amounts of event data (network traffic, logs, email messages)
-Stream data from multiple sources into HDFS
-Guaranteed and reliable real-time data streaming to Hadoop applications

Resource:
http://flume.apache.org/FlumeUserGuide.html

290 Comments

Map Reduce Explained Using SQL

6/2/2016

21 Comments

One of the key components in the Hadoop framework is Map Reduce. It is a distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map, Sorting and Reduce functions. Coming from a non-programming background can make this a difficult concept to understand. A good analogy to use when trying to understand Map Reduce is to associate it to SQL based concepts. Most of the world is familiar with relational databases and SQL, thus, putting it in that context can put a tangible definition to Map Reduce.

Listed below I have broken out each of the components of Map Reduce and provided a SQL construct as association. It is important to note that Map Reduce jobs and SQL Queries have their similarities but also can have very different purposes. As such, there are times when the SQL analogy cannot be used verbatim or may not align 100%.

MAP (From, Where, Union)

Maps are individual tasks that seek out and retrieve Key Value pairs that match a criteria. Although Maps are functions, when using the analogy of SQL it is easiest to think of them as the FROM, WHERE and UNION operations in a SQL query. Maps act to seek out data sets before any reducing, shuffling or sorting is performed . In SQL, the FROM, WHERE and UNION operations seek out data sets before reducing (SUM, COUNT, MIN...etc), shuffling or sorting (GROUP BY, ORDER BY) is performed.

Reduce (Aggregation)

Reduces are tasks that aggregate, summarize and reduce the shuffled and sorted outputs of the mappers. Using the analogy of SQL, the Reduce task is similar to the aggregate functions like SUM, DISTINCT, COUNT, MIN, MAX, AVG...etc in a SQL query. Reducers act to aggregate and summarize the data from a full data set into a smaller and manageable output that was originally specified by the user or job. In SQL, the aggregate functions (SUM, DISTINCT, COUNT, MIN, MAX, AVG..etc) operations act to reduce the data across multiple tables into a smaller output specified by the query or user.

Shuffle & Sort (Group By, ORDER BY)

Shuffles and Sorts are tasks that group, sort and organize the outputs of the mappers. Using the analogy of SQL the Shuffle and Sort tasks closely resemble the GROUP BY and ORDER BY operations in a SQL query. Shuffles and Sorts are key to organizing the data in formats and orders that are easiest to perform reduce functions upon. In SQL, the GROUP BY and ORDER BY operations act similarly to organize and group data so that reducing (SUM, COUNT, MIN...etc) functions can be easily performed.

21 Comments

Massive Parallel Processing (MPP) Appliances

5/16/2016

5 Comments

Overview

In today's world, data exceeds the processing capacity of conventional database systems. The data we need/use is generated too quickly for traditional storage and processing systems. To gain value from this rapidly generated data we need to adopt an alternative way to harness it. The commonly adopted tools in the Hadoop/Big Data framework is MapReduce. It is a distributed data processing model and execution environment that runs on large clusters of commodity machines. Using the MapReduce algorithm it breaks down all operations into Map or Reduce functions. Using MapReduce a task of processing or storing data is distributed across a series of compute nodes. Each node process data in parallel allowing large and complex tasks to be completed in a fraction of the time.

Although, MapReduce exists as a solid solution to the "Volume" problem of Big Data, there is still a strong need for keeping data in a relational and SQL environment. Applications such as Data Warehouses, OLAP Cubes, OLTP systems and Business Intelligence platforms still drive big demand in enterprises all over the world. For these use cases, the traditional tools are not able to keep up with the volumes of data. For these applications, new Massively Parallel Processing (MPP) Appliances have been created. MPP Appliances provide parallel and distributed processing across an integrated set of servers, storage. They also are integrated with relational DBMS and Business Intelligence/Data Warehousing tools to provide a SQL interface and store data in a relational form. Their appliance package provides the ability to scale performance, storage and memory by adding servers. They also arrive at your data center pre-configured for your networking environment which means no need to manage disk systems, software configuration, hardware configuration and optimization.

Although there are many of these applications growing in demand and popularity, the market is currently dominated and lead by the following offerings:

Microsoft Parallel Data Warehouse
IBM Netezza
Teradata Data Warehouse Appliance
Oracle Exadata
SAP HANA
EMC Greenplum

SMP VS MPP

Symmetric Multi-Processing (SMP)
System architecture where all of the processors connect to shared resources (memory, I/O, and network) under a single operating system. Each processor has a private cache memory, access to the main memory. Processors are interconnected using buses, switches or chip networks.

SMP Architecture

Advantages:
- Relatively inexpensive single machine design (no racks needed)
- Symmetric distributed computing
- Efficiently/Quickly process small to medium data volumes

Disadvantages:
- Inefficient/Lengthy at processing large data volumes
- Scaling up or down requires a machine upgrade/downgrade
- Resource/Memory contention between processors
- External interrupts impact all processing
- Operating System limitation on scalability (OS can only support 64-100 multi-processors)
- Expensive (time and cost) to upgrade hardware

Massive Parallel Processing (MMP)
System architecture where processing is parallelly distributed processing across an integrated set of servers known as compute nodes. Each compute node contains its own set of processors, memory and bus. Each server also comes with its own operating system and DMBS allowing it to run as an independent processing unit . Compute nodes are interconnected using a control and management node which split, distribute and mange processing. Compute nodes can be added or remove by adding or removing servers to the rack.

MPP Architecture

Advantages:
- Relatively inexpensive hardware needed to scale (cost of new server is cheaper than buying a new machine)
- No resource contention across compute nodes
- Scaling up or down is easy and can be performed without taking down the system
- Ability to add failover and backup servers
- Efficiently/Quickly process large data volumes
- No limitation on the number of compute nodes that can be added

Disadvantages:
- Additional maintenance required (rack space, cooling, monitoring)
- Additional maintenance costs (power, cooling, hardware upgrades)
- Unused resources during small and medium data volumes

As mentioned earlier, traditional BI, Data Warehouse and DBMS tools are not able to keep up with the volumes of data. Moore's law is unable to keep up with velocity and volume of data growth. Shared resources and memory prevent scalability and distributed processing. These tools use a Symmetric Multi-Processing (SMP) architecture.
In order to accommodate for the hardware lag (disparity between Moor's Law and data growth), integration/coupling of hardware needs to be utilized. Massively Parallel Processing (MPP) architecture allows for the combination of hardware resources to be pooled to create a more powerful system that is able to meet the demands of processing and storing Big Data while still providing usability of tools using the SMP architecture.

5 Comments

Big Data Landscape 2016

2/9/2016

3 Comments

Just came across this interesting visualization of the top tools and technologies for 2016 broken out by functionality. Hope this gives you a good understanding of the key players in each space. Use this as a template to exploring new tools and technologies. I guarantee that employers will start requiring skills with some of these tools real soon. I will try to post some interesting articles based on this image. If there is something specific you want me to cover please mention in the comments.

3 Comments

<<Previous

Hadoop Ecosystem Overview

1. Storage

2. Processing

3. Querying

4. External integration

Map Reduce Explained Using SQL

MAP (From, Where, Union)

Reduce (Aggregation)

Shuffle & Sort (Group By, ORDER BY)

Massive Parallel Processing (MPP) Appliances

Overview

SMP VS MPP

Big Data Landscape 2016

Subscribe

Categories

Archives

Hadoop Ecosystem Overview

1. Storage

2. Processing

3. Querying

4. External integration

Map Reduce Explained U﻿sing SQL

MAP (From, Where, Union)

Reduce (Aggregation)

Shuffle & Sort (Group By, ORDER BY)

Massive Parallel Processing (MPP) Appliances

Overview

SMP VS MPP

Big Data Landscape 2016﻿

Subscribe

Categories

Archives

Map Reduce Explained Using SQL

Big Data Landscape 2016