All Categories

Hadoop Ecosystem Overview

7/11/2016

Before you can traverse through the Hadoop environment it is important to identify and learn about the key players. In this post I will provide an overview of the applications, tools and interfaces currently available in the Hadoop ecosystem. I will categorize each product by its functionality (Storage, Processing, Querying, External Integration & Coordination) and provide a description along with architecture, uses and resource likes. It is important to note that in a few months or years these products might become obsolete and replaced by another.

1. Storage

HDFS
The primary distributed file system used by Hadoop applications which runs on large clusters of commodity machines. HDFS clusters consist of a NameNode that manages the file system metadata and DataNodes that store the actual data.

Architecture:
HDFS Architecture

Uses:
-Storage of large imported files from applications outside of the Hadoop ecosystem
-Staging of imported files to be processed by Hadoop applications

Resource:
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

Architecture:
HBase Architecture

Uses:
-Storage of large data volumes (billions of rows) atop clusters of commodity hardware
-Bulk storage of logs, documents, real-time activity feeds and raw imported data
-Consistent performance of reads/writes to data used by Hadoop applications
-Data Store than can be aggregated or processed using MapReduce functionality
-Data platform for Analytics and Machine Learning

Resource:
http://hbase.apache.org/book/architecture.html

HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data to a tabular form as opposed to files. It also provides REST APIs so that external systems can access these tables’ metadata.

Architecture:
HCatalog Architecture

Uses:
-Centralized location of storage for data used by Hadoop applications
-Reusable data store for sequenced and iterated Hadoop processes (ex: ETL)
-Storage of data in a relational abstraction

Resource:
https://cwiki.apache.org/confluence/display/Hive/HCatalog

2. Processing

MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map or Reduce functions.

Architecture:
MapReduce Architecture

Uses:
-Aggregation (Counting, Sorting, Filtering, Stitching) on large and desperate data sets
-Scalable parallelism of Map or Reduce tasks
-Distributed task execution
-Machine learning

Resource:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Purpose

Pig
A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended functions (UDFs) using Java.

Architecture:
Pig Architecture

Uses:
-Scripting environment to execute ETL tasks/procedures on raw data in HDFS
-SQL based language for creating and running complex MapReduce functions
-Data processing, stitching, schematizing on large and desperate data sets

Resource:
http://pig.apache.org/docs/r0.12.1/index.html

3. Querying

HIVE
A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query language based on SQL semantics (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data.

Architecture:
Hive Architecture

Uses:
-Schematized data store for housing large amounts of raw data
-SQL-like Environment to execute analysis and querying tasks on raw data in HDFS
-Integration with outside RDBMS applications

Resource:
https://cwiki.apache.org/confluence/display/Hive/Home#Home-ApacheHive

4. External integration

Flume
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS. Flume's transports large quantities of event data using a steaming data flow architecture that is fault tolerant and failover recovery ready.

Architecture:
Flume Architecture

Uses:
-Transportation of large amounts of event data (network traffic, logs, email messages)
-Stream data from multiple sources into HDFS
-Guaranteed and reliable real-time data streaming to Hadoop applications

Resource:
http://flume.apache.org/FlumeUserGuide.html

290 Comments

What is Marketing Automation

7/6/2016

5 Comments

Overview

Marketing Automation is a tool that automates and measures marketing actions/tasks and workflows across multiple online channels (email, social media, websites, blogs..etc).

Components

Central Marketing Database: Repository for all prospect, customer and interaction information
Engagement Marketing Engine: An environment to create, manage and automate marketing processes across different channels.
Analytics Engine: An environment to measure the impact of all executed marketing processes across different channels

Use Cases

Enterprise - B2B, B2C
Small and Mid-Sized Businesses - B2B, B2C

Features

Email Marketing
Campaign Management
Lead Management
Social Marketing
Marketing Analytics

Email Marketing

Batch Email Marketing: The ability to send emails to groups of customers and prospects
Real-Time Triggered Emails: The ability to respond to specific customer behaviors or events with appropraite real-time emails
Email Specific Landing Pages: The ability to create customized landing pages for directed links from emails
Customized Emails: The ability to customize emails (text, images and calls-to-action) for specific customer segments
Mobile Optimized Emails: The ability to send emails that optimized and rendered properly for mobile devices
Personalized "from addresses" and signatures: The ability to personal "from addresses" and signatures to appear to come from specific sales owners or individuals

Campaign Management

Program Management: The ability to manage and track marketing campaigns and programs across different channels
Event Management: The ability to manage and automate all elements of events and event marketing. The ability to send personalized invitations and reminders,track registration and attendance and send post-even follow-ups
Program Cloning: The ability to clone, re-purprose or re-use marketing programs
Program Import/Export: The ability to import and export marketing programs

Lead Management

Marketing Database: The ability to store and provide a single view of all leads and contacts with all interactions (website visits, email clicks, scoring changes, data updates..etc)
Segmentation: The ability to segment leads and contacts in your marketing database based on filters (demographic, title, company size, location...etc) into lists
Drip Marketing: The ability to automate drip marketing to send messages over time to lead
Lead Scoring: The ability to score and qualify leads based on BANT criteria (demographics, behaviors, interactions)
CRM Integration: The ability to automatically sync information in near real-time with CRM systems

Social Marketing

Social Listening and Tracking: The ability to track what leads and contacts say on social media websites (Facebook, Twitter, YouTube, LinkedIn..etc)
Social Campaigns: The ability to automate posts to one or more social media websites
Social Analytics: The ability to track conversion rates and sharing of your content posted on social media websites

Marketing Analytics

Web Analytics: The ability to track web interactions and users
SEO/Keyword Analytics: The ability to monitor and track ranking of your website and keywords on major search engines
Reporting: The ability to build custom reports and dashboards to measure leads and contacts by sources, campaigns, segments...etc The ability to also track metrics around any of the Lead management features.
ROI Analytics: The ability to measure and compare revenue performance by channel or program

Benefits

Improve Lead Management
Automate Processes
Higher Lead Conversion
Improve Customer Experience
Reduce time for Campaign Organization
Individualized Customer Targeting

Leading Marketing Automation Tools

5 Comments

What is Lead Generation

6/14/2016

8 Comments

Overview

Lead generation is the process of converting strangers into leads. Leads represent individuals that have expressed interest in a company's product or service. The lead generation process is broken up into two types: Inbound and Outbound.

Inbound Lead Generation

Inbound lead generation is all about helping potential customers find your company, product or service using channels that are owned or managed by the company. By building easier presence and awareness this can lead to higher conversion into a lead or win. In order to do this a variety of different channels and tactics can be used. Listed below are a few examples:

Search Engine Optimization: By improving the search-ability and awareness of your site, visits and engagement are subject to increase
New Content: Creating content that is relevant and impactful to a desired audience can easily lead to visits, engagement and sharing
Website Organization: The website is the key place where visitors need to be converted to leads and opportunities. Strategically organizing the layout, design, content and calls-to-action can make a big difference between visitors and leads
Company Blogs: A blog can be a great tool for educating readers on your company, products or services without explicitly advertising. Keeping readers engaged, interested and returning will allow you to build a strong base of users that can potentially be converted
Social Media: Social networks and channels have now become the new channels for people to research and learn about products and services. Ads and peers can have a big influence on generating buzz or awareness for a particular company, product or service. Strategically using social media to be in front of your audience and customers can help establish credibility and generate awareness.

Outbound Lead Generation

Outbound lead generation is also about helping potential customers find you company, product or service. It serves as complementary to inbound lead generation activities. While inbound lead generation utilizes channels that are owned or managed by the company, outbound lead generation utilizes channels that are external and paid for. Listed below are a few examples:

Email Marketing: Despite the prevalence of social media and networks, email still remains as one of the more prominent forces in communicating and promoting. By using email to stay in touch with customers, educate them on your company or leading them to one of the inbound lead generation channels, you can ensure your company remains on top-of-mind
Display Ads: Display ads allow for highly targeted visibility on large exposure search networks. These ads can be used to drive leads to new content on one of your inbound lead generation channels. Display ads create short sparks of interest that should be nurtured/directed to inbound lead generation channels and eventually to conversion.
Events: Similar to display ads, events can act as methods to create short sparks of interest that should be used to drive leads to inbound lead generation channels. On top of a driver, events are very powerful because they can be used to build a much stronger connection to leads because of face value. These interactions are the most important because there is no buffer of anonymity to polish or refine an image. If done correctly it can quickly turn an individual into a strong lead.

Importance

Social media and the availability of online content has created huge barriers of entry for marketers to penetrate through the noise and directly reach leads. Due to this saturation of content, using traditional mass advertising and marketing techniques are ineffective. Lead conversion needs to be broken into stages and emphasis has to be drawn on nurturing a lead all the way to a customer. Lead generation is key to customer conversion and sales because even with a robust and refined funnel, the challenge still remains to acquire and attract potential leads. Lead generation seeks to target the greater challenge of identifying, attracting and channeling leads into the funnel. Without this key process, the only lead source is word of mouth, and with the current saturation of content in the market, this is becomes increasingly unstable.

8 Comments

Map Reduce Explained Using SQL

6/2/2016

21 Comments

One of the key components in the Hadoop framework is Map Reduce. It is a distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map, Sorting and Reduce functions. Coming from a non-programming background can make this a difficult concept to understand. A good analogy to use when trying to understand Map Reduce is to associate it to SQL based concepts. Most of the world is familiar with relational databases and SQL, thus, putting it in that context can put a tangible definition to Map Reduce.

Listed below I have broken out each of the components of Map Reduce and provided a SQL construct as association. It is important to note that Map Reduce jobs and SQL Queries have their similarities but also can have very different purposes. As such, there are times when the SQL analogy cannot be used verbatim or may not align 100%.

MAP (From, Where, Union)

Maps are individual tasks that seek out and retrieve Key Value pairs that match a criteria. Although Maps are functions, when using the analogy of SQL it is easiest to think of them as the FROM, WHERE and UNION operations in a SQL query. Maps act to seek out data sets before any reducing, shuffling or sorting is performed . In SQL, the FROM, WHERE and UNION operations seek out data sets before reducing (SUM, COUNT, MIN...etc), shuffling or sorting (GROUP BY, ORDER BY) is performed.

Reduce (Aggregation)

Reduces are tasks that aggregate, summarize and reduce the shuffled and sorted outputs of the mappers. Using the analogy of SQL, the Reduce task is similar to the aggregate functions like SUM, DISTINCT, COUNT, MIN, MAX, AVG...etc in a SQL query. Reducers act to aggregate and summarize the data from a full data set into a smaller and manageable output that was originally specified by the user or job. In SQL, the aggregate functions (SUM, DISTINCT, COUNT, MIN, MAX, AVG..etc) operations act to reduce the data across multiple tables into a smaller output specified by the query or user.

Shuffle & Sort (Group By, ORDER BY)

Shuffles and Sorts are tasks that group, sort and organize the outputs of the mappers. Using the analogy of SQL the Shuffle and Sort tasks closely resemble the GROUP BY and ORDER BY operations in a SQL query. Shuffles and Sorts are key to organizing the data in formats and orders that are easiest to perform reduce functions upon. In SQL, the GROUP BY and ORDER BY operations act similarly to organize and group data so that reducing (SUM, COUNT, MIN...etc) functions can be easily performed.

21 Comments

<<Previous

Hadoop Ecosystem Overview

1. Storage

2. Processing

3. Querying

4. External integration

What is Marketing Automation

Overview

Components

Use Cases

Features

Email Marketing

Campaign Management

Lead Management

Social Marketing

Marketing Analytics

Benefits

Leading Marketing Automation Tools

What is Lead Generation

Overview

Inbound Lead Generation

Outbound Lead Generation

Importance

Map Reduce Explained Using SQL

MAP (From, Where, Union)

Reduce (Aggregation)

Shuffle & Sort (Group By, ORDER BY)

Subscribe

Categories

Archives

Hadoop Ecosystem Overview

1. Storage

2. Processing

3. Querying

4. External integration

What is Marketing Automation

Overview

Components

Use Cases

Features

Email Marketing

Campaign Management

Lead Management

Social Marketing

Marketing Analytics

Benefits

Leading Marketing Automation Tools

What is Lead Generation

Overview

Inbound Lead Generation

Outbound Lead Generation

Importance

Map Reduce Explained U﻿sing SQL

MAP (From, Where, Union)

Reduce (Aggregation)

Shuffle & Sort (Group By, ORDER BY)

Subscribe

Categories

Archives

Map Reduce Explained Using SQL