The Big Data Blog
  • Home
  • Blog
  • About
  • Contact
  • Subscribe

Hadoop  Ecosystem  Overview

7/11/2016

267 Comments

 
Before you can traverse through the Hadoop environment it is important to identify and learn about the key players. In this post I will provide an overview of the applications, tools and interfaces currently available in the Hadoop ecosystem. I will categorize each product by its functionality (Storage, Processing, Querying, External Integration & Coordination) and provide a description along with architecture, uses and resource likes. It is important to note that in a few months or years these products might become obsolete and replaced by another.

1. Storage

HDFS
The primary distributed file system used by Hadoop applications which runs on large clusters of commodity machines. HDFS clusters consist of a NameNode that manages the file system metadata and DataNodes that store the actual data.

Architecture:

HDFS Architecture

Uses:
-Storage of large imported files from applications outside of the Hadoop ecosystem
-Staging of imported files to be processed by Hadoop applications

Resource:

http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

Architecture:

HBase Architecture

Uses:
-Storage of large data volumes (billions of rows) atop clusters of commodity hardware
-Bulk storage of logs, documents, real-time activity feeds and raw imported data
-Consistent performance of reads/writes to data used by Hadoop applications
-Data Store than can be aggregated or processed using MapReduce functionality
-Data platform for Analytics and Machine Learning

Resource:

http://hbase.apache.org/book/architecture.html

HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data to a tabular form as opposed to files. It also provides REST APIs so that external systems can access these tables’ metadata.

Architecture:

HCatalog Architecture

Uses:
-Centralized location of storage for data used by Hadoop applications
-Reusable data store for sequenced and iterated Hadoop processes (ex: ETL)
-Storage of data in a relational abstraction

Resource:

https://cwiki.apache.org/confluence/display/Hive/HCatalog

2. Processing

MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map or Reduce functions.

Architecture:
MapReduce Architecture

Uses:
-Aggregation (Counting, Sorting, Filtering, Stitching) on large and desperate data sets
-Scalable parallelism of Map or Reduce tasks
-Distributed task execution
-Machine learning

Resource:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Purpose

Pig
A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended functions (UDFs) using Java.

Architecture:
Pig Architecture

Uses:
-Scripting environment to execute ETL tasks/procedures on raw data in HDFS
-SQL based language for creating and running complex MapReduce functions
-Data processing, stitching, schematizing on large and desperate data sets

Resource:
http://pig.apache.org/docs/r0.12.1/index.html


3. Querying

HIVE
A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query language based on SQL semantics (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data.

 Architecture:
Hive Architecture

 Uses:
-Schematized data store for housing large amounts of raw data
-SQL-like Environment to execute analysis and querying tasks on raw data in HDFS
-Integration with outside RDBMS applications 

 Resource:
https://cwiki.apache.org/confluence/display/Hive/Home#Home-ApacheHive

4. External integration

Flume
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS. Flume's transports large quantities of event data using a steaming data flow architecture that is fault tolerant and failover recovery ready.

Architecture:

Flume Architecture

Uses:
-Transportation of large amounts of event data (network traffic, logs, email messages)
-Stream data from multiple sources into HDFS
-Guaranteed and reliable real-time data streaming to Hadoop applications

Resource:

http://flume.apache.org/FlumeUserGuide.html

Apache Hadoop Ecosystem
267 Comments

What is Marketing Automation

7/6/2016

5 Comments

 

Overview

Marketing Automation is a tool that automates and measures marketing actions/tasks and workflows across multiple online channels (email, social media, websites, blogs..etc).

Components

  1. Central Marketing Database: Repository for all prospect, customer and interaction information
  2. Engagement Marketing Engine: An environment to create, manage and automate marketing processes across different channels.
  3. Analytics Engine: An environment to measure the impact of all executed marketing processes across different channels

Use Cases

  1. Enterprise - B2B, B2C
  2. Small and Mid-Sized Businesses - B2B, B2C

Features

  1. Email Marketing
  2. Campaign Management
  3. Lead Management
  4. Social Marketing
  5. Marketing Analytics

Email Marketing

  • Batch Email Marketing: The ability to send emails to groups of customers and prospects
  • Real-Time Triggered Emails: The ability to respond to specific customer behaviors or events with appropraite real-time emails
  • Email Specific Landing Pages: The ability to create customized landing pages for directed links from emails
  • Customized Emails: The ability to customize emails (text, images and calls-to-action) for specific customer segments
  • Mobile Optimized Emails: The ability to send emails that optimized and rendered properly for mobile devices
  • Personalized "from addresses" and signatures: The ability to personal "from addresses" and signatures to appear to come from specific sales owners or individuals

Campaign Management

  • ​Program Management: The ability to manage and track marketing campaigns and programs across different channels
  • Event Management: The ability to manage and automate all elements of events and event marketing. The ability to send personalized invitations and reminders,track registration and attendance and send post-even follow-ups
  • Program Cloning: The ability to clone, re-purprose or re-use marketing programs
  • Program Import/Export: The ability to import and export marketing programs 

Lead Management

  • Marketing Database: The ability to store and provide a single view of all leads and contacts with all interactions (website visits, email clicks, scoring changes, data updates..etc)
  • Segmentation: The ability to segment leads and contacts in your marketing database based on filters (demographic, title, company size, location...etc) into lists
  • Drip Marketing: The ability to automate drip marketing to send messages over time to lead
  • Lead Scoring: The ability to score and qualify leads based on BANT criteria (demographics, behaviors, interactions)
  • CRM Integration: The ability to automatically sync information in near real-time with CRM systems

Social Marketing

  • Social Listening and Tracking: The ability to track what leads and contacts say on social media websites (Facebook, Twitter, YouTube, LinkedIn..etc)
  • Social Campaigns: The ability to automate posts to one or more social media websites
  • Social Analytics: The ability to track conversion rates and sharing of your content posted on social media websites

Marketing Analytics

  • Web Analytics: The ability to track web interactions and users
  • SEO/Keyword Analytics: The ability to monitor and track ranking of your website and keywords on major search engines
  • Reporting: The ability to build custom reports and dashboards to measure leads and contacts by sources, campaigns, segments...etc The ability to also track metrics around any of the Lead management features.
  • ROI Analytics: The ability to measure and compare revenue performance by channel or program

Benefits

  1. Improve Lead Management
  2. Automate Processes
  3. Higher Lead Conversion
  4. Improve Customer Experience
  5. Reduce time for Campaign Organization
  6. Individualized Customer Targeting

Leading Marketing Automation Tools

Picture
5 Comments

What is Lead Generation

6/14/2016

8 Comments

 

Overview

Lead generation is the process of converting strangers into leads. Leads represent individuals that have expressed interest in a company's product or service. The lead generation process is broken up into two types: Inbound and Outbound.

Inbound Lead Generation

Inbound lead generation is all about helping potential customers find your company, product or service using channels that are owned or managed by the company. By building easier presence and awareness this can lead to higher conversion into a lead or win. In order to do this a variety of different channels and tactics can be used. Listed below are a few examples:
  • Search Engine Optimization: By improving the search-ability and awareness of your site, visits and engagement are subject to increase
  • New Content: Creating content that is relevant and impactful to a desired audience can easily lead to visits, engagement and sharing
  • Website Organization: The website is the key place where visitors need to be converted to leads and opportunities. Strategically organizing the layout, design, content and calls-to-action can make a big difference between visitors and leads
  • Company Blogs: A blog can be a great tool for educating readers on your company, products or services without explicitly advertising. Keeping readers engaged, interested and returning will allow you to build a strong base of users that can potentially be converted  
  • Social Media: Social networks and channels have now become the new channels for people to research and learn about products and services. Ads and peers can have a big influence on generating buzz or awareness for a particular company, product or service. Strategically using social media to be in front of your audience and customers can help establish credibility and generate awareness.

Outbound Lead Generation

​Outbound lead generation is also about helping potential customers find you company, product or service. It serves as complementary to inbound lead generation activities. While inbound lead generation utilizes channels that are owned or managed by the company, outbound lead generation utilizes channels that are external and paid for. Listed below are a few examples:
  • Email Marketing: Despite the prevalence of social media and networks, email still remains as one of the more prominent forces in communicating and promoting. By using email to stay in touch with customers, educate them on your company or leading them to one of the inbound lead generation channels, you can ensure your company remains on top-of-mind
  • Display Ads: Display ads allow for highly targeted visibility on large exposure search networks. These ads can be used to drive leads to new content on one of your inbound lead generation channels. Display ads create short sparks of interest that should be nurtured/directed to inbound lead generation channels and eventually to conversion.
  • Events: Similar to display ads, events can act as methods to create short sparks of interest that should be used to drive leads to inbound lead generation channels. On top of a driver, events are very powerful because they can be used to build a much stronger connection to leads because of face value. These interactions are the most important because there is no buffer of anonymity to polish or refine an image. If done correctly it can quickly turn an individual into a strong lead.

Importance

Social media and the availability of online content has created huge barriers of entry for marketers to penetrate through the noise and directly reach leads. Due to this saturation of content, using traditional mass advertising and marketing techniques are ineffective. Lead conversion needs to be broken into stages and emphasis has to be drawn on nurturing a lead all the way to a customer. Lead generation is key to customer conversion and sales because even with a robust and refined funnel, the challenge still remains to acquire and attract potential leads. Lead generation seeks to target the greater challenge of identifying, attracting and channeling leads into the funnel. Without this key process, the only lead source is word of mouth, and with the current saturation of content in the market, this is becomes increasingly unstable.
8 Comments

Map Reduce Explained Using SQL

6/2/2016

21 Comments

 
One of the key components in the Hadoop framework is Map Reduce. It is a distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map, Sorting and Reduce functions. Coming from a non-programming background can make this a difficult concept to understand. A good analogy to use when trying to understand Map Reduce is to associate it to SQL based concepts. Most of the world is familiar with relational databases and SQL, thus, putting it in that context can put a tangible definition to Map Reduce. 

Listed below I have broken out each of the components of Map Reduce and provided a SQL construct as association. It is important to note that Map Reduce jobs and SQL Queries have their similarities but also can have very different purposes. As such, there are times when the SQL analogy cannot be used verbatim or may not align 100%.  

MAP (From, Where, Union)

Maps are individual tasks that seek out and retrieve  Key Value pairs that match a criteria. Although Maps are functions, when using the analogy of SQL it is easiest to think of them as the FROM, WHERE and UNION operations in a SQL query.  Maps act to seek out data sets before any reducing, shuffling or sorting is performed . In SQL, the FROM, WHERE and UNION operations seek out data sets before reducing (SUM, COUNT, MIN...etc), shuffling or sorting (GROUP BY, ORDER BY) is performed.

Reduce  (Aggregation)

Reduces are tasks that aggregate, summarize and reduce the shuffled and  sorted outputs of the mappers.  Using the analogy of SQL, the Reduce task is similar to the aggregate functions like SUM, DISTINCT, COUNT, MIN, MAX, AVG...etc in a SQL query. Reducers act to aggregate and summarize the data from a full data set into a smaller and manageable output that was originally specified by the user or job. In SQL, the aggregate functions (SUM, DISTINCT, COUNT, MIN, MAX, AVG..etc) operations act to reduce the data across multiple tables into a smaller output specified by the query or user. 

Shuffle & Sort (Group By, ORDER BY)

Shuffles and Sorts are tasks that group, sort and organize the outputs of the mappers. Using the analogy of SQL the Shuffle and Sort tasks closely resemble the GROUP BY and ORDER BY operations in a SQL query. Shuffles and Sorts are key to organizing the data in formats and orders that are easiest to perform reduce functions upon. In SQL, the GROUP BY and ORDER BY operations act similarly to organize and group data so that reducing (SUM, COUNT, MIN...etc) functions can be easily performed. 

21 Comments
<<Previous

      Subscribe

    Subscribe

    Categories

    All
    Bi
    Big Data
    Cassandra
    Cloud
    Column Store
    Data Quality
    Flume
    Hadoop
    Hardware
    HDFS
    Hive
    IPaaS
    MapReduce
    MPP
    NoSQL
    Pig
    Self Service BI
    Semi-Structured
    SEO
    SMP
    Structured
    Unstructured
    Virtualization
    VMware
    Web

    Archives

    July 2016
    June 2016
    May 2016
    March 2016
    February 2016
    December 2015
    August 2015
    May 2015
    April 2015
    March 2015
    October 2014
    September 2014
    July 2014
    April 2014
    March 2014


    RSS Feed


    Disclaimer
    All content represented in this blog is that of the owner. They do not represent any connection with Apache, HortonWorks, Cloudera or any other company. This blog does not claim ownership of any of the content as original thought. This blog will not be held accountable or take any responsibility for any content. All views and recommendations are based upon the opinion of the owner.

Powered by Create your own unique website with customizable templates.