Listed below I have broken out each of the components of Map Reduce and provided a SQL construct as association. It is important to note that Map Reduce jobs and SQL Queries have their similarities but also can have very different purposes. As such, there are times when the SQL analogy cannot be used verbatim or may not align 100%.
One of the key components in the Hadoop framework is Map Reduce. It is a distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map, Sorting and Reduce functions. Coming from a non-programming background can make this a difficult concept to understand. A good analogy to use when trying to understand Map Reduce is to associate it to SQL based concepts. Most of the world is familiar with relational databases and SQL, thus, putting it in that context can put a tangible definition to Map Reduce. Listed below I have broken out each of the components of Map Reduce and provided a SQL construct as association. It is important to note that Map Reduce jobs and SQL Queries have their similarities but also can have very different purposes. As such, there are times when the SQL analogy cannot be used verbatim or may not align 100%. Maps are individual tasks that seek out and retrieve Key Value pairs that match a criteria. Although Maps are functions, when using the analogy of SQL it is easiest to think of them as the FROM, WHERE and UNION operations in a SQL query. Maps act to seek out data sets before any reducing, shuffling or sorting is performed . In SQL, the FROM, WHERE and UNION operations seek out data sets before reducing (SUM, COUNT, MIN...etc), shuffling or sorting (GROUP BY, ORDER BY) is performed. Reduces are tasks that aggregate, summarize and reduce the shuffled and sorted outputs of the mappers. Using the analogy of SQL, the Reduce task is similar to the aggregate functions like SUM, DISTINCT, COUNT, MIN, MAX, AVG...etc in a SQL query. Reducers act to aggregate and summarize the data from a full data set into a smaller and manageable output that was originally specified by the user or job. In SQL, the aggregate functions (SUM, DISTINCT, COUNT, MIN, MAX, AVG..etc) operations act to reduce the data across multiple tables into a smaller output specified by the query or user. Shuffles and Sorts are tasks that group, sort and organize the outputs of the mappers. Using the analogy of SQL the Shuffle and Sort tasks closely resemble the GROUP BY and ORDER BY operations in a SQL query. Shuffles and Sorts are key to organizing the data in formats and orders that are easiest to perform reduce functions upon. In SQL, the GROUP BY and ORDER BY operations act similarly to organize and group data so that reducing (SUM, COUNT, MIN...etc) functions can be easily performed.
21 Comments
12/30/2016 09:35:38 pm
Really this is very great information sharing with us..Thanks lot.<a href="http://Examhelpline.in">Examhelpline.in</a>
Reply
1/8/2017 11:56:33 pm
Really this is very great information sharing with us. Thanks lot.<a href="http://competition.examhelpline.in">Examhelpline.in</a>
Reply
1/23/2017 03:45:49 am
such very good detail. This is the best sites for proving such kinds of good information. <a href="http://medical.examhelpline.in/yu-pget-entrance-exam">YU PGET entrance exam</a>
Reply
1/29/2017 03:35:44 am
Thanks for such very great information. This is the best sites for proving such kinds of good information.<a href="http://school.examhelpline.in/meghalaya-10th-board-sslc-examination-schedule">Meghalaya 10th Board SSLC Examination Schedule 2017</a>
Reply
1/30/2017 02:57:40 am
Such very good information. Thank you for your sites for proving such kinds of good information.
Reply
2/6/2017 03:44:52 am
such a amazing information provide your site thank you a lot..
Reply
3/4/2017 03:26:32 am
really nice post sharing with us...thanks.
Reply
3/21/2017 05:27:40 am
amazing post thanks a lot sharing with us.
Reply
7/30/2017 11:29:02 pm
Reply
9/19/2017 10:03:38 pm
thanks for the information!
Reply
10/13/2017 11:58:23 pm
Thanks for Sharing the article in the blog..I have clearly understood the MapReduce Framework and its parallel distributing concepts and the entire execution environment which runs on large cluster environment.
Reply
10/16/2017 04:08:08 am
Hi,
Reply
5/10/2018 05:02:51 am
Thanks a lot very much for the high quality and results-oriented help. I won’t think twice to endorse your blog post to anybody who wants and needs support about this area.
Reply
Very impressive blog thanks for this update. keep on blogging, we are excepting these types of news only thanks.
Reply
7/19/2018 03:17:52 am
Thanks for your post. valuable information . Big Data Admin Hadoop training course is a comprehensive training designed by industry experts considering current industry job requirements to provide in-depth learning on Big Data and Hadoop Admin Modules. Cloud Lab access will be provided for a month.
Reply
7/26/2018 06:14:41 am
Hi,
Reply
12/21/2018 12:03:13 am
Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing
Reply
9/20/2020 12:31:02 am
It would be better if you display just summary of blog in index pages
Reply
Leave a Reply. |
Categories
All
Archives
July 2016
Disclaimer
All content represented in this blog is that of the owner. They do not represent any connection with Apache, HortonWorks, Cloudera or any other company. This blog does not claim ownership of any of the content as original thought. This blog will not be held accountable or take any responsibility for any content. All views and recommendations are based upon the opinion of the owner. |