My Articles

MapReduce: Simplified Data Processing on Large Clusters

Introduction

The paper talks on distributing computation across multiple devices i.e. parallel computation on large amount of data. It discusses a new abstraction that hides the underlying parallelization concept, data distribution and producing the required result in case of large data computation. The paper resolves the challenge of developing a scalable, effective, and user-friendly system and attempts at solving the problem of processing massive amounts of data in a short time. The parallel processing systems were difficult to set up and maintain and impractical for non-experts to use effectively due to their complexity. Furthermore, it was difficult to scale outdated systems to handle enormous amounts of data, and their performance was not up to the mark. The MapReduce framework offers a straightforward and adaptable programming model for handling big data sets in parallel across a cluster of computers, is introduced in the paper is an effort to address the above difficulties. The paper show that MapReduce is capable of handling a large number of data processing jobs in an efficient manner. Additionally, the framework features fault tolerance, data distribution, and intelligent load balancing. For large- scale data processing tasks, such as indexing the entire web, analyzing log files, and large-scale machine learning, MapReduce has been widely embraced. The Map and Reduce primitives present in many other computation languages were utilized for this purpose. The Map function converts an initial set of key-value pairs into an intermediate set of key-value pairs. The Reduce function uses the intermediate key-value pairs to create the final result by aggregating the values for each key. On several machines in the cluster, these two functions are run concurrently, and the results from each machine are then integrated to give the final output.

Novelty

The paper discusses about the Fault tolerance feature of the MapReduce framework that makes sure the computation keeps going even if a machine malfunctions. Replica management and the re-execution of failed tasks are used to accomplish this. This infers that MapReduce framework offers a high level of dependability and robustness for applications involving huge amounts of data processing. MapReduce was used in search indexing in Google, which significantly reduced the code length along with providing simplicity. Paper indicated an approximate reduction of up to 4 times the original code length when MapReduce was implemented. This highlights how straightforward the concept is, making it simple for programmers to adopt and use. Because of its simplicity, programs can be created using number of programming languages, such as Python, C++, and Java.

Evaluation

On understanding the evaluation that was conducted among two computation that ran on large cluster of machines, where one was to search through terabyte of data for a particular pattern and the other was to sort one terabyte of data, the result of pattern search was obtained in 150 seconds (MR-grep), and overall computation of sorting of 1 TB of data was indicated to be approximately 800 seconds (MR- sort), which shows that MapReduce is highly efficient in case of performing such large scale operations. The computation was performed on clusters with same configurations. The design of the MapReduce system details on how it manages errors, plans tasks, and distributes workload among the worker machines. It also emphasizes the system’s adaptability and usability, as well as its capacity to deal with a range of data processing activities, such as sorting, searching, and graph algorithms.

Performance

Implementation of MapReduce in Google Indexing system proved that it provides a good performance and scalability and is able to process a large amount of data in a short time. For Example, making one change in the old indexing system took a few months, which got reduced to few days after the implementation of MapReduce. The authors conclude that MapReduce is a simple yet powerful tool for data processing and can provide significant benefits in terms of scalability and reliability.

Opinion

On reviewing the working implementation of MapReduce which performs re-computation on failed nodes for fault tolerance, I presume that it can be slow for certain iterative computations, and also the model indicated that it cannot handle real-time computation, as the data needs to be parsed, sorted and shuffled around workers during the operation and these can be time consuming. In case of this model, the fault tolerance is reliable only if all the system are accurate and have a very limited downtime or in fact the systems needs to be homogeneous, in order to produce ideal computation result in a short amount of time. In my opinion, MapReduce does not present a fresh method for decomposing a difficult work into manageable parts in terms of the theory of computation. It does show that some streamlined operations are suitable for a particular set of issues. This opinion is also expressed in one of the questions put forward in later section of my review, where there are additional overheads involved in combining all the results of computation.

Questions

1. How is the correctness of the map and reduce action justified at the output end? 2. If the reduce results are stored in separate files, will there need to be an additional action to be performed to combine the result for the user? Doesn’t this act as an overhead operation? 3. What happens in large scale failure, i.e., if the majority of the map and reduce operation performing worker machines fail. Will this lead to re-computation every time? 4. The result of the map function a failed machine becomes inaccessible as it is stored on the local machine, and re-computation of that function on a new worker is performed. Doesn’t this clutter the worker memory in case of failed tasks, as the result is unused? And also, if these operations are performed on cloud machines, doesn’t that increase the storage cost for unwanted data (only in terms of failed worker)?

Want to collaborate on more paper reviews?

My Articles

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

Dynamo: Amazon’s Highly Available Key-value Store

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels

Sprocket: A Scalable, Distributed, Multi-User, Network File System

Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, Carl Staelin

CloneCloud: Elastic Execution between Mobile Device and Cloud

Byung-Gon Chun, Petros Maniatis, Mayur Naik, and Ashwin Patti

Heron: Stream Processing at Scale

Karthik Ramasamy, Sanjeev Kulkarni, Sijie Guo, James Galford, Heron Team

Twine: A Unified Cluster Management System for Shared Infrastructure

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes

Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge

Yuhan Quan, Ming Liu, Qi Alfred Chen, Xue Liu, Bo Li

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

MapReduce: Simplified Data Processing on Large Clusters

Introduction

Novelty

Evaluation

Performance

Opinion

Questions

Want to collaborate on more paper reviews?