MapReduce is the programming model developed by Google for their large data analysis. Especially we can see the Google Search is scanning trillions of websites to compute the most relevant pages for a specific search query. MapReduce can intelligently distribute computation across a cluster with thousands of servers, each of them analyzes a portion of the dataset stored locally.

After an initial round of independent parallel computation, the machines efficiently exchange intermediate results, from which the final results are computed in parallel. It is reported that Google is using MapReduce to analyze about 1 exabyte of data every month. The model has seen more widely adoption when Hadoop, the open source java implementation of MapReduce, is available. It can be installed on any cluster for any application domain. Hadoop/MapReduce has provided researchers a powerful tool for tackling large-data problems in areas of machine learning, text processing, and
bioinformatics.

There are two phases in the MapReduce model. Firstly, it uses a map algorithm to process a key/value pair to generate a set of intermediate key/value pairs. Then the pairs run through a reduce algorithm so that all intermediate values associated with the same intermediate key are merged. To understand this model let us consider a very simple example discussed in the paper.

google-mapreduce

Consider a web log file with data collected over a period of time. It is a very large file that records every hit that happens on the site. Our interest is the number of hits for a given time. We split our task into two sub-tasks: (i) map : gathers the hit information at a time from each line and (ii) reduce: accumulates the hits that occurred for each timestamp using timestamp as the key. The map algorithm is to extract the pairs (key,value pairs) from the original web log. The reduce algorithm accumulates the number of hits for each timestamp. Output of map stage is pairs like <1345 1 >, <1345 1 >, <1345 1>, <1345 1>, <1346 1> … that results in <1345 4> <1346 1> after the reduce stage. Depending on the data volume, the map and reduce tasks can be shared among many computational units. The example is depicted in figure 1. The figure shows the task split into four nodes running map algorithm and two nodes running reduce algorithm.

Hadoop allows us easily write and run applications that process vast amounts of data. Hadoop Distributed File System (HDFS) is the storage system used by Hadoop. A typical HDFS file size is in the order of terabytes with a block size of 128MB. Blocks are replicated for reliability and rackaware placement is used to improve bandwidth. The HDFS operates in a master-slave mode with one master node called the NameNode. The node will keep meta-data and manage the system and a number of slave nodes called DataNodes. When we execute a query from a client, it will reach out to the NameNode to get the file metadata information, and then it will go to the DataNodes to get the real data blocks.

Leave a comment

Your email address will not be published.