MapReduce, Simple Programming for Big Results. MapReduce is a programming model for the Hadoop ecosystem. It relies on YARN to schedule and execute parallel processing over the distributed file blocks in HDFS. There are several tools that use the MapReduce model to provide a higher level interface to other programming models. Hive has a SQL-like interface that adds capabilities that help with relational data modeling. And Pig is a high level data flow language that adds capabilities that help with process map modeling. Traditional parallel programming requires expertise on a number of computing and systems concepts. For example, synchronization mechanisms like locks, semaphores, and monitors are essential. And incorrectly using them can either crash your program, or severely impact performance. This high learning curve makes it difficult. It is also error prone, since your code can run on hundreds, or thousands of nodes, each having many cores. And any problem related to these parallel processes, needs to be handled by your parallel program. The MapReduce programming model greatly simplifies running code in parallel since you don't have to deal with any of these issues. Instead, you only need to create and map and reduce tasks, and you don't have to worry about multiple threads, synchronization, or concurrency issues. So, what is a map and reduce? Map and reduce are two concepts based on functional programming where the output the function is based solely on the input. Just like in a mathematical function, f (x) = y, y depends on x. You provide a function, or operation for a map, and reduce. And the runtime executes it over the data. For map, the operation is applied on each data element. And in reduce, the operation summarizes elements in some manner. An example, using map and reduce will make this concepts more clear. Hello word is a traditional first program you code when you start to learning programming languages. The first program to learn, or hello word of map reduce, is often WordCount. WordCount reads one or more text files, and counts the number of occurrences of each word in these files. The output will be a text file with a list of words and their occurrence frequencies in the input data. Let's examine each step of WordCount. For simplification we are assuming we have one big file as an input. Before WordCount runs, the input file is stored in HDFS. As you know now, HDFS partitions the blocks across multiple nodes in the cluster. In this case, four partitions labeled, A, B, C, and D. The first step in MapReduce is to run a map operation on each node. As the input partitions are read from HTFS, map is called for each line in the input. Let's look at the first lines of the input partitions, A and B, and start counting the words. The first line, in the partition on node A, says, My apple is red and my rose is blue. Similarly, the first line, on partition B, says, You are the apple of my eye. Let's now see what happens in the first map node for partition A. Map creates a key value for each word on the line containing the word as the key, and 1 as the value. In this example, the word apple is read from the line in partition A. Map produces a key value of (apple, 1). Similarly, the word my is seen on the first line of A twice. So, the key values of (my, 1), are created. Note that map goes to each node containing a data block for the file, instead of the data moving to map. This is moving computation to data. Let's now see what the same map operation generates for partition B. Since each word only happens to occur once, a list of all the words with one key-value pairing each gets generated. Please take a moment to observe the outputs of map and each key-value pair associated to a word. Next, all the key-values that were output from map are sorted based on their key. And the key values, with the same word, are moved, or shuffled, to the same node. To simplify this figure, each node only has a single word, in orange boxes. But in general, a node will have many different words. Just like our example from the two lines in A and B partitions. Here we see that, you and apple, are assigned to the first node. The word is, to the second node. And the words, rose and red, to the third. Although, for simplicity, we drew four map nodes and three shuffle nodes. The number of nodes can be extended as much as the application demands. Next, the reduce operation executes on these nodes to add values for key-value pairs with the same keys. For example, (apple, 1), and another (apple, 1), becomes (apple, 2). The result of reduce is a single key pair for each word that was read in the input file. The key is the word, and the value is the number of occurrences. If we look back at our WordCount example, we see that there were three distinct steps. Namely, the map step, the shuffle and sort step, and the reduce step. Although, the WordCount example is pretty simple, it represents a large number of applications to which these three steps can be applied in order to achieve data parallel scalability. For example, now that you have seen the WordCount application, consider changing the WordCount algorithm to index all the URLs by words after a web crawl. This means, instead of pointing to a number, the keys would refer to URLs. After the map, with this new function, which by the way is called a user defined function, the output of shuffle and sort would look like this. Now, when we reduce the URLs, all the URLs that mention Apple would look like this. This is, in fact, one of the ways a search engine like Google works. So now, if somebody came to the interface built for this application, to search for the word apple, and entered apple, it would be easy to get all the URLs as the word itself. No wonder the first MapReduce paper was produced by Google. We will give you a link to the original Google paper on MapReduce from 2004 at the end of this lecture. It is pretty technical, but it gives you a simple overview without the current system implementations. We just saw how MapReduce can be used in search engines in addition to counting the words and documents. Although it's possible to add many more applications, let's stop here for a general discussion on how the points of data parallelism can be used in search in this three step pattern. There is definitely parallelization during the map step. This parallelization is over the input, as each partition gets processed one line at a time. To achieve this type of data parallelism we must decide on the data granularity of each parallel competition. In this case, it will be a line. We also see parallel grouping of data in the shuffle and sort phase. This time, the parallelization is over the intermediate products. That is, the individual key-value pairs. And after the grouping of the intermediate products, the reduce step gets parallelized to construct one output file. You have probably noticed that the data gets reduced to a smaller set at each step. This overview gave us an idea of what kinds of tasks that MapReduce is good for. While MapReduce excels at independent batch tasks similar to our applications, there are certain kinds of tasks that you would not want to use MapReduce for. For example, if your data is frequently changing, MapReduce is slow since it reads the entire input data set each time. The MapReduce model requires that maps and reduces execute independently of each other. This greatly simplifies your job as a designer, since you do not have to deal with synchronization issues. However, it means that computations that do have dependencies, cannot be expressed with MapReduce. Finally, MapReduce does not return any results until the entire process is finished. It must read the entire input data set. This makes it unsuitable for interactive applications where the results must be presented to the user very quickly, expecting a return from the user. As a summary, MapReduce hides complexities of parallel programming and greatly simplifies building parallel applications. Many types of tasks suitable for MapReduce include search engine page ranking and topic mapping. Please see the reading after this lecture on making Pasta Sauce with MapReduce for another fun application using the MapReduce programming model.