CSC/ECE 506 Fall 2007/wiki2 4 BV: Difference between revisions
No edit summary |
No edit summary |
||
Line 12: | Line 12: | ||
=Steps of Parallelization= | =Steps of Parallelization= | ||
The programs written based on this model can be parallelized and executed concurrently on a number of systems. The model provides the capability for partitioning of data, assignment to different processes and merging the results. | The programs written based on this model can be parallelized and executed concurrently on a number of systems. The model provides the capability for partitioning of data, assignment to different processes and merging the results.<br> | ||
[[Image:MapReduce01.gif]]<br>[http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0007.html: Execution Overview] | |||
===Decomposition=== | |||
The data set is decomposed into M sets. Typically a master process takes care of the decomposition. In the context of a problem like counting the number of occurrences of a specific word or a group of words in a document, the document would be split into M parts. Each part would be searched in parallel for the specific patterns. The number of parts, M, depends on the size of the entire document(data set) and the number of different words(keys) searched. | |||
==Assignment== | ===Assignment=== | ||
The tasks are assigned to various processes that execute them to generate partial/complete output. For the | The tasks are assigned to various processes that execute them to generate partial/complete output. For the problem explained above, each process would be assigned a part of the document with a word(s) to be searched. The word(s) to be searched represents the key/value pair. Each process searches the part of the document assigned to it and generates a partial count of occurrences. The output, again will be generated in the form of key/value pair, where key represents the word(s) searched and value represents the number of occurrences, for the part of the document assigned. | ||
===Orchestration=== | |||
The main purpose of orchestration is to reduce the synchronization and communication overhead between the processes. The high I/O traffic generated may become a crucial problem. In the above example, the intermediate values generated by the processes are consolidated into a final output. The reduce workers read the data from the map workes through remote procedure calls. The data read is sorted based on the intermediate keys so that all occurrences of the same key are grouped together. This increases the probability that each reduce process may work on the same key for intermediate key/value pairs.<br> | |||
[[Image:MapReduce02.gif]]<br>[http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0008.html: Parallel Execution]<br> | |||
==Mapping== | ===Mapping=== | ||
This phase will assign the processes to different processors with an intent of minimizing the communication between them. Considering the above mentioned problem, the document splits are mapped to different map workers (i.e processors) by the master. It is quite possible that much of the time is spent on transferring the data ( i.e. splits and intermediate key/value pairs) across the network. One way of solving the problem is to assign the splits in such a way that the replica of the input split is local to the processor on which it is processor. This will eliminate the actual file transfer from the master to the map worker and thereby reducing the comminication. | This phase will assign the processes to different processors with an intent of minimizing the communication between them. Considering the above mentioned problem, the document splits are mapped to different map workers (i.e processors) by the master. It is quite possible that much of the time is spent on transferring the data ( i.e. splits and intermediate key/value pairs) across the network. One way of solving the problem is to assign the splits in such a way that the replica of the input split is local to the processor on which it is processor. This will eliminate the actual file transfer from the master to the map worker and thereby reducing the comminication. |
Revision as of 23:55, 24 September 2007
MapReduce
One of the major problems faced in the network world, is the need to classify huge amounts of data into easily understandable smaller data sets. It might be the problem of classifying hits for a website from different sources or classifying the frequency of occurrence of words in text, the problem involves analyzing large amounts of data into small understandable data sets.
MapReduce is a programming model for processing and generating large data sets. This was developed as a scalable, multiprocessing model by Google. This model boasts of high scalability and high fault tolerance. MapReduce helps in mapping large data sets to smaller number of keys, thus reducing the data set itself to a more manageable proportion. The model takes user inputs about the data and the key values, into which to classify the data into. More specifically, users specify a map function to process a key/value pair and generate a set of intermediate key/value pairs. The reduce function merges these intermediate values associated with the same intermediate key.
Some of the examples, where MapReduce can be employed are:
- Distributed Grep
- Count of URL Access Frequency
- ReverseWeb-Link Graph
- Term-Vector per Host
Steps of Parallelization
The programs written based on this model can be parallelized and executed concurrently on a number of systems. The model provides the capability for partitioning of data, assignment to different processes and merging the results.
Execution Overview
Decomposition
The data set is decomposed into M sets. Typically a master process takes care of the decomposition. In the context of a problem like counting the number of occurrences of a specific word or a group of words in a document, the document would be split into M parts. Each part would be searched in parallel for the specific patterns. The number of parts, M, depends on the size of the entire document(data set) and the number of different words(keys) searched.
Assignment
The tasks are assigned to various processes that execute them to generate partial/complete output. For the problem explained above, each process would be assigned a part of the document with a word(s) to be searched. The word(s) to be searched represents the key/value pair. Each process searches the part of the document assigned to it and generates a partial count of occurrences. The output, again will be generated in the form of key/value pair, where key represents the word(s) searched and value represents the number of occurrences, for the part of the document assigned.
Orchestration
The main purpose of orchestration is to reduce the synchronization and communication overhead between the processes. The high I/O traffic generated may become a crucial problem. In the above example, the intermediate values generated by the processes are consolidated into a final output. The reduce workers read the data from the map workes through remote procedure calls. The data read is sorted based on the intermediate keys so that all occurrences of the same key are grouped together. This increases the probability that each reduce process may work on the same key for intermediate key/value pairs.
Parallel Execution
Mapping
This phase will assign the processes to different processors with an intent of minimizing the communication between them. Considering the above mentioned problem, the document splits are mapped to different map workers (i.e processors) by the master. It is quite possible that much of the time is spent on transferring the data ( i.e. splits and intermediate key/value pairs) across the network. One way of solving the problem is to assign the splits in such a way that the replica of the input split is local to the processor on which it is processor. This will eliminate the actual file transfer from the master to the map worker and thereby reducing the comminication.