CSC/ECE 506 Fall 2007/wiki2 4 BV: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 5: Line 5:




Decomposition The data set is decomposed into N sets. Typically a master process assumes the responsibility of it. In the context of a problem like counting the number of occurrences of a specific word or a group of words in a document. The document would be split into N pieces. The number of pieces depend upon the size of the entire document and the number of different words. This phase will give the number of concurrent tasks.
==Decomposition==
The data set is decomposed into N sets. Typically a master process assumes the responsibility of it. In the context of a problem like counting the number of occurrences of a specific word or a group of words in a document. The document would be split into N pieces. The number of pieces depend upon the size of the entire document and the number of different words. This phase will give the number of concurrent tasks.


    
    
Assignment The tasks are assigned to various processes that execute them to generate partial/complete output. For the example explained above, a process would be assigned a split with a key/value pair which the represent the group of words to be looked for. Each process reads the key/value pair and generates an intermediate result in the form of a key/intermediate number of occurrences.  
==Assignment==
The tasks are assigned to various processes that execute them to generate partial/complete output. For the example explained above, a process would be assigned a split with a key/value pair which the represent the group of words to be looked for. Each process reads the key/value pair and generates an intermediate result in the form of a key/intermediate number of occurrences.  




Orchestration The main purpose of orchestration is to reduced the synchronization and communication overhead between the tasks.  The high I/O traffic generated may become a crucial problem as the network bottle neck. In the above example, the intermediate values generated by the processes are reduced to generated a  final output. The reduce workers read the data from the map workes through remote procedure calls. The data read is sorted based on the intermediate keys so that all occurrences of the same key are grouped together. This increases the probability that each reduce process may work on the same key for intermediate key/value pairs.
==Orchestration==
The main purpose of orchestration is to reduced the synchronization and communication overhead between the tasks.  The high I/O traffic generated may become a crucial problem as the network bottle neck. In the above example, the intermediate values generated by the processes are reduced to generated a  final output. The reduce workers read the data from the map workes through remote procedure calls. The data read is sorted based on the intermediate keys so that all occurrences of the same key are grouped together. This increases the probability that each reduce process may work on the same key for intermediate key/value pairs.




Mapping This phase will assign the processes to different processors with an intent of minimizing the communication between them. Considering the above mentioned problem, the document splits are mapped to different map workers (i.e processors) by the master. It is quite possible that much of the time is spent on transferring the data ( i.e. splits and intermediate key/value pairs) across the network. One way of solving the problem is to assign the splits in such a way that the replica of the input split is local to the processor on which it is processor. This will eliminate the actual file transfer from the master to the map worker and thereby reducing the comminication.
==Mapping==
This phase will assign the processes to different processors with an intent of minimizing the communication between them. Considering the above mentioned problem, the document splits are mapped to different map workers (i.e processors) by the master. It is quite possible that much of the time is spent on transferring the data ( i.e. splits and intermediate key/value pairs) across the network. One way of solving the problem is to assign the splits in such a way that the replica of the input split is local to the processor on which it is processor. This will eliminate the actual file transfer from the master to the map worker and thereby reducing the comminication.

Revision as of 22:55, 24 September 2007

Map Reduce

Google uses "Map Reduce", a programming model for processing and generating large data sets. This model has been known for the high scalability and fault tolerant nature. At a bird's view, users specify a "map" function to process a key/value pair and generate a set of intermediate key/value pairs. The reduce function merges these intermediate values associated with the same intermediate key. The programs written based on this model can be parallelized and executed concurrently on a number of systems. The model provides the capability for partitioning of data, assignment to different processes and merging the results.

The steps of parallelization are explained as below.


Decomposition

The data set is decomposed into N sets. Typically a master process assumes the responsibility of it. In the context of a problem like counting the number of occurrences of a specific word or a group of words in a document. The document would be split into N pieces. The number of pieces depend upon the size of the entire document and the number of different words. This phase will give the number of concurrent tasks.


Assignment

The tasks are assigned to various processes that execute them to generate partial/complete output. For the example explained above, a process would be assigned a split with a key/value pair which the represent the group of words to be looked for. Each process reads the key/value pair and generates an intermediate result in the form of a key/intermediate number of occurrences.


Orchestration

The main purpose of orchestration is to reduced the synchronization and communication overhead between the tasks. The high I/O traffic generated may become a crucial problem as the network bottle neck. In the above example, the intermediate values generated by the processes are reduced to generated a final output. The reduce workers read the data from the map workes through remote procedure calls. The data read is sorted based on the intermediate keys so that all occurrences of the same key are grouped together. This increases the probability that each reduce process may work on the same key for intermediate key/value pairs.


Mapping

This phase will assign the processes to different processors with an intent of minimizing the communication between them. Considering the above mentioned problem, the document splits are mapped to different map workers (i.e processors) by the master. It is quite possible that much of the time is spent on transferring the data ( i.e. splits and intermediate key/value pairs) across the network. One way of solving the problem is to assign the splits in such a way that the replica of the input split is local to the processor on which it is processor. This will eliminate the actual file transfer from the master to the map worker and thereby reducing the comminication.