Expertiza_Wiki - User contributions [en]

File:Wiki invalidate states.jpg

2013-03-27T18:19:48Z

Amahaba:

File:Wiki invalidate bus.jpg

2013-03-27T18:17:34Z

Amahaba:

CSC/ECE 506 Spring 2013/8b ap

2013-03-27T18:14:48Z

Amahaba: /* Cache Coherence Protocols on Real Architectures */

=Cache Coherence Protocols on Real Architectures=

In parallel computer architectures, [http://en.wikipedia.org/wiki/Cache_coherence '''cache coherence'''] refers to the consistency of data that is stored throughout the caches on individual processors or throughout the shared memory. The problem here is that we have multiple caches on multiple processors. When an update to a single cache makes changes to a shared memory, you will need to have all the caches be coherent on the value change. This is better shown below.

[[File:Wiki_first_shared.jpg‎]] 
'''Figure 1. Multiple Caches of Shared Resource'''

There are two ways to maintain cache consistency<ref name="glasco">Glasco, D.B.; Delagi, B.A.; Flynn, M.J.; , "Update-based cache coherence protocols for scalable shared-memory multiprocessors," System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , vol.1, no., pp.534-545, 4-7 Jan. 1994 doi: 10.1109/HICSS.1994.323135
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=323135&isnumber=7709 Paper]</ref>, invalidation based and update based.

Invalidation based protocol will purge the copies of the line from the other caches which results in a single copy of the line whereas updating forwards the write value to the other caches, after which all caches are consistent.One of the drawbacks of an invalidation-based protocol is that it incurs high number of coherence misses. To solve this, one can use a update coherence protocol, or a new type of protocol called adaptive coherence protocol. We shall discuss all three below.

== Invalidation Coherence Protocols ==
Invalidate all remote copies of cache when a local cache block is updated. Under the invalidation scheme, updates are only propagated when data are read, and several updates can take place before communication is necessary. In the multiple-reader-single-write scheme described, this is potentially expensive. But, if the read/write ratio is sufficiently high, then the parallelism obtained by allowing multiple simultaneous readers offsets this cost. However, for multiple read after write command, there would be continuous misses and subsequent fetches for those misses.

=== Write-Through Write Invalidate Caches ===
The state of a cache block copy of local processor can take one of the two
states :
Valid State:
All processors can read safely. The cache block is valid and clean, i.e. the cached value is the same as that at a lower memory level
Local processor can also write
Invalid State: (not in cache)
Block being invalidated.
Block being replaced
Any requests to the cache block will result into cache misses.
The requests are:
PrRd (Processor-Read) - processor-side request to read data from a cache block
PrWr (Processor-Write) - request sent when processor wants to write into a cache block
The Snooped Bus requests are:
BusRd (Bus-Read) - When request that indicates there is a read request to a block made by a processor
BusWr (Bus-Write) - Request that indicates there is a write request to a block made by a processor. In case of a write-through cache, the BusWr is a write-through to the main memory performed by another processor.
In the beginning the state of the cache block is Invalid state. If a
When a remote processor writes to its cache copy, all other cache copies become invalidated. When there is a processor-side read request, the processor suffers a cache miss. This results in a BusRd request on the bus. The memory block is fetched from main memory. The block goes to the valid state. When there is a PrWr request, then a BusWr command is sent on bus. Other caches that have the block invalidate their copy. The main memory (write-through) loads the correct value into the requested cache. Therefore, the state remains invalid.
For the case, when the block is the valid state, on a PrRd, it is a cache hit since the block resides in the cache. The state remains valid. On a PrWr, if there is a cache hit, that means, no other cache block resides. Thus no bus snooping protocol has to be sent.
On a BusRd, a cache block in the valid state remains in the valid state. While a block in the invalid state remains in the invalid state.
On a BusWr, a cache block in the valid state gets into the invalid state and an invalid block remain in the invalid state.

==== Advantages ====
A processor can modify a cache block by invalidating other copies in other caches. Thus cache block that is being modifies resides only in one cache giving exclusive membership to that processor.

=== Write Back Write Invalidate caches ===
• Processor / Cache Operations
PrRd, PrWr, block Replace
States
Invalid, Valid (clean) - same as above
Modified (dirty) - A block that is valid and has value written into; thus, is different from the one in main memory

• Bus Transactions
Bus Read (BusRd), Write-Back (BusWB)

Write-through Advantages :
When a processor continuously writes into the same cache block, only a single command is used to invalidate the cache blocks
Disadvantage :
requires a high amount of bandwidth.

==Update Coherence Protocol==

===Introduction===
Update-based cache coherence protocols work by directly updating all the cache values in the system. This differs from the invalidation-based protocols because it achieves write propagation without having to invalidate data and thus not resulting in a cache miss. This saves on numerous coherence misses, time spent to correct the miss, and bandwidth usage. The update-based protocols we will be discussing in this section are
* [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]
* [http://en.wikipedia.org/wiki/Firefly_protocol Firefly protocol].

===Dragon Protocol===
Dragon protocol saves on bandwidth by updating the specific words within the cache instead of the entire block. The caches use write allocate and write update policies. The Dragon Protocol is made up of four states and does not include an invalidation state.

* '''Modified (M)''' - cache block is exclusively owned, however it can be different from main memory
* '''Exclusive (E)''' - cache block is clean and is only in one cache
* '''Shared Modified (Sm)''' - cache block resides in multiple caches and is possible dirty
* '''Shared Clean (Sc)''' - cache block resides in multiple caches and is clean

There is not an invalidation state, because if a block is cached then it assumed to be valid. However,it can differ from main memory. Below are the finite state machines for the processor-side calls and bus-side calls. Dragon protocol utilizes snoopy caches to appear as if it as a uniform memory space even though there are multiple processors.
[[File:Dragon Protocol Processor-Side.png|550px|center]]
<center> '''Figure 2. Dragon Protocol Processor-Side''' </center>

[[File:Dragon Protocol Bus-Side.png|500px|center]]
<center> '''Figure 3. Dragon Protocol Bus-Side''' </center>

The Dragon Protocol is implemented in the [http://en.wikipedia.org/wiki/Cray_CS6400 Cray CS6400] (also know as the Xerox Dragon multiprocessor workstation). It was available with either 60Mhz or 85Mhz processors. The Xerox Dragon was designed to be a research numerous processors.

===Firefly Protocol===
Firefly protocol is another example of update coherence cache protocols. However, unlike the Dragon Protocol, it uses write-through policy (which writes all changes back to memory). The following states can be assigned to a block in this protocol.

* '''Valid Exclusive (VE)''' - cache block is exclusively owned, cache block is clean
* '''Dirty (D)''' - exclusive rights to the cache block, cache block is dirty
* '''Shared (S)''' - cache block is shared but is not modified

The Firefly Protocol uses a special bus technique called SharedLine to allow quick detection to copies of the block in other caches. It is similar to the COPIES_EXIST (C) and !COPIES_EXIST, and is shown that way in the finite state machines below. Similar to the Dragon protocol, there is no invalidation state because no cache blocks are ever invalidated.

[[File:Firefly Protocol Processor-Side.png|550px|center]]
<center> '''Figure 4. Firefly Protocol Processor-Side''' </center>

[[File:Firefly Protocol Bus-Side.png|550px|center]]
<center> '''Figure 5. Firefly Protocol Bus-Side''' </center>

The Firefly protocol is used in the [http://en.wikipedia.org/wiki/DEC_Firefly DEC Firefly] multiprocessor workstation, developed by [http://en.wikipedia.org/wiki/Digital_Equipment_Corporation Digital Equipment Corporation]. The system is asymmetic and the cache is direct-mapped to support multiprocessing. The cache capicity was 16KB for the original [http://en.wikipedia.org/wiki/MicroVAX_78032 MicroVAX 78032] microprocessor were latter increased to 64KB when upgraded to the [http://en.wikipedia.org/wiki/CVAX#CVAX_78034 CVAX_78034] microprocessor.

==== Advantages of Write Update Protocol ====
The simplest, most obvious and fastest.
Also, for a bandwidth-restricted architecture, using write back caches does not prevent scalability.

==== Disadvantages ====
• Multiple writes to the same word (no intervening read) need only one invalidate message but would require an update for each
• Writes to same block in (usual) multi-word cache block require only one invalidate but would require multiple updates. Due to both spatial and temporal locality, previous cases occur often.
• Bus bandwidth is a precious commodity in shared memory multi-processors
• Experience has shown that invalidate protocols use significantly less bandwidth

==Adaptive Coherence Protocols==

===Introduction===
Even though there are clear advantages to using either update protocols or invalidate protocols, there are still disadvantages for each. In a write invalidate protocol, any update/write operation in a processor invalidates the shared cache blocks of other processors. This will force the other caches to do read requests such that the new data is retrieved. This tends to cause a high bus bandwidth and can be especially bad if there is few processors that frequently update the cache. Fortunately, write update protocols mitigate this issue. It will update all other caches at the same time it propagates an update itself. Unfortunately, this creates a different problem, there will sometime be unnecessary update to the cache on the other processors. This tends to increase conflict and capacity cache misses.

Adaptive protocols tries to mitigate these problems. It will both tend to have some high bus traffic as well as some unnecessary updates. But, these can be mitigated based on how the adaptive algorithm switches between write-invalidate and write-update. There is also adaptive directory-based protocols, but these are not discussed here.

===Subblock protocol===
This snoopy-based protocol mitigate the features of [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

'''Block states'''

'''Invalid''': All subblocks are invalid

'''Valid Exclusive''': All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replacement.

'''Clean Shared''': The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.

'''Dirty Shared''': Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.<ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

'''Finite state diagram of block/ line states is as follows:'''

<center> [[File:line_protocol.png]]</center>
<center> '''Figure 4. Finite state diagram of block''' </center>

'''Subblock states'''

'''Invalid''': The subblock is invalid

'''Clean Shared''': A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.

'''Dirty Shared''': The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.

'''Dirty''': The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.

'''Finite state diagram of subblock is as follows:'''
<center>[[File:subblock_protocol.png]]</center>
<center> '''Figure 5. Finite state diagram of Sub-block''' </center>

Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block.
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.

In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

===Read-snarfing protocol===
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.

In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.

Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared.
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution.

Simple algorithm of Read-snarfing Random Walk protocol is as follows:
Initially Tb of each cache block b is set to 0. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

<pre>

// Number of Write operation happens before being accessed by other processor
If (most recent write run > R) {
If(Tb > 1) {
Tb--;
}
} else {
If(R > Tb) {
Tb++;
}
}

R = Invalidation Ratio which is (Ci + Cr) / Cu
Ci: The cost in bus cycles of an invalidation transaction
Cu: The cost in bus cycles of an update transaction
Cr: The cost in bus cycles of reading a cache block

</pre>

Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.
When block is actively shared, block is not invalidated by adjusting the Tb upward. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

=====Example 1 =====
Suppose invalidation ratio (R) = 5
Current threshold block (Tb) = 3
If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.
This means Tb is at the best possible value and only update can be issues.

=====Example 2 =====
Consider, R= 5 and Tb = 3 for a particular block
If the processor writes 10 times before it is accessed by other processor
Tb will be 2. (Decreased)
So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.
After 2 more write, Tb will be 0 and invalidation will occur immediately.

===Competitive Update Protocol===
A competitive-update protocol is a "..hybrid protocols between write-invalidate and write-update.."<ref name="nilsson">
H. Nilsson, P. Stenström "An adaptive update-based cache coherence protocol for reduction of miss rate and traffic"
Proc. Parallel Architectures and Languages Europe (PARLE) Conf., Lecture Notes in Computer Science, Athens, Greece, 817, Springer-Verlag, Berlin (Jul. 1994), pp. 363–374
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.7116&rep=rep1&type=pdf Paper]</ref>
These hybrid protocols are used to reduce the coherence miss rate caused by invalidation or update alone. The sole issue here is that there can be high traffic peeks and these peeks can offset the performance gain<ref name="nilsson"></ref>
According to Nilsson in <ref name="nilsson2">H. Nilsson, P. Stenström, and M. Dubois, “Implementation and Evaluation of Update-
Based Cache Protocols Under Relaxed Memory Consistency Models”, Technical Report,
Dept. of Computer Engineering, Lund University, Sweden, July 1993</ref>, competitive-update protocols will outperform write-invalidate protocols under relaxed memory consistency. The concept presented is very simple. The first write to a block causes an update to the copy of the block. If instead the local processor does not access it, it will then propagate an invalidate. What this effectively does is make regularly accessed copies of the memory block be updated. The limitation here is that migratory data makes this protocol sub-optimal. The latest research done in this area is Competitive Update Protocol with Migratory Detection<ref name="nilsson"></ref>. This recognizes when there is migratory data and compensates.

<center>[[Image:CompetitiveUpdateProtocolWithMigratoryDetection.jpg|800px]]</center>
<center> '''Figure 6. Competitive Update Protocol With Migratory Detection<ref name="nilsson"></ref>''' </center>
<center> '''Coherence actions for detection of migratory blocks (left) and coherence actions for read misses to migratory blocks (right).''' </center>

This is only one of many ways to deal with migratory data. For further reading, a Google Scholar search on "Adaptive Protocols and Migratory" will return many papers published on different ways to deal with migratory data issue that arises when using adaptive protocols.

===Cachet===
Cachet is an adaptive cache coherence protocol that uses micro-protocols <ref name="shen">Xiaowei Shen, Arvind, and Larry Rudolph. 1999. CACHET: an adaptive cache coherence protocol for distributed shared-memory systems. In Proceedings of the 13th international conference on Supercomputing (ICS '99). ACM, New York, NY, USA, 135-144. DOI=10.1145/305138.305187[http://doi.acm.org/10.1145/305138.305187 Paper]</ref> Cachet recognizes that shared-memory programs have various access patterns and no fixed cache coherence protocol works well for all access patterns.<ref name="bennet">J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Adaptive Software Cache Management for Distributed Shared Memory Architectures. In Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990</ref><ref name="eggers">S. Eggers and R. H. Katz. Evaluating the Performance for Four Snooping Cache Coherency Protocols. In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989</ref><ref name="falsafi">B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. Application specific protocols for user-level shared memory. In Supercomputing, Nov. 1994</ref><ref name="weber">W. D. Weber and A. Gupta. Analysis of Cache Invalidation Patterns in Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, 1989</ref>. What cachet attempts to do is either take in the access pattern through program annotations from the programmer or recognition by the compiler.

So how does it work?

<blockquote>"Cachet-Base: The most straightforward implementation simply uses the memory as the rendezvous. When a Commit instruction is executed for an address that is cached in the Dirty state, the data must be written back to the memory before the instruction can complete. A Reconcile instruction for an address cached in the Clean state requires the data be purged from the cache before the instruction can complete. An attractive characteristic of Cachet-Base is its simplicity; no state needs to be maintained at the memory side."<ref name=shen></ref></blockquote>

<blockquote>Cachet-WriterPush: Since load operations are usually more frequent than store operations, it is desirable to allow a Reconcile instruction to complete even when the address is cached in the Clean state. Thus, the following load access to the address causes no cache miss. Correspondingly, when a Commit instruction is performed on a dirty cell, it cannot complete before clean copies of the address are purged from all other caches. Therefore, it can be a lengthy process to commit an address that is cached in the Dirty state."<ref name=shen></ref></blockquote>

<blockquote>"Cachet-Migratory: When an address is exclusively accessed by one processor for a reasonable time period, it makes sense to give the cache the exclusive ownership so that all instructions on the address become local operations. This is reminiscent of the exclusive state in conventional MESI like protocols. The protocol ensures that an address can be cached in at most one cache at any time. Therefore, a Commit instruction can complete even when the address is cached in the Dirty state, and a Reconcile instruction can complete even when the address is cached in the Clean state. The exclusive ownership can migrate among different caches whenever necessary."<ref name=shen></ref></blockquote>

<blockquote>"Different micro-protocols are optimized for different access patterns. Cachet-Base is ideal when the location is randomly accessed by multiple processors, and only necessary commit and reconcile operations are invoked. A conventional implementation of release consistency usually requires that all addresses be indistinguishably committed before a release, and reconciled after an acquire. Such excessive use of commit and reconcile operations can result in performance degradation under Cachet-Base."<ref name=shen></ref></blockquote>

<blockquote>"Cachet-WriterPush is appropriate when certain processors are likely to read an address many times before another processor writes the address. A reconcile operation performed on a clean copy causes no purge operation, regardless of whether the reconcile is necessary. Thus, subsequent load operations to the address can continually use the cached data without causing any cache miss. Cachet Migratory fits well when one processor is likely to read and write an address many times before another processor accesses the address."<ref name=shen></ref></blockquote>

What is so interesting about Cachet is its ability to switch between these mirco-protocols. This excerpt from the paper does the best of explaining it.

==Power Considerations==
A major issue when considering power is how many bus transaction are incurring over the bus. Different protocols require different bus transactions, so we are able to loosely demonstrate how much power is being utilized by each of the different techniques by comparing there bus transactions over the same read/write pattern. We will use the patterns found in '''Ch 8 of Solihin'''.

[[File:MSI_Protocol_operation.png|600px]]

[[File:MOESI_Protocol_operation.png|600px]]

[[File:Dragon_Protocol_operation.png|600px]]

As you can see from the above spreadsheets the update-based protocols demonstrate beter power consumption in terms of bus transactions and memory access. The invalidation protocols require 6 bus transactions plus 4 memory access for the MSI protocol, as well as 5 bus transactions and 1 memory access for the MOESI protocol. This is higher compared to the Dragon protocol which only used 4 bus transactions and 1 memory access. By counting the number of bus transactions and memory access over the same procedure of processor calls we are able to put the different protocols on the same field and compare them. The Dragon protocol also requires less time on the bus because it is only passing modified words instead of whole cache blocks.

Unfortunately for adaptive protocols, we are unable to provide rough estimates. Most cases with competitive update and cachet, it depends on the problem itself. For example, bus traffic can vary in a competitive update. Competitive update does invalidates on non-regularly used cache blocks. This can vary between program to program. What make power estimates hard for adaptive protocols is the nature of being adaptive. Depending on the program involved, the performance/power used can vary drastically.

==References==
<references />

File:Wiki first shared.jpg

2013-03-27T18:13:58Z

Amahaba:

File:Diags.png

2013-03-22T16:19:16Z

Amahaba:

2013-03-20T23:30:58Z

Amahaba: /* Update and Adaptive Coherence Protocols on Real Architectures */

=Update and Adaptive Coherence Protocols on Real Architectures=

In parallel computer architectures, [http://en.wikipedia.org/wiki/Cache_coherence '''cache coherence'''] refers to the consistency of data that is stored throughout the caches on individual processors or throughout the shared memory. The problem here is that we have multiple caches on multiple processors. When an update to a single cache makes changes to a shared memory, you will need to have all the caches be coherent on the value change. This is better shown below.

<center>[[Image:Cache Coherency Generic.png|400px| Multiple Caches of Shared Resource]]</center>
<center> '''Figure 1. Multiple Caches of Shared Resource''' </center>

There are two ways to maintain cache consistency<ref name="glasco">Glasco, D.B.; Delagi, B.A.; Flynn, M.J.; , "Update-based cache coherence protocols for scalable shared-memory multiprocessors," System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , vol.1, no., pp.534-545, 4-7 Jan. 1994 doi: 10.1109/HICSS.1994.323135
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=323135&isnumber=7709 Paper]</ref>, invalidation based and update based.

Invalidation based protocol will purge the copies of the line from the other caches which results in a single copy of the line whereas updating forwards the write value to the other caches, after which all caches are consistent.One of the drawbacks of an invalidation-based protocol is that it incurs high number of coherence misses. To solve this, one can use a update coherence protocol, or a new type of protocol called adaptive coherence protocol.

==Update Coherence Protocol==

===Introduction===
Update-based cache coherence protocols work by directly updating all the cache values in the system. This differs from the invalidation-based protocols because it achieves write propagation without having to invalidation and misses. This saves on numerous coherence misses, time spent to correct the miss, and bandwidth usage. The update-based protocols we will be discussing in this section are the [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] and the [http://en.wikipedia.org/wiki/Firefly_protocol Firefly protocol].

===Dragon Protocol===
The [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] is an implementation of update-based coherence protocol. It further saves on bandwidth by updating the specific words within the cache instead of the entire block. The caches use write allocate and write update policies. The Dragon Protocol is made up of four states ('''Modified (M)''', '''Exclusive (E)''', '''Shared Modified (Sm)''', '''Shared Clean (Sc)''') and does not include an invalidation state, because if the block is in cache it is valid.

* '''Modified (M)''' - cache block is exclusively owned, however it can be different from main memory
* '''Exclusive (E)''' - cache block is clean and is only in one cache
* '''Shared Modified (Sm)''' - cache block resides in multiple caches and is possible dirty
* '''Shared Clean (Sc)''' - cache block resides in multiple caches and is clean

There is not an invalidation state, because if a block is cached then it assumed to be valid. However, it can differ from main memory. Below are the finite state machines for the processor-side calls and bus-side calls. Dragon protocol utilizes snoopy caches to appear as if it as a uniform memory space even though there are multiple processors.
[[File:Dragon Protocol Processor-Side.png|550px|center]]
<center> '''Figure 2. Dragon Protocol Processor-Side''' </center>

[[File:Dragon Protocol Bus-Side.png|500px|center]]
<center> '''Figure 3. Dragon Protocol Bus-Side''' </center>

The Dragon Protocol is implemented in the [http://en.wikipedia.org/wiki/Cray_CS6400 Cray CS6400] (also know as the Xerox Dragon multiprocessor workstation) and was developed by Xerox in. It was available with either 60Mhz or 85Mhz processors. The Xerox Dragon was designed to be a research numerous processors

===Firefly Protocol===
[http://en.wikipedia.org/wiki/Firefly_protocol Firefly protocol] is another example of update coherence cache protocols. However, unlike the [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol], it uses write-through policy (which writes all changes back to memory).

* '''Valid Exclusive (VE)''' - cache block is exclusively owned, cache block is clean
* '''Dirty (D)''' - exclusive rights to the cache block, cache block is dirty
* '''Shared (S)''' - cache block is shared but is not modified

The Firefly Protocol uses a special bus technique called SharedLine to allow quick detection to copies of the block in other caches. It is similar to the COPIES_EXIST (C) and !COPIES_EXIST, and is shown that way in the finite state machines below. Similar to the Dragon protocol, there is no invalidation state because no cache blocks are ever invalidated.

[[File:Firefly Protocol Processor-Side.png|550px|center]]
<center> '''Figure 4. Firefly Protocol Processor-Side''' </center>

[[File:Firefly Protocol Bus-Side.png|550px|center]]
<center> '''Figure 5. Firefly Protocol Bus-Side''' </center>

The Firefly protocol is used in the [http://en.wikipedia.org/wiki/DEC_Firefly DEC Firefly] multiprocessor workstation, developed by [http://en.wikipedia.org/wiki/Digital_Equipment_Corporation Digital Equipment Corporation]. The system is asymmetic and the cache is direct-mapped to support multiprocessing. The cache capicity was 16KB for the original [http://en.wikipedia.org/wiki/MicroVAX_78032 MicroVAX 78032] microprocessor were latter increased to 64KB when upgraded to the [http://en.wikipedia.org/wiki/CVAX#CVAX_78034 CVAX_78034] microprocessor.

==Adaptive Coherence Protocols==

===Introduction===
Even though there are clear advantages to using either update protocols or invalidate protocols, there are still disadvantages for each. In a write invalidate protocol, any update/write operation in a processor invalidates the shared cache blocks of other processors. This will force the other caches to do read requests such that the new data is retrieved. This tends to cause a high bus bandwidth and can be especially bad if there is few processors that frequently update the cache. Fortunately, write update protocols mitigate this issue. It will update all other caches at the same time it propagates an update itself. Unfortunately, this creates a different problem, there will sometime be unnecessary update to the cache on the other processors. This tends to increase conflict and capacity cache misses.

Adaptive protocols tries to mitigate these problems. It will both tend to have some high bus traffic as well as some unnecessary updates. But, these can be mitigated based on how the adaptive algorithm switches between write-invalidate and write-update. There is also adaptive directory-based protocols, but these are not discussed here.

===Subblock protocol===
This snoopy-based protocol mitigate the features of [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

<dt>
=====Block states=====

<dd> Invalid: All subblocks are invalid

<dd> Valid Exclusive: All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus <dd>transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.

<dd> Clean Shared: The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.

<dd> Dirty Shared: Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.<ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

<dt>

'''Finite state diagram of block/ line states is as follows:'''

<center> [[File:line_protocol.png]]</center>
<center> '''Figure 4. Finite state diagram of block''' </center>

=====Subblock states=====

<dd> Invalid: The subblock is invalid

<dd> Clean Shared: A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.

<dd> Dirty Shared: The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.

<dd> Dirty: The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.

'''Finite state diagram of subblock is as follows:'''
<center>[[File:subblock_protocol.png]]</center>
<center> '''Figure 5. Finite state diagram of Sub-block''' </center>

Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block.
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.

In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>
</dl>

===Read-snarfing protocol===
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.

In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.

Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared.
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution.

Simple algorithm of Read-snarfing Random Walk protocol is as follows:
Initially Tb of each cache block b is set to 0. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

<pre>

// Number of Write operation happens before being accessed by other processor
If (most recent write run > R) {
If(Tb > 1) {
Tb--;
}
} else {
If(R > Tb) {
Tb++;
}
}

R = Invalidation Ratio which is (Ci + Cr) / Cu
Ci: The cost in bus cycles of an invalidation transaction
Cu: The cost in bus cycles of an update transaction
Cr: The cost in bus cycles of reading a cache block

</pre>

Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.
When block is actively shared, block is not invalidated by adjusting the Tb upward. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>
<dl>
<dd>
=====Example 1 =====
<dd>Suppose invalidation ratio (R) = 5
<dd>Current threshold block (Tb) = 3
<dd>If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.
<dd>This means Tb is at the best possible value and only update can be issues.

=====Example 2 =====
<dd>Consider, R= 5 and Tb = 3 for a particular block
<dd>If the processor writes 10 times before it is accessed by other processor
<dd>Tb will be 2. (Decreased)
<dd>So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.
<dd>After 2 more write, Tb will be 0 and invalidation will occur immediately.

</dl>

===Competitive Update Protocol===
A competitive-update protocol is a "..hybrid protocols between write-invalidate and write-update.."<ref name="nilsson">
H. Nilsson, P. Stenström "An adaptive update-based cache coherence protocol for reduction of miss rate and traffic"
Proc. Parallel Architectures and Languages Europe (PARLE) Conf., Lecture Notes in Computer Science, Athens, Greece, 817, Springer-Verlag, Berlin (Jul. 1994), pp. 363–374
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.7116&rep=rep1&type=pdf Paper]</ref>
These hybrid protocols are used to reduce the coherence miss rate caused by invalidation or update alone. The sole issue here is that there can be high traffic peeks and these peeks can offset the performance gain<ref name="nilsson"></ref>
According to Nilsson in <ref name="nilsson2">H. Nilsson, P. Stenström, and M. Dubois, “Implementation and Evaluation of Update-
Based Cache Protocols Under Relaxed Memory Consistency Models”, Technical Report,
Dept. of Computer Engineering, Lund University, Sweden, July 1993</ref>, competitive-update protocols will outperform write-invalidate protocols under relaxed memory consistency. The concept presented is very simple. The first write to a block causes an update to the copy of the block. If instead the local processor does not access it, it will then propagate an invalidate. What this effectively does is make regularly accessed copies of the memory block be updated. The limitation here is that migratory data makes this protocol sub-optimal. The latest research done in this area is Competitive Update Protocol with Migratory Detection<ref name="nilsson"></ref>. This recognizes when there is migratory data and compensates.

<center>[[Image:CompetitiveUpdateProtocolWithMigratoryDetection.jpg|800px]]</center>
<center> '''Figure 6. Competitive Update Protocol With Migratory Detection<ref name="nilsson"></ref>''' </center>
<center> '''Coherence actions for detection of migratory blocks (left) and coherence actions for read misses to migratory blocks (right).''' </center>

This is only one of many ways to deal with migratory data. For further reading, a Google Scholar search on "Adaptive Protocols and Migratory" will return many papers published on different ways to deal with migratory data issue that arises when using adaptive protocols.

===Cachet===
Cachet is an adaptive cache coherence protocol that uses micro-protocols <ref name="shen">Xiaowei Shen, Arvind, and Larry Rudolph. 1999. CACHET: an adaptive cache coherence protocol for distributed shared-memory systems. In Proceedings of the 13th international conference on Supercomputing (ICS '99). ACM, New York, NY, USA, 135-144. DOI=10.1145/305138.305187[http://doi.acm.org/10.1145/305138.305187 Paper]</ref> Cachet recognizes that shared-memory programs have various access patterns and no fixed cache coherence protocol works well for all access patterns.<ref name="bennet">J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Adaptive Software Cache Management for Distributed Shared Memory Architectures. In Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990</ref><ref name="eggers">S. Eggers and R. H. Katz. Evaluating the Performance for Four Snooping Cache Coherency Protocols. In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989</ref><ref name="falsafi">B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. Application specific protocols for user-level shared memory. In Supercomputing, Nov. 1994</ref><ref name="weber">W. D. Weber and A. Gupta. Analysis of Cache Invalidation Patterns in Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, 1989</ref>. What cachet attempts to do is either take in the access pattern through program annotations from the programmer or recognition by the compiler.

So how does it work?

<blockquote>"Cachet-Base: The most straightforward implementation simply uses the memory as the rendezvous. When a Commit instruction is executed for an address that is cached in the Dirty state, the data must be written back to the memory before the instruction can complete. A Reconcile instruction for an address cached in the Clean state requires the data be purged from the cache before the instruction can complete. An attractive characteristic of Cachet-Base is its simplicity; no state needs to be maintained at the memory side."<ref name=shen></ref></blockquote>

<blockquote>Cachet-WriterPush: Since load operations are usually more frequent than store operations, it is desirable to allow a Reconcile instruction to complete even when the address is cached in the Clean state. Thus, the following load access to the address causes no cache miss. Correspondingly, when a Commit instruction is performed on a dirty cell, it cannot complete before clean copies of the address are purged from all other caches. Therefore, it can be a lengthy process to commit an address that is cached in the Dirty state."<ref name=shen></ref></blockquote>

<blockquote>"Cachet-Migratory: When an address is exclusively accessed by one processor for a reasonable time period, it makes sense to give the cache the exclusive ownership so that all instructions on the address become local operations. This is reminiscent of the exclusive state in conventional MESI like protocols. The protocol ensures that an address can be cached in at most one cache at any time. Therefore, a Commit instruction can complete even when the address is cached in the Dirty state, and a Reconcile instruction can complete even when the address is cached in the Clean state. The exclusive ownership can migrate among different caches whenever necessary."<ref name=shen></ref></blockquote>

<blockquote>"Different micro-protocols are optimized for different access patterns. Cachet-Base is ideal when the location is randomly accessed by multiple processors, and only necessary commit and reconcile operations are invoked. A conventional implementation of release consistency usually requires that all addresses be indistinguishably committed before a release, and reconciled after an acquire. Such excessive use of commit and reconcile operations can result in performance degradation under Cachet-Base."<ref name=shen></ref></blockquote>

<blockquote>"Cachet-WriterPush is appropriate when certain processors are likely to read an address many times before another processor writes the address. A reconcile operation performed on a clean copy causes no purge operation, regardless of whether the reconcile is necessary. Thus, subsequent load operations to the address can continually use the cached data without causing any cache miss. Cachet Migratory fits well when one processor is likely to read and write an address many times before another processor accesses the address."<ref name=shen></ref></blockquote>

What is so interesting about Cachet is its ability to switch between these mirco-protocols. This excerpt from the paper does the best of explaining it.

==Power Considerations==
A major issue when considering power is how many bus transaction are incurring over the bus. Different protocols require different bus transactions, so we are able to loosely demonstrate how much power is being utilized by each of the different techniques by comparing there bus transactions over the same read/write pattern. We will use the patterns found in '''Ch 8 of Solihin'''.

[[File:MSI_Protocol_operation.png|600px]]

[[File:MOESI_Protocol_operation.png|600px]]

[[File:Dragon_Protocol_operation.png|600px]]

As you can see from the above spreadsheets the update-based protocols demonstrate beter power consumption in terms of bus transactions and memory access. The invalidation protocols require 6 bus transactions plus 4 memory access for the MSI protocol, as well as 5 bus transactions and 1 memory access for the MOESI protocol. This is higher compared to the Dragon protocol which only used 4 bus transactions and 1 memory access. By counting the number of bus transactions and memory access over the same procedure of processor calls we are able to put the different protocols on the same field and compare them. The Dragon protocol also requires less time on the bus because it is only passing modified words instead of whole cache blocks.

Unfortunately for adaptive protocols, we are unable to provide rough estimates. Most cases with competitive update and cachet, it depends on the problem itself. For example, bus traffic can vary in a competitive update. Competitive update does invalidates on non-regularly used cache blocks. This can vary between program to program. What make power estimates hard for adaptive protocols is the nature of being adaptive. Depending on the program involved, the performance/power used can vary drastically.

==Quiz==
Question 1 : What write protocol is used in the Dragon Protocol?
a) write-permeate
b) write-through
c) write-back
d) write-update

Question 2 : How does the Dragon Protocol save on bandwidth?
a) does not have a invalid state and can not incur read/write misses
b) consistently updates the memory
c) updates specific words within cache instead of entire blocks
d) only flushes the memory when in the exclusive state

Question 3 : How are the states different for Dragon protocol than MOESI protocol?
a) The Dragon shared states can allow for copies to exist in multiple caches
b) split up the share state into clean and modified
c) Exclusive state in the Dragon protocol can be dirty
d) The Modified state is exclusively owned in the MOESI protocol

Question 4 : Why do update-based protocols not require a invalidation state?
a) because the it does not allow for an invalid block
b) if the block is in the cache then it is valid
c) (a) and (c) are correct
d) None of the above

Question 5 : Which of the following is NOT true of the states in the Firefly protocol?
a) They states use the SharedLine technique to detect for copies of the cache
b) Dirty has multiple modified cache blocks
c) Shared allows for multiple copies of the cache block to exist
d) Valid Exclusive allows for only one copy of the cache to be in the state

Question 6 : What are two adaptive protocols?
a) Competitive Update
b) Competitive Invalidate
c) Cachet
d) Roulette

Question 7 : Competitive Update Protocol uses what protocol(s) as its basis?
a) Update
b) Invalidate
c) Both (a) and (b)
d) None of the above

Question 8 : What is the major drawback for adaptive protocols?
a) Write-Updates
b) Write-Invalidates
c) Overall Hardware Limitations
d) Migratory Data

Question 9 : What is not a sub-protocol of Cachet?
a) Cachet-Base
b) Cachet-WriterPush
c) Cachet-ReadPush
d) Cachet-Migratory

Question 10 : What is the basis for Cachet protocol?
a) It switches between update and invalidate
b) It switches between three sub-protocols
c) It switches between update and competitive update protocols
d) It switches between invalidate and competitive update protocols

==References==
<references />

CSC/ECE 506 Spring 2013/8b ap

2013-03-20T23:28:51Z

Amahaba: /* Update and Adaptive Coherence Protocols on Real Architectures */

=Update and Adaptive Coherence Protocols on Real Architectures=

In parallel computer architectures, [http://en.wikipedia.org/wiki/Cache_coherence '''cache coherence'''] refers to the consistency of data that is stored throughout the caches on individual processors or throughout the shared memory. The problem here is that we have multiple caches on multiple processors. When an update to a single cache makes changes to a shared memory, you will need to have all the caches be coherent on the value change. This is better shown below.

<center>[[Image:Cache Coherency Generic.png|400px| Multiple Caches of Shared Resource]]</center>
<center> '''Figure 1. Multiple Caches of Shared Resource''' </center>

There are two ways to maintain cache consistency<ref name="glasco">Glasco, D.B.; Delagi, B.A.; Flynn, M.J.; , "Update-based cache coherence protocols for scalable shared-memory multiprocessors," System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , vol.1, no., pp.534-545, 4-7 Jan. 1994
doi: 10.1109/HICSS.1994.323135
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=323135&isnumber=7709 Paper]</ref>, invalidation based and update based.

Invalidation based protocol will purge the copies of the line from the other caches which results in a single copy of the line whereas updating forwards the write value to the other caches, after which all caches are consistent"<ref name="glasco"></ref>.

One of the drawbacks of an invalidate-based protocol is that it incurs high number of coherence misses. To solve this, one can use a update coherence protocol, or a new type protocol called adaptive coherence protocol.

==Update Coherence Protocol==

===Introduction===
Update-based cache coherence protocols work by directly updating all the cache values in the system. This differs from the invalidation-based protocols because it achieves write propagation without having to invalidation and misses. This saves on numerous coherence misses, time spent to correct the miss, and bandwidth usage. The update-based protocols we will be discussing in this section are the [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] and the [http://en.wikipedia.org/wiki/Firefly_protocol Firefly protocol].

===Dragon Protocol===
The [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] is an implementation of update-based coherence protocol. It further saves on bandwidth by updating the specific words within the cache instead of the entire block. The caches use write allocate and write update policies. The Dragon Protocol is made up of four states ('''Modified (M)''', '''Exclusive (E)''', '''Shared Modified (Sm)''', '''Shared Clean (Sc)''') and does not include an invalidation state, because if the block is in cache it is valid.

* '''Modified (M)''' - cache block is exclusively owned, however it can be different from main memory
* '''Exclusive (E)''' - cache block is clean and is only in one cache
* '''Shared Modified (Sm)''' - cache block resides in multiple caches and is possible dirty
* '''Shared Clean (Sc)''' - cache block resides in multiple caches and is clean

There is not an invalidation state, because if a block is cached then it assumed to be valid. However, it can differ from main memory. Below are the finite state machines for the processor-side calls and bus-side calls. Dragon protocol utilizes snoopy caches to appear as if it as a uniform memory space even though there are multiple processors.
[[File:Dragon Protocol Processor-Side.png|550px|center]]
<center> '''Figure 2. Dragon Protocol Processor-Side''' </center>

[[File:Dragon Protocol Bus-Side.png|500px|center]]
<center> '''Figure 3. Dragon Protocol Bus-Side''' </center>

The Dragon Protocol is implemented in the [http://en.wikipedia.org/wiki/Cray_CS6400 Cray CS6400] (also know as the Xerox Dragon multiprocessor workstation) and was developed by Xerox in. It was available with either 60Mhz or 85Mhz processors. The Xerox Dragon was designed to be a research numerous processors

===Firefly Protocol===
[http://en.wikipedia.org/wiki/Firefly_protocol Firefly protocol] is another example of update coherence cache protocols. However, unlike the [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol], it uses write-through policy (which writes all changes back to memory).

* '''Valid Exclusive (VE)''' - cache block is exclusively owned, cache block is clean
* '''Dirty (D)''' - exclusive rights to the cache block, cache block is dirty
* '''Shared (S)''' - cache block is shared but is not modified

The Firefly Protocol uses a special bus technique called SharedLine to allow quick detection to copies of the block in other caches. It is similar to the COPIES_EXIST (C) and !COPIES_EXIST, and is shown that way in the finite state machines below. Similar to the Dragon protocol, there is no invalidation state because no cache blocks are ever invalidated.

[[File:Firefly Protocol Processor-Side.png|550px|center]]
<center> '''Figure 4. Firefly Protocol Processor-Side''' </center>

[[File:Firefly Protocol Bus-Side.png|550px|center]]
<center> '''Figure 5. Firefly Protocol Bus-Side''' </center>

The Firefly protocol is used in the [http://en.wikipedia.org/wiki/DEC_Firefly DEC Firefly] multiprocessor workstation, developed by [http://en.wikipedia.org/wiki/Digital_Equipment_Corporation Digital Equipment Corporation]. The system is asymmetic and the cache is direct-mapped to support multiprocessing. The cache capicity was 16KB for the original [http://en.wikipedia.org/wiki/MicroVAX_78032 MicroVAX 78032] microprocessor were latter increased to 64KB when upgraded to the [http://en.wikipedia.org/wiki/CVAX#CVAX_78034 CVAX_78034] microprocessor.

==Adaptive Coherence Protocols==

===Introduction===
Even though there are clear advantages to using either update protocols or invalidate protocols, there are still disadvantages for each. In a write invalidate protocol, any update/write operation in a processor invalidates the shared cache blocks of other processors. This will force the other caches to do read requests such that the new data is retrieved. This tends to cause a high bus bandwidth and can be especially bad if there is few processors that frequently update the cache. Fortunately, write update protocols mitigate this issue. It will update all other caches at the same time it propagates an update itself. Unfortunately, this creates a different problem, there will sometime be unnecessary update to the cache on the other processors. This tends to increase conflict and capacity cache misses.

Adaptive protocols tries to mitigate these problems. It will both tend to have some high bus traffic as well as some unnecessary updates. But, these can be mitigated based on how the adaptive algorithm switches between write-invalidate and write-update. There is also adaptive directory-based protocols, but these are not discussed here.

===Subblock protocol===
This snoopy-based protocol mitigate the features of [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

<dt>
=====Block states=====

<dd> Invalid: All subblocks are invalid

<dd> Valid Exclusive: All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus <dd>transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.

<dd> Clean Shared: The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.

<dd> Dirty Shared: Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.<ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

<dt>

'''Finite state diagram of block/ line states is as follows:'''

<center> [[File:line_protocol.png]]</center>
<center> '''Figure 4. Finite state diagram of block''' </center>

=====Subblock states=====

<dd> Invalid: The subblock is invalid

<dd> Clean Shared: A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.

<dd> Dirty Shared: The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.

<dd> Dirty: The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.

'''Finite state diagram of subblock is as follows:'''
<center>[[File:subblock_protocol.png]]</center>
<center> '''Figure 5. Finite state diagram of Sub-block''' </center>

Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block.
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.

In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>
</dl>

===Read-snarfing protocol===
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.

In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.

Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared.
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution.

Simple algorithm of Read-snarfing Random Walk protocol is as follows:
Initially Tb of each cache block b is set to 0. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>

<pre>

// Number of Write operation happens before being accessed by other processor
If (most recent write run > R) {
If(Tb > 1) {
Tb--;
}
} else {
If(R > Tb) {
Tb++;
}
}

R = Invalidation Ratio which is (Ci + Cr) / Cu
Ci: The cost in bus cycles of an invalidation transaction
Cu: The cost in bus cycles of an update transaction
Cr: The cost in bus cycles of reading a cache block

</pre>

Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.
When block is actively shared, block is not invalidated by adjusting the Tb upward. <ref>[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&ei=2n9kT8gjhPDSAaO2nb4P&usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]</ref>
<dl>
<dd>
=====Example 1 =====
<dd>Suppose invalidation ratio (R) = 5
<dd>Current threshold block (Tb) = 3
<dd>If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.
<dd>This means Tb is at the best possible value and only update can be issues.

=====Example 2 =====
<dd>Consider, R= 5 and Tb = 3 for a particular block
<dd>If the processor writes 10 times before it is accessed by other processor
<dd>Tb will be 2. (Decreased)
<dd>So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.
<dd>After 2 more write, Tb will be 0 and invalidation will occur immediately.

</dl>

===Competitive Update Protocol===
A competitive-update protocol is a "..hybrid protocols between write-invalidate and write-update.."<ref name="nilsson">
H. Nilsson, P. Stenström "An adaptive update-based cache coherence protocol for reduction of miss rate and traffic"
Proc. Parallel Architectures and Languages Europe (PARLE) Conf., Lecture Notes in Computer Science, Athens, Greece, 817, Springer-Verlag, Berlin (Jul. 1994), pp. 363–374
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.7116&rep=rep1&type=pdf Paper]</ref>
These hybrid protocols are used to reduce the coherence miss rate caused by invalidation or update alone. The sole issue here is that there can be high traffic peeks and these peeks can offset the performance gain<ref name="nilsson"></ref>
According to Nilsson in <ref name="nilsson2">H. Nilsson, P. Stenström, and M. Dubois, “Implementation and Evaluation of Update-
Based Cache Protocols Under Relaxed Memory Consistency Models”, Technical Report,
Dept. of Computer Engineering, Lund University, Sweden, July 1993</ref>, competitive-update protocols will outperform write-invalidate protocols under relaxed memory consistency. The concept presented is very simple. The first write to a block causes an update to the copy of the block. If instead the local processor does not access it, it will then propagate an invalidate. What this effectively does is make regularly accessed copies of the memory block be updated. The limitation here is that migratory data makes this protocol sub-optimal. The latest research done in this area is Competitive Update Protocol with Migratory Detection<ref name="nilsson"></ref>. This recognizes when there is migratory data and compensates.

<center>[[Image:CompetitiveUpdateProtocolWithMigratoryDetection.jpg|800px]]</center>
<center> '''Figure 6. Competitive Update Protocol With Migratory Detection<ref name="nilsson"></ref>''' </center>
<center> '''Coherence actions for detection of migratory blocks (left) and coherence actions for read misses to migratory blocks (right).''' </center>

This is only one of many ways to deal with migratory data. For further reading, a Google Scholar search on "Adaptive Protocols and Migratory" will return many papers published on different ways to deal with migratory data issue that arises when using adaptive protocols.

===Cachet===
Cachet is an adaptive cache coherence protocol that uses micro-protocols <ref name="shen">Xiaowei Shen, Arvind, and Larry Rudolph. 1999. CACHET: an adaptive cache coherence protocol for distributed shared-memory systems. In Proceedings of the 13th international conference on Supercomputing (ICS '99). ACM, New York, NY, USA, 135-144. DOI=10.1145/305138.305187[http://doi.acm.org/10.1145/305138.305187 Paper]</ref> Cachet recognizes that shared-memory programs have various access patterns and no fixed cache coherence protocol works well for all access patterns.<ref name="bennet">J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Adaptive Software Cache Management for Distributed Shared Memory Architectures. In Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990</ref><ref name="eggers">S. Eggers and R. H. Katz. Evaluating the Performance for Four Snooping Cache Coherency Protocols. In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989</ref><ref name="falsafi">B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. Application specific protocols for user-level shared memory. In Supercomputing, Nov. 1994</ref><ref name="weber">W. D. Weber and A. Gupta. Analysis of Cache Invalidation Patterns in Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, 1989</ref>. What cachet attempts to do is either take in the access pattern through program annotations from the programmer or recognition by the compiler.

So how does it work?

<blockquote>"Cachet-Base: The most straightforward implementation simply uses the memory as the rendezvous. When a Commit instruction is executed for an address that is cached in the Dirty state, the data must be written back to the memory before the instruction can complete. A Reconcile instruction for an address cached in the Clean state requires the data be purged from the cache before the instruction can complete. An attractive characteristic of Cachet-Base is its simplicity; no state needs to be maintained at the memory side."<ref name=shen></ref></blockquote>

<blockquote>Cachet-WriterPush: Since load operations are usually more frequent than store operations, it is desirable to allow a Reconcile instruction to complete even when the address is cached in the Clean state. Thus, the following load access to the address causes no cache miss. Correspondingly, when a Commit instruction is performed on a dirty cell, it cannot complete before clean copies of the address are purged from all other caches. Therefore, it can be a lengthy process to commit an address that is cached in the Dirty state."<ref name=shen></ref></blockquote>

<blockquote>"Cachet-Migratory: When an address is exclusively accessed by one processor for a reasonable time period, it makes sense to give the cache the exclusive ownership so that all instructions on the address become local operations. This is reminiscent of the exclusive state in conventional MESI like protocols. The protocol ensures that an address can be cached in at most one cache at any time. Therefore, a Commit instruction can complete even when the address is cached in the Dirty state, and a Reconcile instruction can complete even when the address is cached in the Clean state. The exclusive ownership can migrate among different caches whenever necessary."<ref name=shen></ref></blockquote>

<blockquote>"Different micro-protocols are optimized for different access patterns. Cachet-Base is ideal when the location is randomly accessed by multiple processors, and only necessary commit and reconcile operations are invoked. A conventional implementation of release consistency usually requires that all addresses be indistinguishably committed before a release, and reconciled after an acquire. Such excessive use of commit and reconcile operations can result in performance degradation under Cachet-Base."<ref name=shen></ref></blockquote>

<blockquote>"Cachet-WriterPush is appropriate when certain processors are likely to read an address many times before another processor writes the address. A reconcile operation performed on a clean copy causes no purge operation, regardless of whether the reconcile is necessary. Thus, subsequent load operations to the address can continually use the cached data without causing any cache miss. Cachet Migratory fits well when one processor is likely to read and write an address many times before another processor accesses the address."<ref name=shen></ref></blockquote>

What is so interesting about Cachet is its ability to switch between these mirco-protocols. This excerpt from the paper does the best of explaining it.

==Power Considerations==
A major issue when considering power is how many bus transaction are incurring over the bus. Different protocols require different bus transactions, so we are able to loosely demonstrate how much power is being utilized by each of the different techniques by comparing there bus transactions over the same read/write pattern. We will use the patterns found in '''Ch 8 of Solihin'''.

[[File:MSI_Protocol_operation.png|600px]]

[[File:MOESI_Protocol_operation.png|600px]]

[[File:Dragon_Protocol_operation.png|600px]]

As you can see from the above spreadsheets the update-based protocols demonstrate beter power consumption in terms of bus transactions and memory access. The invalidation protocols require 6 bus transactions plus 4 memory access for the MSI protocol, as well as 5 bus transactions and 1 memory access for the MOESI protocol. This is higher compared to the Dragon protocol which only used 4 bus transactions and 1 memory access. By counting the number of bus transactions and memory access over the same procedure of processor calls we are able to put the different protocols on the same field and compare them. The Dragon protocol also requires less time on the bus because it is only passing modified words instead of whole cache blocks.

Unfortunately for adaptive protocols, we are unable to provide rough estimates. Most cases with competitive update and cachet, it depends on the problem itself. For example, bus traffic can vary in a competitive update. Competitive update does invalidates on non-regularly used cache blocks. This can vary between program to program. What make power estimates hard for adaptive protocols is the nature of being adaptive. Depending on the program involved, the performance/power used can vary drastically.

==Quiz==
Question 1 : What write protocol is used in the Dragon Protocol?
a) write-permeate
b) write-through
c) write-back
d) write-update

Question 2 : How does the Dragon Protocol save on bandwidth?
a) does not have a invalid state and can not incur read/write misses
b) consistently updates the memory
c) updates specific words within cache instead of entire blocks
d) only flushes the memory when in the exclusive state

Question 3 : How are the states different for Dragon protocol than MOESI protocol?
a) The Dragon shared states can allow for copies to exist in multiple caches
b) split up the share state into clean and modified
c) Exclusive state in the Dragon protocol can be dirty
d) The Modified state is exclusively owned in the MOESI protocol

Question 4 : Why do update-based protocols not require a invalidation state?
a) because the it does not allow for an invalid block
b) if the block is in the cache then it is valid
c) (a) and (c) are correct
d) None of the above

Question 5 : Which of the following is NOT true of the states in the Firefly protocol?
a) They states use the SharedLine technique to detect for copies of the cache
b) Dirty has multiple modified cache blocks
c) Shared allows for multiple copies of the cache block to exist
d) Valid Exclusive allows for only one copy of the cache to be in the state

Question 6 : What are two adaptive protocols?
a) Competitive Update
b) Competitive Invalidate
c) Cachet
d) Roulette

Question 7 : Competitive Update Protocol uses what protocol(s) as its basis?
a) Update
b) Invalidate
c) Both (a) and (b)
d) None of the above

Question 8 : What is the major drawback for adaptive protocols?
a) Write-Updates
b) Write-Invalidates
c) Overall Hardware Limitations
d) Migratory Data

Question 9 : What is not a sub-protocol of Cachet?
a) Cachet-Base
b) Cachet-WriterPush
c) Cachet-ReadPush
d) Cachet-Migratory

Question 10 : What is the basis for Cachet protocol?
a) It switches between update and invalidate
b) It switches between three sub-protocols
c) It switches between update and competitive update protocols
d) It switches between invalidate and competitive update protocols

==References==
<references />

2013-02-21T01:04:29Z

Amahaba: /* Apache’s Hadoop MapReduce */

= Introduction to MapReduce =
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of concurrency management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for locality.

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to counts number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and load balancing. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since network bandwidth is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of the job.
* Client program uploads files to the Hadoop Distributed File System (HDFS) location and notifies the JobTracker which in turn returns the Job ID to the client.
* The Jobtracker allocates map tasks to the TaskTrackers.
* JobTracker determines appropriate jobs based on how busy the TaskTracker is.
* TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full.
* The buffer is eventually flushed into two files.
* After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.
* When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked.
* The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

== Phoenix ==
Phoenix<ref>
http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref> implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

=== Phoenix API ===
The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as void pointers wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use stack-allocated and heap-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

=== Buffer Management ===
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

=== Pros and Cons ===
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of GPUs differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Mars API ===

Mars provides a small set of APIs that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);
//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);
//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);
//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);
//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);
//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);
//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

=== Implementation Details ===

* Since the GPU does not support dynamic memory allocation on the device memory during the execution of the GPU code, arrays are used as the main data structure.
* The input data, the intermediate result and the final result are stored in three kinds of arrays, i.e., the key array, the value array and the directory index. The directory index consists of an entry of <key offset, key size, value offset, value size> for each key/value pair.
* Given a directory index entry, the key or the value at the corresponding offset in the key array or the value array is fetched.
* With the array structure, for the input data as well as for the result output the space on the device memory is allocated before executing the GPU program. However, the sizes of the output from the map and the reduce stages are unknown. The output scheme for the map stage is similar to that for the reduce stage.

First, each map task outputs three counts, i.e., the number of intermediate results, the total size of keys (in bytes) and the total size of values (in bytes) generated by the map task. Based on key sizes (or value sizes) of all map tasks, the run-time system computes a prefix sum on these sizes and produces an array of write locations. A write location is the start location in the output array for the corresponding map task to write. Based on the number of intermediate results, the run-time system computes a prefix sum and produces an array of start locations in the output directory index for the corresponding map task. Through these prefix sums, the sizes of the arrays for the intermediate result is also known. Thus, the run-time allocates arrays in the device memory with the exact size for storing the intermediate results.

Second, each map task outputs the intermediate key/value pairs to the output array and updates the directory index. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided. This two-step scheme does not require the hardware support of atomic functions. It is suitable for the massive thread parallelism on the GPU. However, it doubles the map computation in the worst case. The overhead of this scheme is application dependent, and is usually much smaller than that in the worst case.

=== Optimization Techniques ===
==== Memory Optimizations ====
Two memory optimizations are used to reduce the number of memory requests in order to improve the memory bandwidth utilization.
* '''Coalesced accesses'''
The GPU feature of coalesced accesses is utilized to improve the memory performance. The memory accesses of each thread to the data arrays are designed according to the coalesced access pattern when applicable. Suppose there are T threads in total and the number of key/value pairs is N in the map stage. Thread i processes the (i + T • k )th (k=0,..,N/T) key/value pair. Due to the SIMD property of the GPU, the memory addresses from the threads within a thread group are consecutive and these accesses are coalesced into one. The figure below illustrates the map stage with and without the coalesced access optimization. 
[[File:Mars.jpg]] 

* '''Accesses using built-in vector types'''
Accessing the values in the device memory can be costly, because the data values are often
of different sizes and the accesses are hardly coalesced. Fortunately, GPUs such as G80 support built-in vector types such as char4 and int4. Reading built-in vectors fetches the entire vector
in a single memory request. Compared with reading char or int, the number of memory requests is greatly reduced and the memory performance is improved.

==== Thread parallelism ====
The thread configuration, i.e., the number of thread groups and the number of threads per thread group, is related to multiple factors including, (1) the hardware configuration such as the number of multiprocessors and the on-chip computation resources such as the number of registers on each multiprocessor, (2) the computation characteristics of the map and the reduce tasks, e.g., they are memory- or computation-intensive. Since the map and the reduce functions are implemented by the developer, and their costs are unknown to the runtime system, it is difficult to find the optimal setting for the thread configuration at
run time.

==== Handling variable-sized types ====
The variable-sized types are supported with the directory index. If two key/value pairs need to be swapped, their corresponding entries in the directory index are swapped without modifying the key and the value arrays. This choice is to save the swapping cost since the directory entries are typically much smaller than the key/value pairs. Even though swapping changes the order of entries in the directory index, the array layout is preserved and therefore accesses to the directory index can still be coalesced after swaps. Since strings are a typical variable-sized type, and string processing is common in web data analysis tasks, a GPU-based string manipulation library was developed for Mars. The operations in the library include strcmp, strcat, memset and so on. The APIs of these operations are consistent with those in C/C++ library on the CPU. The difference is that simple algorithms for these GPU-based string operations were used, since they usually handle small strings within a map or a reduce task. In addition, char4 is used to implement strings to optimize the memory performance.

==== Hashing ====
Hashing is used in the sort algorithm to store the results with the same key value consecutively. In that case, it is not needed that the results with the key values are in their strict ascending/ decreasing order. The hashing technique that hashes a key into a 32-bit integer is used, and the records are sorted according to their hash values. When two records are compared, their hash values are compared first. Only when their hash values are the same, their keys are fetched and compared. Given a good hash function, the probability of comparing the keys is low.

==== File manipulation ====
Currently, the GPU cannot directly access the data in the hard disk. Thus, the file manipulation with the assistance of the CPU is performed in three phases. First, the file I/O on the CPU is performed and the file data is loaded into a buffer in the main memory. To reduce the I/O stall, multiple threads are used to perform the I/O task. Second, the preprocessing on the buffered data is performed and the input key/value pairs are obtained. Finally, the input key/value pairs are copied to the GPU device memory.

=== Pros and Cons ===
* '''Advantages'''

# Provides a performance speedup of accessing data by using built-in vector types. These vector types reduces the number of memory requests and improves the bandwidth utilization.
# Applications written on Mars may or may not have the reduce stage and thus improves speedup.

* '''Disadvantages'''

# GPU based applications are much more complex
# Mars currently handles data that can fit into the device memory but has not yet been checked to support massive data sets

= More Examples =
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns), Sector/Sphere, Datameer Analytics Solution. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Interesting Read =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

File:HMR.png

2013-02-21T01:00:22Z

Amahaba:

CSC/ECE 506 Spring 2013/3b xz

2013-02-21T01:00:15Z

Amahaba: /* Apache’s Hadoop MapReduce */

= Introduction to MapReduce =
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of concurrency management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for locality.

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to counts number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and load balancing. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since network bandwidth is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMR.png|center|Apache Hadoop MapReduce]]
The figure above depicts the execution of thejob. Client program uploads files to the HDFS location and notifies the JobTracker which in turn returns the Job ID to the client. The Jobtracker allocates map tasks to the TaskTrackers. JobTracker determines appropriate jobs based on how busy the TaskTracker is.

TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full. The buffer is eventually flushed into two files. After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.

When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked. The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

== Phoenix ==
Phoenix<ref>
http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref> implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

=== Phoenix API ===
The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as void pointers wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use stack-allocated and heap-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

=== Buffer Management ===
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

=== Pros and Cons ===
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of GPUs differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Mars API ===

Mars provides a small set of APIs that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);
//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);
//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);
//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);
//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);
//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);
//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

=== Implementation Details ===

* Since the GPU does not support dynamic memory allocation on the device memory during the execution of the GPU code, arrays are used as the main data structure.
* The input data, the intermediate result and the final result are stored in three kinds of arrays, i.e., the key array, the value array and the directory index. The directory index consists of an entry of <key offset, key size, value offset, value size> for each key/value pair.
* Given a directory index entry, the key or the value at the corresponding offset in the key array or the value array is fetched.
* With the array structure, for the input data as well as for the result output the space on the device memory is allocated before executing the GPU program. However, the sizes of the output from the map and the reduce stages are unknown. The output scheme for the map stage is similar to that for the reduce stage.

First, each map task outputs three counts, i.e., the number of intermediate results, the total size of keys (in bytes) and the total size of values (in bytes) generated by the map task. Based on key sizes (or value sizes) of all map tasks, the run-time system computes a prefix sum on these sizes and produces an array of write locations. A write location is the start location in the output array for the corresponding map task to write. Based on the number of intermediate results, the run-time system computes a prefix sum and produces an array of start locations in the output directory index for the corresponding map task. Through these prefix sums, the sizes of the arrays for the intermediate result is also known. Thus, the run-time allocates arrays in the device memory with the exact size for storing the intermediate results.

Second, each map task outputs the intermediate key/value pairs to the output array and updates the directory index. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided. This two-step scheme does not require the hardware support of atomic functions. It is suitable for the massive thread parallelism on the GPU. However, it doubles the map computation in the worst case. The overhead of this scheme is application dependent, and is usually much smaller than that in the worst case.

=== Optimization Techniques ===
==== Memory Optimizations ====
Two memory optimizations are used to reduce the number of memory requests in order to improve the memory bandwidth utilization.
* '''Coalesced accesses'''
The GPU feature of coalesced accesses is utilized to improve the memory performance. The memory accesses of each thread to the data arrays are designed according to the coalesced access pattern when applicable. Suppose there are T threads in total and the number of key/value pairs is N in the map stage. Thread i processes the (i + T • k )th (k=0,..,N/T) key/value pair. Due to the SIMD property of the GPU, the memory addresses from the threads within a thread group are consecutive and these accesses are coalesced into one. The figure below illustrates the map stage with and without the coalesced access optimization. 
[[File:Mars.jpg]] 

* '''Accesses using built-in vector types'''
Accessing the values in the device memory can be costly, because the data values are often
of different sizes and the accesses are hardly coalesced. Fortunately, GPUs such as G80 support built-in vector types such as char4 and int4. Reading built-in vectors fetches the entire vector
in a single memory request. Compared with reading char or int, the number of memory requests is greatly reduced and the memory performance is improved.

==== Thread parallelism ====
The thread configuration, i.e., the number of thread groups and the number of threads per thread group, is related to multiple factors including, (1) the hardware configuration such as the number of multiprocessors and the on-chip computation resources such as the number of registers on each multiprocessor, (2) the computation characteristics of the map and the reduce tasks, e.g., they are memory- or computation-intensive. Since the map and the reduce functions are implemented by the developer, and their costs are unknown to the runtime system, it is difficult to find the optimal setting for the thread configuration at
run time.

==== Handling variable-sized types ====
The variable-sized types are supported with the directory index. If two key/value pairs need to be swapped, their corresponding entries in the directory index are swapped without modifying the key and the value arrays. This choice is to save the swapping cost since the directory entries are typically much smaller than the key/value pairs. Even though swapping changes the order of entries in the directory index, the array layout is preserved and therefore accesses to the directory index can still be coalesced after swaps. Since strings are a typical variable-sized type, and string processing is common in web data analysis tasks, a GPU-based string manipulation library was developed for Mars. The operations in the library include strcmp, strcat, memset and so on. The APIs of these operations are consistent with those in C/C++ library on the CPU. The difference is that simple algorithms for these GPU-based string operations were used, since they usually handle small strings within a map or a reduce task. In addition, char4 is used to implement strings to optimize the memory performance.

==== Hashing ====
Hashing is used in the sort algorithm to store the results with the same key value consecutively. In that case, it is not needed that the results with the key values are in their strict ascending/ decreasing order. The hashing technique that hashes a key into a 32-bit integer is used, and the records are sorted according to their hash values. When two records are compared, their hash values are compared first. Only when their hash values are the same, their keys are fetched and compared. Given a good hash function, the probability of comparing the keys is low.

==== File manipulation ====
Currently, the GPU cannot directly access the data in the hard disk. Thus, the file manipulation with the assistance of the CPU is performed in three phases. First, the file I/O on the CPU is performed and the file data is loaded into a buffer in the main memory. To reduce the I/O stall, multiple threads are used to perform the I/O task. Second, the preprocessing on the buffered data is performed and the input key/value pairs are obtained. Finally, the input key/value pairs are copied to the GPU device memory.

=== Pros and Cons ===
* '''Advantages'''

# Provides a performance speedup of accessing data by using built-in vector types. These vector types reduces the number of memory requests and improves the bandwidth utilization.
# Applications written on Mars may or may not have the reduce stage and thus improves speedup.

* '''Disadvantages'''

# GPU based applications are much more complex
# Mars currently handles data that can fit into the device memory but has not yet been checked to support massive data sets

= More Examples =
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns), Sector/Sphere, Datameer Analytics Solution. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Interesting Read =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]

CSC/ECE 506 Spring 2013/3b xz

2013-02-21T00:58:30Z

Amahaba: /* Apache’s Hadoop MapReduce */

= Introduction to MapReduce =
MapReduce is a software framework introduced by Google in 2004 to support [http://publib.boulder.ibm.com/infocenter/txformp/v6r0m0/index.jsp?topic=%2Fcom.ibm.cics.te.doc%2Ferziaz0015.htm distributed computing] on large data sets on clusters of computers.

MapReduce programming model consists of two major steps:

* In the '''map''' step, the problem being solved is divided into a series of sub-problems and distributed to different workers.

* After collecting results from workers, the computation enters the '''reduce''' step to combine and produce the final result.
 

= Overview of the Programming Model =
[[File:Mapreduce.png|thumbnail|MapReduce for a Shape Counter]]
The MapReduce programming model is inspired by [http://enfranchisedmind.com/blog/posts/what-is-a-functional-programming-language/ functional languages] and targets data-intensive computations.

The input data format is application-specific, and is specified by the user. The output is a set of <key,value> pairs. The user expresses an algorithm using two functions, Map and Reduce. The Map function is applied on the input data and produces a list of intermediate <key,value> pairs. The Reduce function is applied to all intermediate pairs with the same key. It typically performs some kind of merging operation and produces zero or more output pairs. Finally, the output pairs are sorted by their key value. In the simplest form of MapReduce programs, the programmer provides just the Map function. All other functionality, including the grouping of the intermediate pairs which have the same key and the final sorting, is provided by the runtime.

The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelization. The actual parallelization and the details of concurrency management are left to the runtime system. Hence the program code is generic and easily portable across systems. Nevertheless, the model provides sufficient high-level information for parallelization. The Map function can be executed in parallel on non-overlapping portions of the input data and the Reduce function can be executed in parallel on each set of intermediate pairs with the same key. Similarly, since it is explicitly known which pairs each function will operate upon, one can employ pre-fetching or other scheduling optimizations for locality.

''Sample Code''

The following pseudo-code shows the basic structure of a MapReduce program.

Program to counts number of occurrences of each word in a collection of documents.
<pre>
//Input : a Document
//Intermediate Output: key = word, value = 1
Map(void * input){
for each word w in Input
Emit Intermediate(w,1)
}

//Intermediate Output key = word, value = 1
//Output : key = word, value = occurrences
Reduce(String key, Iterator values){
int result = 0;
for each v in values
result += v
Emit(w, result)
}
</pre>

= Role of the Run-time System =
In both steps of MapReduce, the run-time must decide on factors such as the size of the units, the number of nodes involved, how units are assigned to nodes dynamically, and how buffer space is allocated. The decisions can be fully automatic or guided by the programmer given application specific knowledge. These decisions allow the run-time to execute a program efficiently across a wide range of machines and data-set scenarios without modifications to the source code. Finally, the run-time must merge and sort the output pairs from all Reduce tasks.

= Implementations =
Many different implementations of the MapReduce interface are possible. The right choice depends on the environment. For example, one implementation may be suitable for a small shared-memory machine, another for a large [http://msdn.microsoft.com/en-us/library/ms178144.aspx NUMA] multi-processor, and yet another for an even larger collection of networked machines.

* Google's MapReduce and [http://en.wikipedia.org/wiki/Apache_Hadoop Hadoop] implement map reduce for large clusters of commodity PCs connected together with switched Ethernet.

* [http://mapreduce.stanford.edu/ Phoenix] implements MapReduce for shared-memory systems.

* [http://www.cse.ust.hk/gpuqp/Mars.html Mars] is a MapReduce framework on graphics processors ([http://www.nvidia.com/object/gpu.html GPUs]).

== Google's MapReduce ==

=== Execution Overview ===
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. Below is a detailed look. <ref>http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf Simplified Data Processing on Large Clusters</ref>

[[File:Google Map Reduce.jpg|center|Google's MapReduce]] 
The figure above shows the overall flow of a MapReduce operation in Google's implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in the figure above correspond to the numbers in the list below):

# The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
# One of the copies of the program is special, ''the master''. The rest are workers that are assigned work by the master. There are ''M'' map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
# A worker who is assigned a map task reads the contents of the corresponding input split. It parses ''key/value'' pairs out of the input data and passes each pair to the user-defined Map function. The intermediate ''key/value'' pairs produced by the Map function are buffered in memory.
# Periodically, the buffered pairs are written to local disk, partitioned into ''R'' regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers.
# When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
# The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
# When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
# After successful completion, the output of the mapreduce execution is available in the ''R'' output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these ''R'' output files into one file . They often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

=== Data Structures: Master ===

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).
The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the ''R'' intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

=== Fault Tolerance ===

Since the MapReduce library is designed to help process very large amounts of data using hundreds or thousands of machines, the library must tolerate machine failures gracefully.

* '''Master Failure'''

It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpoint. However, given that there is only a single master, its failure is unlikely; therefore Google's current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

* '''Worker Failure'''

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker ''A'' and then later executed by worker ''B'' (because ''A'' failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker ''A'' will read the data from worker ''B''. MapReduce is resilient to large-scale worker failures.

=== Pros and Cons ===
* '''Advantages'''

# Large variety of problems are easily expressible as Map-Reduce computations.
# The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and load balancing. For example, Map-Reduce is used for the generation of data for Google’s production Web search service, for sorting, data mining, machine learning, and many other systems.
# Implementation of Map-Reduce can be scaled to large clusters of machines comprising thousands of machines.

* '''Disadvantages'''

# Restricted programming model puts bounds on the way you implement the framework.
# Since network bandwidth is scarce, a number of optimization in the system are therefore targeted at reducing the amount of data sent across the network.

=== Apache’s Hadoop MapReduce ===

Apache, after Google published the paper on MapReduce and Google File System (GFS <ref>http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf</ref>) introduced it's own implementation of the same. The important thing to note here is that Apache made this framework open-source. This framework transparently provides both reliability and data motion to applications. Hadoop has prominent users such as Yahoo! and Facebook.
(Good Read!<ref>http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop</ref>)

Hadoop MapRed is based on a “pull” model where multiple “TaskTrackers” poll the “JobTracker” for tasks (either map task or reduce task).

[[File:HMP.jpg|center|Google's MapReduce]]
The figure above depicts the execution of thejob. Client program uploads files to the HDFS location and notifies the JobTracker which in turn returns the Job ID to the client. The Jobtracker allocates map tasks to the TaskTrackers. JobTracker determines appropriate jobs based on how busy the TaskTracker is.

TaskTracker forks MapTask which extracts input data and invokees the user provided "map" function which fills in the buffer with key/value pairs until it is full. The buffer is eventually flushed into two files. After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job.

When done, the JobTracker notifies TaskTracker to jump to reduce phase. This again follows same method where reduce task is forked. The output of each reducer is written to a temporary output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename.

== Phoenix ==
Phoenix<ref>
http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems</ref> implements MapReduce for shared-memory systems. Its goal is to support efficient execution on multiple cores without burdening the programmer with concurrency management. Phoenix consists of a simple API that is visible to application programmers and an efficient runtime that handles parallelization, resource management, and fault recovery.

=== Phoenix API ===
The current Phoenix implementation provides an API for C and C++. 
* The first set is provided by Phoenix and is used by the programmer’s application code to initialize the system and emit output pairs (1 required and 2 optional functions). 
* The second set includes the functions that the programmer defines (3 required and 2 optional functions). 

Apart from the Map and Reduce functions, the user provides functions that partition the data before each step and a function that implements key comparison. The function arguments are declared as void pointers wherever possible to provide flexibility in their declaration and fast use without conversion overhead. The data structure used to communicate basic function information and buffer allocation between the user code and run-time is of type ''scheduler_args_t'' ([http://pages.cs.wisc.edu/~gibson/mapreduceexample/MapReduceScheduler.h.html MapReduce Header File]). There are additional data structure types to facilitate communication between the Splitter, Map, Partition, and Reduce functions. These types use pointers whenever possible to implement communication without actually copying significant amounts of data.

The Phoenix API does not rely on any specific compiler options and does not require a parallelizing compiler. However, it assumes that its functions can freely use stack-allocated and heap-allocated structures for private data. It also assumes that there is no communication through shared-memory structures other than the input/output buffers for these functions. For C/C++, these assumptions cannot be checked statically for arbitrary programs. Although there are stringent checks within the system to ensure valid data are communicated between user and run-time code, eventually it is the task of user to provide functionally correct code.

[[File:Phoenix.jpg|center|Phoenix MapReduce]] 

The Phoenix run-time was developed on top of [http://en.wikipedia.org/wiki/POSIX_Threads POSIX threads], but can be easily ported to other shared memory thread packages. The figure above shows the basic data flow for the run-time system.

* The run-time is controlled by the scheduler, which is initiated by user code.
* The scheduler creates and manages the threads that run all Map and Reduce tasks. It also manages the buffers used for task communication.
* The programmer provides the scheduler with all the required data and function pointers through the ''scheduler_args_t'' structure.
*After initialization, the scheduler determines the number of cores to use for this computation. For each core, it spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

To start the '''Map''' stage, the scheduler uses the ''Splitter'' to divide input pairs into equally sized units to be processed by the Map tasks. The Splitter is called once per Map task and returns a pointer to the data the Map task will process. The Map tasks are allocated dynamically to workers and each one emits intermediate <key,value> pairs. The ''Partition'' function splits the intermediate pairs into units for the Reduce tasks. The function ensures all values of the same key go to the same unit. Within each buffer, values are ordered by key to assist with the final sorting. At this point, the Map stage is over. The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

'''Reduce''' tasks are also assigned to workers dynamically, similar to Map tasks. The one difference is that, while with Map tasks there is complete freedom in distributing pairs across tasks, with Reduce all values for the same key must be processed in one task. Hence, the Reduce stage may exhibit higher imbalance across workers and dynamic scheduling is more important. The output of each Reduce task is already sorted by key. As the last step, the final output from all tasks is merged into a single buffer, sorted by keys.

=== Buffer Management ===
Two types of temporary buffers are necessary to store data between the various stages. All buffers are allocated in shared memory but are accessed in a well specified way by a few functions. To re-arrange buffers (e.g., split across tasks), pointer manipulation is done instead of the actual pairs, which may be large in size. The intermediate buffers are not directly visible to user code. Map-Reduce buffers are used to store the intermediate output pairs. Each worker has its own set of buffers. The buffers are initially sized to a default value and then resized dynamically as needed. At this stage, there may be multiple pairs with the same key. To accelerate the Partition function, the Emit intermediate function stores all values for the same key in the same buffer. At the end of the Map task, each buffer is sorted by key order. Reduce- Merge buffers are used to store the outputs of Reduce tasks before they are sorted. At this stage, each key has only one value associated with it. After sorting, the final output is available in the user allocated Output data buffer.

=== Pros and Cons ===
* '''Advantages'''

# Phoenix is fast and scalable across all workloads
# On clusters of machines, the combiner function reduces the number of key-value pairs that must be exchanged between machines. These combiners contribute to better data locality and lower memory allocation pressure, resulting in substantial number of applications being scalable.

* '''Disadvantages'''<ref>http://csl.stanford.edu/~christos/publications/2011.phoenixplus.mapreduce.pdf Phoenix++: Modular MapReduce for Shared Memory Systems</ref>

# Due to shared memory there is an inefficient key-Value storage since containers must provide fast lookup and retrieval over potentially large data-set, all the while coordinating accesses across multiple threads
#Ineffective Combiner : However, on SMP machines memory allocation costs tend to dominate, even more than the memory traffic. Combiners fail to reduce the memory allocation pressure, since generated key-value pairs must still be stored. Further, by the time the combiners are run, those pairs may no longer be in the cache causing expensive memory access penalties.
#Phoenix implements internally grouping tasks into chunks to reduce scheduling costs and amortize per-task overhead. This design enables the user-implemented optimizations described in the previous two sections. However, it also has two drawbacks. Firstly, since the code for grouping tasks is pushed into user code, map function becomes more complicated due to the extra code to deal with chunks. Secondly, if the user leverages the exposed chunk to improve performance, the framework can no longer freely adjust the chunk size since doing so will affect the efficiency of the map function.

== Map Reduce on Graphics Processors ==
=== Challenges ===

Compared with CPUs, the hardware architecture of GPUs differs significantly. For instance, current GPUs have over one hundred [http://encyclopedia.jrank.org/articles/pages/6904/SIMD-Single-Instruction-Multiple-Data-Processing.html SIMD (Single Instruction Multiple Data)] processors whereas current multi-core CPUs offer a much smaller number of cores. Moreover, most GPUs do not support atomic operations or locks.

Due to the architectural differences, there are following three technical challenges in implementing the MapReduce framework on the GPU.
# The synchronization overhead in the run-time system of the framework must be low so that the system can scale to hundreds of processors.
# Due to the lack of dynamic thread scheduling on current GPUs, it is essential to allocate work evenly across threads on the GPU to exploit its massive thread parallelism.
# The core tasks of MapReduce programs, including string processing, file manipulation and concurrent reads and writes, are unconventional to GPUs and must be handled efficiently.

'''''Mars''''', MapReduce framework on the GPU was designed and implemented with these challenges in mind.<ref>http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/mars-pact-08.pdf Mars: A MapReduce Framework on Graphic Processors</ref>

=== Mars API ===

Mars provides a small set of APIs that are similar to those of CPU-based MapReduce. Run-time system utilizes a large number of GPU threads for Map or Reduce tasks, and automatically assigns each thread a small number of key/value pairs to work on. As a result, the massive thread parallelism on the GPU is well utilized. To avoid any conflict between concurrent writes, Mars has a lock-free scheme with low runtime overhead on the massive thread parallelism of the GPU. This scheme guarantees the correctness of parallel execution with little synchronization overhead.

Mars has two kinds of APIs, the ''user-implemented APIs'', which the users implement, and the ''system-provided APIs'', which the users can use as library calls.

* Mars has the following user-implemented APIs. These APIs are implemented with C/C++. ''void*'' type has been used so that the developer can manipulate strings and other complex data types conveniently.

<pre>
//MAP_COUNT counts result size of the map function.
voidMAP_COUNT(void *key, void *val, int keySize, int valSize);
//The map function.
voidMAP(void *key, void* val, int keySize, int valSize);
//REDUCE_COUNT counts result size of the reduce function.
void REDUCE_COUNT(void* key, void* vals, int keySize, int valCount);
//The reduce function.
void REDUCE(void* key, void* vals, int keySize, int valCount);
</pre>

* Mars has the following four system-provided APIs. The emit functions are used in user-implemented map and reduce functions to output the intermediate/final results.
<pre>
//Emit the key size and the value size inMAP_COUNT.
void EMIT_INTERMEDIATE_COUNT(int keySize, int valSize);
//Emit an intermediate result in MAP.
void EMIT_INTERMEDIATE(void* key, void* val, int keySize, int valSize);
//Emit the key size and the value size in REDUCE_COUNT.
void EMIT_COUNT(int keySize, int valSize);
//Emit a final result in REDUCE.
void EMIT(void *key, void* val, int keySize, int valSize);
</pre>
Overall, the APIs in Mars are similar to those in the existing MapReduce frameworks such as Hadoop and Phoenix. The major difference is that Mars needs two APIs to implement the functionality of each CPU-based API. One is to count the size of results, and the other one is to output the results. This is because the GPU does not support atomic operations, and the Mars runtime uses a two-step design for the result output.

=== Implementation Details ===

* Since the GPU does not support dynamic memory allocation on the device memory during the execution of the GPU code, arrays are used as the main data structure.
* The input data, the intermediate result and the final result are stored in three kinds of arrays, i.e., the key array, the value array and the directory index. The directory index consists of an entry of <key offset, key size, value offset, value size> for each key/value pair.
* Given a directory index entry, the key or the value at the corresponding offset in the key array or the value array is fetched.
* With the array structure, for the input data as well as for the result output the space on the device memory is allocated before executing the GPU program. However, the sizes of the output from the map and the reduce stages are unknown. The output scheme for the map stage is similar to that for the reduce stage.

First, each map task outputs three counts, i.e., the number of intermediate results, the total size of keys (in bytes) and the total size of values (in bytes) generated by the map task. Based on key sizes (or value sizes) of all map tasks, the run-time system computes a prefix sum on these sizes and produces an array of write locations. A write location is the start location in the output array for the corresponding map task to write. Based on the number of intermediate results, the run-time system computes a prefix sum and produces an array of start locations in the output directory index for the corresponding map task. Through these prefix sums, the sizes of the arrays for the intermediate result is also known. Thus, the run-time allocates arrays in the device memory with the exact size for storing the intermediate results.

Second, each map task outputs the intermediate key/value pairs to the output array and updates the directory index. Since each map has its deterministic and non-overlapping positions to write to, the write conflicts are avoided. This two-step scheme does not require the hardware support of atomic functions. It is suitable for the massive thread parallelism on the GPU. However, it doubles the map computation in the worst case. The overhead of this scheme is application dependent, and is usually much smaller than that in the worst case.

=== Optimization Techniques ===
==== Memory Optimizations ====
Two memory optimizations are used to reduce the number of memory requests in order to improve the memory bandwidth utilization.
* '''Coalesced accesses'''
The GPU feature of coalesced accesses is utilized to improve the memory performance. The memory accesses of each thread to the data arrays are designed according to the coalesced access pattern when applicable. Suppose there are T threads in total and the number of key/value pairs is N in the map stage. Thread i processes the (i + T • k )th (k=0,..,N/T) key/value pair. Due to the SIMD property of the GPU, the memory addresses from the threads within a thread group are consecutive and these accesses are coalesced into one. The figure below illustrates the map stage with and without the coalesced access optimization. 
[[File:Mars.jpg]] 

* '''Accesses using built-in vector types'''
Accessing the values in the device memory can be costly, because the data values are often
of different sizes and the accesses are hardly coalesced. Fortunately, GPUs such as G80 support built-in vector types such as char4 and int4. Reading built-in vectors fetches the entire vector
in a single memory request. Compared with reading char or int, the number of memory requests is greatly reduced and the memory performance is improved.

==== Thread parallelism ====
The thread configuration, i.e., the number of thread groups and the number of threads per thread group, is related to multiple factors including, (1) the hardware configuration such as the number of multiprocessors and the on-chip computation resources such as the number of registers on each multiprocessor, (2) the computation characteristics of the map and the reduce tasks, e.g., they are memory- or computation-intensive. Since the map and the reduce functions are implemented by the developer, and their costs are unknown to the runtime system, it is difficult to find the optimal setting for the thread configuration at
run time.

==== Handling variable-sized types ====
The variable-sized types are supported with the directory index. If two key/value pairs need to be swapped, their corresponding entries in the directory index are swapped without modifying the key and the value arrays. This choice is to save the swapping cost since the directory entries are typically much smaller than the key/value pairs. Even though swapping changes the order of entries in the directory index, the array layout is preserved and therefore accesses to the directory index can still be coalesced after swaps. Since strings are a typical variable-sized type, and string processing is common in web data analysis tasks, a GPU-based string manipulation library was developed for Mars. The operations in the library include strcmp, strcat, memset and so on. The APIs of these operations are consistent with those in C/C++ library on the CPU. The difference is that simple algorithms for these GPU-based string operations were used, since they usually handle small strings within a map or a reduce task. In addition, char4 is used to implement strings to optimize the memory performance.

==== Hashing ====
Hashing is used in the sort algorithm to store the results with the same key value consecutively. In that case, it is not needed that the results with the key values are in their strict ascending/ decreasing order. The hashing technique that hashes a key into a 32-bit integer is used, and the records are sorted according to their hash values. When two records are compared, their hash values are compared first. Only when their hash values are the same, their keys are fetched and compared. Given a good hash function, the probability of comparing the keys is low.

==== File manipulation ====
Currently, the GPU cannot directly access the data in the hard disk. Thus, the file manipulation with the assistance of the CPU is performed in three phases. First, the file I/O on the CPU is performed and the file data is loaded into a buffer in the main memory. To reduce the I/O stall, multiple threads are used to perform the I/O task. Second, the preprocessing on the buffered data is performed and the input key/value pairs are obtained. Finally, the input key/value pairs are copied to the GPU device memory.

=== Pros and Cons ===
* '''Advantages'''

# Provides a performance speedup of accessing data by using built-in vector types. These vector types reduces the number of memory requests and improves the bandwidth utilization.
# Applications written on Mars may or may not have the reduce stage and thus improves speedup.

* '''Disadvantages'''

# GPU based applications are much more complex
# Mars currently handles data that can fit into the device memory but has not yet been checked to support massive data sets

= More Examples =
Below are a few simple examples of programs that can be easily expressed as MapReduce computations.
*Distributed [http://unixhelp.ed.ac.uk/CGI/man-cgi?grep Grep]: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. 
*Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. 
*[http://books.google.com/books?id=gJrmszNHQV4C&pg=PA376&lpg=PA376&dq=what+is+reverse+web+link+graph&source=bl&ots=rLQ2yuV6oc&sig=wimcG_7MR7d9g-ePGXkEK1ANmws&hl=en&sa=X&ei=BtxBT5HkN42DtgefhbXRBQ&ved=0CEwQ6AEwBg#v=onepage&q=what%20is%20reverse%20web%20link%20graph&f=false Reverse Web-Link Graph]: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. 
*Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. 
*[http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html Inverted Index]: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

= Summary =
Google’s MapReduce runtime implementation targets large clusters of Linux PCs connected through Ethernet switches. Tasks are forked using remote procedure calls. Buffering and communication occurs by reading and writing files on a distributed file system. The locality optimizations focus mostly on avoiding remote file accesses. While such a system is effective with distributed computing, it leads to very high overheads if used with shared-memory systems that facilitate communication through memory and are typically of much smaller scale.

Phoenix, implementation of MapReduce uses shared memory and minimizes the overheads of task spawning and data communication. With Phoenix,the programmer can provide a simple, functional expression of the algorithm and leaves parallelization and scheduling to the runtime system.Phoenix leads to scalable performance for both multi-core chips and conventional symmetric multiprocessors. Phoenix automatically handles key scheduling decisions during parallel execution. Despite runtime overheads, results have shown that performance of Phoenix to that of parallel code written in P-threads API are almost similar. Nevertheless,there are also applications that do not fit naturally in the MapReduce model for which P-threads code performs significantly better.

Graphics processors have emerged as a commodity platform for parallel computing. However, the developer requires the knowledge of the GPU architecture and much effort in developing GPU applications. Such difficulty is even more for complex and performance centric tasks such as web data analysis. Since MapReduce has been successful in easing the development of web data analysis tasks, one can use a GPU-based MapReduce for these applications. With the GPU-based framework, the developer writes their code using the simple and familiar MapReduce interfaces. The runtime on the GPU is completely hidden from the developer by the framework.

The framework is followed by criticisms as well. Google was awarded the patent for MapReduce, but it can be argued that this technology is similar to many other already existing ones. There are programming models that are similar to MapReduce like Algorithm Skeletons (Parallelism Patterns), Sector/Sphere, Datameer Analytics Solution. Algorithm Skeletons are a high-level parallel programming model for parallel and distributed computing.This frameowrk libraries are used for a number of applications. Sector/Sphere is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming framework that supports massive in-storage parallel data processing for data stored in Sector. Additionally, Sector/Sphere is unique in its ability to operate in a wide area network (WAN) setting. Datameer Analytics Solution (DAS) is a business integration platform for Hadoop and includes data source integration, an analytics engine with a spreadsheet interface designed for business users with over 180 analytic functions and visualization including reports, charts and dashboards.

= References =
<references />

= Interesting Read =
# [http://en.wikipedia.org/wiki/Algorithmic_skeleton#Frameworks_and_libraries Algorithm Skeleton]
# [http://en.wikipedia.org/wiki/Sector/Sphere Sector/Sphere]
# [http://en.wikipedia.org/wiki/Datameer Datameer Analytic Solution]