CSC/ECE 506 Spring 2012/10a dr: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
 
(173 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''Prefetching and consistency models'''
'''Prefetching and Memory Consistency Models'''


== Introduction <ref>http://en.wikipedia.org/wiki/Amdahl%27s_law</ref> ==  
== '''Overview''' ==
In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm. Parallel computing gains very high importance in scientific computations because of its good speedup. More so in those computations that involve large-scaled data. As far as the need for parallel computing goes, one claim is that we can always double the speed of a chip every 18 months according to Moore’s Law. This means there is no need to develop parallel computation. This claim however, has been proved to be wrong.
This wiki article explores two different topics Sequential Prefetching and Memory Consistency models. The article covers description of prefetching, different types of prefetching like Fixed, Adaptive, etc explained in detail. This is followed by, different types of memory consistency models like Sequential consistency and Relaxed consistency. It also talks about the authors' and researchers' comments through examples.


According to [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] the speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. But this solves a fixed problem in the shortest possible period of time, rather than solving the largest possible problem (e.g., the most accurate possible approximation) in a fixed "reasonable" amount of time. To overcome these shortcomings, [http://en.wikipedia.org/wiki/John_L._Gustafson John L. Gustafson] and his colleague Edwin H. Barsis described [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Gustafson's Law], which provides a counterpoint to Amdahl's law, which describes a limit on the speed-up that parallelization can provide, given a fixed data set size.
= '''Prefetching''' =


== Scaled speedup==
Sequential prefetching is a simple hardware controlled pre fetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality.  In its simplest for, the number of prefetched blocks on each miss is fixed throughout the execution<ref>http://129.16.20.23/~pers/pub/j5.pdf</ref>.


Scaled speedup is the speedup that can be achieved by increasing the data size. This increase in data size is done to solve a given problem on multiple parallel processors. In other words, with larger number of parallel processors at our disposal, we can increase the data size of the same problem and achieve higher speedup. This is what is referred to as scaled speedup. Means of achieving this speedup are exploited by Gustafson's Law.  
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. Prefetching approaches proposed in the literature are software or hardware based.


===Gustafson's Law <ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law</ref>===
Software controlled prefetching schemes rely on the programmer/compiler to insert prefetch instructions prior to the instructions that trigger a miss.  In addition, both the processor and the memory system must be able to support prefetch instructions which can potentially increase the code size and the run-time overhead.  By contrast, hardware-controlled prefetch relieve the programmer/compiler from the burden of deciding what and when to prefetch. Usually, these schemes take advantage of the regularity of data access in scientific computations by dynamically detecting access strides.


[[Image:Gustafson.png|thumb|right|350px|Figure 1. Gustafson's Law]]
Basically there are two types of prefetching techniques


Gustafson's Law says that it is possible to parallelize computations when they involve significantly large data sets. It says that there is skepticism regarding the viability of massive parallelism. This skepticism is largely due to Amdahl's law, which says that the maximum speedup that can be achieved in a given problem with serial fraction of work s, is 1/s, even when the number of processors increases to an infinite number. For example, if 5% of computation in a problem is serial, then the maximum achievable speedup is 20 regardless of the number of processors. This is not a very encouraging result.
1. Fixed sequential prefetching


2. Adaptive sequential prefetching


Amdahl's law does not fully exploit the computing power that becomes available as the number of machines increases. Gustafson's law addresses
More details about these is discussed in the following sections along with other prefetching models:
this limitation. It considers the effect of increasing the problem size. Gustafson reasoned that when a problem is ported onto a multiprocessor system, it is possible to consider larger problem sizes. In other words, the same problem with a larger number of data values takes the same time. The law proposes that programmers tend to set the size of problems to use the available equipment to solve problems within a
practical fixed time. Larger problems can be solved in the same time if faster, i.e. more parallel equipment is available. Therfore, it should be possible to achieve high speedup if we scale the problem size.


== Simulated processing Node Architecture ==
According to Fig. 2.1, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on write misses and is blocking on read misses. Writes and read miss requests are buffered in the FLWB. The second-level cache (SLC) is a direct-mapped write-back cache. For both prefetching techniques, we only prefetch into the SLC.  In addition, a second-level write buffer (SLWB) keeps track of outstanding requests (SLC read miss, prefetch, and write requests). No more than one request to the same block is allowed to be issued to the system; others are just kept in the SLWB while waiting for the pending request to that block to complete. Moreover, a read miss request may bypass write requests if they are for different blocks.
[[File:Procenv.png|thumb|center|upright|350px|Figure 2.1. Processor environment and simulated architecture]]


Example:
==Fixed Sequential Prefetching<ref>http://www.springerlink.com/content/lu0755310187318n/</ref>==
    s (serial fraction of work) = 5%
By fixed sequential prefetching we mean that K consecutive blocks are prefetched into the SLC on a reference to a block, i.e., blocks n + 1 ... n + K are prefetched upon a reference to block n, if they are not present in the cache. Sequential prefetching has been extensively studied in the context of uniprocessors,but to our knowledge, have never been considered for general applications on multiprocessors. Although many sequential strategies have been proposed for uniprocessors, we have restricted ourselves to prefetching on a miss in the SLC. When a reference misses in the SLC, the miss request is sent to memory, and the cache is searched for the K consecutive blocks directly following the missing block in the address space. The blocks among the K consecutive blocks that are not present in the SLC and have no pending requests in the SLWB are prefetched. We refer to K as the degree of prefetching.
    p (number of processors) = 20
[[File:untitled1.png|thumb|center|upright|350px|Figure 2.2. The fixed sequential prefetching mechanism]]
    speedup (Amdahl's Law) = 10.26
    scaled speedup (Gustafson's Law) = 19.05


===Derivation of Gustafson's Law<ref>http://www.johngustafson.net/pubs/pub13/amdahl.pdf</ref><ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law#Derivation_of_Gustafson.27s_Law</ref>===
Fig. 2.2 shows the mechanism of the fixed sequential prefetching scheme. As a cache lookup is made for block address n, the next block address
(n + 1) is calculated. On a read miss, a read request is issued to the memory system and is kept in the SLWB. In the next cache cycle, the calculated address (n + I) is directed to the cache, and a cache lookup is made. If the block is not present in the cache, a prefetch request is issued and is kept in the SLWB. During that time, the subsequent block address is calculated (n + 2). The number of iterations is determined by the degree of prefetching. The processor is blocked only during the time it takes to handle the first read miss. Since the prefetch requests are issued one at a time and are pipelined in the memory system, they can be overlapped with the original read request. Besides the simple extensions in the SLC to incorporate fixed sequential prefetching, the memory system must be able to handle three new network commands: a prefetch request and two reply messages denoted PreData and PreNeg. Whereas PreData carries the prefetched block, PreNeg tells the cache that the prefetch request cannot be satisfied because the memory copy is in a transient state-some other cache is reading or writing to it.


If p is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and 1-s is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, then [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] says that speedup is given by:
==Adaptive Sequential Prefetching<ref>http://web.cecs.pdx.edu/~walpole/papers/mmcn1998b.pdf</ref>==
The mechanism behind the adaptive scheme is basically the same as that of fixed sequential prefetching. For example, prefetching is activated by a read miss and blocks are prefetched into the SLC. In contrast to fixed sequential prefetching, however, the degree of prefetching is not fixed; rather it is controlled
by a register, the Lookahead Counter. The adaptive sequential prefetching scheme relies on adjusting the degree of prefetching (the value of the Lookahead- Counter) dynamically by counting the useful prefetches, i.e., prefetched blocks that are actually referenced during their lifetime in the cache. To explain how this is achieved, we will first focus on how the algorithm measures the prefetch efficiency and then how the Lookahead Counter is adjusted to a certain prefetch efficiency. The mechanisms needed to achieve these task-two bits per cache line and three counters per cache appear in Table 1.


[[Image:Amdahl.png|center|300px]]
{|class="wikitable"
|-
|PrefetchBit (per Cache Line)
|Used to detect useful prefetches (needed when prefetching is tumed on.)
|-
|ZeroBit (per cache line)
|Used to detect when a prefetch would have been useful (needed when prefetching is turned off.)
|-
|LookaheadCounter (per cache)
|The current degree of prefetching (per cache)
|-
|PrefetchCounter (per cache)
|Counts the number of prefetches that have been I returned after each read miss
|-
|UsefulCounter (per cache)
|Counts the number of useful prefetches
|}
Conceptually, the algorithm measures the prefetch efficiency by counting the fraction of prefetched blocks that are referenced by the processors. If this fraction exceeds a preset threshold, the degree of prefetching is increased and, if it is below another preset threshold, the degree of prefetching is decreased.


Let us consider a bigger problem size of measure n.
The basic mechanisms used to measure the prefetch efficiency consist of two counters (the PrefetchCounter and the UsefulCounter) and a PrefetchBit per cache line which are all cleared from the very beginning. The fraction of useful prefetches is established as the ratio of the UsefulCounter and the PrefetchCounter as follows. The number of prefetched blocks is counted by incrementing the PrefetchCounter whenever a prefetch acknowledgment is received from the memory system, independent of whether the prefetch was accepted (PreData) or not (PreNeg) (e.g., if the memory block was in a transient state and neither clean nor dirty) by the memory system. To count the number of prefetched blocks that are referenced, the PrefetchBit of a prefetched block is set; when a block is accessed with its PrefetchBit set, the Usefulcounter is incremented and the PrefetchBit is cleared.
The execution of the program on a parallel computer is decomposed into:


Every time the PrefetchCounter reaches its maximum (i.e., it wraps around), the value of the Usefulcounter is matched against two preset thresholds to determine if the Lookahead-Counter-initially set to one-should be changed. If the Useful- Counter exceeds the upper threshold, we are in a phase of execution where the program could benefit from a higher degree of prefetching and therefore the LookaheadCounter is incremented. If the Usefulcounter is lower than the lower preset threshold, the amount of prefetching is too high and the LookaheadCounter is
decremented. Finally, if the Usefulcounter has a value between the two thresholds, the LookaheadCounter is not affected. In all cases, the Usefulcounter is cleared. In our evaluation, we have considered counters modulo 16 (4 bits).


[[Image:Eq2.png|center|250px]]
When the LookaheadCounter reaches zero, prefetching is turned off. To turn it back on, we use the following mechanism. When a block is received on a read miss and prefetching is turned off, the ZeroBit in the corresponding SLC block frame, which is initially cleared, is set to indicate that the following block in the address space could have been prefetched
and the PrefetchCounter is incremented. On a read miss, a cache lookup is made to the previous block (by address); if it
hits and the ZeroBit is set, the UsefulCounter is incremented and the ZeroBit is cleared. The ZeroBit of a block is also
cleared when the block is accessed and the LookaheadCounter is not zero to keep the number of ZeroBits that have been previously set to a minimum.


where a is the sequential fraction, 1-s is the parallel fraction, ignoring overhead for now, and p is the number of processors working in parallel during the parallel stage.
==Chip Multiprocessing Prefetching (CMP)==
[[File:PrefetchImp.png|thumb|right|upright|300px|Figure 2.3. Prefetch Implementation]]
Prefetching the lowest miss address stream in the cache hierarchy has many advantages, particularly in a CMP system. First, in a CMP, the L2 cache is often shared by all processors on the chip. Consequently, prefetching the L2 miss address stream can share prefetch history among the processors, resulting in larger history tables. Second, prefetching L2 miss addresses reduces contention on the cache ports, which is becoming increasingly important as the number of processors per chip grows. Before a prefetch is sent to the memory subsystem, it must access the L2 directory. Since the L2 miss address stream has the fewest memory references it will generate less prefetches and access the cache ports less often. Last, prefetching into the L1 is relatively insignificant, since modern out-of-order processors can tolerate most L1 data cache misses with relatively little performance degradation. Prefetching in a CMP is more difficult than in a uniprocessor system. In addition to limited bandwidth and increased latency (as described earlier), cache coherency protocols play an important role in CMP prefetching.


The relative time for sequential processing would be  '''''s + p (1 - s)''''', where p is the number of processors in the parallel case.
==Disadvantages of Prefetching==


Speedup is therefore:
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.


[[Image:Eq3.png|center|550px]]
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.  


where s: = s(n) is the sequential fraction.
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.


Assuming the sequential fraction s(n) diminishes with problem size n, then speedup approaches p as n approaches infinity, as desired.
= '''Consistency models''' <ref>http://titanium.cs.berkeley.edu/papers/kamil-su-yelick-sc05.pdf</ref> <ref>http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</ref>=
Thus Gustafson's law seems to rescue parallel processing from Amdahl's law.
The interface for memory in a [http://en.wikipedia.org/wiki/Shared_memory shared memory] multiprocessor is called a memory consistency model. As the interface between the programmer and the system, the effect of the memory consistency model is pervasive in a shared memory system. The model affects programmability because programmers must use it to reason about the correctness of their programs. The model affects the performance of the system because it determines the types of optimizations that may be exploited by the hardware and the system software. Finally, due to a lack of [http://en.wikipedia.org/wiki/Consensus consensus] on a single model, portability can be affected when moving software across systems supporting different models.


Amdahl's law argues that even using massively parallel computer systems cannot influence the sequential part of a fixed workload. Since this part is irreducible, the sequential fraction of the fixed workload is a function of p that approaches 1 for large p. In comparison to that, Gustafson's law is based on the idea that the importance of the sequential part diminishes with a growing workload; and if n is allowed to grow along with p, the sequential fraction will not ultimately dominate.
A memory consistency model specification is required for every level at which an interface is defined between the programmer and the system. At the machine code interface, the memory model specification affects the designer of the machine hardware and the programmer who writes or reasons about machine code. At the high level language interface, the specification affects the programmers who use the high level language and the designers of both the software that converts high-level language code into machine code and the hardware that executes this code. Therefore, the programmability, performance, and portability concerns may be present at several different levels.


In contrast with Amdahl's Law, this function is simply a line, and one with much more moderate slope: 1 – p.
In summary, the memory model influences the writing of parallel programs from the programmer’s perspective, and virtually all aspects of designing a parallel system (including the processor, memory system, interconnection network, compiler, and programming languages) from a system designer’s perspective.


It is thus much easier to achieve efficient parallel performance than is implied by Amdahl’s paradigm. The two approaches, fixed-sized and scaled-sized, are contrasted and summarized in Figure 2a and b.
The memory consistency model in shared memory parallel programming controls the order in which memory operations performed by one thread may be observed by another. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores (such as a filesystems, databases, optimistic replication systems or Web caching). The system supports a given model, if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.


[[Image:Amdahl_slope.png|thumb|center|375px|Figure 2a. Fixed-Size Model: Speedup = 1 / (s + (1-s) / p)]]
==''Sequential Consistency Model (SC)'' <ref>A Primer on Memory Consistency and Cache Coherence. - Daniel J. Sorin, Mark D. Hill, and David A. Wood</ref>==


[[Image:2.1.jpg|thumb|right|300px|Figure 3.1. Sequential Consistency Example]]
Arguably the most intuitive memory consistency model is [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency] (SC). Sequential consistency was first formalized by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport]. Lamport first called a single processor (core) sequential if “the result of an execution is the same as if the operations had been executed in the order specified by the program.” He then called a multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.” This total order of operations is called memory order. In SC, memory order respects each core’s program order, but other consistency models may permit memory orders that do not always respect the program orders.


[[Image:Gus_slope.png|thumb|center|700px|Figure 2b. Scaled-Size Model: Speedup = s + p (1-s)]]


An easy way to enforce sequential consistency is to insert memory fences after each shared memory access. This forbids all reordering of shared memory operations, which prevents optimizations such as prefetching and code motion, resulting in an unacceptable performance penalty. Various techniques have been proposed to minimize the number of fences, or delay set, required to enforce sequential consistency


===A Construction Metaphor===


'''Amdahl's Law approximately suggests:'''
Figure 3.2 depicts the abstraction of memory provided to programmers by a sequentially consistent system. Multiple processes appear to share a single logical memory, even though in the real machine main memory may be distributed across multiple processors, each with their private caches and write buffers.


Suppose a 60 feet building is under construction, and 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day. No matter how fast they construct the last 30 feet, it is impossible to achieve average rate of construction as 9 feet / day before completion of the building. Since they had already taken 10 days and there are only 60 feet in total; constructing infinitely fast you would only achieve a rate of 6 feet / day.
[[Image:3.2.jpg|thumb|center|800px|Figure 3.2. Programmer’s abstraction of the memory subsystem under the sequential consistency model.]]


'''Gustafson's Law approximately states:'''
Every processor appears to issue and complete memory operations one at a time and atomically in program order - that is, a memory operation does not appear to be issued until the previous one has completed - and the common memory appears to service these requests one at a time in an interleaved manner according to an arbitrary (but hopefully fair) schedule. Memory operations appear atomic in this interleaved order; that is, it should appear globally (to all processors) as if one operation in the consistent interleaved order executes and completes before the next one begins.


Suppose 100 workers are constructing a building at the rate of less than 9 feet / day. Given enough workers and height of building to construct, the average rate of construction can always eventually reach 9 feet / day, no matter how slow the construction had been. For example, 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day, they could achieve this by more workers at the rate of 12 feet / day, for 20 additional days, or at the rate of 15 feet / day,  for 10 additional days, and so on.


===Gordon Bell prize <ref>http://techresearch.intel.com/ResearcherDetails.aspx?Id=182</ref> <ref>http://en.wikipedia.org/wiki/Gordon_Bell_Prize</ref>===
'''Sequential Consistency Example''':


[[Image:John_L_Gustafson_CEO.jpg|thumb|right|130px|John Gustafson, circa 2005]]
  From figure 3.1, P2 does not see P1’s write of z = 5 before its first read of z, so it happens to have an out-of-date value.
The Gordon Bell Prizes are a set of awards awarded by the [http://en.wikipedia.org/wiki/Association_for_Computing_Machinery Association for Computing Machinery] in conjunction with the [http://en.wikipedia.org/wiki/Institute_of_Electrical_and_Electronics_Engineers Institute of Electrical and Electronics Engineers] each year at the [http://en.wikipedia.org/wiki/Supercomputing_Conference Supercomputing Conference] to recognize outstanding achievement in high-performance computing applications. The main purpose of the award is to acknowledge, reward, and thereby assess the progress of parallel computing. The awards were established in 1987.
  However, the write propagates to P2 before its second read of z.  
  This is legal under SC because,
      -> Processors do not always have to see up-to-date values
      -> Processors just need to see writes in the order they happen
  Note: it would also have been legal under SC for W(z) 1 to propagate to P2 before its first read of z or after its second read of z.
  Pictorial representation in figure 3.2


The Prizes were preceded by a similar much smaller prize (nominal: $100) by Alan Karp, a numerical analyst (then of IBM; won by Gustafson and Montry) challenging claims of MIMD performance improvements proposed in the Letters to the Editor section of the [http://en.wikipedia.org/wiki/Communications_of_the_ACM Communications of the ACM] who went on to be one of the first Bell Prize judges. Cash prizes accompany these recognitions and are funded by the award founder, [http://en.wikipedia.org/wiki/Gordon_Bell Gordon Bell], a pioneer in high-performance and parallel computing.


[http://en.wikipedia.org/wiki/John_L._Gustafson Dr. John L. Gustafson] introduced the first commercial cluster system in 1985 and having first demonstrated 1000x, scalable parallel performance on real applications in 1988, for which he won the inaugural Gordon Bell Award. That demonstration broke the [http://www.netlib.org/benchmark/karp-challenge “Karp Challenge”] that claimed speedup of more than 200x was a practical impossibility; it created a watershed that led to the widespread manufacture and use of highly parallel computers.
[[Image:2.2_new.jpg|thumb|center|1100px|Figure 3.3. Sequential Consistency Example
(a) P1 writes z = 5. Then this updated value is sent over the interconnection network to the other processors.
(b) P1 writes z = 5. Then this updated value is sent over the interconnection network to the other processors.
(c) By the time P2 reads x the second time, the remote write by P1 has propagated through the interconnection network. ]]


==Superlinear speedup <ref>http://www.ccs.neu.edu/course/com3620/projects/scalable/jshan/final1.pdf</ref>==


[[Image:Superlinear.jpg|thumb|right|350px|Figure 3. Super-linear speedup]]
===''Importance of write Atomicity ''<ref>Parallel Computer Architecture: A Hardware/Software Approach. - David E. Culler, University of California, Berkeley; Jaswinder Pal Singh, Princeton University</ref>===


Not too long ago, the parallel time to solve a given problem using p processors was believed to be no greater than p. However, people then observed that in some computations the speedup was greater than p. When the speedup is higher than p, it is called super-linear speedup. One thing that could hinder the chances of achieving [http://en.wikipedia.org/wiki/Speedup super-linear speedup] is the cost involved in inter-process communication during parallel computation. This is not a concern in serial computation. However, super-linear speedup can be achieved by utilizing the resources very efficiently.


===Controversy===
'''Write atomicity''', included in the definition of SC above, implies that the position in the total order at which a write appears to perform should be the same with respect to all processors. It ensures that nothing a processor does after it has seen the new value produced by a write becomes visible to other processes before they too have seen the new value for that write. In effect, while coherence (write serialization) says that writes to the same location should appear to all processors to have occurred in the same order, sequential consistency says that all writes (to any location) should appear to all processors to have occurred in the same order. The following example shows why write atomicity is important.


Talk of super-linear speedup always sparks some controversy. Since super-linear speedup is not possible in theory, some non-orthodox practices could be thought of being the cause for achieving super-linear speedup. This is true especially with regard to the traditional research community. Hence, reporting super-linear speedup is controversial.


===Reasons for super-linear speedup===
'''Example:'''
Let us look at the reasons for super-linear speedup. The data set of a given problem could be much larger than the cache size when the problem is executed serially. In [http://en.wikipedia.org/wiki/Parallel_computing parallel computation], however, the data set has enough space in each cache that is available. In problems that involve searching a data structure, multiple searches can be executed at the same time. This reduces the termination time. Another reason is the efficient utilization of resources by [http://en.wikipedia.org/wiki/Multiprocessing multiprocessors].


===Reasons for the discrepancy in the 'Parallel Search' <ref>http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from</ref>===
[[Image:Table3.4.jpg|thumb|right|600px|Figure 3.4. Example illustrating the importance of write atomicity for sequential consistency]]
When search is being performed in parallel on multiple processors, the amount of work being done is lesser than the amount of work being done serially. Let us see why this is so:


* The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
Consider the three processes in Figure 3.4. Show how not preserving write atomicity violates sequential consistency.
* Modern processors have faster and slower memories. The processor will try to keep the data we are using in the fast memory. The amount of our data is most likely larger than the amount of fast memory. If we use n processors we have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time, and hence amount of work to do the same task.
Since P2 waits until '''X''' becomes 3 and then sets '''Y''' to 80, and since P3 waits until '''Y''' becomes 80 and only then reads value of '''X''', from transitivity we would infer that P3 should find the value of '''X''' to be 3. If P2 is allowed to go on past the read of '''X''' and write '''Y''' before it is guaranteed that P3 has seen the new value of '''X''', then P3 may read the new value of '''Y''' but the old value of '''X''' from its cache, violating our sequentially consistent intuition.
* The original sequential algorithm was really bad
* There are multiple processors at our disposal and hence much more cache is available when compared to serial computation on a single processor. The serial algorithm always runs out of cache space when the data set is very large.
* Getting a serial algorithm to do this parallel work and get better results wouldn't be feasible because the serial algorithm will not utilize the resources efficiently like a parallel algorithm would.


===Super-linearity on a large machine <ref>http://drdobbs.com/article/print?articleId=206903306&siteSectionName=</ref>===


So far we have looked at leveraging the super-linearity that can arise naturally in parallel computation. Now let us think how else we could achieve super-linear speedup. We could use more parallelism on the same machine. To speed up computational work we can increase the number of cores that we use. However, using more cores will only give us linear speedup.
Each process’s program order imposes a partial order on the set of all operations; that is, it imposes an ordering on the subset of the operations that are issued by that process. An interleaving (in the above sense) of the operations from different processes defines a total order on the set of all operations. Since the exact interleaving is not defined by SC, interleaving the partial (program) orders for different processes may yield a large number of possible total orders.  
When you run a program with less parallelism and another with more parallelism on the same modern desktop or server hardware, the one with more parallelism literally runs on a bigger machine — a disproportionately larger share of the available hardware. This happens because the one with more parallelism can use not only additional cores, but additional hardware resources attached to those cores that would not otherwise be available to the program. In particular, using more cores also means getting access to more [http://en.wikipedia.org/wiki/Cache_%28computing%29 cache] and/or more memory.


Let us take a look at Figure 4 to see why this is so. This figure shows a simplified block diagram of the cores and caches on two modern commodity CPUs: the current [http://arstechnica.com/hardware/news/2006/12/8363.ars Intel "Kentsfield" processor], and the upcoming [http://arstechnica.com/hardware/news/2006/12/8363.ars AMD "Barcelona" processor], respectively. The interesting feature in both chips is that each core has access to cache memory that is not available to some or all of the other cores. In the Kentsfield processor, each pair of cores shares a private [http://www.wisegeek.com/what-is-l2-cache.htm L2 cache]; in the Barcelona chip, each core has its own private L2 cache. In both cases, no core by itself has access to all the available L2 cache, and that means that code running on just one core is limited, not only to the one core, but also to just a fraction of the available cache. For code whose performance is memory bound, the amount of available cache can make a significant difference.


[[Image:superlinear_example.jpg|thumb|center|700px|Figure 4. Intel Kentsfield core and cache utilization: 1 thread versus 3 threads]]
The following definitions apply:
*''Sequentially Consistent Execution'' - An execution of a program is said to be sequentially consistent if the results it produces are the same as those produced by any one of these possible total orders (interleavings as defined earlier). That is, there should exist a total order or interleaving of program orders from processes that yields the same result as that actual execution.
*''Sequentially Consistent System'' - A system is sequentially consistent if any possible execution on that system corresponds to (produces the same results as) some possible total order as defined above.


Of course, an implicit assumption throughout is that a read returns the last value that was written to that same location (by any process) in the interleaved total order.


Example:
===''Sufficient Conditions for Preserving Sequential Consistency ''===


    Suppose we have an 8 processor machine, each processor has a 1MB cache and each computation uses 6MB of data.
It is possible to define a set of sufficient conditions that the system should obey that will guarantee
    On a single processor the computation will be doing a lot of data movement between CPU, cache and RAM.
sequential consistency in a multiprocessor - whether bus-based or distributed, cache-coherent or not. The following set, adapted from their original form, are commonly used because they are relatively simple without being overly restrictive:
    On 8 processors the computation will only have to move data between CPU and cache.
    This way super-linear speedup can be achieved.


==Conclusion==
* Every process issues memory requests in the order specified by the program.


===Scaled speedup <ref>http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&sqi=2&ved=0CCkQFjAB&url=http%3A%2F%2Fcoitweb.uncc.edu%2F~abw%2FITCS4145F10%2Fslides1a.ppt&ei=rq05T_zOOpKutwe635jRAg&usg=AFQjCNFMNTKEQ2P9G4zV3fGWyaMbmYtsMQ</ref><ref>http://spartan.cis.temple.edu/shi/public_html/docs/amdahl/amdahl.html</ref>===
* After a write operation is issued, the issuing process waits for the write to complete before issuing its next operation.


Using Amdahl's Law as an argument against massively parallel processing is not valid. This is because ''serial parts'' of a program can be very close to zero for many practical applications. Thus very high speedups are possible using massively many processors. Gustafson's experiments are just examples of these applications.
* After a read operation is issued, the issuing process waits for the read to complete, and for the write whose value is being returned by the read to complete, before issuing its next operation. That is, if the write whose value is being returned has performed with respect to this processor (as it must have if its value is being returned) then the processor should wait until the write has performed with respect to all processors.


Gustafson's formulation gives an illusion that as if ''number of processors'' can increase indefinitely. A closer look finds that the increase in serial parts of a program is affecting speedup negatively. The rate of speedup decrease as ''number of processors'' approaches infinity, if we translate the scaled-percentage to a non-scaled percentage. We cannot observe the speedup impact by ''number of processors'' using Gustafson's formulation directly since it contains a ''number of processors'' dependent variable serial parts of a program.  
==''Relaxed Consistency Models'' <ref>http://indigo.ece.neu.edu/~jmankin/docs/jmankin_mem_const.pdf</ref>==


Even though Amdahl's law is theoretically correct, the serial percentage is not practically obtainable. For example, if the serial percentage is to be derived from computational experiments, i.e. recording the total parallel elapsed time and the parallel-only elapsed time, then it can contain all overheads, such as communication, synchronization, input/output and memory access. The law offers no help to separate these factors. On the other hand, if we obtain the serial percentage by counting the number of total serial and parallel instructions in a program, then all other overheads are excluded. However, in this case the predicted speedup may never agree with the experiments.
Once it became clear that sequential consistency wasn’t always necessary, and that the hardware overhead and performance degradation were not justified by ease-of-programming, there was a trend to create less strict models. The result is the group of memory consistency models known collectively as Relaxed Consistency Models, as these models relax one or more of the requirements of sequential consistency. Relaxed consistency models can be partitioned into subgroups using four comparisons: Type of Relaxation, Synchronizing vs. Non-Synchronizing, Issue vs. View-Based, and Relative Model Strength.


Conclusion drawn from Gustafson’s law is that it should be possible to get high speedup if we scale up the problem size.
===Relaxation===
A simple and effective way of categorizing relaxed consistency models is by defining which requirement of sequential consistency is relaxed. Using the sufficient conditions for sequential consistency specified, we can say that a relaxed consistency model either relaxes the program order or write atomicity requirement. When relaxing program order, the model may relax any or all of the orderings on memory operation pairs: read-after-write, write-after-write, or read/write-after-read. When relaxing write atomicity, a model may allow a processor to view its own writes before another processor can, or allow a processor to view another processor’s write before the rest of the processors in the system can.


===Super-linear speedup===
===Synchronizing vs. Non-Synchronizing===
A synchronizing model divides shared memory accesses into at least two groups and assigns a different consistency restriction to each group. For example, one type of memory access may only need a weak consistency model, whereas another type of memory access may require a strict consistency model. A non-synchronizing model does not differentiate between individual memory accesses and assigns the same consistency model to all accesses collectively.


Even though in theory the maximum possible speedup is equal to the number of parallel processors, in practice we do see speedups higher than the number of processors, or in other words, super-linear speedup. It is considered controversial by purists because of the skepticism regarding the means of achieving this high speedup. If super-linear speedup is achievable, it gives an extremely high degree of parallelism and also provides maximum utilization of resources.
===Issue vs. View-Based===
An issue-based relaxation focuses on how the ordering of an instruction issue will be seen by the entire system, as a collective unit. On the other hand, a view-based method does not aim to simulate sequential consistency; in this category of models, each processor is allowed it’s own view of the ordering of events in the system, and these views do not need to match.


== External links ==
===Relative Model Strength===
1. [http://en.wikipedia.org/wiki/File:John_L_Gustafson_CEO.jpg John Gustafson]
Some models are inherently more strong (or strict) than other models. Sequential consistency, described above, is one of the most strict, in that it has the least relaxations (none). Other models may be weaker, but compromise programmer effort. If rating the strength of a model by the relaxations of program order or atomicity,
it may be possible to directly compare the strength of different models, as in a case where one model relaxes everything another model relaxes—plus more. Other times, models cannot be directly related, as they relax different requirements and a relative importance cannot be derived.


2. [http://en.wikipedia.org/wiki/File:Gustafson.png Figure 1 - Plot of speedup vs processors - Gustafson's Law]
===The intuition behind Relaxed memory consistency models===
The intuition behind relaxed models is that SC is usually too conservative, in that many of the orders it preserves are not really needed to satisfy a programmer’s intuition in most situations.


3. [http://drdobbs.com/article/print?articleId=206903306&siteSectionName= Figure 3 - Superlinear speedup]
'''Consider the simple example shown in Figure 3.5. '''
[[Image:Relax.jpg|thumb|right|700px|Figure 3.5. Intuition behind relaxed memory consistency models
(a) Orderings Maintained By Sequential Consistency.
(b) Orderings Necessary For Correct Program Semantics. ]]


4. [http://en.wikipedia.org/wiki/Speedup Superlinear speedup wiki]
On the left are the orderings that will be maintained by an SC implementation. On the right are the orderings that are necessary for intuitively correct program semantics. The latter are far fewer. For example, writes to variables '''X''' and '''Y''' by P1 can be reordered without affecting the results observed by the program; all we must ensure is that both of them complete before variable flag is set to 1.  


5. [http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.80.9410%26rep%3Drep1%26type%3Dpdf&ei=9qM5T7KpE4jEtwe5-JmtAg&usg=AFQjCNGqiAI5QMYuYyp9ePmRKFNBC31g7g Controversy]
Similarly, reads to variables '''X''' and '''Y''' can be reordered at P2, once flag has been observed to change to value 1. Even with these re-orderings, the results look just like those of an SC execution.  


6. [http://drdobbs.com/article/print?articleId=206903306&siteSectionName= Figure 4 - Intel Kentsfield core and cache utilization: 1 thread versus 3 threads.]
On the other hand, while the accesses to flag are also simple variable accesses, a model that allowed them to be reordered with respect to '''X''' and '''Y''' at either process would compromise the intuitive semantics and SC results. It would be wonderful if system software or hardware could automatically detect which program orders are critical to maintaining SC semantics, and allow the others to be violated for higher performance.  


7. [http://arstechnica.com/hardware/news/2006/12/8363.ars AMD "Barcelona" processor]
However, the problem is undecidable for general programs, and inexact solutions are often too conservative to be very useful.


8. [http://www.xbitlabs.com/articles/cpu/display/kentsfield-preview.html Intel "Kentsfield" processor]


9. [http://www.netlib.org/benchmark/karp-challenge "Karp Challenge"]
===Program Orders===


==References==
Let us discuss some of the specifications of consistency models, using the relaxations in [http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf program order] that they allow as our primary axis for grouping models together.
 
====''Relaxing the Write-to-Read Program Order''====
The main motivation for this class of models is to allow the hardware to hide the latency of write operations that miss in the first-level cache. While the write miss is still in the write buffer and not yet visible to other processors, the processor can issue and complete reads that hit in its cache or even a single read that misses in its cache.
 
Examples: [https://blogs.oracle.com/d/entry/total_store_ordering_tso Total store ordering], [http://en.wikipedia.org/wiki/PRAM_consistency processor consistency]
 
====''Relaxing the Write-to-Read and Write-to-Write Program Orders''====
Allowing writes as well to bypass previous outstanding writes to different locations allows multiple writes to be merged in the write buffer before updating main memory, and thus multiple write misses to be overlapped and become visible out of order. The motivation is to further reduce the impact of write latency on processor stall time, and to improve communication efficiency between processors by making new data values visible to other processors sooner.
 
Example: [http://docs.oracle.com/cd/E19963-01/html/819-3196/hwovr-15.html Partial store ordering].
 
====''Relaxing All Program Orders: Weak Ordering and Release Consistency''====
 
''Weak Ordering'': The motivation behind the weak ordering model (also known as the weak consistency model) is quite simple. Most parallel programs use synchronization operations to coordinate accesses to data when this is necessary. Between synchronization operations, they do not rely on the order of accesses being preserved.
 
''Release Consistency'': [http://en.wikipedia.org/wiki/Release_consistency Release consistency] observes that weak ordering does not go far enough. It extends the weak ordering model by distinguishing among the types of synchronization operations and exploiting their associated semantics. In particular, it divides synchronization operations into acquires and releases. An acquire is a read operation (it can also be a read-modify-write) that is performed to gain access to a set of variables.
 
 
='''Prefetching under Consistency Models'''<ref>http://www.csl.cornell.edu/courses/ee572/gharachorloo.icpp91.pdf</ref>=
 
In the hardware prefetching technique, hardware issues prefetches for any memory operations that are in the reorder buffer but are not yet permitted by the consistency model to actually issue to the memory system.
 
For example, the processor might prefetch a read that is preceded by another incomplete memory operation in SC, or by an incomplete acquire operation in RC, thus overlapping them. For writes, prefetching allows the data and ownership to be brought to the cache before the write actually gets to the head of the buffer and can be issued to the memory system. These prefetches are nonbinding, which means that the data are brought into the cache, not into a register or the reorder buffer, and are still visible to the coherence protocol, so they do not violate the consistency model.
 
Let us assume the invalidation-based cache coherence scheme, with cache hit latency of 1 cycle and cache miss latency of 100 cycles. Assume the memory system can accept an access on every cycle (e.g., caches are lockup-free). We also assume no other processes are writing to the locations used in the examples and that the lock synchronizations succeed (i.e., the lock is free).
 
Example 4.1:
  lock L (miss)
  write A (miss)
  write B (miss)
  unlock L (hit)
 
Example 4.2:
  lock L (miss)
  read C (miss)
  read D (hit)
  read E[D] (miss)
  unlock L (hit)
 
 
Let us first consider the code segment in example 4.1. This code segment resembles a producer process updating the values of two memory locations. Given a system with sequential consistency, each access is delayed for the previous access to be performed. The first three accesses miss in the cache, while the unlock access hits due to the fact that exclusive ownership was gained by the previous lock access. Therefore, the four accesses take a total of 301 cycles to perform. In a system with release consistency, the write accesses are delayed until the lock access is performed, and the unlock access is delayed for the write accesses to perform. However, the write accesses are pipelined. Therefore, the accesses take 202 cycles.
 
 
The prefetch technique described in this section boosts the performance of both the sequential and release consistent systems. Concerning the loop that would be used to implement the lock synchronization, we assume the branch predictor takes the path that assumes the lock synchronization succeeds. Thus, the lookahead into the instruction stream allows locations A and B to be prefetched in read-exclusive mode. Regardless of the consistency model, the lock access is serviced in parallel with prefetch for the two write accesses. Once the result for the lock access returns, the two write accesses will be satisfied quickly since the locations are prefetched into the cache. Therefore, with prefetching, the accesses complete in 103 cycles for both SC and RC. For this example, prefetching boosts the performance of both SC and RC
and also equalizes the performance of the two models.
 
 
We now consider the second code segment in example 4.2. This code segment resembles a consumer process reading several memory locations. There are three read accesses within the critical section. As shown, the read to location D is assumed to hit in the cache, and the read of array E depends on the value of D to access the appropriate element. For simplicity, we will ignore the delay due to address calculation for accessing the array element. Under SC, the accesses take 302 cycles to perform. Under RC, they take 203 cycles. With the prefetch technique, the accesses take 203 cycles under SC and 202 cycles under RC. Although the performance of both SC and RC are enhanced by prefetching, the maximum performance is not achieved for either model. The reason is simply because the address of the read access to array E depends on the value of D and although the read access to D is a cache hit, this access is not allowed to perform (i.e., the value can not be used by the processor) until the read of C completes (under SC) or until the lock access completes (under RC). Thus, while prefetching can boost performance by pipelining several accesses that are delayed due to consistency constraints, it fails to remedy the cases where out-of-order consumption of return values is important to allow the processor to proceed efficiently.
 
= '''Conclusions''' <ref>ftp://ftp.cs.wisc.edu/markhill/Papers/icpp90_seqcon.pdf</ref>=
In this article all the different techniques for prefetching and its relation to sequential consistency have been discussed in terms of how they are able to support multiprocessors today.  A decent effort has been made to discuss the relative weaknesses and strengths of different models.  What we can conclude is that there is no one best model, only the model that best fits the application, architecture, and programmer.  The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing.
 
For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort.
 
A memory model for shared-memory multiprocessor systems most commonly assumed by programmers is that of sequential consistency. Strong ordering has been believed to be a sufficient condition for implementing sequential consistency. As a necessary condition for a practical implementation of sequential consistency, it has been widely believed that a processor should not be allowed to issue an access until all effects of its previous access are observed.
 
= '''Quiz''' =
* Which of the following techniques is not a practical way to improve the performance of sequential consistency?
 
# Making stores speculative, so that a store is allowed to proceed to the cache before we are sure that a previous store will complete without being invalidated.
# Allowing a load to bypass a previous load, but marking it as speculative, and canceling it if the block has been invalidated by the time the previous load completes.-
# Overlapping data fetches generated by loads and stores, without actually overlapping the execution of loads and stores.
# Prefetching blocks in order to make loads complete more quickly.
 
 
* Which of the following is '''not''' performance implications of sequential consistency?
 
# Non-blocking caches lose their advantage.-
# Only one instruction can be allowed to proceed at a time.
# A load cannot be allowed to complete before a preceding store reaches the cache.
# Compiler optimizations involving reordered memory accesses are not acceptable.
 
 
*Is write Atomicity important for SC?
 
# Yes
# No
# It is not at all related to SC.
 
= External links =
1. [http://129.16.20.23/~pers/pub/j5.pdf Sequential hardware prefetching in shared-memory multiprocessors]
 
2. [http://titanium.cs.berkeley.edu/papers/kamil-su-yelick-sc05.pdf Making Sequential Consistency Practical in Titanium]
 
3. [http://static.usenix.org/event/usenix08/tech/full_papers/baek/baek.pdf Prefetching with Adaptive Cache Culling for Striped Disk Arrays]
 
4. [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=669044&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D669044 An adaptive network prefetch scheme]
 
5. [http://www.csl.cornell.edu/courses/ee572/gharachorloo.icpp91.pdf Two Techniques to Enhance the Performance of Memory Consistency Models]
 
6. [http://research.cs.wisc.edu/multifacet/papers/computer98_sccase_pdf.pdf Multiprocessors Should Support Simple Memory Consistency Models]
 
7. [http://www.ee.ryerson.ca/~courses/ee8207/prefetchprj3.pdf Hardware Prefetching Schemes]
 
8. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.1422&rep=rep1&type=pdf Sequential Prefetching in Shared Memory Processors ]
 
9. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.1229&rep=rep1&type=pdf Software Controlled prefetching in Shared Memory Processors]
 
10. [http://www.cs.cmu.edu/~tcm/thesis/subsection2_10_3_2.html Relaxed consistency models]
 
=References=
<references/>
<references/>

Latest revision as of 17:52, 11 April 2012

Prefetching and Memory Consistency Models

Overview

This wiki article explores two different topics Sequential Prefetching and Memory Consistency models. The article covers description of prefetching, different types of prefetching like Fixed, Adaptive, etc explained in detail. This is followed by, different types of memory consistency models like Sequential consistency and Relaxed consistency. It also talks about the authors' and researchers' comments through examples.

Prefetching

Sequential prefetching is a simple hardware controlled pre fetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest for, the number of prefetched blocks on each miss is fixed throughout the execution<ref>http://129.16.20.23/~pers/pub/j5.pdf</ref>.

Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. Prefetching approaches proposed in the literature are software or hardware based.

Software controlled prefetching schemes rely on the programmer/compiler to insert prefetch instructions prior to the instructions that trigger a miss. In addition, both the processor and the memory system must be able to support prefetch instructions which can potentially increase the code size and the run-time overhead. By contrast, hardware-controlled prefetch relieve the programmer/compiler from the burden of deciding what and when to prefetch. Usually, these schemes take advantage of the regularity of data access in scientific computations by dynamically detecting access strides.

Basically there are two types of prefetching techniques

1. Fixed sequential prefetching

2. Adaptive sequential prefetching

More details about these is discussed in the following sections along with other prefetching models:

Simulated processing Node Architecture

According to Fig. 2.1, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on write misses and is blocking on read misses. Writes and read miss requests are buffered in the FLWB. The second-level cache (SLC) is a direct-mapped write-back cache. For both prefetching techniques, we only prefetch into the SLC. In addition, a second-level write buffer (SLWB) keeps track of outstanding requests (SLC read miss, prefetch, and write requests). No more than one request to the same block is allowed to be issued to the system; others are just kept in the SLWB while waiting for the pending request to that block to complete. Moreover, a read miss request may bypass write requests if they are for different blocks.

Figure 2.1. Processor environment and simulated architecture

Fixed Sequential Prefetching<ref>http://www.springerlink.com/content/lu0755310187318n/</ref>

By fixed sequential prefetching we mean that K consecutive blocks are prefetched into the SLC on a reference to a block, i.e., blocks n + 1 ... n + K are prefetched upon a reference to block n, if they are not present in the cache. Sequential prefetching has been extensively studied in the context of uniprocessors,but to our knowledge, have never been considered for general applications on multiprocessors. Although many sequential strategies have been proposed for uniprocessors, we have restricted ourselves to prefetching on a miss in the SLC. When a reference misses in the SLC, the miss request is sent to memory, and the cache is searched for the K consecutive blocks directly following the missing block in the address space. The blocks among the K consecutive blocks that are not present in the SLC and have no pending requests in the SLWB are prefetched. We refer to K as the degree of prefetching.

Figure 2.2. The fixed sequential prefetching mechanism

Fig. 2.2 shows the mechanism of the fixed sequential prefetching scheme. As a cache lookup is made for block address n, the next block address (n + 1) is calculated. On a read miss, a read request is issued to the memory system and is kept in the SLWB. In the next cache cycle, the calculated address (n + I) is directed to the cache, and a cache lookup is made. If the block is not present in the cache, a prefetch request is issued and is kept in the SLWB. During that time, the subsequent block address is calculated (n + 2). The number of iterations is determined by the degree of prefetching. The processor is blocked only during the time it takes to handle the first read miss. Since the prefetch requests are issued one at a time and are pipelined in the memory system, they can be overlapped with the original read request. Besides the simple extensions in the SLC to incorporate fixed sequential prefetching, the memory system must be able to handle three new network commands: a prefetch request and two reply messages denoted PreData and PreNeg. Whereas PreData carries the prefetched block, PreNeg tells the cache that the prefetch request cannot be satisfied because the memory copy is in a transient state-some other cache is reading or writing to it.

Adaptive Sequential Prefetching<ref>http://web.cecs.pdx.edu/~walpole/papers/mmcn1998b.pdf</ref>

The mechanism behind the adaptive scheme is basically the same as that of fixed sequential prefetching. For example, prefetching is activated by a read miss and blocks are prefetched into the SLC. In contrast to fixed sequential prefetching, however, the degree of prefetching is not fixed; rather it is controlled by a register, the Lookahead Counter. The adaptive sequential prefetching scheme relies on adjusting the degree of prefetching (the value of the Lookahead- Counter) dynamically by counting the useful prefetches, i.e., prefetched blocks that are actually referenced during their lifetime in the cache. To explain how this is achieved, we will first focus on how the algorithm measures the prefetch efficiency and then how the Lookahead Counter is adjusted to a certain prefetch efficiency. The mechanisms needed to achieve these task-two bits per cache line and three counters per cache appear in Table 1.

PrefetchBit (per Cache Line) Used to detect useful prefetches (needed when prefetching is tumed on.)
ZeroBit (per cache line) Used to detect when a prefetch would have been useful (needed when prefetching is turned off.)
LookaheadCounter (per cache) The current degree of prefetching (per cache)
PrefetchCounter (per cache) Counts the number of prefetches that have been I returned after each read miss
UsefulCounter (per cache) Counts the number of useful prefetches

Conceptually, the algorithm measures the prefetch efficiency by counting the fraction of prefetched blocks that are referenced by the processors. If this fraction exceeds a preset threshold, the degree of prefetching is increased and, if it is below another preset threshold, the degree of prefetching is decreased.

The basic mechanisms used to measure the prefetch efficiency consist of two counters (the PrefetchCounter and the UsefulCounter) and a PrefetchBit per cache line which are all cleared from the very beginning. The fraction of useful prefetches is established as the ratio of the UsefulCounter and the PrefetchCounter as follows. The number of prefetched blocks is counted by incrementing the PrefetchCounter whenever a prefetch acknowledgment is received from the memory system, independent of whether the prefetch was accepted (PreData) or not (PreNeg) (e.g., if the memory block was in a transient state and neither clean nor dirty) by the memory system. To count the number of prefetched blocks that are referenced, the PrefetchBit of a prefetched block is set; when a block is accessed with its PrefetchBit set, the Usefulcounter is incremented and the PrefetchBit is cleared.

Every time the PrefetchCounter reaches its maximum (i.e., it wraps around), the value of the Usefulcounter is matched against two preset thresholds to determine if the Lookahead-Counter-initially set to one-should be changed. If the Useful- Counter exceeds the upper threshold, we are in a phase of execution where the program could benefit from a higher degree of prefetching and therefore the LookaheadCounter is incremented. If the Usefulcounter is lower than the lower preset threshold, the amount of prefetching is too high and the LookaheadCounter is decremented. Finally, if the Usefulcounter has a value between the two thresholds, the LookaheadCounter is not affected. In all cases, the Usefulcounter is cleared. In our evaluation, we have considered counters modulo 16 (4 bits).

When the LookaheadCounter reaches zero, prefetching is turned off. To turn it back on, we use the following mechanism. When a block is received on a read miss and prefetching is turned off, the ZeroBit in the corresponding SLC block frame, which is initially cleared, is set to indicate that the following block in the address space could have been prefetched and the PrefetchCounter is incremented. On a read miss, a cache lookup is made to the previous block (by address); if it hits and the ZeroBit is set, the UsefulCounter is incremented and the ZeroBit is cleared. The ZeroBit of a block is also cleared when the block is accessed and the LookaheadCounter is not zero to keep the number of ZeroBits that have been previously set to a minimum.

Chip Multiprocessing Prefetching (CMP)

Figure 2.3. Prefetch Implementation

Prefetching the lowest miss address stream in the cache hierarchy has many advantages, particularly in a CMP system. First, in a CMP, the L2 cache is often shared by all processors on the chip. Consequently, prefetching the L2 miss address stream can share prefetch history among the processors, resulting in larger history tables. Second, prefetching L2 miss addresses reduces contention on the cache ports, which is becoming increasingly important as the number of processors per chip grows. Before a prefetch is sent to the memory subsystem, it must access the L2 directory. Since the L2 miss address stream has the fewest memory references it will generate less prefetches and access the cache ports less often. Last, prefetching into the L1 is relatively insignificant, since modern out-of-order processors can tolerate most L1 data cache misses with relatively little performance degradation. Prefetching in a CMP is more difficult than in a uniprocessor system. In addition to limited bandwidth and increased latency (as described earlier), cache coherency protocols play an important role in CMP prefetching.

Disadvantages of Prefetching

1. Increased Complexity and overhead of handling the perfetching algorithms. Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.

2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.

3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.

Consistency models <ref>http://titanium.cs.berkeley.edu/papers/kamil-su-yelick-sc05.pdf</ref> <ref>http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</ref>

The interface for memory in a shared memory multiprocessor is called a memory consistency model. As the interface between the programmer and the system, the effect of the memory consistency model is pervasive in a shared memory system. The model affects programmability because programmers must use it to reason about the correctness of their programs. The model affects the performance of the system because it determines the types of optimizations that may be exploited by the hardware and the system software. Finally, due to a lack of consensus on a single model, portability can be affected when moving software across systems supporting different models.

A memory consistency model specification is required for every level at which an interface is defined between the programmer and the system. At the machine code interface, the memory model specification affects the designer of the machine hardware and the programmer who writes or reasons about machine code. At the high level language interface, the specification affects the programmers who use the high level language and the designers of both the software that converts high-level language code into machine code and the hardware that executes this code. Therefore, the programmability, performance, and portability concerns may be present at several different levels.

In summary, the memory model influences the writing of parallel programs from the programmer’s perspective, and virtually all aspects of designing a parallel system (including the processor, memory system, interconnection network, compiler, and programming languages) from a system designer’s perspective.

The memory consistency model in shared memory parallel programming controls the order in which memory operations performed by one thread may be observed by another. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores (such as a filesystems, databases, optimistic replication systems or Web caching). The system supports a given model, if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.

Sequential Consistency Model (SC) <ref>A Primer on Memory Consistency and Cache Coherence. - Daniel J. Sorin, Mark D. Hill, and David A. Wood</ref>

Figure 3.1. Sequential Consistency Example

Arguably the most intuitive memory consistency model is sequential consistency (SC). Sequential consistency was first formalized by Leslie Lamport. Lamport first called a single processor (core) sequential if “the result of an execution is the same as if the operations had been executed in the order specified by the program.” He then called a multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.” This total order of operations is called memory order. In SC, memory order respects each core’s program order, but other consistency models may permit memory orders that do not always respect the program orders.


An easy way to enforce sequential consistency is to insert memory fences after each shared memory access. This forbids all reordering of shared memory operations, which prevents optimizations such as prefetching and code motion, resulting in an unacceptable performance penalty. Various techniques have been proposed to minimize the number of fences, or delay set, required to enforce sequential consistency


Figure 3.2 depicts the abstraction of memory provided to programmers by a sequentially consistent system. Multiple processes appear to share a single logical memory, even though in the real machine main memory may be distributed across multiple processors, each with their private caches and write buffers.

Figure 3.2. Programmer’s abstraction of the memory subsystem under the sequential consistency model.

Every processor appears to issue and complete memory operations one at a time and atomically in program order - that is, a memory operation does not appear to be issued until the previous one has completed - and the common memory appears to service these requests one at a time in an interleaved manner according to an arbitrary (but hopefully fair) schedule. Memory operations appear atomic in this interleaved order; that is, it should appear globally (to all processors) as if one operation in the consistent interleaved order executes and completes before the next one begins.


Sequential Consistency Example:

  From figure 3.1, P2 does not see P1’s write of z = 5 before its first read of z, so it happens to have an out-of-date value.
  However, the write propagates to P2 before its second read of z. 
  This is legal under SC because, 
     -> Processors do not always have to see up-to-date values
     -> Processors just need to see writes in the order they happen
  Note: it would also have been legal under SC for W(z) 1 to propagate to P2 before its first read of z or after its second read of z.
  Pictorial representation in figure 3.2


Figure 3.3. Sequential Consistency Example (a) P1 writes z = 5. Then this updated value is sent over the interconnection network to the other processors. (b) P1 writes z = 5. Then this updated value is sent over the interconnection network to the other processors. (c) By the time P2 reads x the second time, the remote write by P1 has propagated through the interconnection network.


Importance of write Atomicity <ref>Parallel Computer Architecture: A Hardware/Software Approach. - David E. Culler, University of California, Berkeley; Jaswinder Pal Singh, Princeton University</ref>

Write atomicity, included in the definition of SC above, implies that the position in the total order at which a write appears to perform should be the same with respect to all processors. It ensures that nothing a processor does after it has seen the new value produced by a write becomes visible to other processes before they too have seen the new value for that write. In effect, while coherence (write serialization) says that writes to the same location should appear to all processors to have occurred in the same order, sequential consistency says that all writes (to any location) should appear to all processors to have occurred in the same order. The following example shows why write atomicity is important.


Example:

Figure 3.4. Example illustrating the importance of write atomicity for sequential consistency

Consider the three processes in Figure 3.4. Show how not preserving write atomicity violates sequential consistency. Since P2 waits until X becomes 3 and then sets Y to 80, and since P3 waits until Y becomes 80 and only then reads value of X, from transitivity we would infer that P3 should find the value of X to be 3. If P2 is allowed to go on past the read of X and write Y before it is guaranteed that P3 has seen the new value of X, then P3 may read the new value of Y but the old value of X from its cache, violating our sequentially consistent intuition.


Each process’s program order imposes a partial order on the set of all operations; that is, it imposes an ordering on the subset of the operations that are issued by that process. An interleaving (in the above sense) of the operations from different processes defines a total order on the set of all operations. Since the exact interleaving is not defined by SC, interleaving the partial (program) orders for different processes may yield a large number of possible total orders.


The following definitions apply:

  • Sequentially Consistent Execution - An execution of a program is said to be sequentially consistent if the results it produces are the same as those produced by any one of these possible total orders (interleavings as defined earlier). That is, there should exist a total order or interleaving of program orders from processes that yields the same result as that actual execution.
  • Sequentially Consistent System - A system is sequentially consistent if any possible execution on that system corresponds to (produces the same results as) some possible total order as defined above.

Of course, an implicit assumption throughout is that a read returns the last value that was written to that same location (by any process) in the interleaved total order.

Sufficient Conditions for Preserving Sequential Consistency

It is possible to define a set of sufficient conditions that the system should obey that will guarantee sequential consistency in a multiprocessor - whether bus-based or distributed, cache-coherent or not. The following set, adapted from their original form, are commonly used because they are relatively simple without being overly restrictive:

  • Every process issues memory requests in the order specified by the program.
  • After a write operation is issued, the issuing process waits for the write to complete before issuing its next operation.
  • After a read operation is issued, the issuing process waits for the read to complete, and for the write whose value is being returned by the read to complete, before issuing its next operation. That is, if the write whose value is being returned has performed with respect to this processor (as it must have if its value is being returned) then the processor should wait until the write has performed with respect to all processors.

Relaxed Consistency Models <ref>http://indigo.ece.neu.edu/~jmankin/docs/jmankin_mem_const.pdf</ref>

Once it became clear that sequential consistency wasn’t always necessary, and that the hardware overhead and performance degradation were not justified by ease-of-programming, there was a trend to create less strict models. The result is the group of memory consistency models known collectively as Relaxed Consistency Models, as these models relax one or more of the requirements of sequential consistency. Relaxed consistency models can be partitioned into subgroups using four comparisons: Type of Relaxation, Synchronizing vs. Non-Synchronizing, Issue vs. View-Based, and Relative Model Strength.

Relaxation

A simple and effective way of categorizing relaxed consistency models is by defining which requirement of sequential consistency is relaxed. Using the sufficient conditions for sequential consistency specified, we can say that a relaxed consistency model either relaxes the program order or write atomicity requirement. When relaxing program order, the model may relax any or all of the orderings on memory operation pairs: read-after-write, write-after-write, or read/write-after-read. When relaxing write atomicity, a model may allow a processor to view its own writes before another processor can, or allow a processor to view another processor’s write before the rest of the processors in the system can.

Synchronizing vs. Non-Synchronizing

A synchronizing model divides shared memory accesses into at least two groups and assigns a different consistency restriction to each group. For example, one type of memory access may only need a weak consistency model, whereas another type of memory access may require a strict consistency model. A non-synchronizing model does not differentiate between individual memory accesses and assigns the same consistency model to all accesses collectively.

Issue vs. View-Based

An issue-based relaxation focuses on how the ordering of an instruction issue will be seen by the entire system, as a collective unit. On the other hand, a view-based method does not aim to simulate sequential consistency; in this category of models, each processor is allowed it’s own view of the ordering of events in the system, and these views do not need to match.

Relative Model Strength

Some models are inherently more strong (or strict) than other models. Sequential consistency, described above, is one of the most strict, in that it has the least relaxations (none). Other models may be weaker, but compromise programmer effort. If rating the strength of a model by the relaxations of program order or atomicity, it may be possible to directly compare the strength of different models, as in a case where one model relaxes everything another model relaxes—plus more. Other times, models cannot be directly related, as they relax different requirements and a relative importance cannot be derived.

The intuition behind Relaxed memory consistency models

The intuition behind relaxed models is that SC is usually too conservative, in that many of the orders it preserves are not really needed to satisfy a programmer’s intuition in most situations.

Consider the simple example shown in Figure 3.5.

Figure 3.5. Intuition behind relaxed memory consistency models (a) Orderings Maintained By Sequential Consistency. (b) Orderings Necessary For Correct Program Semantics.

On the left are the orderings that will be maintained by an SC implementation. On the right are the orderings that are necessary for intuitively correct program semantics. The latter are far fewer. For example, writes to variables X and Y by P1 can be reordered without affecting the results observed by the program; all we must ensure is that both of them complete before variable flag is set to 1.

Similarly, reads to variables X and Y can be reordered at P2, once flag has been observed to change to value 1. Even with these re-orderings, the results look just like those of an SC execution.

On the other hand, while the accesses to flag are also simple variable accesses, a model that allowed them to be reordered with respect to X and Y at either process would compromise the intuitive semantics and SC results. It would be wonderful if system software or hardware could automatically detect which program orders are critical to maintaining SC semantics, and allow the others to be violated for higher performance.

However, the problem is undecidable for general programs, and inexact solutions are often too conservative to be very useful.


Program Orders

Let us discuss some of the specifications of consistency models, using the relaxations in program order that they allow as our primary axis for grouping models together.

Relaxing the Write-to-Read Program Order

The main motivation for this class of models is to allow the hardware to hide the latency of write operations that miss in the first-level cache. While the write miss is still in the write buffer and not yet visible to other processors, the processor can issue and complete reads that hit in its cache or even a single read that misses in its cache.

Examples: Total store ordering, processor consistency

Relaxing the Write-to-Read and Write-to-Write Program Orders

Allowing writes as well to bypass previous outstanding writes to different locations allows multiple writes to be merged in the write buffer before updating main memory, and thus multiple write misses to be overlapped and become visible out of order. The motivation is to further reduce the impact of write latency on processor stall time, and to improve communication efficiency between processors by making new data values visible to other processors sooner.

Example: Partial store ordering.

Relaxing All Program Orders: Weak Ordering and Release Consistency

Weak Ordering: The motivation behind the weak ordering model (also known as the weak consistency model) is quite simple. Most parallel programs use synchronization operations to coordinate accesses to data when this is necessary. Between synchronization operations, they do not rely on the order of accesses being preserved.

Release Consistency: Release consistency observes that weak ordering does not go far enough. It extends the weak ordering model by distinguishing among the types of synchronization operations and exploiting their associated semantics. In particular, it divides synchronization operations into acquires and releases. An acquire is a read operation (it can also be a read-modify-write) that is performed to gain access to a set of variables.


Prefetching under Consistency Models<ref>http://www.csl.cornell.edu/courses/ee572/gharachorloo.icpp91.pdf</ref>

In the hardware prefetching technique, hardware issues prefetches for any memory operations that are in the reorder buffer but are not yet permitted by the consistency model to actually issue to the memory system.

For example, the processor might prefetch a read that is preceded by another incomplete memory operation in SC, or by an incomplete acquire operation in RC, thus overlapping them. For writes, prefetching allows the data and ownership to be brought to the cache before the write actually gets to the head of the buffer and can be issued to the memory system. These prefetches are nonbinding, which means that the data are brought into the cache, not into a register or the reorder buffer, and are still visible to the coherence protocol, so they do not violate the consistency model.

Let us assume the invalidation-based cache coherence scheme, with cache hit latency of 1 cycle and cache miss latency of 100 cycles. Assume the memory system can accept an access on every cycle (e.g., caches are lockup-free). We also assume no other processes are writing to the locations used in the examples and that the lock synchronizations succeed (i.e., the lock is free).

Example 4.1:

 lock L		(miss)
 write A	(miss)
 write B	(miss)
 unlock L	(hit)

Example 4.2:

 lock L		(miss)
 read C		(miss)
 read D		(hit)
 read E[D]	(miss)
 unlock L	(hit)


Let us first consider the code segment in example 4.1. This code segment resembles a producer process updating the values of two memory locations. Given a system with sequential consistency, each access is delayed for the previous access to be performed. The first three accesses miss in the cache, while the unlock access hits due to the fact that exclusive ownership was gained by the previous lock access. Therefore, the four accesses take a total of 301 cycles to perform. In a system with release consistency, the write accesses are delayed until the lock access is performed, and the unlock access is delayed for the write accesses to perform. However, the write accesses are pipelined. Therefore, the accesses take 202 cycles.


The prefetch technique described in this section boosts the performance of both the sequential and release consistent systems. Concerning the loop that would be used to implement the lock synchronization, we assume the branch predictor takes the path that assumes the lock synchronization succeeds. Thus, the lookahead into the instruction stream allows locations A and B to be prefetched in read-exclusive mode. Regardless of the consistency model, the lock access is serviced in parallel with prefetch for the two write accesses. Once the result for the lock access returns, the two write accesses will be satisfied quickly since the locations are prefetched into the cache. Therefore, with prefetching, the accesses complete in 103 cycles for both SC and RC. For this example, prefetching boosts the performance of both SC and RC and also equalizes the performance of the two models.


We now consider the second code segment in example 4.2. This code segment resembles a consumer process reading several memory locations. There are three read accesses within the critical section. As shown, the read to location D is assumed to hit in the cache, and the read of array E depends on the value of D to access the appropriate element. For simplicity, we will ignore the delay due to address calculation for accessing the array element. Under SC, the accesses take 302 cycles to perform. Under RC, they take 203 cycles. With the prefetch technique, the accesses take 203 cycles under SC and 202 cycles under RC. Although the performance of both SC and RC are enhanced by prefetching, the maximum performance is not achieved for either model. The reason is simply because the address of the read access to array E depends on the value of D and although the read access to D is a cache hit, this access is not allowed to perform (i.e., the value can not be used by the processor) until the read of C completes (under SC) or until the lock access completes (under RC). Thus, while prefetching can boost performance by pipelining several accesses that are delayed due to consistency constraints, it fails to remedy the cases where out-of-order consumption of return values is important to allow the processor to proceed efficiently.

Conclusions <ref>ftp://ftp.cs.wisc.edu/markhill/Papers/icpp90_seqcon.pdf</ref>

In this article all the different techniques for prefetching and its relation to sequential consistency have been discussed in terms of how they are able to support multiprocessors today. A decent effort has been made to discuss the relative weaknesses and strengths of different models. What we can conclude is that there is no one best model, only the model that best fits the application, architecture, and programmer. The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing.

For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort.

A memory model for shared-memory multiprocessor systems most commonly assumed by programmers is that of sequential consistency. Strong ordering has been believed to be a sufficient condition for implementing sequential consistency. As a necessary condition for a practical implementation of sequential consistency, it has been widely believed that a processor should not be allowed to issue an access until all effects of its previous access are observed.

Quiz

  • Which of the following techniques is not a practical way to improve the performance of sequential consistency?
  1. Making stores speculative, so that a store is allowed to proceed to the cache before we are sure that a previous store will complete without being invalidated.
  2. Allowing a load to bypass a previous load, but marking it as speculative, and canceling it if the block has been invalidated by the time the previous load completes.-
  3. Overlapping data fetches generated by loads and stores, without actually overlapping the execution of loads and stores.
  4. Prefetching blocks in order to make loads complete more quickly.


  • Which of the following is not performance implications of sequential consistency?
  1. Non-blocking caches lose their advantage.-
  2. Only one instruction can be allowed to proceed at a time.
  3. A load cannot be allowed to complete before a preceding store reaches the cache.
  4. Compiler optimizations involving reordered memory accesses are not acceptable.


  • Is write Atomicity important for SC?
  1. Yes
  2. No
  3. It is not at all related to SC.

External links

1. Sequential hardware prefetching in shared-memory multiprocessors

2. Making Sequential Consistency Practical in Titanium

3. Prefetching with Adaptive Cache Culling for Striped Disk Arrays

4. An adaptive network prefetch scheme

5. Two Techniques to Enhance the Performance of Memory Consistency Models

6. Multiprocessors Should Support Simple Memory Consistency Models

7. Hardware Prefetching Schemes

8. Sequential Prefetching in Shared Memory Processors

9. Software Controlled prefetching in Shared Memory Processors

10. Relaxed consistency models

References

<references/>