Expertiza_Wiki - User contributions [en]

https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&feedformat=atom&user=Cslingaf Expertiza_Wiki - User contributions [en] 2026-07-23T22:13:14Z User contributions MediaWiki 1.41.0 https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44637 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-27T15:46:09Z

<p>Cslingaf: added in the performance section (missed it when I was first writing the chapter)</p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Architecture==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===Optimization techniques on MOESI===<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> =Implementation Complexities=<br /> ==MESI==<br /> There are two possible causes of complexity with the MESI protocol during replacement of a cache line. Some MESI implementations require a message to be sent to memory when a cache line is flushed - an '''E''' to '''I''' transition, as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from the memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).<br /> <br/>Source: http://rsim.cs.illinois.edu/rsim/Manual/node109.html<br /> <br /> ==Word Invalidation==<br /> One complexity problem applying to a number of the protocols deals with invalidation. In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line. The other processors will thus have a mostly correct cache line, with only a word difference. This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared. This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line. One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is "not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent." In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.<br /> <br/>Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps<br /> <br /> <br /> In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.<br /> <br /> <br /> =Performance: MOESI vs MEI/MESI=<br /> Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference). As time progressed, more multi-processors transitioned to the MESI protocol. This is most likely due to the characteristics of MEI - it is "easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used" [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&csi=155278&sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source]. As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.<br /> <br /> In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI. However, advantages arise from using it. As Any Keane (VP of Marketing for PMC-Sierra) put it "this [fifth] state allows shared data that is dirty to remain in the cache. Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower" [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&csi=155278&sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source]. This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.<br /> <br /> Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&csi=155278&sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). "Architects wrestle with multiprocessor options". Electronic engineering times (0192-1541), (1178), p. 48.]<br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44516 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-20T19:23:51Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Architecture==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===Optimization techniques on MOESI===<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> =Implementation Complexities=<br /> ==MESI==<br /> There are two possible causes of complexity with the MESI protocol during replacement of a cache line. Some MESI implementations require a message to be sent to memory when a cache line is flushed - an '''E''' to '''I''' transition, as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from the memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).<br /> <br/>Source: http://rsim.cs.illinois.edu/rsim/Manual/node109.html<br /> <br /> ==Word Invalidation==<br /> One complexity problem applying to a number of the protocols deals with invalidation. In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line. The other processors will thus have a mostly correct cache line, with only a word difference. This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared. This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line. One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is "not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent." In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.<br /> <br/>Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps<br /> <br /> <br /> In essence, the solution proposed here is to advance<br /> the MOESI protocol with word invalidation and specific<br /> treatment of temporal and spatial data, so that the block<br /> is not invalidated <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44515 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-20T19:22:52Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===Optimization techniques on MOESI===<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> =Implementation Complexities=<br /> ==MESI==<br /> There are two possible causes of complexity with the MESI protocol during replacement of a cache line. Some MESI implementations require a message to be sent to memory when a cache line is flushed - an '''E''' to '''I''' transition, as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from the memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).<br /> <br/>Source: http://rsim.cs.illinois.edu/rsim/Manual/node109.html<br /> <br /> ==Word Invalidation==<br /> One complexity problem applying to a number of the protocols deals with invalidation. In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line. The other processors will thus have a mostly correct cache line, with only a word difference. This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared. This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line. One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is "not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent." In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.<br /> <br/>Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps<br /> <br /> <br /> In essence, the solution proposed here is to advance<br /> the MOESI protocol with word invalidation and specific<br /> treatment of temporal and spatial data, so that the block<br /> is not invalidated <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44514 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-20T19:12:15Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===Optimization techniques on MOESI===<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> =Implementation Complexities=<br /> ==MESI==<br /> There are two possible causes of complexity with the MESI protocol during replacement of a cache line. Some MESI implementations require a message to be sent to memory when a cache line is flushed - an '''E''' to '''I''' transition, as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from the memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).<br /> <br/>Source: http://rsim.cs.illinois.edu/rsim/Manual/node109.html<br /> <br /> ==MOESI==<br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44513 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-20T19:08:02Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===Optimization techniques on MOESI===<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> =Implementation Complexities of MESI=<br /> There are two possible causes of complexity with the MESI protocol during replacement of a cache line. Some MESI implementations require a message to be sent to memory when a cache line is flushed - an '''E''' to '''I''' transition, as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from the memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).<br /> <br/>Source: http://rsim.cs.illinois.edu/rsim/Manual/node109.html<br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44512 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-20T19:07:37Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===Optimization techniques on MOESI===<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> =Implementation Complexities of MESI=<br /> There are two possible causes of complexity with the MESI protocol during replacement of a cache line. Some MESI implementations require a message to be sent to memory when a cache line is flushed - an '''E''' to '''I''' transition, as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from the memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).<br /> Source: http://rsim.cs.illinois.edu/rsim/Manual/node109.html<br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44383 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:49:29Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===Optimization techniques on MOESI===<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44382 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:48:05Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> =Optimization techniques on MOESI=<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> = CMP Implementation in Intel Architecture =<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44381 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:46:32Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''<br /> | No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''<br /> | does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> == CMP Implementation in Intel Architecture ==<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> =Optimization techniques on MOESI=<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44380 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:25:07Z

<p>Cslingaf: moved dragon protocol to correct location</p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''| No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''| does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> == CMP Implementation in Intel Architecture ==<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> =Optimization techniques on MOESI=<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44379 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:23:57Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> <br /> ===Synapse protocol===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI==<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''| No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''| does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> == CMP Implementation in Intel Architecture ==<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> =Optimization techniques on MOESI=<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44378 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:22:30Z

<p>Cslingaf: fixed open paarinthesis spacing issues</p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> ==SYNAPSE Multiprocessor==<br /> ===Synapse protocol and Synapse multiprocessor===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI===<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''| No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''| does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> == CMP Implementation in Intel Architecture ==<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes (out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> =Optimization techniques on MOESI=<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44377 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:19:28Z

<p>Cslingaf: edits, removing extraneous information</p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> <br /> ==SMP Protocol==<br /> Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> <br /> <br /> =Snooping Protocols=<br /> ==MSI Protocol==<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified(M) ,Shared(S)''' and '''Invalid(I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> <br /> ==SYNAPSE Multiprocessor==<br /> ===Synapse protocol and Synapse multiprocessor===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==MESI===<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade(BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''| No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''| does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing('''[http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP]''') in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. <br /> As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> == CMP Implementation in Intel Architecture ==<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor(CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> <br /> ==MOESI==<br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes(out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESI is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> <br /> =Prefetching=<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> =Optimization techniques on MOESI=<br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> ==Dragon Protocol==<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified(M)''' and '''Exclusive(E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date(if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. <br /> <br /> <br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44376 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T23:08:31Z

<p>Cslingaf: </p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> Most parallel software in the commercial market relies on the shared-memory<br /> programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> =Cache Coherence in real machines=<br /> <br /> <br /> ===MSI Protocol===<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified(M) ,Shared(S)''' and '''Invalid(I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> MSI protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based(Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 mips(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.<br /> <br /> The Multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the MSI protocol.<br /> <br /> With the simple rules of MSI enforced by the hardware protocols of the sync bus and the Multiprocessor bus, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.<br /> <br /> ==SYNAPSE Multiprocessor==<br /> ===Synapse protocol and Synapse multiprocessor===<br /> From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ===MESI & Intel Processors===<br /> MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the MESI protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade(BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''| No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''| does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing('''[http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP]''') in various multiprocessor configurations. '''[http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP]''' and '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the MESI protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be '''wasting the bandwidth'''. <br /> As a '''solution''' to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a '''unique''' copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> === CMP Implementation in Intel Architecture ===<br /> <br /> Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor(CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> ==AMD - Advanced Micro Devices Processors==<br /> <br /> <br /> ===MOESI & AMD Processors===<br /> <br /><br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of the MOESI protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the MOESI protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes(out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for MOESIis as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> MOESI State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===AMD Opteron memory Architecture===<br /> <br /><br /> <br /> <center>[[Image:Untitled.jpg]]</center><br /> <br /> <br /> <br /> The AMD processor’s high-performance cache architecture includes an integrated, 64-bit, dual-ported 128-Kbyte split-L1 cache with separate snoop port, multi-level translation lookaside buffers (TLBs), a scalable L2 cache controller with a 72-bit (64-bit data + 8-bit ECC) interface to as much as 8-Mbyte of industry-standard SDR or DDR SRAMs, and an integrated tag for the most cost-effective 512-Kbyte L2 configurations. The AMD Athlon processor’s integrated L1 cache comprises two separate 64-Kbyte, two-way set-associative data and instruction caches.<br /> <br /> More information about this can be found in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD 64 bit Architecture Programmers's Manual]<br /> <br /> ===Special Coherence Considerations in AMD64 architectures===<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occuring, there are special instructions provided by software like INVLPG or MOV CR3 instruction which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> ===Optimization techniques on MOESI when implemented on AMD Phenom processors===<br /> <br /><br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> <br /> ===Dragon Protocol===<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified(M)''' and '''Exclusive(E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date(if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> <br /> The Dragon Protocol , was developed by Xerox Palo Alto Research Center('''[http://en.wikipedia.org/wiki/Xerox_PARC Xerox PARC]'''), a subsidiary of Xerox Corporation. This protocol was used in the '''[http://en.wikipedia.org/w/index.php?title=Xerox_Dragon&redirect=no Xerox PARC Dragon]''' multiprocessor workstation, a VLSI research computer that could support multiple processors on a central high bandwidth memory bus. The Dragon design implemented snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. The memory bus used in the Xeron Dragon evolved to become the '''[http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]''' , a low-cost, synchronous, '''packet-switched VLSI bus''' designed for use in high-performance multiprocessors. This was used as the interconnect in many multiprocessor server systems like '''Cray Superserver 6400''', '''Sun Microsystems' SPARCcenter 2000''' and '''SPARCserver 1000''' and '''Sun4d''' systems.<br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch8_cl&diff=44375 CSC/ECE 506 Spring 2011/ch8 cl 2011-03-17T22:52:44Z

<p>Cslingaf: original import</p> <hr /> <div>=Introduction to bus-based cache coherence in real machines=<br /> <br /> Most parallel software in the commercial market relies on the shared-memory<br /> programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''cache coherence problem'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.<br /> <br /> <center>[[Image:Busbased SMP.jpg]]</center><br /> <br /> <br /> At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are as shown in ''Table 1'':<br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''States'''<br /> | '''Access Type'''<br /> | '''Invariant'''<br /> |-<br /> | '''Modified'''<br /> | read, write<br /> | all other caches in I state<br /> |-<br /> | '''Exclusive'''<br /> | read<br /> | all other caches in I state<br /> |-<br /> | '''Owned'''<br /> | read<br /> | all other caches in I or S state<br /> |-<br /> | '''Shared'''<br /> | read<br /> | no other cache in M or E state<br /> |-<br /> | '''Invalid'''<br /> | -<br /> | -<br /> |}<br /> <br /> <br /> The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss how some real time machines by '''Intel''' , '''AMD''' and other processors maintain cache coherence using '''''snooping based coherence protocols'''''. For more information on snooping based protocols refer to Solihin text book Chapter 8.<br /> <br /><br /> <br /><br /> <br /> =Cache Coherence in real machines=<br /> <br /> ==SGI - Silicon Graphics, Inc==<br /> ===MSI & SGI IRIS 4D Processors===<br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified(M) ,Shared(S)''' and '''Invalid(I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. BusRdx even if hit in '''S''' state, it is promoted to '''M''' (upgrade) state.<br /> <br /> The following state transition diagram for MSI protocol explains the working of the protocol:<br /> <br /> <center>[[Image:MSI.jpg]]</center><br /> <br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based(Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 mips(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.<br /> <br /> The Multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.<br /> <br /> With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' enforced by the hardware protocols of<br /> the sync bus and the Multiprocessor bus, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.<br /> <br /> ==SYNAPSE Multiprocessor==<br /> ===Synapse protocol and Synapse multiprocessor===<br /> From the state transition diagram of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''', we observe that for '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> <br /> In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol which clearly shows its working.<br /> <br /> <center>[[Image:Synapse1.jpg]]</center><br /> <br /> ==Intel==<br /> ===MESI & Intel Processors===<br /> '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.<br /> Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.<br /> <br /> '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. <br /> * '''Invalid''' : The cache line is either not present or is invalid<br /> * '''Exclusive''' : The cache line is clean and is owned by this core/processor only<br /> * '''Modified''' : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line,exclusive of the memory also.<br /> * '''Shared''' : The cache line is clean and is shared by more than one core/processor<br /> <br /> In a nutshell, the '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol works as follows: <br /> A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade(BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize '''[http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol'''.<br /> <br /> <br /> {| class="wikitable" border="1"<br /> |-<br /> | '''Cache Line State:'''<br /> | '''Modified'''<br /> | '''Exclusive'''<br /> | '''Shared'''<br /> | '''Invalid'''<br /> |-<br /> | '''This cache line is valid?'''<br /> | Yes<br /> | Yes<br /> | Yes<br /> | No<br /> |-<br /> | '''The memory copy is…'''<br /> | out of date<br /> | valid<br /> | valid<br /> | -<br /> |-<br /> | '''Copies exist in caches of other processors?'''| No<br /> | No<br /> | Maybe<br /> | Maybe<br /> |-<br /> | '''A write to this line'''| does not go to bus<br /> | does not go to bus<br /> | goes to bus and updates cache<br /> | goes directly to bus<br /> |}<br /> <br /> <br /> <br /> The transition diagram from the lecture slides is given below for reference.<br /> <br /> <center>[[Image:MESI.jpg]]</center> <br /><br /> <br /> The '''Pentium Pro''' microprocessor, introduced in 1992 was the '''first''' Intel architecture microprocessor to support symmetric multiprocessing('''[http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP]''') in various multiprocessor configurations. '''[http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP]''' and '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in '''Intel's (Nehalem-EP) quad-core x86-64'''. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''QuickPath Interconnect''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' protocol called '''MESIF''', by introducing an additional state, F, the forward state. <br /> <br /> The '''Intel architecture''' uses the '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol as the '''basis''' to ensure cache coherence, which is true whether you're on one of the older processors that use a '''common bus''' to communicate or using the new Intel '''QuickPath''' point-to-point interconnection technology. <br /> <br /> Let us now walk through a briefing on the '''MESIF protocl''':<br /> <br /> The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be '''wasting the bandwidth'''. <br /> As a '''solution''' to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.<br /> All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''. <br /> The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a '''unique''' copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. <br /> <br /> More information on the QuickPath Interconnect and MESIF protocol can be found at<br /> '''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''<br /> <br /> === CMP Implementation in Intel Architecture ===<br /> <br /> Let us now see how Intel architecture using the [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol progressed from a uniprocessor architecture to a Chip MultiProcessor(CMP) using the bus as the interconnect. <br /> <br /> <br /> '''Uniprocessor Architecture'''<br /> <br /> The diagram below shows the structure of the memory cluster in Intel Pentium M processor.<br /> <br /> <center>[[Image:intel_cache1.jpg]]</center> <br /><br /> <br /> <br /> In this structure we have,<br /> * A unified on-chip '''L1 cache''' with the '''processor/core''',<br /> * A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,<br /> * The second level '''L2 cache''' along with the '''prefetch unit''' and<br /> * '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. <br /> <br /> As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.<br /> <br /> <br /> '''CMP Architecture'''<br /> <br /> For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.<br /> Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. <br /> The general CMP implementation structure of the Intel Core Duo is shown below<br /> <br /> <center>[[Image:intel_cache2.jpg]]</center> <br /><br /> <br /> This structure has the following changes when compared to the uniprocessor memory cluster structure. <br /> * '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.<br /> * The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.<br /> * The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.<br /> * A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.<br /> <br /> This new '''partitioned structure''' for the memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. <br /> For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to <br /> [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> <br /> The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the '''same [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol'''; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.<br /> <br /> ==AMD - Advanced Micro Devices Processors==<br /> <br /> <br /> ===MOESI & AMD Processors===<br /> <br /><br /> [http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel to AMD dual core Opteron , used the [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol to handle cache coherence. [http://en.wikipedia.org/wiki/MESI_protocol MESI] came with the drawback of using much time and bandwidth in certain situations. <br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem . [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] added a fifth state to [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol called '''“Owned”''' . [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''"Owned"''' stays responsible to update the main memory later when the cache line is evicted.<br /> <br /> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI]' has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.<br /> <br /> The five different states of [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol are:<br /> * '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.<br /> * '''Owned (O)''' : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted. <br /> * '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory. <br /> * '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. <br /> * '''Invalid (I)''' : A cache line does not hold a valid copy of the data.<br /> <br /> A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]<br /> <br /> The following table summarizes the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol:<br /> <br /> <br /> {| class="wikitable" border="1"<br /> <br /> |-<br /> <br /> | '''Cache Line State:'''<br /> <br /> | '''Modified'''<br /> <br /> | '''Owner''' <br /> <br /> | '''Exclusive'''<br /> <br /> | '''Shared'''<br /> <br /> | '''Invalid'''<br /> <br /> |-<br /> <br /> | '''This cache line is valid?'''<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | Yes<br /> <br /> | No<br /> <br /> |-<br /> <br /> | '''The memory copy is…'''<br /> <br /> | out of date<br /> <br /> | out of date<br /> <br /> | valid<br /> <br /> | valid<br /> <br /> | -<br /> <br /> |-<br /> <br /> | '''Copies exist in caches of other processors?'''<br /> <br /> | No<br /> <br /> | No<br /> | Yes(out of date values)<br /> | Maybe<br /> <br /> | Maybe<br /> <br /> |-<br /> <br /> | '''A write to this line'''<br /> <br /> | does not go to bus<br /> | does not go to bus<br /> | does not go to bus<br /> <br /> | goes to bus and updates cache<br /> <br /> | goes directly to bus<br /> <br /> |}<br /> <br /> <br /> <br /> State transition for [http://en.wikipedia.org/wiki/MOESI_protocol MOESI]is as shown below : <br /> <br /> <br /> <center>[[Image:MOESI_State_Transition_Diagram.jpg]]</center><br /> <br /> <center> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI]State transition Diagram</center><br /> <br /><br /> <br /><br /> <br /> ===AMD Opteron memory Architecture===<br /> <br /><br /> <br /> <center>[[Image:Untitled.jpg]]</center><br /> <br /> <br /> <br /> The AMD processor’s high-performance cache architecture includes an integrated, 64-bit, dual-ported 128-Kbyte split-L1 cache with separate snoop port, multi-level translation lookaside buffers (TLBs), a scalable L2 cache controller with a 72-bit (64-bit data + 8-bit ECC) interface to as much as 8-Mbyte of industry-standard SDR or DDR SRAMs, and an integrated tag for the most cost-effective 512-Kbyte L2 configurations. The AMD Athlon processor’s integrated L1 cache comprises two separate 64-Kbyte, two-way set-associative data and instruction caches.<br /> <br /> More information about this can be found in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD 64 bit Architecture Programmers's Manual]<br /> <br /> ===Special Coherence Considerations in AMD64 architectures===<br /> <br /><br /> Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. <br /> <br /><br /> An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:<br /> # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.<br /> # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.<br /> # Data in virtual-page A is accessed.<br /> <br /> Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.<br /> <br /> In order to prevent errors from occuring, there are special instructions provided by software like INVLPG or MOV CR3 instruction which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.<br /> <br /> More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> <br /> ===Optimization techniques on MOESI when implemented on AMD Phenom processors===<br /> <br /><br /> <br /> In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol with some optimization techniques incorporated. <br /> <br /> It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.<br /> <br /> When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI]). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.<br /> <br /> Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.<br /> <br /> You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> <br /> <br /> ==Xerox Corporation==<br /> ===Dragon Protocol & Xerox Dragon Processors===<br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them. But the '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and '''Modified'''. <br /> * '''Modified(M)''' and '''Exclusive(E)''' - these states have the same meaning as explained in the protocols above. <br /> * '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches have this block and memory may or may not be up to date and this processor's cache had modified the block.<br /> * '''Shared Clean (Sc)''' - Potentially two or more caches have this block and memory may or may not be up to date(if no other cache has it in Sm state, memory will be up to date else it is not).<br /> When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.<br /> <br /> <center>[[Image:Dragon.jpg]]]]</center><br /> <br /> <br /> The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' , was developed by Xerox Palo Alto Research Center('''[http://en.wikipedia.org/wiki/Xerox_PARC Xerox PARC]'''), a subsidiary of Xerox Corporation. This protocol was used in the '''[http://en.wikipedia.org/w/index.php?title=Xerox_Dragon&redirect=no Xerox PARC Dragon]''' multiprocessor workstation, a VLSI research computer that could support multiple processors on a central high bandwidth memory bus. The Dragon design implemented snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses. The Dragon system was designed to support 4 to 8 Dragon processors. The memory bus used in the Xeron Dragon evolved to become the '''[http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]''' , a low-cost, synchronous, '''packet-switched VLSI bus''' designed for use in high-performance multiprocessors. This was used as the interconnect in many multiprocessor server systems like '''Cray Superserver 6400''', '''Sun Microsystems' SPARCcenter 2000''' and '''SPARCserver 1000''' and '''Sun4d''' systems.<br /> <br /> =References=<br /> # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]<br /> # [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]<br /> # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]<br /> # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]<br /> # [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&%20MESI.pdf Cache consistency with MESI on Intel processor]<br /> # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]<br /> # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]<br /> # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=4913 Silicon Graphics Computer Systems]<br /> # [http://books.google.com/books?id=g82fofiqa5IC&printsec=frontcover&dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&source=bl&ots=COrdamlfVn&sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&hl=en&ei=0ZO6S4TJGcOclgejzI3BBw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=&f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]<br /> # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]<br /> # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]<br /> # [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]<br /> # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]<br /> # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]<br /> # [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]<br /> # [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&arnumber=289691 XDBus]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43604 CSC/ECE 506 Spring 2011/ch2 cl 2011-02-07T20:46:24Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> <br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> ===Comparison Diagram===<br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism (PU (Processing Unit) and PE (processing Element) are synonymous]]<br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 k[http://en.wikipedia.org/wiki/FLOPS FLOPS]. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 [http://searchdatacenter.techtarget.com/definition/Linpack-benchmark LINPACK benchmark]. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single instruction, multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> * ''MIMD (multiple instruction, multiple data).'' A processor which executes multiple instructions simultaneously on multiple data locations<br /> * ''PE'' A Processing Element<br /> * ''PU'' A Processing Unit (synonymous with PE)<br /> * ''FLOPS'' FLoating point Operations Per Second (a 'benchmark' allowing comparisons between different parallel system architectures)<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43603 CSC/ECE 506 Spring 2011/ch2 cl 2011-02-07T20:34:17Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> <br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> ===Comparison Diagram===<br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism (PU (Processing Unit) and PE (processing Element) are synonymous]]<br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 [http://searchdatacenter.techtarget.com/definition/Linpack-benchmark LINPACK benchmark]. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single instruction, multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> * ''MIMD (multiple instruction, multiple data).'' A processor which executes multiple instructions simultaneously on multiple data locations<br /> * ''PE'' A Processing Element<br /> * ''PU'' A Processing Unit (synonymous with PE)<br /> * ''FLOPS'' FLoating point Operations Per Second (a 'benchmark' allowing comparisons between different parallel system architectures)<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43602 CSC/ECE 506 Spring 2011/ch2 cl 2011-02-07T20:30:36Z

<p>Cslingaf: /* Definitions */</p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> <br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> ===Comparison Diagram===<br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism (PU (Processing Unit) and PE (processing Element) are synonymous]]<br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single instruction, multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> * ''MIMD (multiple instruction, multiple data).'' A processor which executes multiple instructions simultaneously on multiple data locations<br /> * ''PE'' A Processing Element<br /> * ''PU'' A Processing Unit (synonymous with PE)<br /> * ''FLOPS'' FLoating point Operations Per Second (a 'benchmark' allowing comparisons between different parallel system architectures)<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43601 CSC/ECE 506 Spring 2011/ch2 cl 2011-02-07T20:27:09Z

<p>Cslingaf: corrected graphic subtitle</p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> <br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> ===Comparison Diagram===<br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism (PU (Processing Unit) and PE (processing Element) are synonymous]]<br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single instruction, multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> * ''MIMD (multiple instruction, multiple data).'' A processor which executes multiple instructions simultaneously on multiple data locations<br /> * ''PE'' A Processing Element<br /> * ''PU'' A Processing Unit (synonymous with PE)<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43600 CSC/ECE 506 Spring 2011/ch2 cl 2011-02-07T20:23:31Z

<p>Cslingaf: Added PE/PU definitions</p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> <br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> ===Comparison Diagram===<br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single instruction, multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> * ''MIMD (multiple instruction, multiple data).'' A processor which executes multiple instructions simultaneously on multiple data locations<br /> * ''PE'' A Processing Element<br /> * ''PU'' A Processing Unit (synonymous with PE)<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43599 CSC/ECE 506 Spring 2011/ch2 cl 2011-02-07T20:19:35Z

<p>Cslingaf: moved comparison diagram to correct section</p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> <br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> ===Comparison Diagram===<br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single instruction, multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> * ''MIMD (multiple instruction, multiple data).'' A processor which executes multiple instructions simultaneously on multiple data locations<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43591 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T22:14:13Z

<p>Cslingaf: /* Definitions */</p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single instruction, multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> * ''MIMD (multiple instruction, multiple data).'' A processor which executes multiple instructions simultaneously on multiple data locations<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43590 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T22:11:24Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> ==Comparison between Data and Task Parallel Programming Models==<br /> <br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ===Synchronous vs Asynchronous===<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> ===Determinism vs. Non-Determinism===<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43589 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T22:08:51Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced as a point of contrast.<br /> <br /> =Data Parallel Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> <br /> The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ==Synchronous vs Asynchronous==<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> == Determinism vs. Non-Determinism ==<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43588 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T20:37:11Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Data Parallel Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. <br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Task Parallel Overview==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. <br /> <br /> ==Example of Task Parallel Programming Model==<br /> An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> <br /> The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ==Synchronous vs Asynchronous==<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> == Determinism vs. Non-Determinism ==<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/ch_2_maf&diff=43587 CSC/ECE 506 Spring 2010/ch 2 maf 2011-01-31T20:31:40Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from [[#References | Solihin (2008)]] which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.<br /> <br /> // Data parallel code, adapted from [[#References|Solihin (2008), p. 27.]]<br /> <br /> id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1<br /> local_iter = 4;<br /> start_iter = id * local_iter;<br /> end_iter = start_iter + local_iter;<br /> <br /> if (id == 0)<br /> send_msg(P1, b[4..7], c[4..7]);<br /> else<br /> recv_msg(P0, b[4..7], c[4..7]);<br /> <br /> // Begin data parallel section<br /> <br /> for (i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum = 0;<br /> for (i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum = local_sum + a[i];<br /> <br /> // End data parallel section<br /> <br /> if (id == 0)<br /> {<br /> recv_msg(P1, &local_sum1);<br /> sum = local_sum + local_sum1;<br /> Print sum;<br /> }<br /> else<br /> send_msg(P0, local_sum);<br /> <br /> In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.<br /> <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of [[#References | Solihin (2008)]], which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.<br /> <br /> // Sequential code, from [[#References|Solihin (2008), p. 25.]]<br /> <br /> for (i = 0; i < 8; i++)<br /> a[i] = b[i] + c[i];<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> Print sum;<br /> <br /> == Example of Data Parallel Programing Model ==<br /> <br /> <br /> This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.<br /> <br /> <br /> Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.<br /> <br /> <br /> // simple sequential task<br /> sum = 0;<br /> '''for''' (i = 0; i < a.length; i++)<br /> {<br /> a[i] = a[i] * i;<br /> sum = sum + a[i];<br /> }<br /> <br /> <br /> When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.<br /> <br /> <br /> // data parallel programming: let each PE perform the same task on different pieces of distributed data<br /> pe_id = getid();<br /> my_sum = 0;<br /> '''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE <br /> {<br /> a[i] = a[i] * i;<br /> my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum<br /> }<br /> <br /> <br /> In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.<br /> <br /> <br /> The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.<br /> <br /> [[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]<br /> <br /> ==Example of Task Parallel Programming Model==<br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> =History of Parallel Programming Models=<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ==References for this section==<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43586 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T20:21:44Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from [[#References | Solihin (2008)]] which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.<br /> <br /> // Data parallel code, adapted from [[#References|Solihin (2008), p. 27.]]<br /> <br /> id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1<br /> local_iter = 4;<br /> start_iter = id * local_iter;<br /> end_iter = start_iter + local_iter;<br /> <br /> if (id == 0)<br /> send_msg(P1, b[4..7], c[4..7]);<br /> else<br /> recv_msg(P0, b[4..7], c[4..7]);<br /> <br /> // Begin data parallel section<br /> <br /> for (i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum = 0;<br /> for (i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum = local_sum + a[i];<br /> <br /> // End data parallel section<br /> <br /> if (id == 0)<br /> {<br /> recv_msg(P1, &local_sum1);<br /> sum = local_sum + local_sum1;<br /> Print sum;<br /> }<br /> else<br /> send_msg(P0, local_sum);<br /> <br /> In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.<br /> <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of [[#References | Solihin (2008)]], which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.<br /> <br /> // Sequential code, from [[#References|Solihin (2008), p. 25.]]<br /> <br /> for (i = 0; i < 8; i++)<br /> a[i] = b[i] + c[i];<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> Print sum;<br /> <br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> <br /> The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ==Synchronous vs Asynchronous==<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> == Determinism vs. Non-Determinism ==<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ===References for this section===<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43585 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T20:19:50Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from [[#References | Solihin (2008)]] which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.<br /> <br /> // Data parallel code, adapted from [[#References|Solihin (2008), p. 27.]]<br /> <br /> id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1<br /> local_iter = 4;<br /> start_iter = id * local_iter;<br /> end_iter = start_iter + local_iter;<br /> <br /> if (id == 0)<br /> send_msg(P1, b[4..7], c[4..7]);<br /> else<br /> recv_msg(P0, b[4..7], c[4..7]);<br /> <br /> // Begin data parallel section<br /> <br /> for (i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum = 0;<br /> for (i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum = local_sum + a[i];<br /> <br /> // End data parallel section<br /> <br /> if (id == 0)<br /> {<br /> recv_msg(P1, &local_sum1);<br /> sum = local_sum + local_sum1;<br /> Print sum;<br /> }<br /> else<br /> send_msg(P0, local_sum);<br /> <br /> In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.<br /> <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of [[#References | Solihin (2008)]], which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.<br /> <br /> // Sequential code, from [[#References|Solihin (2008), p. 25.]]<br /> <br /> for (i = 0; i < 8; i++)<br /> a[i] = b[i] + c[i];<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> Print sum;<br /> <br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> <br /> The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ==Synchronous vs Asynchronous==<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> == Determinism vs. Non-Determinism ==<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ==References for this section==<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> **Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> **Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43584 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T20:19:06Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from [[#References | Solihin (2008)]] which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.<br /> <br /> // Data parallel code, adapted from [[#References|Solihin (2008), p. 27.]]<br /> <br /> id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1<br /> local_iter = 4;<br /> start_iter = id * local_iter;<br /> end_iter = start_iter + local_iter;<br /> <br /> if (id == 0)<br /> send_msg(P1, b[4..7], c[4..7]);<br /> else<br /> recv_msg(P0, b[4..7], c[4..7]);<br /> <br /> // Begin data parallel section<br /> <br /> for (i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum = 0;<br /> for (i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum = local_sum + a[i];<br /> <br /> // End data parallel section<br /> <br /> if (id == 0)<br /> {<br /> recv_msg(P1, &local_sum1);<br /> sum = local_sum + local_sum1;<br /> Print sum;<br /> }<br /> else<br /> send_msg(P0, local_sum);<br /> <br /> In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.<br /> <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of [[#References | Solihin (2008)]], which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.<br /> <br /> // Sequential code, from [[#References|Solihin (2008), p. 25.]]<br /> <br /> for (i = 0; i < 8; i++)<br /> a[i] = b[i] + c[i];<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> Print sum;<br /> <br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> <br /> The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ==Synchronous vs Asynchronous==<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> == Determinism vs. Non-Determinism ==<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ==References for this section==<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1950's===<br /> *1955<br /> **IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> <br /> *1956<br /> **IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> <br /> *1958<br /> **Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> **John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1960's===<br /> *1960<br /> **Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> <br /> *1964<br /> **Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> <br /> *1966<br /> **Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> <br /> *1967<br /> **IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> **Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> <br /> *1968<br /> **IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> **Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> <br /> *1969<br /> **George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> **Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> <br /> ===1970's===<br /> *1971<br /> **Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> <br /> *1972<br /> **Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> **Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> **Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> <br /> *1974<br /> **Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> **IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> <br /> *1975<br /> **ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> <br /> *1976<br /> **Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> <br /> *1979<br /> **IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> <br /> ===1980's===<br /> *1980<br /> **PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> **David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> <br /> *1982<br /> **Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> **ILLIAC-IV decommissioned. <br /> <br /> *1983<br /> **J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> **Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> **CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> <br /> *1984<br /> **The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> **CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> <br /> *1985<br /> **Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> <br /> *1986<br /> **CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> **Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> <br /> *1987<br /> **The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> <br /> *1988<br /> **John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> **CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> <br /> *1989<br /> **CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> **Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> <br /> ===1990's===<br /> *1990<br /> **Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> **Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> **National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> <br /> *1991<br /> **CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> <br /> *1993<br /> **Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ===References===<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43583 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T20:10:53Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from [[#References | Solihin (2008)]] which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.<br /> <br /> // Data parallel code, adapted from [[#References|Solihin (2008), p. 27.]]<br /> <br /> id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1<br /> local_iter = 4;<br /> start_iter = id * local_iter;<br /> end_iter = start_iter + local_iter;<br /> <br /> if (id == 0)<br /> send_msg(P1, b[4..7], c[4..7]);<br /> else<br /> recv_msg(P0, b[4..7], c[4..7]);<br /> <br /> // Begin data parallel section<br /> <br /> for (i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum = 0;<br /> for (i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum = local_sum + a[i];<br /> <br /> // End data parallel section<br /> <br /> if (id == 0)<br /> {<br /> recv_msg(P1, &local_sum1);<br /> sum = local_sum + local_sum1;<br /> Print sum;<br /> }<br /> else<br /> send_msg(P0, local_sum);<br /> <br /> In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.<br /> <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of [[#References | Solihin (2008)]], which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.<br /> <br /> // Sequential code, from [[#References|Solihin (2008), p. 25.]]<br /> <br /> for (i = 0; i < 8; i++)<br /> a[i] = b[i] + c[i];<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> Print sum;<br /> <br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. <br /> <br /> The following diagram may be of use conceptually distinguishing between data parallelism (SIMD: Single Instruction, Multiple Data) and task parallelism (MIMD: Multiple Instruction, Multiple Data). In the SIMD, it is observed that a single instruction runs to multiple processors which then access multiple connections to the data. In contrast, the MIMD has multiple instruction streams (evidenced by two groups of processors) which interact, again, with multiple connections to the data<br /> [[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]<br /> <br /> <br /> The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> ==Synchronous vs Asynchronous==<br /> While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.<br /> <br /> == Determinism vs. Non-Determinism ==<br /> Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).<br /> <br /> <br /> =History of Parallel Programming Models=<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ==References for this section==<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1955===<br /> IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> ===1956===<br /> IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> ===1958===<br /> Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1962===<br /> Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> ===1964===<br /> Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> ===1966===<br /> Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> ===1967===<br /> IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> ===1968===<br /> IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> ===1969===<br /> George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> ===1971===<br /> Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> ===1972===<br /> Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> ===1974===<br /> Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> ===1975===<br /> ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> ===1976===<br /> Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> ===1979===<br /> IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> ===1980===<br /> PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> ===1982===<br /> Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> ILLIAC-IV decommissioned. <br /> ===1983===<br /> J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> ===1984===<br /> The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> ===1985===<br /> Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> ===1986===<br /> CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> ===1987===<br /> The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> ===1988===<br /> John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> ===1989===<br /> CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> ===1990===<br /> Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> ===1991===<br /> CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> ===1993===<br /> Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ====References====<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43568 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T05:02:53Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from [[#References | Solihin (2008)]] which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.<br /> <br /> // Data parallel code, adapted from [[#References|Solihin (2008), p. 27.]]<br /> <br /> id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1<br /> local_iter = 4;<br /> start_iter = id * local_iter;<br /> end_iter = start_iter + local_iter;<br /> <br /> if (id == 0)<br /> send_msg(P1, b[4..7], c[4..7]);<br /> else<br /> recv_msg(P0, b[4..7], c[4..7]);<br /> <br /> // Begin data parallel section<br /> <br /> for (i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum = 0;<br /> for (i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum = local_sum + a[i];<br /> <br /> // End data parallel section<br /> <br /> if (id == 0)<br /> {<br /> recv_msg(P1, &local_sum1);<br /> sum = local_sum + local_sum1;<br /> Print sum;<br /> }<br /> else<br /> send_msg(P0, local_sum);<br /> <br /> In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.<br /> <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of [[#References | Solihin (2008)]], which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.<br /> <br /> // Sequential code, from [[#References|Solihin (2008), p. 25.]]<br /> <br /> for (i = 0; i < 8; i++)<br /> a[i] = b[i] + c[i];<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> Print sum;<br /> <br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> =History of Parallel Programming Models=<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ==References for this section==<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> ==Interesting Dates in Parallel Computing History (with a focus on IBM, Cray, and ILLIAC)==<br /> ===1955===<br /> IBM introduces the 704. Principal architect is Gene Amdahl; it is the first commercial machine with floating-point hardware, and is capable of approximately 5 kFLOPS. <br /> ===1956===<br /> IBM starts 7030 project (known as STRETCH) to produce supercomputer for Los Alamos National Laboratory (LANL). Its goal is to produce a machine with 100 times the performance of any available at the time. <br /> ===1958===<br /> Bull of France announces the Gamma 60 with multiple functional units and fork & join operations in its instruction set. 19 are later built. <br /> John Cocke and Daniel Slotnick discuss use of parallelism in numerical calculations in an IBM research memo. Slotnick later proposes SOLOMON, a SIMD machine with 1024 1-bit PEs, each with memory for 128 32-bit values. The machine is never built, but the design is the starting point for much later work. <br /> ===1962===<br /> Atlas computer becomes operational. It is the first machine to use virtual memory and paging; its instruction execution is pipelined, and it contains separate fixed- and floating-point arithmetic units, capable of approximately 200 kFLOPS. <br /> Burroughs introduces the D825 symmetrical MIMD multiprocessor. 1 to 4 CPUs access 1 to 16 memory modules using a crossbar switch. The CPUs are similar to the later B5000; the operating system is symmetrical, with a shared ready queue. <br /> ===1964===<br /> Daniel Slotnick proposes building a massively-parallel machine for the Lawrence Livermore National Laboratory (LLNL); the Atomic Energy Commission gives the contract to CDC instead, who build the STAR-100 to fulfil it. Slotnick's design funded by the Air Force, and evolves into the ILLIAC-IV. The machine is built at the University of Illinois, with Burroughs and Texas Instruments as primary subcontractors. Texas Instruments' Advanced Scientific Computer (ASC) also grows out of this initiative. <br /> ===1966===<br /> Michael Flynn publishes a paper describing the architectural taxonomy which bears his name. <br /> ===1967===<br /> IBM produces the 360/91 (later model 95) with dynamic instruction reordering. 20 of these are produced over the next several years; the line is eventually supplanted by the slower Model <br /> Gene Amdahl and Daniel Slotnick have published debate at AFIPS Conference about the feasibility of parallel processing. Amdahl's argument about limits to parallelism becomes known as "Amdahl's Law"; he also propounds a corollary about system balance (sometimes called "Amdahl's Other Law"), which states that a balanced machine has the same number of MIPS, Mbytes, and Mbit/s of I/O bandwidth. <br /> ===1968===<br /> IBM 2938 Array Processor delivered to Western Geophysical (who promptly paint racing stripes on it). First commercial machine to sustain 10 MFLOPS on 32-bit floating-point operations. A programmable digital signal processor, it proves very popular in the petroleum industry. <br /> Edsger Dijkstra describes semaphores, and introduces the dining philosophers problem, which later becomes a standard example in concurrency theory. <br /> ===1969===<br /> George Paul, M. Wayne Wilson, and Charles Cree begin work at IBM on VECTRAN, an extension to FORTRAN 66 with array-valued operators, functions, and I/O facilities. <br /> Work begins at Compass Inc. on a parallelizing FORTRAN compiler for the ILLIAC-IV called IVTRAN. <br /> ===1971===<br /> Intel produces the world's first single-chip CPU, the 4004 microprocessor. <br /> ===1972===<br /> Seymour Cray leaves Control Data Corporation to found Cray Research Inc. CDC cancels the 8600 project, a follow-on to the 7600. <br /> Quarter-sized (64 PEs) ILLIAC-IV installed at NASA Ames. Each processor has a peak speed of 4 MFLOPS; the machine's I/O system is capable of 500 Mbit/s. <br /> Paper studies of massive bit-level parallelism done by Stewart Reddaway at ICL. These later lead to development of ICL DAP. <br /> ===1974===<br /> Leslie Lamport's paper "Parallel Execution of Do-Loops" lays the theoretical foundation for most later research on automatic vectorization and shared-memory parallelization. Much of the work was done in 1971-2 while Lamport was at Compass Inc. <br /> IBM delivers the first 3838 array processor, a general-purpose digital signal processor. <br /> ===1975===<br /> ILLIAC-IV becomes operational at NASA Ames after concerted check-out effort. <br /> ===1976===<br /> Cray Research delivers the first Freon-cooled CRAY-1 to Los Alamos National Laboratory. <br /> ===1979===<br /> IBM's John Cocke designs the 801, the first of what are later called RISC architectures. <br /> ===1980===<br /> PFC (Parallel FORTRAN Compiler) developed at Rice University under the direction of Ken Kennedy. <br /> David Padua and David Kuck at the University of Illinois develop the DOACROSS parallel construct to be used as a target in program transformation. The name DOACROSS is due to Robert Kuhn. <br /> ===1982===<br /> Steve Chen's group at Cray Research produces the first X-MP, containing two pipelined processors compatible with the CRAY-1 and shared memory. <br /> ILLIAC-IV decommissioned. <br /> ===1983===<br /> J. R. Allen's Ph.D. thesis at Rice University introduces the concepts of loop-carried and loop-independent dependencies, and formalizes the process of vectorization. <br /> Scientific Computer Systems founded to design and market Cray-compatible minisupercomputers. <br /> CRAY-1 with 1 processor achieves 12.5 MFLOPS on the 100x100 LINPACK benchmark. <br /> ===1984===<br /> The CRAY X-MP family is expanded to include 1- and 4-processor machines. A CRAY X-MP running CX-OS, the first Unix-like operating system for supercomputers, is delivered to NASA Ames. <br /> CRAY X-MP with 1 processor achieves 21 MFLOPS on 100x100 LINPACK. <br /> ===1985===<br /> Cray Research produces the CRAY-2, with four background processors, a single foreground processor, a 4.1 nsec clock cycle, and 256 Mword memory. The machine is cooled by an inert fluorocarbon previously used as a blood substitute. <br /> ===1986===<br /> CRAY X-MP with 4 processors achieves 713 MFLOPS (against a peak of 840) on 1000x1000 LINPACK. <br /> Alan Karp offers $100 prize to first person to demonstrate speedup of 200 or more on general purpose parallel processor. Benner, Gustafson, and Montry begin work to win it, and are later awarded the Gordon Bell Prize. <br /> ===1987===<br /> The first Gordon Bell Prizes for parallel performance is awarded. The recipients are Brenner, Gustafson, and Montry, for a speedup of 400-600 on variety of applications running on a 1024-node nCUBE, and Chen, De Benedictis, Fox, Li, and Walker, for speedups of 39-458 on various hypercubes. <br /> ===1988===<br /> John Gustafson and Gary Montry argue that Amdahl's Law can be invalidated by increasing problem size. <br /> CRAY Y-MP with 1 processor achieves 74 MFLOPS on 100x100 LINPACK; the same machine with 8 processors achieves 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> ===1989===<br /> CRAY Y-MP with 8 processors achieves 275 MFLOPS on 100x100 LINPACK, and 2.1 GFLOPS (against a peak of 2.6) on 1000x1000 LINPACK. <br /> Gordon Bell Prize for absolute performance awarded to a team from Mobil and Thinking Machines Corporation, who achieve 6 GFLOPS on a CM-2 Connection Machine; prize in price/performance category awarded to Emeagwali, who achieves 400 MFLOPS per million dollars on the same platform. <br /> Seymour Cray leaves Cray Research to found Cray Computer Corporation. <br /> ===1990===<br /> Cray Research, Inc., purchases Supertek Computers Inc., makers of the S-1, a minisupercomputer compatible with the CRAY X-MP. <br /> Gordon Bell Prize in price/performance category awarded to Geist, Stocks, Ginatempo, and Shelton, who achieves 800 MFLOPS per million dollars in a high-temperature superconductivity program on a 128-node Intel iPSC/860. The prize in the compiler parallelization category is awarded to Sabot, Tennies, and Vasilevsky, who achieve 1.5 GFLOPS on a CM-2 Connection Machine with FORTRAN 90 code derived from FORTRAN 77. <br /> National Energy Research Supercomputer Center (NERSC) at LLNL places order with Cray Computer Corporation for CRAY-3 supercomputer. The order includes a unique 8-processor CRAY-2 computer system that is installed in April. <br /> ===1991===<br /> CRAY Y-MP C90 with 16 processors achieves 403 MFLOPS on 100x100 LINPACK; a Fujitsu VP-2600 with 1 processor achieves 4 GFLOPS (against a peak of 5 GFLOPS) on 1000x1000 LINPACK. <br /> ===1993===<br /> Cray Research delivers a Y-MP M90 with 32 Gbyte of memory to the U.S. Government, after delivering a similar machine with 8 Gbyte of memory in the previous year to the Minnesota Supercomputer Center. <br /> <br /> ====References====<br /> http://ei.cs.vt.edu/~history/Parallel.html<br /> <br /> <br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&diff=43567 ECE506 Main Page 2011-01-31T04:44:22Z

<p>Cslingaf: </p> <hr /> <div>This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.<br /> <br /> =Supplements to Solihin Text=<br /> <br /> Post links to the textbook supplements in this section.<br /> <br /> *Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]<br /> *Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch2_cl&diff=43566 CSC/ECE 506 Spring 2011/ch2 cl 2011-01-31T04:43:45Z

<p>Cslingaf: </p> <hr /> <div>=Supplement to Chapter 2: The Data Parallel Programming Model=<br /> <br /> Chapter 2 of [[#References | Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [[#Definitions | ''data parallel'']] model, another commonly recognized parallel programming model covered in other treatments like [[#References | Foster (1995)]] and [[#References | Culler (1999)]]. Whereas the shared memory and message passing models are often present as competing models, the data parallel model addresses fundamentally different programming concerns and can therefore be used in conjunction with either. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References | Solihin (2008)]]. The [[#Definitions | ''task parallel'']] model will also be introduced briefly as a point of contrast.<br /> <br /> =Overview=<br /> <br /> Whereas the shared memory and message passing models focus on how parallel tasks access common data, the [[#Definitions | ''data parallel'']] model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from [[#References | Solihin (2008)]] which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.<br /> <br /> // Data parallel code, adapted from [[#References|Solihin (2008), p. 27.]]<br /> <br /> id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1<br /> local_iter = 4;<br /> start_iter = id * local_iter;<br /> end_iter = start_iter + local_iter;<br /> <br /> if (id == 0)<br /> send_msg(P1, b[4..7], c[4..7]);<br /> else<br /> recv_msg(P0, b[4..7], c[4..7]);<br /> <br /> // Begin data parallel section<br /> <br /> for (i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum = 0;<br /> for (i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum = local_sum + a[i];<br /> <br /> // End data parallel section<br /> <br /> if (id == 0)<br /> {<br /> recv_msg(P1, &local_sum1);<br /> sum = local_sum + local_sum1;<br /> Print sum;<br /> }<br /> else<br /> send_msg(P0, local_sum);<br /> <br /> In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.<br /> <br /> [[#References | Hillis (1986)]] points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. [[#References | Haveraaen (2000)]] also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of [[#References | Solihin (2008)]], which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.<br /> <br /> // Sequential code, from [[#References|Solihin (2008), p. 25.]]<br /> <br /> for (i = 0; i < 8; i++)<br /> a[i] = b[i] + c[i];<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> Print sum;<br /> <br /> The logical opposite of data parallel is [[#Definitions | ''task parallel,'']] in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.<br /> <br /> // Task parallel code.<br /> <br /> int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1<br /> <br /> if (id == 0)<br /> {<br /> for (i = 0; i < 8; i++)<br /> {<br /> a[i] = b[i] + c[i];<br /> send_msg(P1, a[i]);<br /> }<br /> }<br /> else<br /> {<br /> sum = 0;<br /> for (i = 0; i < 8; i++)<br /> {<br /> recv_msg(P0, a[i]);<br /> if (a[i] > 0)<br /> sum = sum + a[i];<br /> }<br /> Print sum;<br /> }<br /> <br /> In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.'' The other sums the elements of ''a.'' These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each thread is different (rather than identical).<br /> <br /> Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. In addition, unique tasks are likely to have significantly different run times, making it more challenging to balance load across processors. [[#References | Haveraaen (2000)]] also notes that task parallel algorithms are inherently more complex, requiring a greater degree of communication and synchronization. In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the threads. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end. The table below summarizes the key differences between data parallel and task parallel programming models.<br /> <br /> {| class="wikitable" border="1" align="center"<br /> |+ '''Comparison between data parallel and task parallel programming models.'''<br /> |-<br /> ! Aspects<br /> ! Data Parallel<br /> ! Task Parallel<br /> |-<br /> | Decomposition<br /> | Partition data into subsets<br /> | Partition program into subtasks<br /> |-<br /> | Parallel tasks<br /> | Identical<br /> | Unique<br /> |-<br /> | Degree of parallelism<br /> | Scales easily<br /> | Fixed<br /> |-<br /> | Load balancing<br /> | Easier<br /> | Harder<br /> |-<br /> | Communication overhead<br /> | Lower<br /> | Higher<br /> |}<br /> <br /> =History of Parallel Programming Models=<br /> <br /> ==Vector Machines==<br /> <br /> First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.<br /> <br /> The Solomon project at Westinghouse was one of the first machines to use vector operations. It's CPU had a large number of ALUs that would each be fed different data each cycle. Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.<br /> <br /> An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased. The Cray-1 could perform at 240 MFLOPS.<br /> <br /> ==References for this section==<br /> *Wikipedia, Vector processor http://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=405209552<br /> *Wikipedia, Cray-1 http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730<br /> <br /> =Comparing the Data Parallel Model with the Shared Memory and Message Passing Models=<br /> <br /> Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References | Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.<br /> <br /> As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.<br /> <br /> Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [[#Definitions | ''SIMD (single-instruction-multiple-data)'']] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code | Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model&mdash;like the message passing model&mdash;does not require hardware support.<br /> <br /> Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.<br /> <br /> Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.<br /> <br /> =Definitions=<br /> <br /> * ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.<br /> * ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.<br /> * ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.<br /> <br /> =References=<br /> <br /> * David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.<br /> * Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.<br /> * Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.<br /> * W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.<br /> * Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.<br /> * Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.<br /> <br /> =Appendix: C for CUDA Example Code=<br /> <br /> The following code is a data parallel implementation of the sequential Code 2.3 from [[#References | Solihin (2008)]] using [http://www.nvidia.com/object/cuda_learn.html C for CUDA]. It is presented to give an impression of programming for a SIMD architecture, but a detailed discussion is beyond the scope of this supplement. Ignoring memory allocation issues, the code is very similar to the data parallel example, Code 2.5 from [[#References | Solihin (2008)]], discussed earlier. The main difference is the presence of a control thread that sends the parallel tasks to the CUDA device.<br /> <br /> // Data parallel implementation of the example code using C for CUDA.<br /> <br /> #include <iostream><br /> <br /> __global__ void kernel(float* a, float* b, float* c, float* local_sum)<br /> {<br /> int id = threadIdx.x;<br /> int local_iter = 4;<br /> int start_iter = id * local_iter;<br /> int end_iter = start_iter + local_iter;<br /> <br /> // Begin data parallel section<br /> <br /> for (int i = start_iter; i < end_iter; i++)<br /> a[i] = b[i] + c[i];<br /> local_sum[id] = 0;<br /> for (int i = start_iter; i < end_iter; i++)<br /> if (a[i] > 0)<br /> local_sum[id] = local_sum[id] + a[i];<br /> <br /> // End data parallel section<br /> }<br /> <br /> int main()<br /> {<br /> float h_a[8], h_b[8], h_c[8], h_sum[2];<br /> float *d_a, *d_b, *d_c, *d_sum;<br /> float sum;<br /> <br /> size_t size = 8 * sizeof(float);<br /> size_t size2 = 2 * sizeof(float);<br /> <br /> cudaMalloc((void**)&d_a, size);<br /> cudaMalloc((void**)&d_b, size);<br /> cudaMalloc((void**)&d_c, size);<br /> cudaMalloc((void**)&d_local_sum, size2);<br /> <br /> cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);<br /> cudaMemcpy(d_c, h_c, size, cudaMemcpyHostToDevice);<br /> <br /> kernel<<<1, 2>>>(d_a, d_b, d_c, d_sum);<br /> <br /> cudaMemcpy(h_a, d_a, size, cudaMemcpyDeviceToHost);<br /> cudaMemcpy(h_sum, d_sum, size2, cudaMemcpyDeviceToHost);<br /> <br /> sum = h_sum[0] + h_sum[1];<br /> std::cout << sum;<br /> <br /> cudaFree(d_a);<br /> cudaFree(d_b);<br /> cudaFree(d_c);<br /> cudaFree(d_sum);<br /> }</div>

Cslingaf https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&diff=43565 ECE506 Main Page 2011-01-31T04:43:09Z

<p>Cslingaf: </p> <hr /> <div>This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.<br /> <br /> =Supplements to Solihin Text=<br /> <br /> Post links to the textbook supplements in this section.<br /> <br /> *Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]<br /> [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]</div>

Cslingaf