CSC/ECE 506 Spring 2014/8a at - Revision history

Admin: put spaces before parens that needed them

2014-04-07T21:13:56Z

put spaces before parens that needed them

← Older revision		Revision as of 21:13, 7 April 2014
Line 23:		Line 23:

	==Implementation Complexities==		==Implementation Complexities==
	The Synapse Expansion Bus includes an ownership level protocol between processor caches. It employs a non-write-through algorithm to minimize the bandwidth between cache and shared memory is employed in the cache to reduce memory contention. This protocol does not require a great deal of hardware complexity. Since an extra bit is added to the main memory to indicate whether a cache has an exclusive(Dirty) copy of the block, this needs to be implemented right to prevent malfunction of the protocol.<ref>[http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors: a new approach to solve old problems]</ref>		The Synapse Expansion Bus includes an ownership level protocol between processor caches. It employs a non-write-through algorithm to minimize the bandwidth between cache and shared memory is employed in the cache to reduce memory contention. This protocol does not require a great deal of hardware complexity. Since an extra bit is added to the main memory to indicate whether a cache has an exclusive (Dirty) copy of the block, this needs to be implemented right to prevent malfunction of the protocol.<ref>[http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors: a new approach to solve old problems]</ref>
	<br>		<br>

Line 30:		Line 30:

	==SGI 4D MP==		==SGI 4D MP==
	[[File:MSI.jpg\|612px\|thumbnail\|Architecture of SGI 4D MP]]'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation.		[[File:MSI.jpg\|612px\|thumbnail\|Architecture of SGI 4D MP]]'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS (million instructions per second) of computing performance to a graphics superworkstation.

	Such a high degree of computing and graphics processing was made possible by an intelligent Computing System Architecture. The sync bus provides the required synchronization among the main processors(4 in this system) of the system. Processor buses provide full speed access to the L1 instruction and data caches. Each of the L1 caches are 64KB in size. Thus providing a 512 KB total cache size. The L2 Cache counts for another 512 KB memory of the system and is made up of four individual 64 KB caches. One important highlight of the 4D-MP is that the memory hierarchy is inclusive i.e., the L1 cache is a subset of the L2 cache. The multiprocessor(MP) bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem.		Such a high degree of computing and graphics processing was made possible by an intelligent Computing System Architecture. The sync bus provides the required synchronization among the main processors (4 in this system) of the system. Processor buses provide full speed access to the L1 instruction and data caches. Each of the L1 caches are 64KB in size. Thus providing a 512 KB total cache size. The L2 Cache counts for another 512 KB memory of the system and is made up of four individual 64 KB caches. One important highlight of the 4D-MP is that the memory hierarchy is inclusive i.e., the L1 cache is a subset of the L2 cache. The multiprocessor (MP) bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem.

	Every transaction on this MP bus is monitored by the L2 cache. The state for each cache line is maintained by it. It checks if the transactions involve data in its storage through a tag-matching mechanism and changes the state of the cache lines accordingly (States will be M/E/S/I depending on the requests). Write propagation is via the invalidation operation and Write serialization is via the MP bus. Consistency is guaranteed due to the Inclusion property of the Memory Hierarchy system.		Every transaction on this MP bus is monitored by the L2 cache. The state for each cache line is maintained by it. It checks if the transactions involve data in its storage through a tag-matching mechanism and changes the state of the cache lines accordingly (States will be M/E/S/I depending on the requests). Write propagation is via the invalidation operation and Write serialization is via the MP bus. Consistency is guaranteed due to the Inclusion property of the Memory Hierarchy system.
Line 39:		Line 39:

	==Implementation complexities==		==Implementation complexities==
	In the MSI system, an explicit upgrade message is required for a read followed by a write, even if there are no other sharers. When a processor reads in and modifies a data item, two bus transactions are generated in this protocol even in the absence of sharers. The first is a BusRd that gets the memory block in S state, and the second is a BusRdX(or BusUpgr) that converts the block from S to M state. In this protocol, the complexity of the mechanism that determines the exclusiveness of the block is an aspect that needs attention. Also, in snoop-based cache-coherence protocols, the overall set of actions for memory operations is not atomic. This could lead to race conditions, and the issues of deadlock, serialization, etc. make it harder to implement.<br>		In the MSI system, an explicit upgrade message is required for a read followed by a write, even if there are no other sharers. When a processor reads in and modifies a data item, two bus transactions are generated in this protocol even in the absence of sharers. The first is a BusRd that gets the memory block in S state, and the second is a BusRdX (or BusUpgr) that converts the block from S to M state. In this protocol, the complexity of the mechanism that determines the exclusiveness of the block is an aspect that needs attention. Also, in snoop-based cache-coherence protocols, the overall set of actions for memory operations is not atomic. This could lead to race conditions, and the issues of deadlock, serialization, etc. make it harder to implement.<br>

	= MESI Protocol=		= MESI Protocol=

Tthampy: /* Introduction */

2014-03-30T19:33:32Z

Introduction

← Older revision		Revision as of 19:33, 30 March 2014
Line 6:		Line 6:
	[[Image:Busbased SMP.jpg\|frame\|center\|<b>Figure 1:</b> Typical Bus-Based Processor Model]]		[[Image:Busbased SMP.jpg\|frame\|center\|<b>Figure 1:</b> Typical Bus-Based Processor Model]]
	<br>		<br>
	If each processor has a cache that reflects the state of various parts of memory, it is possible that two or more caches may have copies of the same line. It is also possible that a given line may contain more than one lockable data item. If two threads make appropriately serialized changes to those data items, the result could be that both caches end up with different, incorrect versions of the line of memory. In other words, the system's state is no longer coherent because the system contains two different versions of what is supposed to be the content of a specific area of memory. Various protocols have been devised to address the issue of cache coherence problem, such as MSI, MESI, MOESI, [http://www.enotes.com/topic/MERSI_protocol MERSI], MESIF, Synapse, Berkeley, [https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Firefly] and [http://www.cs.utah.edu/~rajeev/cs7820/pres/08-7820-03.pdf Dragon] protocol. In this wiki article, MSI, MESI, MESIF, MOESI and Synapse protocol implementations on real architectures will be discussed.<ref>[http://www.windowsnetworking.com/articles_tutorials/cache-coherency.html Cache Coherence]</ref>		If each processor has a cache that reflects the state of various parts of memory, it is possible that two or more caches may have copies of the same line. It is also possible that a given line may contain more than one lockable data item. If two threads make appropriately serialized changes to those data items, the result could be that both caches end up with different, incorrect versions of the line of memory. In other words, the system's state is no longer coherent because the system contains two different versions of what is supposed to be the content of a specific area of memory. Various protocols have been devised to address the issue of cache coherence problem, such as MSI, MESI, MOESI, [http://www.enotes.com/topic/MERSI_protocol MERSI], MESIF, Synapse, [http://ctho.org/toread/forclass/18-742/3/p273-archibald.pdf Berkeley], [https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Firefly] and [http://www.cs.utah.edu/~rajeev/cs7820/pres/08-7820-03.pdf Dragon] protocol. In this wiki article, MSI, MESI, MESIF, MOESI and Synapse protocol implementations on real architectures will be discussed.<ref>[http://www.windowsnetworking.com/articles_tutorials/cache-coherency.html Cache Coherence]</ref>
	<br>		<br>
	<br>		<br>

Tthampy: /* Implementation complexities */

2014-03-30T19:29:59Z

Implementation complexities

← Older revision		Revision as of 19:29, 30 March 2014
Line 39:		Line 39:

	==Implementation complexities==		==Implementation complexities==
	In the MSI system, an explicit upgrade message is required for a read followed by a write, even if there are no other sharers. When a processor reads in and modifies a data item, two bus transactions are generated in this protocol even in the absence of sharers. The first is a BusRd that gets the memory block in S state, and the second is a BusRdX(or BusUpgr) that converts the block from S to M state. In this protocol, the complexity of the mechanism that determines the exclusiveness of the block is an aspect that needs attention. Also, in snoop-based cache-coherence protocols, the overall set of actions for memory operations is not atomic. This could lead to race conditions, and the issues of deadlock, serialization, etc. make it harder to implement.~~<ref>[http://www.cse.ohio-state.edu/~panda/875/class_slides/C6_2.pdf Snoop-based Multiprocessor Design]</ref>~~<br>		In the MSI system, an explicit upgrade message is required for a read followed by a write, even if there are no other sharers. When a processor reads in and modifies a data item, two bus transactions are generated in this protocol even in the absence of sharers. The first is a BusRd that gets the memory block in S state, and the second is a BusRdX(or BusUpgr) that converts the block from S to M state. In this protocol, the complexity of the mechanism that determines the exclusiveness of the block is an aspect that needs attention. Also, in snoop-based cache-coherence protocols, the overall set of actions for memory operations is not atomic. This could lead to race conditions, and the issues of deadlock, serialization, etc. make it harder to implement.<br>

	= MESI Protocol=		= MESI Protocol=

Tthampy: /* MESIF in Intel Nehalem Computer */

2014-03-30T19:26:44Z

MESIF in Intel Nehalem Computer

← Older revision		Revision as of 19:26, 30 March 2014
Line 112:		Line 112:

	== MESIF in Intel Nehalem Computer==		== MESIF in Intel Nehalem Computer==
	[http://rolfed.com/nehalem/nehalemPaper.pdf Intel Nehalem Computer] uses the MESIF protocol. In the Nehalem architecture each core has its own L1 and L2 cache. Nehalem ~~does~~ has a shared cache, implemented as L3 cache. This cache is shared among all cores and is relatively large. This cache is inclusive, meaning that it duplicates all data stored in each individual L1 and L2 cache. This duplication greatly adds to the inter-core communication efficiency because any given core does not have to locate data in another processor’s cache. If the requested data is not found in any level of the core’s cache, it knows the data is also not present in any other core’s cache. To ensure coherency across all caches, the L3 cache has additional flags that keep track of which core the data came from. If the data is modified in L3 cache, then the L3 cache knows if the data came from a different core than last time, and that the data in the first core needs its L1/L2 values updated with the new data. This greatly reduces the amount of traditional “snooping” coherency traffic between cores.<ref>[http://www.cs.uwaterloo.ca/~brecht/courses/856/Possible-Readings/multicore/cache-performance-x86-2009.pdf Comparing Cache Organization and Memory Management of the Intel Nehalem Computer Architecture]</ref>		[http://rolfed.com/nehalem/nehalemPaper.pdf Intel Nehalem Computer] uses the MESIF protocol. In the Nehalem architecture each core has its own L1 and L2 cache. Nehalem also has a shared cache, implemented as L3 cache. This cache is shared among all cores and is relatively large. This cache is inclusive, meaning that it duplicates all data stored in each individual L1 and L2 cache. This duplication greatly adds to the inter-core communication efficiency because any given core does not have to locate data in another processor’s cache. If the requested data is not found in any level of the core’s cache, it knows the data is also not present in any other core’s cache. To ensure coherency across all caches, the L3 cache has additional flags that keep track of which core the data came from. If the data is modified in L3 cache, then the L3 cache knows if the data came from a different core than last time, and that the data in the first core needs its L1/L2 values updated with the new data. This greatly reduces the amount of traditional “snooping” coherency traffic between cores.<ref>[http://www.cs.uwaterloo.ca/~brecht/courses/856/Possible-Readings/multicore/cache-performance-x86-2009.pdf Comparing Cache Organization and Memory Management of the Intel Nehalem Computer Architecture]</ref>
	<br><br>		<br><br>
	The cache organization of a 8-core Intel Nehalem Processor is shown below:<br>		The cache organization of a 8-core Intel Nehalem Processor is shown below:<br>

Tthampy: /* AM486 */

2014-03-30T19:26:09Z

AM486

← Older revision		Revision as of 19:26, 30 March 2014
Line 49:		Line 49:
	The AM486 processor implements a 32-bit architecture, encompassing the complete 486 microprocessor instruction set with several extensions. The AM486 also uses a modified MESI cache coherence protocol with write-back and write-through and read-allocation. Caches in the AM486 follow the Pseudo-LRU block replacement policy.		The AM486 processor implements a 32-bit architecture, encompassing the complete 486 microprocessor instruction set with several extensions. The AM486 also uses a modified MESI cache coherence protocol with write-back and write-through and read-allocation. Caches in the AM486 follow the Pseudo-LRU block replacement policy.

	The AM486 introduces the concept of multi-master environment, allowing it reduce unnecessary bus traffic through dynamic identification of shared blocks. This multi-master environment with the MESI cache coherence model allows the system to appear as a single unified memory structure, facilitating even programs written without cache support.		The AM486 introduces the concept of multi-master environment, allowing it to reduce unnecessary bus traffic through dynamic identification of shared blocks. This multi-master environment with the MESI cache coherence model allows the system to appear as a single unified memory structure, facilitating even programs written without cache support.

	The modified MESI protocol in AM486 differs in one respect to traditional MESI. When a block is hit, either by the processor that currently has the block, or by an external master (another processor), main memory is updated. However in the case that the modified block is requested by another processor other than the processor holding the block, the request is first cancelled and then the modified block is flushed to main memory. The requester must then resend a request, this time to main memory for the block in question. <ref>[http://www.ece.ufrgs.br/~fetter/eng04476/datasheets/Am486.pdf AM486 Datasheet]</ref>		The modified MESI protocol in AM486 differs in one respect to traditional MESI. When a block is hit, either by the processor that currently has the block, or by an external master (another processor), main memory is updated. However in the case that the modified block is requested by another processor other than the processor holding the block, the request is first cancelled and then the modified block is flushed to main memory. The requester must then resend a request, this time to main memory for the block in question. <ref>[http://www.ece.ufrgs.br/~fetter/eng04476/datasheets/Am486.pdf AM486 Datasheet]</ref>

Tthampy at 17:41, 24 March 2014

2014-03-24T17:41:17Z

Tthampy at 20:27, 22 March 2014

2014-03-22T20:27:19Z

← Older revision		Revision as of 20:27, 22 March 2014
Line 1:		Line 1:
	Wiki Writeup: [https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit]		Wiki Writeup: [https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit]

	Parent Wiki:[http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/8a_an http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/8a_an]		Parent Wiki:[http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/8a_an http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/8a_an]
	= Introduction =		= Introduction =

Tthampy at 20:27, 22 March 2014

2014-03-22T20:27:06Z

← Older revision		Revision as of 20:27, 22 March 2014
Line 1:		Line 1:
	[https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit]		Wiki Writeup: [https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit]
			Parent Wiki:[http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/8a_an http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2013/8a_an]
	= Introduction =		= Introduction =
	Symmetric multiprocessing ([http://searchdatacenter.techtarget.com/definition/SMP SMP]) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory via system bus. An SMP provides symmetric access to all of main memory from any processor and is the building block for larger parallel systems.		Symmetric multiprocessing ([http://searchdatacenter.techtarget.com/definition/SMP SMP]) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory via system bus. An SMP provides symmetric access to all of main memory from any processor and is the building block for larger parallel systems.

Tthampy at 20:25, 22 March 2014

2014-03-22T20:25:08Z

← Older revision		Revision as of 20:25, 22 March 2014
Line 1:		Line 1:
	[https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit]		[https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit]
	= Introduction =		= Introduction =
	Symmetric multiprocessing ([http://searchdatacenter.techtarget.com/definition/SMP SMP]) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory via system bus. An SMP provides symmetric access to all of main memory from any processor and is the building block for larger parallel systems.		Symmetric multiprocessing ([http://searchdatacenter.techtarget.com/definition/SMP SMP]) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory via system bus. An SMP provides symmetric access to all of main memory from any processor and is the building block for larger parallel systems.

Tthampy at 20:24, 22 March 2014

2014-03-22T20:24:38Z

← Older revision		Revision as of 20:24, 22 March 2014
Line 1:		Line 1:
			[https://docs.google.com/a/ncsu.edu/document/d/1sstpcoUFmbwGkCGfVlp5P2fsPAzjPKhFUtOJcrE6XJM/edit]
	= Introduction =		= Introduction =
	Symmetric multiprocessing ([http://searchdatacenter.techtarget.com/definition/SMP SMP]) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory via system bus. An SMP provides symmetric access to all of main memory from any processor and is the building block for larger parallel systems.		Symmetric multiprocessing ([http://searchdatacenter.techtarget.com/definition/SMP SMP]) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory via system bus. An SMP provides symmetric access to all of main memory from any processor and is the building block for larger parallel systems.