Expertiza_Wiki - User contributions [en]

https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&feedformat=atom&user=Pmpatel Expertiza_Wiki - User contributions [en] 2026-05-14T09:22:24Z User contributions MediaWiki 1.41.0 https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44592 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-22T02:42:41Z

<p>Pmpatel: /* Introduction */</p> <hr /> <div>=Introduction=<br /> <br /> Shared memory architecture is one of the two major classes of large computer systems architecture. In a shared memory architecture, physical memory from all the processors is mapped into a single memory map. All the processors can potentially access to all the memory on the system, although access time could be different (eg NUMA). Different threads or processes can communicate by reading/writing to shared memory.<br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization<br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor<sup><span id="3body">[[#4foot|[4]]]</span></sup>]] <br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.<br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> <br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br><br /> <span id="4foot">[[#4body|4.]]</span>http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44591 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-22T02:41:59Z

<p>Pmpatel: </p> <hr /> <div>=Introduction=<br /> <br /> Shared memory architecture is one of the two major classes of large computer systems. In a shared memory architecture, physical memory from all the processors is mapped into a single memory map. All the processors can potentially access to all the memory on the system, although access time could be different (eg NUMA). Different threads or processes can communicate by reading/writing to shared memory.<br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization<br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor<sup><span id="3body">[[#4foot|[4]]]</span></sup>]] <br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.<br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> <br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br><br /> <span id="4foot">[[#4body|4.]]</span>http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44590 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-22T02:34:31Z

<p>Pmpatel: /* Peterson's Algorithm */</p> <hr /> <div>=Introduction=<br /> <br /> Shared memory architecture is one of the two major classes of large computer systems. In a shared memory architecture, physical memory from all the processors is mapped into a single memory map. All the processors can potentially access to all the memory on the system, although access time could be different (eg NUMA).<br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization<br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor<sup><span id="3body">[[#4foot|[4]]]</span></sup>]] <br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.<br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> <br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br><br /> <span id="4foot">[[#4body|4.]]</span>http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44589 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-22T02:31:09Z

<p>Pmpatel: /* Cache Coherence Problem */</p> <hr /> <div>=Introduction=<br /> <br /> Shared memory architecture is one of the two major classes of large computer systems. In a shared memory architecture, physical memory from all the processors is mapped into a single memory map. All the processors can potentially access to all the memory on the system, although access time could be different (eg NUMA).<br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization<br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor<sup><span id="3body">[[#4foot|[4]]]</span></sup>]] <br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.<br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br><br /> <span id="4foot">[[#4body|4.]]</span>http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44588 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-22T02:30:48Z

<p>Pmpatel: /* Cache Coherence Problem */</p> <hr /> <div>=Introduction=<br /> <br /> Shared memory architecture is one of the two major classes of large computer systems. In a shared memory architecture, physical memory from all the processors is mapped into a single memory map. All the processors can potentially access to all the memory on the system, although access time could be different (eg NUMA).<br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization<br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg|thumbnail|right|700px|Shared Memory system with dedicated Cache for each processor<sup><span id="3body">[[#4foot|[4]]]</span></sup>]] <br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.<br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br><br /> <span id="4foot">[[#4body|4.]]</span>http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44587 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-22T02:29:11Z

<p>Pmpatel: /* Introduction */</p> <hr /> <div>=Introduction=<br /> <br /> Shared memory architecture is one of the two major classes of large computer systems. In a shared memory architecture, physical memory from all the processors is mapped into a single memory map. All the processors can potentially access to all the memory on the system, although access time could be different (eg NUMA).<br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization<br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg|thumbnail|right|500px|Shared Memory<sup><span id="3body">[[#4foot|[4]]]</span></sup>]] <br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.<br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br><br /> <span id="4foot">[[#4body|4.]]</span>http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44579 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-22T02:18:58Z

<p>Pmpatel: </p> <hr /> <div>=Introduction=<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg|thumbnail|700px]] <br /> http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br><br /> <span id="4foot">[[#4body|4.]]</span>http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44564 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-20T23:30:15Z

<p>Pmpatel: </p> <hr /> <div>=Introduction=<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> ==Advantages of shared memory system==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg]] <br /> http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44563 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-20T23:27:00Z

<p>Pmpatel: </p> <hr /> <div>=Introduction:=<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> ==Advantages of shared memory system:==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system:==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> <br /> [[Image:sharedmem.jpg]] <br /> http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=File:Sharedmem.jpg&diff=44562 File:Sharedmem.jpg 2011-03-20T23:26:24Z

<p>Pmpatel: Shared Memory</p> <hr /> <div>Shared Memory</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44561 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-20T23:21:14Z

<p>Pmpatel: </p> <hr /> <div>=Introduction:=<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> ==Advantages of shared memory system:==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system:==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> # Cache Coherence protocol<br /> # Memory consistency model<br /> # Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> Insert image 1, source http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44560 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-20T23:20:07Z

<p>Pmpatel: </p> <hr /> <div>=Introduction:=<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> ==Advantages of shared memory system:==<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> ==Disadvantages of shared memory system:==<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> 1. Cache Coherence protocol<br /> 2. Memory consistency model<br /> 3. Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> Insert image 1, source http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44559 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-20T23:19:27Z

<p>Pmpatel: </p> <hr /> <div>=Introduction:=<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> =Advantages of shared memory system:=<br /> <br /> * Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> * Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> * Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> * Communicating data between caches is faster than that in message passing model.<br /> * Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> * Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> Disadvantages of shared memory system:<br /> <br /> * Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> 1. Cache Coherence protocol<br /> 2. Memory consistency model<br /> 3. Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> Insert image 1, source http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44558 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-20T23:18:35Z

<p>Pmpatel: </p> <hr /> <div>=Introduction:=<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> =Advantages of shared memory system:=<br /> <br /> •Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> •Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> •Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> •Communicating data between caches is faster than that in message passing model.<br /> •Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> •Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> Disadvantages of shared memory system:<br /> <br /> •Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> ==Hardware Support ==<br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> 1. Cache Coherence protocol<br /> 2. Memory consistency model<br /> 3. Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> Insert image 1, source http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch7_jp&diff=44557 CSC/ECE 506 Spring 2011/ch7 jp 2011-03-20T23:17:30Z

<p>Pmpatel: </p> <hr /> <div>Introduction:<br /> <br /> <copy introduction to shared memory architecture form previous chapters ><br /> <br /> Advantages of shared memory system:<br /> <br /> • Shared memory parallel programs and multi-threaded programs will automatically run on shared memory system.<br /> • Single OS runs on shared memory system which simplifies maintenance, memory management and scheduler tasks.<br /> • Amount of total memory is large as it is simply sum of all memory of individual nodes.<br /> • Communicating data between caches is faster than that in message passing model.<br /> • Ability of finer granularity sharing compared to message passing which adds overhead of creating/sending messages for each bit of information.<br /> • Common data types can be shared between different threads running on different processors. (eg. shared data which is Read Only)<br /> <br /> Disadvantages of shared memory system:<br /> <br /> • Cost of providing shared memory grows super linearly with number of processors compared to message passing model which grows linearly.<br /> <br /> Hardware Support <br /> Shared memory architecture needs some hardware support for the implementation unlike a message passing model which can rely on software for send and receive messages. Three things necessary with respect to hardware support for correct execution of a shared memory parallel program on a multiprocessor system are: <br /> 1. Cache Coherence protocol<br /> 2. Memory consistency model<br /> 3. Hardware Synchronization <br /> <br /> <br /> =Cache Coherence Problem=<br /> <br /> Insert image 1, source http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html<br /> <br /> In a system with single processor (single core), maintaining cache coherency is simple and easy but in a multiprocessor system, it is not as simple. Data can be present in any processors cache and protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated. <br /> <br /> In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache. Supposed processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is responsibility of Cache Coherence Protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.<br /> <br /> One may think that cache write policy can provide cache coherence but it is not true. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches. <br /> <br /> <br /> <br /> =Memory Consistency Problem=<br /> <br /> Memory consistency deals with the ordering of memory operations (load and store) to different memory locations. In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables. But in shared memory model with multiple processors, two threads could be access a shared data (something like a synchronization variable) and the output of the threads could change based on which thread can get to the shared data earlier. If this happens, then the program output on uni-processor system and multi-processor program will be different.<br /> <br /> Maintaining program order is very important for memory consistency but it comes with performance degradation. Various memory consistency models trades off performance to make programming easy. <br /> <br /> =Hardware Synchronization=<br /> <br /> =Peterson's Algorithm=<br /> Peterson’s algorithm, also known as Peterson’s solution, is an algorithm that addresses the critical section problem by meeting the 3 criteria of the problem: mutual exclusion, progress, and bounded waiting<sup><span id="1body">[[#1foot|[1]]]</span></sup>.<br /> <br /> int turn; // turn for execution<br /> int interested[2]; // the processors interested<br /> void enter_region(int process) <br /> { <br /> int otherProcess; <br /> otherProcess = 1 - process; <br /> interested[process]=1; <br /> turn = process; <br /> while(turn == process && interested[otherProcess] == 1)<br /> {<br /> // busy wait<br /> }<br /> } <br /> void leave_region(int process) <br /> { <br /> interested[process] = 0; <br /> }<br /> The algorithm uses two variables as seen above, interested and turn. If a flag value for interested is set to 1, it indicates the process wants to enter the critical section. The variable turn is a placeholder for the process whose turn will be next allowed to enter the critical section. As seen above, Process1 has been given priority over P0.<br /> <br /> To ensure mutual exclusion, the variable interested and turn are set to ensure only 1 can enter the critical section at a time. To ensure Progress, once a processor has left the critical section the other processors can determine which processor is allowed into the critical section next. To ensure bounded waiting, as seen with the above code, a process will wait no longer than 1 turn to enter the critical section as the variable turn=1 will be enabled.<br /> <br /> =Page tables=<br /> An issue with multiprocessor systems is how to enter the critical section. One software solution is to use Page tables to ensure mutual exclusion, progress, and bounded waiting which are the 3 criteria needed to be met<sup><span id="2body">[[#2foot|[2]]]</span></sup>.<br /> <br /> ==Memory Coherence==<br /> As with other forms of synchronization, page tables for multiprocessor systems must have strategies for handling data. As discussed in Li and Hudaks article, there are two basic approaches for page synchronization, invalidation and write-broadcast. In an invalidation approach, there is only 1 owner for each page, having either read or write access to the page. In a write-broadcast, as the name implies, it broadcasts or writes all copies for the page data and returns the faulty instruction<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> As mentioned in the invalidation approach, there must be an owner for the page. There are two basic ownership approaches for page tables, fixed or dynamic. In a fixed ownership approach, 1 processor always owns the same page. Other processors who want read or write access to the page location must negotiate for the access. For dynamic ownership, there are two subcategories, centralized or distributed management<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> ==Centralized Management==<br /> In centralized management for dynamic pages, there is a single, central processor that maintains a page table which has a table for each page and each page has 3 fields: owner, copy set, and lock. The owner field consists of the single processor which owns the page, which could also be thought as the most recent processor to have write access. The copy set field has a list of all processors that have a copy of the page. This field helps to avoid a broadcast. The lock field is used for synchronizing requests<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> ==Distributed Management==<br /> There are 2 schemes for Distributed Management: fixed and dynamic. For a fixed distributed management, each processor in the system is given a predetermined subset of pages to manage. A simple, straightforward distribution of responsibility is the divide pages evenly among the system, although there are other solutions which could be tailored to the applications needs<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> In a dynamic distributed management system, each processor keeps track of ownership of all pages within their local table. Instead of the owner field in the Centralized Management scheme, instead there is a probOwner field which has the probable owner of the page for a page fault<sup><span id="3body">[[#3foot|[3]]]</span></sup>.<br /> <br /> =References=<br /> <br /> <span id="1foot">[[#1body|1.]]</span> http://en.wikipedia.org/wiki/Memory_coherence <br><br /> <span id="2foot">[[#2body|2.]]</span> http://en.wikipedia.org/wiki/Peterson%27s_algorithm <br><br /> <span id="3foot">[[#3body|3.]]</span> Li, K. and Hudak, P. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{359, November 1989. <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43860 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-27T00:52:27Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> [[Image: memchart.jpg|thumbnail|right|Memory Hierarchy<sup><span id="9body">[[#9foot|[9]]]</span></sup>]]<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br><br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Table 1: Cache on different Microprocessors<br /> |-<br /> ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year<br /> |-<br /> | Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006<br /> |-<br /> | Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007<br /> |-<br /> | Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:64KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64X2 || 2 || I:64KB D:4KB Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007<br /> |-<br /> | AMD Barcelona || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Aug 2007<br /> |-<br /> | Sun Microsystems Ultra Sparc T2 || 8 || I:16KB D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007<br /> |-<br /> | Intel Xeon Wolfdale DP || 2 || D:96KB || 6MB || - || Nov 2007<br /> |-<br /> | Intel Xeon Hapertown || 4 || D:96KB || 2*6MB || - || Nov 2007<br /> |-<br /> | AMD Phenom || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Nov 2007 Mar 2008<br /> |-<br /> | Intel Core 2 Duo || 2 || I:32KB D:32KB || 2/4MB 8 way set assoc. || - || 2008<br /> |-<br /> | Intel Penryn Wolfdale DP || 4 || - || 6-12MB || 6MB || Mar 2008 Aug 2008<br /> |-<br /> | Intel Core 2 Quad Yorkfield || 4 || D:96KB || 12MB || - || Mar 2008<br /> |-<br /> | AMD Toliman || 3K10 || I:64KB D:64KB || 512KB || 2MB Shared || Apr 2008<br /> |-<br /> | Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008<br /> |-<br /> | IBM RoadRunner || 8+1 || 32KB || 512KB || - || Jun 2008<br /> |}<br /> <br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> *Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> *Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> * Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.<br /> * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss.<br /> * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br><br /> Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. <br /> <br /> ==Advantages==<br /> *Improves overall performance by reducing cache misses.<br /> ==Disadvantages==<br /> * Wastes bandwidth when prefetched data is not used.<br /> * Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.<br /> * Software prefetching adds additional instructions to the program, making the program larger.<br /> * If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.<br /> * When scheduler changes the task running on a processor, prefetched data may become useless.<br /> <br /> ==Effectiveness==<br /> Prefetching effectiveness can be tracked by following matrices<br /> # Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.<br /> # Accuracy is defined as fraction of prefetches that are useful.<br /> # Timeliness measures how early the prefetches arrive.<br /> Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.<br /> <br /> ==Stream Buffer Prefetching==<br /> This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.<br /> <br><br /> <br><br /> INSERT IMAGE HERE<br /> <br><br><br /> Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite<sup><span id="7body">[[#7foot|[7]]]</span></sup>. <br /> ==Prefetching in Parallel Computing==<br /> On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br><br /> In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:<br /> # Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost<sup><span id="6body">[[#6foot|[6]]]</span></sup>.<br /> # Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.<br /> # In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.<br /> <br /> =References=<br /> <span id="1foot">[[#1body|1.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <span id="2foot">[[#2body|2.]]</span> Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg <br><br /> <span id="3foot">[[#3body|3.]]</span> Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin <br><br /> <span id="4foot">[[#4body|4.]]</span> “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.<br><br /> <span id="5foot">[[#5body|5.]]</span> Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer <br><br /> <span id="6foot">[[#6body|6.]]</span> “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) <br><br /> <span id="7foot">[[#7body|7.]]</span> “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) <br><br /> <span id="8foot">[[#8body|8.]]</span> http://en.wikipedia.org/wiki/Instruction_prefetch <br><br /> <span id="9foot">[[#9body|9.]]</span> http://www.real-knowledge.com/memory.htm <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43859 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-27T00:51:47Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> [[Image: memchart.jpg|thumbnail|right|Memory Hierarchy<sup><span id="9body">[[#9foot|[9]]]</span></sup>]]<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br><br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Processor Architecture<br /> |-<br /> ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year<br /> |-<br /> | Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006<br /> |-<br /> | Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007<br /> |-<br /> | Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:64KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64X2 || 2 || I:64KB D:4KB Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007<br /> |-<br /> | AMD Barcelona || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Aug 2007<br /> |-<br /> | Sun Microsystems Ultra Sparc T2 || 8 || I:16KB D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007<br /> |-<br /> | Intel Xeon Wolfdale DP || 2 || D:96KB || 6MB || - || Nov 2007<br /> |-<br /> | Intel Xeon Hapertown || 4 || D:96KB || 2*6MB || - || Nov 2007<br /> |-<br /> | AMD Phenom || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Nov 2007 Mar 2008<br /> |-<br /> | Intel Core 2 Duo || 2 || I:32KB D:32KB || 2/4MB 8 way set assoc. || - || 2008<br /> |-<br /> | Intel Penryn Wolfdale DP || 4 || - || 6-12MB || 6MB || Mar 2008 Aug 2008<br /> |-<br /> | Intel Core 2 Quad Yorkfield || 4 || D:96KB || 12MB || - || Mar 2008<br /> |-<br /> | AMD Toliman || 3K10 || I:64KB D:64KB || 512KB || 2MB Shared || Apr 2008<br /> |-<br /> | Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008<br /> |-<br /> | IBM RoadRunner || 8+1 || 32KB || 512KB || - || Jun 2008<br /> |}<br /> <br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> *Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> *Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> * Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.<br /> * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss.<br /> * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br><br /> Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. <br /> <br /> ==Advantages==<br /> *Improves overall performance by reducing cache misses.<br /> ==Disadvantages==<br /> * Wastes bandwidth when prefetched data is not used.<br /> * Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.<br /> * Software prefetching adds additional instructions to the program, making the program larger.<br /> * If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.<br /> * When scheduler changes the task running on a processor, prefetched data may become useless.<br /> <br /> ==Effectiveness==<br /> Prefetching effectiveness can be tracked by following matrices<br /> # Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.<br /> # Accuracy is defined as fraction of prefetches that are useful.<br /> # Timeliness measures how early the prefetches arrive.<br /> Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.<br /> <br /> ==Stream Buffer Prefetching==<br /> This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.<br /> <br><br /> <br><br /> INSERT IMAGE HERE<br /> <br><br><br /> Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite<sup><span id="7body">[[#7foot|[7]]]</span></sup>. <br /> ==Prefetching in Parallel Computing==<br /> On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br><br /> In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:<br /> # Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost<sup><span id="6body">[[#6foot|[6]]]</span></sup>.<br /> # Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.<br /> # In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.<br /> <br /> =References=<br /> <span id="1foot">[[#1body|1.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <span id="2foot">[[#2body|2.]]</span> Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg <br><br /> <span id="3foot">[[#3body|3.]]</span> Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin <br><br /> <span id="4foot">[[#4body|4.]]</span> “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.<br><br /> <span id="5foot">[[#5body|5.]]</span> Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer <br><br /> <span id="6foot">[[#6body|6.]]</span> “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) <br><br /> <span id="7foot">[[#7body|7.]]</span> “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) <br><br /> <span id="8foot">[[#8body|8.]]</span> http://en.wikipedia.org/wiki/Instruction_prefetch <br><br /> <span id="9foot">[[#9body|9.]]</span> http://www.real-knowledge.com/memory.htm <br></div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43858 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-27T00:51:08Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> [[Image: memchart.jpg|thumbnail|right|Memory Hierarchy<sup><span id="9body">[[#9foot|[9]]]</span></sup>]]<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br><br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> *Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> *Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> * Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.<br /> * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss.<br /> * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br><br /> Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. <br /> <br /> ==Advantages==<br /> *Improves overall performance by reducing cache misses.<br /> ==Disadvantages==<br /> * Wastes bandwidth when prefetched data is not used.<br /> * Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.<br /> * Software prefetching adds additional instructions to the program, making the program larger.<br /> * If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.<br /> * When scheduler changes the task running on a processor, prefetched data may become useless.<br /> <br /> ==Effectiveness==<br /> Prefetching effectiveness can be tracked by following matrices<br /> # Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.<br /> # Accuracy is defined as fraction of prefetches that are useful.<br /> # Timeliness measures how early the prefetches arrive.<br /> Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.<br /> <br /> ==Stream Buffer Prefetching==<br /> This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.<br /> <br><br /> <br><br /> INSERT IMAGE HERE<br /> <br><br><br /> Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite<sup><span id="7body">[[#7foot|[7]]]</span></sup>. <br /> ==Prefetching in Parallel Computing==<br /> On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br><br /> In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:<br /> # Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost<sup><span id="6body">[[#6foot|[6]]]</span></sup>.<br /> # Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.<br /> # In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.<br /> <br /> =References=<br /> <span id="1foot">[[#1body|1.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <span id="2foot">[[#2body|2.]]</span> Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg <br><br /> <span id="3foot">[[#3body|3.]]</span> Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin <br><br /> <span id="4foot">[[#4body|4.]]</span> “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.<br><br /> <span id="5foot">[[#5body|5.]]</span> Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer <br><br /> <span id="6foot">[[#6body|6.]]</span> “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) <br><br /> <span id="7foot">[[#7body|7.]]</span> “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) <br><br /> <span id="8foot">[[#8body|8.]]</span> http://en.wikipedia.org/wiki/Instruction_prefetch <br><br /> <span id="9foot">[[#9body|9.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <br /> <br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Processor Architecture<br /> |-<br /> ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year<br /> |-<br /> | Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006<br /> |-<br /> | Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007<br /> |-<br /> | Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:64KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64X2 || 2 || I:64KB D:4KB Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007<br /> |-<br /> | AMD Barcelona || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Aug 2007<br /> |-<br /> | Sun Microsystems Ultra Sparc T2 || 8 || I:16KB D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007<br /> |-<br /> | Intel Xeon Wolfdale DP || 2 || D:96KB || 6MB || - || Nov 2007<br /> |-<br /> | Intel Xeon Hapertown || 4 || D:96KB || 2*6MB || - || Nov 2007<br /> |-<br /> | AMD Phenom || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Nov 2007 Mar 2008<br /> |-<br /> | Intel Core 2 Duo || 2 || I:32KB D:32KB || 2/4MB 8 way set assoc. || - || 2008<br /> |-<br /> | Intel Penryn Wolfdale DP || 4 || - || 6-12MB || 6MB || Mar 2008 Aug 2008<br /> |-<br /> | Intel Core 2 Quad Yorkfield || 4 || D:96KB || 12MB || - || Mar 2008<br /> |-<br /> | AMD Toliman || 3K10 || I:64KB D:64KB || 512KB || 2MB Shared || Apr 2008<br /> |-<br /> | Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008<br /> |-<br /> | IBM RoadRunner || 8+1 || 32KB || 512KB || - || Jun 2008<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43857 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-27T00:49:55Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> [[Image: memchart.jpg|thumbnail|right|Memory Hierarchy<sup><span id="9body">[[#9foot|[9]]]</span></sup>]]<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br><br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> *Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> *Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> * Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.<br /> * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss.<br /> * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br><br /> Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. <br /> <br /> ==Advantages==<br /> *Improves overall performance by reducing cache misses.<br /> ==Disadvantages==<br /> * Wastes bandwidth when prefetched data is not used.<br /> * Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.<br /> * Software prefetching adds additional instructions to the program, making the program larger.<br /> * If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.<br /> * When scheduler changes the task running on a processor, prefetched data may become useless.<br /> <br /> ==Effectiveness==<br /> Prefetching effectiveness can be tracked by following matrices<br /> # Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.<br /> # Accuracy is defined as fraction of prefetches that are useful.<br /> # Timeliness measures how early the prefetches arrive.<br /> Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.<br /> <br /> ==Stream Buffer Prefetching==<br /> This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.<br /> <br><br /> <br><br /> INSERT IMAGE HERE<br /> <br><br><br /> Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite<sup><span id="7body">[[#7foot|[7]]]</span></sup>. <br /> ==Prefetching in Parallel Computing==<br /> On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br><br /> In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:<br /> # Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost<sup><span id="6body">[[#6foot|[6]]]</span></sup>.<br /> # Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.<br /> # In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.<br /> <br /> =References=<br /> <span id="1foot">[[#1body|1.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <span id="2foot">[[#2body|2.]]</span> Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg <br><br /> <span id="3foot">[[#3body|3.]]</span> Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin <br><br /> <span id="4foot">[[#4body|4.]]</span> “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.<br><br /> <span id="5foot">[[#5body|5.]]</span> Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer <br><br /> <span id="6foot">[[#6body|6.]]</span> “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) <br><br /> <span id="7foot">[[#7body|7.]]</span> “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) <br><br /> <span id="8foot">[[#8body|8.]]</span> http://en.wikipedia.org/wiki/Instruction_prefetch <br><br /> <span id="9foot">[[#9body|9.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <br /> <br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Processor Architecture<br /> |-<br /> ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year<br /> |-<br /> | Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006<br /> |-<br /> | Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007<br /> |-<br /> | Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:64KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64X2 || 2 || I:64KB D:4KB Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007<br /> |-<br /> | AMD Barcelona || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Aug 2007<br /> |-<br /> | Sun Microsystems Ultra Sparc T2 || 8 || I:16KB D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007<br /> |-<br /> | Intel Xeon Wolfdale DP || 2 || D:96KB || 6MB || - || Nov 2007<br /> |-<br /> | Intel Xeon Hapertown || 4 || D:96KB || 2*6MB || - || Nov 2007<br /> |-<br /> | AMD Phenom || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Nov 2007 Mar 2008<br /> |-<br /> | Intel Core 2 Duo || 2 || I:32KB D:32KB || 2/4MB 8 way set assoc. || - || 2008<br /> |-<br /> | Intel Penryn Wolfdale DP || 4 || - || 6-12MB || 6MB || Mar 2008 Aug 2008<br /> |-<br /> | Intel Core 2 Quad Yorkfield || 4 || D:96KB || 12MB || - || Mar 2008<br /> |-<br /> | AMD Toliman || 3K10 || I:64KB D:64KB || 512KB || 2MB Shared || Apr 2008<br /> |-<br /> | Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008<br /> |-<br /> | IBM RoadRunner || 8+1 || 32KB || 512KB || - || Jun 2008<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43856 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-27T00:43:07Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> [[Image: memchart.jpg|thumbnail|right|Memory Hierarchy<sup><span id="9body">[[#9foot|[9]]]</span></sup>]]<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br><br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> *Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> *Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> * Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.<br /> * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss.<br /> * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br><br /> Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. <br /> <br /> ==Advantages==<br /> *Improves overall performance by reducing cache misses.<br /> ==Disadvantages==<br /> * Wastes bandwidth when prefetched data is not used.<br /> * Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.<br /> * Software prefetching adds additional instructions to the program, making the program larger.<br /> * If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.<br /> * When scheduler changes the task running on a processor, prefetched data may become useless.<br /> <br /> ==Effectiveness==<br /> Prefetching effectiveness can be tracked by following matrices<br /> # Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.<br /> # Accuracy is defined as fraction of prefetches that are useful.<br /> # Timeliness measures how early the prefetches arrive.<br /> Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.<br /> <br /> ==Stream Buffer Prefetching==<br /> This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.<br /> <br><br /> <br><br /> INSERT IMAGE HERE<br /> <br><br><br /> Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite<sup><span id="7body">[[#7foot|[7]]]</span></sup>. <br /> ==Prefetching in Parallel Computing==<br /> On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br><br /> In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:<br /> # Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost<sup><span id="6body">[[#6foot|[6]]]</span></sup>.<br /> # Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.<br /> # In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.<br /> <br /> =References=<br /> <span id="1foot">[[#1body|1.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <span id="2foot">[[#2body|2.]]</span> Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg <br><br /> <span id="3foot">[[#3body|3.]]</span> Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin <br><br /> <span id="4foot">[[#4body|4.]]</span> “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.<br><br /> <span id="5foot">[[#5body|5.]]</span> Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer <br><br /> <span id="6foot">[[#6body|6.]]</span> “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) <br><br /> <span id="7foot">[[#7body|7.]]</span> “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) <br><br /> <span id="8foot">[[#8body|8.]]</span> http://en.wikipedia.org/wiki/Instruction_prefetch <br><br /> <span id="9foot">[[#9body|9.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <br /> <br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Processor Architecture<br /> |-<br /> ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year<br /> |-<br /> | Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006<br /> |-<br /> | Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007<br /> |-<br /> | Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:64KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64X2 || 2 || I:64KB D:4KB Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007<br /> |-<br /> | AMD Barcelona || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Aug 2007<br /> |-<br /> | Sun Microsystems Ultra Sparc T2 || 8 || D:8KB I:16KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007<br /> |-<br /> | Intel Xeon Wolfdale DP || 2 || D:96KB || 6MB || - || Nov 2007<br /> |-<br /> | Intel Xeon Hapertown || 4 || D:96KB || 2*6MB || - || Nov 2007<br /> |-<br /> | AMD Phenom || 4 || I64KB D:64KB || 512KB || 2MB Shared || Nov 2007 Mar 2008<br /> |-<br /> <br /> |-<br /> <br /> |-<br /> <br /> |-<br /> <br /> |-<br /> <br /> <br /> |-<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43855 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-27T00:35:14Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> [[Image: memchart.jpg|thumbnail|right|Memory Hierarchy<sup><span id="9body">[[#9foot|[9]]]</span></sup>]]<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br><br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> *Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> *Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> * Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.<br /> * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss.<br /> * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br><br /> Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. <br /> <br /> ==Advantages==<br /> *Improves overall performance by reducing cache misses.<br /> ==Disadvantages==<br /> * Wastes bandwidth when prefetched data is not used.<br /> * Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.<br /> * Software prefetching adds additional instructions to the program, making the program larger.<br /> * If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.<br /> * When scheduler changes the task running on a processor, prefetched data may become useless.<br /> <br /> ==Effectiveness==<br /> Prefetching effectiveness can be tracked by following matrices<br /> # Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.<br /> # Accuracy is defined as fraction of prefetches that are useful.<br /> # Timeliness measures how early the prefetches arrive.<br /> Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.<br /> <br /> ==Stream Buffer Prefetching==<br /> This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.<br /> <br><br /> <br><br /> INSERT IMAGE HERE<br /> <br><br><br /> Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite<sup><span id="7body">[[#7foot|[7]]]</span></sup>. <br /> ==Prefetching in Parallel Computing==<br /> On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br><br /> In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:<br /> # Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost<sup><span id="6body">[[#6foot|[6]]]</span></sup>.<br /> # Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.<br /> # In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.<br /> <br /> =References=<br /> <span id="1foot">[[#1body|1.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <span id="2foot">[[#2body|2.]]</span> Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg <br><br /> <span id="3foot">[[#3body|3.]]</span> Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin <br><br /> <span id="4foot">[[#4body|4.]]</span> “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.<br><br /> <span id="5foot">[[#5body|5.]]</span> Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer <br><br /> <span id="6foot">[[#6body|6.]]</span> “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) <br><br /> <span id="7foot">[[#7body|7.]]</span> “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) <br><br /> <span id="8foot">[[#8body|8.]]</span> http://en.wikipedia.org/wiki/Instruction_prefetch <br><br /> <span id="9foot">[[#9body|9.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <br /> <br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Processor Architecture<br /> |-<br /> ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year<br /> |-<br /> | Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006<br /> |-<br /> | Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007<br /> |-<br /> | Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:4*32KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:4*32KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:4*32KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |-<br /> | AMD Athlon 64FX || 2 || I:4*32KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43854 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-27T00:29:59Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> [[Image: memchart.jpg|thumbnail|right|Memory Hierarchy<sup><span id="9body">[[#9foot|[9]]]</span></sup>]]<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br><br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> *Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> *Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> * Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.<br /> * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss.<br /> * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br><br /> Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. <br /> <br /> ==Advantages==<br /> *Improves overall performance by reducing cache misses.<br /> ==Disadvantages==<br /> * Wastes bandwidth when prefetched data is not used.<br /> * Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.<br /> * Software prefetching adds additional instructions to the program, making the program larger.<br /> * If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.<br /> * When scheduler changes the task running on a processor, prefetched data may become useless.<br /> <br /> ==Effectiveness==<br /> Prefetching effectiveness can be tracked by following matrices<br /> # Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.<br /> # Accuracy is defined as fraction of prefetches that are useful.<br /> # Timeliness measures how early the prefetches arrive.<br /> Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.<br /> <br /> ==Stream Buffer Prefetching==<br /> This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.<br /> <br><br /> <br><br /> INSERT IMAGE HERE<br /> <br><br><br /> Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite<sup><span id="7body">[[#7foot|[7]]]</span></sup>. <br /> ==Prefetching in Parallel Computing==<br /> On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br><br /> In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:<br /> # Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost<sup><span id="6body">[[#6foot|[6]]]</span></sup>.<br /> # Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.<br /> # In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.<br /> <br /> =References=<br /> <span id="1foot">[[#1body|1.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <span id="2foot">[[#2body|2.]]</span> Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg <br><br /> <span id="3foot">[[#3body|3.]]</span> Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin <br><br /> <span id="4foot">[[#4body|4.]]</span> “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.<br><br /> <span id="5foot">[[#5body|5.]]</span> Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer <br><br /> <span id="6foot">[[#6body|6.]]</span> “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) <br><br /> <span id="7foot">[[#7body|7.]]</span> “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) <br><br /> <span id="8foot">[[#8body|8.]]</span> http://en.wikipedia.org/wiki/Instruction_prefetch <br><br /> <span id="9foot">[[#9body|9.]]</span> http://www.real-knowledge.com/memory.htm <br><br /> <br /> <br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Processor Architecture<br /> |-<br /> ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year<br /> |-<br /> | Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006<br /> |-<br /> | 2 || 2 || 4 || 6<br /> |-<br /> | 3 || 3 || 6 || 9<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43826 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-26T23:09:54Z

<p>Pmpatel: </p> <hr /> <div>=Cache Hierarchy=<br /> In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance. <br /> <br /> To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core. <br /> <br /> There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br /> <br /> =Cache Write Policies=<br /> <br /> ==Write hit policies==<br /> ===Write-through===<br /> Also known as store-through, this policy will write to main memory whenever a write is performed to cache.<br /> ===Write-back===<br /> Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.<br /> <br /> ==Write miss policies==<br /> ===Write-allocate vs Write no-allocate===<br /> When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.<br /> ===Fetch-on-write vs no-fetch-on-write===<br /> The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. <br /> ===Write-before-hit vs no-write-before-hit===<br /> The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.<br /> <br /> ==Combination Policies==<br /> ===Write-validate===<br /> It is a combination of no-fetch-on-write and write-allocate. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.<br /> ===Write-invalidate===<br /> This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate. This policy invalidates lines when there is a miss.<br /> ===Write-around===<br /> Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.<br /> <br /> =Prefetching=<br /> <br /> ==Advantages==<br /> <br /> ==Disadvantages==<br /> <br /> ==Effectiveness==<br /> <br /> ==Stream Buffer Prefetching==<br /> <br /> ==Prefetching in Parallel Computing==<br /> <br /> =References=<br /> <br /> {| class="wikitable"<br /> |-<br /> ! Header 1<br /> ! Header 2<br /> ! Header 3<br /> |-<br /> | row 1, cell 1<br /> | row 1, cell 2<br /> | row 1, cell 3<br /> |-<br /> | row 2, cell 1<br /> | row 2, cell 2<br /> | row 2, cell 3<br /> |-<br /> | row 3, cell 1<br /> | row 3, cell 2<br /> | row 3, cell 3<br /> |}<br /> <br /> <br /> {| border='1' class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Multiplication table<br /> |-<br /> ! &times; !! 1 !! 2 !! 3<br /> |-<br /> | 1 || 1 || 2 || 3<br /> |-<br /> | 2 || 2 || 4 || 6<br /> |-<br /> | 3 || 3 || 6 || 9<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43822 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-26T22:55:34Z

<p>Pmpatel: </p> <hr /> <div>'''Bold text'''<nowiki>Insert non-formatted text here</nowiki>--[[User:Pmpatel|Pmpatel]] 17:55, 26 February 2011 (EST)[[Media:Example.ogg]]<math>Insert formula here</math><br /> ----<br /> <br /> ----<br /> Test<br /> <br /> <br /> {| class="wikitable"<br /> |-<br /> ! Header 1<br /> ! Header 2<br /> ! Header 3<br /> |-<br /> | row 1, cell 1<br /> | row 1, cell 2<br /> | row 1, cell 3<br /> |-<br /> | row 2, cell 1<br /> | row 2, cell 2<br /> | row 2, cell 3<br /> |-<br /> | row 3, cell 1<br /> | row 3, cell 2<br /> | row 3, cell 3<br /> |}<br /> <br /> <br /> {| class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Multiplication table<br /> |-<br /> ! &times; !! 1 !! 2 !! 3<br /> |-<br /> | 1 || 1 || 2 || 3<br /> |-<br /> | 2 || 2 || 4 || 6<br /> |-<br /> | 3 || 3 || 6 || 9<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43821 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-26T22:49:52Z

<p>Pmpatel: </p> <hr /> <div>Test<br /> <br /> <br /> {| class="wikitable"<br /> |-<br /> ! Header 1<br /> ! Header 2<br /> ! Header 3<br /> |-<br /> | row 1, cell 1<br /> | row 1, cell 2<br /> | row 1, cell 3<br /> |-<br /> | row 2, cell 1<br /> | row 2, cell 2<br /> | row 2, cell 3<br /> |-<br /> | row 3, cell 1<br /> | row 3, cell 2<br /> | row 3, cell 3<br /> |}<br /> <br /> <br /> {| class="wikitable" style="text-align:center"<br /> |+style="white-space:nowrap"| Multiplication table<br /> |-<br /> ! &times; !! 1 !! 2 !! 3<br /> |-<br /> | 1 || 1 || 2 || 3<br /> |-<br /> | 2 || 2 || 4 || 6<br /> |-<br /> | 3 || 3 || 6 || 9<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43820 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-26T22:18:53Z

<p>Pmpatel: </p> <hr /> <div>Test<br /> <br /> <br /> {| class="wikitable"<br /> |-<br /> ! Header 1<br /> ! Header 2<br /> ! Header 3<br /> |-<br /> | row 1, cell 1<br /> | row 1, cell 2<br /> | row 1, cell 3<br /> |-<br /> | row 2, cell 1<br /> | row 2, cell 2<br /> | row 2, cell 3<br /> |-<br /> | row 3, cell 1<br /> | row 3, cell 2<br /> | row 3, cell 3<br /> |}</div>

Pmpatel https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_jp&diff=43819 CSC/ECE 506 Spring 2011/ch6a jp 2011-02-26T22:16:08Z

<p>Pmpatel: </p> <hr /> <div>Test<br /> <br /> | p1 | p2 | p3 |<br /> | | | |</div>

Pmpatel