CSC/ECE 506 Spring 2010/Ch 9/Synchronization: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
==Hardware Support For Synchronization== | ==Hardware Support For Synchronization== | ||
===Hardware Implementations=== | ===Hardware Implementations=== | ||
Hardware implementations for synchronization traditionally include locks, barriers, and mutual exclusion. These types of hardware synchronizations use a method called busy-waiting, or spinning which prevents threads from continuing to execute. Spinning simply means that a thread continuously checks if it is okay to continue (no other useful work is accomplished by the thread). In the case for a lock, multiple threads are competing for access to the lock. Only one thread is allowed to access the critical code protected by the lock at a time. For barriers, multiple threads reach the barrier at different times and no thread can continue until all threads have reached the barrier. Spinning is implemented in hardware through mutexes or semaphores, which prevent multiple processes from being accessed at the same time. Atomic instructions, which are support by the processor Instruction Set Architecture(ISA), are typically used for hardware synchronization mechanism. Examples of these hardware synchronization mechanisms include: Test-and-Set, Fetch-and-Increment, Test-and-Test-and-Set. Below are descriptions of these atomic instructions: | Hardware implementations for synchronization traditionally include locks, barriers, and mutual exclusion. These types of hardware synchronizations use a method called busy-waiting, or spinning which prevents threads from continuing to execute. Spinning simply means that a thread continuously checks if it is okay to continue (no other useful work is accomplished by the thread). In the case for a lock, multiple threads are competing for access to the lock. Only one thread is allowed to access the critical code protected by the lock at a time. For barriers, multiple threads reach the barrier at different times and no thread can continue until all threads have reached the barrier. Spinning is implemented in hardware through mutexes or semaphores, which prevent multiple processes from being accessed at the same time. Atomic instructions, which are support by the processor Instruction Set Architecture(ISA), are typically used for hardware synchronization mechanism. Examples of these hardware synchronization mechanisms include: Test-and-Set, Fetch-and-Increment, Test-and-Test-and-Set. Below are descriptions of these atomic instructions: | ||
===Software Implementations=== | ===Software Implementations=== | ||
The software implementations discussed are specifically used in Networks of Workstations (NOWs). Using explicit messages, synchronization can be implemented using software. | The software implementations discussed are specifically used in Networks of Workstations (NOWs). Using explicit messages, synchronization can be implemented using software. | ||
==Mutual Exclusion== | ==Mutual Exclusion== | ||
Line 27: | Line 30: | ||
==Overhead== | ==Overhead== | ||
===Test and Set=== | ===Test and Set=== | ||
The performance of test-and-set degrades significantly as the number of processors increase. There are two reasons for this performance decrease. First, a thread must contend with other threads in order to release the lock. The second reason is that the requests by the spinning threads trying to access the lock slow the lock holder. The overhead of this approach is very minimal. There is not much overhead associated with test-and-set. This approach is very trivial. It is the easiest to implement, but has the worst performance. | The performance of test-and-set degrades significantly as the number of processors increase. There are two reasons for this performance decrease. First, a thread must contend with other threads in order to release the lock. The second reason is that the requests by the spinning threads trying to access the lock slow the lock holder. The overhead of this approach is very minimal. There is not much overhead associated with test-and-set. This approach is very trivial. It is the easiest to implement, but has the worst performance. | ||
===Test and Test and Set=== | ===Test and Test and Set=== | ||
The performance of test-and-test-and-set is slightly better than test-and-set. The downside to test-and-test-and-set performance is that there is a separation between detecting that the lock has been released and trying to acquire the lock. The separation allows multiple threads to see that the lock is free, and then the threads that pass the test attempt a test-and-set instruction to try to acquire the lock. The overhead is about the same as test-and-set. There is still not a significant amount of overhead associated with test-and-test-and-set. Performance is sacrificed when overhead is smaller. | The performance of test-and-test-and-set is slightly better than test-and-set. The downside to test-and-test-and-set performance is that there is a separation between detecting that the lock has been released and trying to acquire the lock. The separation allows multiple threads to see that the lock is free, and then the threads that pass the test attempt a test-and-set instruction to try to acquire the lock. The overhead is about the same as test-and-set. There is still not a significant amount of overhead associated with test-and-test-and-set. Performance is sacrificed when overhead is smaller. | ||
===API Synchronization=== | ===API Synchronization=== | ||
Line 45: | Line 51: | ||
==Improved Hardware Primitives== | ==Improved Hardware Primitives== | ||
===Load Locked=== | ===Load Locked=== | ||
===Store Conditional=== | ===Store Conditional=== | ||
==Barriers== | ==Barriers== | ||
===Centralized Barrier=== | ===Centralized Barrier=== | ||
In a centralized barrier algorithm, each processor updates a counter to indicate that it has arrived at the barrier and then repeatedly polls a flag that is set when all threads have reached the barrier. Once all threads have arrived, each of them is allowed to continue past the barrier. The flag can be a sense reversal flag, to prevent intervention of adjacent barrier operations. A disadvantage to this type of barrier is that it does not scale well with a large number of processors, however, its advantage is that it uses a small amount of memory. See code below: | In a centralized barrier algorithm, each processor updates a counter to indicate that it has arrived at the barrier and then repeatedly polls a flag that is set when all threads have reached the barrier. Once all threads have arrived, each of them is allowed to continue past the barrier. The flag can be a sense reversal flag, to prevent intervention of adjacent barrier operations. A disadvantage to this type of barrier is that it does not scale well with a large number of processors, however, its advantage is that it uses a small amount of memory. See code below: | ||
Line 67: | Line 78: | ||
else | else | ||
while (team->barrier_flag == barrier_flag) | while (team->barrier_flag == barrier_flag) | ||
===Tree Barrier=== | ===Tree Barrier=== | ||
In tree barriers, each thread is assigned to a node in the tree. The tree structure is such that a parent node notifies each of its child nodes by setting a flag, and similarly the child node responds to its parent node in order to let the parent node know that it has arrived at the barrier. The parent node clears the flag only after ALL of its child nodes have reached the barrier. Eventually all levels of tree will have all its child node reach the barrier, at which point the barrier can be crossed. The code below shows an implementation of a tree barrier: | In tree barriers, each thread is assigned to a node in the tree. The tree structure is such that a parent node notifies each of its child nodes by setting a flag, and similarly the child node responds to its parent node in order to let the parent node know that it has arrived at the barrier. The parent node clears the flag only after ALL of its child nodes have reached the barrier. Eventually all levels of tree will have all its child node reach the barrier, at which point the barrier can be crossed. The code below shows an implementation of a tree barrier: | ||
Line 105: | Line 118: | ||
[[Image:Graph.png]] | [[Image:Graph.png]] | ||
==References== | ==References== | ||
1. http://www-inst.eecs.berkeley.edu/~n252/su06/CS252Midterm_Hermoso.pdf | 1. http://www-inst.eecs.berkeley.edu/~n252/su06/CS252Midterm_Hermoso.pdf | ||
Revision as of 01:31, 13 April 2010
Hardware Support For Synchronization
Hardware Implementations
Hardware implementations for synchronization traditionally include locks, barriers, and mutual exclusion. These types of hardware synchronizations use a method called busy-waiting, or spinning which prevents threads from continuing to execute. Spinning simply means that a thread continuously checks if it is okay to continue (no other useful work is accomplished by the thread). In the case for a lock, multiple threads are competing for access to the lock. Only one thread is allowed to access the critical code protected by the lock at a time. For barriers, multiple threads reach the barrier at different times and no thread can continue until all threads have reached the barrier. Spinning is implemented in hardware through mutexes or semaphores, which prevent multiple processes from being accessed at the same time. Atomic instructions, which are support by the processor Instruction Set Architecture(ISA), are typically used for hardware synchronization mechanism. Examples of these hardware synchronization mechanisms include: Test-and-Set, Fetch-and-Increment, Test-and-Test-and-Set. Below are descriptions of these atomic instructions:
Software Implementations
The software implementations discussed are specifically used in Networks of Workstations (NOWs). Using explicit messages, synchronization can be implemented using software.
Mutual Exclusion
Mutual Exclusion is the process of preventing concurrent access to shared variables in a parallel program. Mutual exclusion is implement with locks. A lock only allows a single thread access to a critical section of code. Below are several hardware locks that provide mutual exclusion:
Test-and-Set Rx, M: Read the value stored in memory location M, test the value against a constant, and if they match, write the value in register Rx to memory location M.
Fetch-and-Increment M: Read the value stored in memory location M, increment it, and then store the new value to the memory location.
Test-and-Test-and-Set: This is the the same as Test-and-Set, except that there is an extra testing phase. Only when the atomic Test-and-set instruction has a good chance of succeeding, the atomic instruction is executed. See example code below:
lock: ld R1, &lockvar // R1 = lockvar bnz R1, lock // jump to lock if R1 != 0 t&s R1, &lockvar // test-and-set atomic instruction // R1=lockvar; if(R1 == 0) lockvar=1 bnz R1, lcok //jump to lock if R1 != 0 ret //return to caller unlock: st &lockvar, #0 //lockvar = 0 ret //return to caller
Overhead
Test and Set
The performance of test-and-set degrades significantly as the number of processors increase. There are two reasons for this performance decrease. First, a thread must contend with other threads in order to release the lock. The second reason is that the requests by the spinning threads trying to access the lock slow the lock holder. The overhead of this approach is very minimal. There is not much overhead associated with test-and-set. This approach is very trivial. It is the easiest to implement, but has the worst performance.
Test and Test and Set
The performance of test-and-test-and-set is slightly better than test-and-set. The downside to test-and-test-and-set performance is that there is a separation between detecting that the lock has been released and trying to acquire the lock. The separation allows multiple threads to see that the lock is free, and then the threads that pass the test attempt a test-and-set instruction to try to acquire the lock. The overhead is about the same as test-and-set. There is still not a significant amount of overhead associated with test-and-test-and-set. Performance is sacrificed when overhead is smaller.
API Synchronization
There are API's that exist for parallel architectures that provide specific types of synchronization. If the API are used they way they were design, performance can be maximized while minimizing overhead. In general, there is a trade-off between performance and overhead. See Choosing Appropriate Synchronization Primitives article
The figure below shows graphically these two methods and there actual performance (time vs number of threads) compared to the ideal desired performance of multiprocessing:
Improved Hardware Primitives
Load Locked
Store Conditional
Barriers
Centralized Barrier
In a centralized barrier algorithm, each processor updates a counter to indicate that it has arrived at the barrier and then repeatedly polls a flag that is set when all threads have reached the barrier. Once all threads have arrived, each of them is allowed to continue past the barrier. The flag can be a sense reversal flag, to prevent intervention of adjacent barrier operations. A disadvantage to this type of barrier is that it does not scale well with a large number of processors, however, its advantage is that it uses a small amount of memory. See code below:
// Shared data. barrier_count // not yet arrived barrier_flag // indicate all have arrived team_size // number of threads procedure central_barrier barrier_flag=team->barrier-flag new_count = atomic_inc(barrier_count) if (new_count == team_size) team->barrier_count = 0 team->barrier_flag = barrier_flag ^ 1 else while (team->barrier_flag == barrier_flag)
Tree Barrier
In tree barriers, each thread is assigned to a node in the tree. The tree structure is such that a parent node notifies each of its child nodes by setting a flag, and similarly the child node responds to its parent node in order to let the parent node know that it has arrived at the barrier. The parent node clears the flag only after ALL of its child nodes have reached the barrier. Eventually all levels of tree will have all its child node reach the barrier, at which point the barrier can be crossed. The code below shows an implementation of a tree barrier:
// Shared data: typedef struct { volatile boolean *parentflag; boolean *child_notify[2]; volatile long havechild; volatile long childnotready; boolean wakeup_sense; } treenode; treenode shared_array[P]; Private data for each thread: volatile int parity; volatile boolean sense; treenode *mynode; procedure tree_barrier vpid=thread_id; treenode *mynode_reg; mynode_reg = cur_thread->mynode; while (mynode_reg->childnotready); mynode_reg->childnotready = mynode_reg->havechild; *mynode_reg->parentflag = False; if (vpid) while(mynode_reg->wakeup_sense != cur_thread->sense); *mynode_reg->child_notify[0] = cur_thread->sense; *mynode_reg->child_notify[1] = cur_thread->sense; cur_thread->sense ^= True;
The Figure below shows how each barrier implementation performs with the certain number of threads. The graph shows how long it takes each barrier type to implement the barrier with respect to the number of threads.
References
1. http://www-inst.eecs.berkeley.edu/~n252/su06/CS252Midterm_Hermoso.pdf
2. ftp://ftp.cs.wisc.edu/wwt/ics96_synch.pdf
3. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf
6. http://software.intel.com/en-us/articles/use-non-blocking-locks-when-possible/