CSC/ECE 506 Spring 2010/Ch 9/Synchronization: Difference between revisions

Revision as of 03:20, 19 April 2010

Hardware Support For Synchronization

Hardware Implementations

Hardware implementations for synchronization traditionally include locks, barriers, and mutual exclusion. These types of hardware synchronizations use a method called 'busy-waiting' or 'spinning' which prevents threads from continuing to execute. Spinning simply means that a thread continuously checks if it is okay to continue (no other useful work is accomplished by the thread). In the case for a lock, multiple threads are competing for access to the lock. Only one thread is allowed to access the critical code protected by the lock at a time. For barriers, multiple threads reach the barrier at different times and no thread can continue until all threads have reached the barrier. Spinning is implemented in hardware through mutexes or semaphores, which prevent multiple processes from being accessed at the same time. Atomic instructions, which are support by the processor Instruction Set Architecture(ISA), are typically used for hardware synchronization mechanism. Examples of these hardware synchronization mechanisms include: Test-and-Set, Fetch-and-Increment, Test-and-Test-and-Set. Below are descriptions of these atomic instructions:

Software Implementations

The software implementations discussed are specifically used in Networks of Workstations (NOWs). Using explicit messages, synchronization can be implemented using software.

Mutual Exclusion

Mutual Exclusion is the process of preventing concurrent access to shared variables in a parallel program. Mutual exclusion is implement with locks. A lock only allows a single thread access to a critical section of code. Below are several hardware locks that provide mutual exclusion:

Test-and-Set Rx, M: Read the value stored in memory location M, test the value against a constant, and if they match, write the value in register Rx to memory location M.

Fetch-and-Increment M: Read the value stored in memory location M, increment it, and then store the new value to the memory location.

Test-and-Test-and-Set: This is the the same as Test-and-Set, except that there is an extra testing phase. Only when the atomic Test-and-set instruction has a good chance of succeeding, the atomic instruction is executed. See example code below:

 lock: ld R1, &lockvar // R1 = lockvar
       bnz R1, lock // jump to lock if R1 != 0
       t&s R1, &lockvar // test-and-set atomic instruction
                                     // R1=lockvar; if(R1 == 0) lockvar=1
       bnz R1, lcok //jump to lock if R1 != 0
       ret     //return to caller
 unlock: st &lockvar, #0 //lockvar = 0
         ret  //return to caller

Overhead

Test and Set

The performance of test-and-set degrades significantly as the number of processors increases. There are two reasons for this performance decrease. First, a thread must contend with other threads in order to release the lock. The second reason is that the requests by the spinning threads trying to access the lock slow the lock holder. The overhead of this approach is very minimal. There is not much overhead associated with test-and-set. This approach is very trivial. It is the easiest to implement, but has the worst performance.

Test and Test and Set

The performance of test-and-test-and-set is slightly better than test-and-set. The downside to test-and-test-and-set performance is that there is a separation between detecting that the lock has been released and trying to acquire the lock. The separation allows multiple threads to see that the lock is free, and then the threads that pass the test attempt a test-and-set instruction to try to acquire the lock. The overhead is about the same as test-and-set. There is still not a significant amount of overhead associated with test-and-test-and-set. Performance is sacrificed when overhead is smaller.

API Synchronization

There are API's that exist for parallel architectures that provide specific types of synchronization. If the API are used they way they were design, performance can be maximized while minimizing overhead. In general, there is a trade-off between performance and overhead. See Choosing Appropriate Synchronization Primitives article

The figure below shows graphically these two methods and there actual performance (time vs number of threads) compared to the ideal desired performance of multiprocessing:

Improved Hardware Primitives

Load Locked, Store Conditonal

Load Locked(LL) and Store Conditional(SC) are a pair of instructions that are used for lock-free read-modify-write operation. Performance-wise the LL/SC implementation is similar to the test-and-test-and-set, however, it is much simpler with the extra linked registers and it can be used to implement many atomic instructions. The advantage of this method is that there less wait traffic compared to test-and-test-and-set.

Barriers

Centralized Barrier

In a centralized barrier algorithm, each processor updates a counter to indicate that it has arrived at the barrier and then repeatedly polls a flag that is set when all threads have reached the barrier. Once all threads have arrived, each of them is allowed to continue past the barrier. The flag can be a sense reversal flag, to prevent intervention of adjacent barrier operations. A disadvantage to this type of barrier is that it does not scale well with a large number of processors, however, its advantage is that it uses a small amount of memory. See code below:

 // Shared data. 
 barrier_count // not yet arrived 
 barrier_flag // indicate all have arrived 
 team_size // number of threads
 procedure central_barrier 
    barrier_flag=team->barrier-flag 
    new_count = atomic_inc(barrier_count)
 if (new_count == team_size) 
    team->barrier_count = 0 
    team->barrier_flag = barrier_flag ^ 1
 else 
    while (team->barrier_flag == barrier_flag)

Tree Barrier

In tree barriers, each thread is assigned to a node in the tree. The tree structure is such that a parent node notifies each of its child nodes by setting a flag, and similarly the child node responds to its parent node in order to let the parent node know that it has arrived at the barrier. The parent node clears the flag only after ALL of its child nodes have reached the barrier. Eventually all levels of tree will have all its child node reach the barrier, at which point the barrier can be crossed. The code below shows an implementation of a tree barrier:

 // Shared data:
 typedef struct { volatile boolean *parentflag; 
                  boolean *child_notify[2]; 
                  volatile long havechild; 
                  volatile long childnotready; 
                  boolean wakeup_sense;
 } treenode; 
 treenode shared_array[P]; 
 
 Private data for each thread: 
   volatile int parity; 
   volatile boolean sense; 
   treenode *mynode;
 
 procedure tree_barrier 
   vpid=thread_id; 
   treenode *mynode_reg; 
   mynode_reg = cur_thread->mynode;
 
   while (mynode_reg->childnotready);
 
   mynode_reg->childnotready = mynode_reg->havechild; 
   *mynode_reg->parentflag = False;
 
   if (vpid) 
     while(mynode_reg->wakeup_sense != cur_thread->sense);
 
   *mynode_reg->child_notify[0] = cur_thread->sense; 
   *mynode_reg->child_notify[1] = cur_thread->sense; 
   cur_thread->sense ^= True;

The Figure below shows how each barrier implementation performs with the certain number of threads. The graph shows how long it takes each barrier type to implement the barrier with respect to the number of threads.

References

1. http://www-inst.eecs.berkeley.edu/~n252/su06/CS252Midterm_Hermoso.pdf

2. ftp://ftp.cs.wisc.edu/wwt/ics96_synch.pdf

3. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf

4. http://software.intel.com/en-us/articles/choosing-appropriate-synchronization-primitives-to-minimize-overhead/

5. http://software.intel.com/en-us/articles/use-synchronization-routines-provided-by-the-threading-api-rather-than-hand-coded-synchronization/

6. http://software.intel.com/en-us/articles/use-non-blocking-locks-when-possible/

7. http://www.cs.washington.edu/homes/tom/pubs/spinlock.pdf

8. http://www.cs.utah.edu/avalanche/sync.ps

@@ Line 3: / Line 3: @@
 ===Hardware Implementations===
-Hardware implementations for synchronization traditionally include locks, barriers, and mutual exclusion.  These types of hardware synchronizations use a method called busy-waiting, or spinning which prevents threads from continuing to execute.  Spinning simply means that a thread continuously checks if it is okay to continue (no other useful work is accomplished by the thread).  In the case for a lock, multiple threads are competing for access to the lock.  Only one thread is allowed to access the critical code protected by the lock at a time.  For barriers, multiple threads reach the barrier at different times and no thread can continue until all threads have reached the barrier.  Spinning is implemented in hardware through mutexes or semaphores, which prevent multiple processes from being accessed at the same time.  Atomic instructions, which are support by the processor Instruction Set Architecture(ISA), are typically used for hardware synchronization mechanism.  Examples of these hardware synchronization mechanisms include: Test-and-Set, Fetch-and-Increment, Test-and-Test-and-Set.  Below are descriptions of these atomic instructions:
+Hardware implementations for synchronization traditionally include locks, barriers, and mutual exclusion.  These types of hardware synchronizations use a method called 'busy-waiting' or 'spinning' which prevents threads from continuing to execute.  Spinning simply means that a thread continuously checks if it is okay to continue (no other useful work is accomplished by the thread).  In the case for a lock, multiple threads are competing for access to the lock.  Only one thread is allowed to access the critical code protected by the lock at a time.  For barriers, multiple threads reach the barrier at different times and no thread can continue until all threads have reached the barrier.  Spinning is implemented in hardware through mutexes or semaphores, which prevent multiple processes from being accessed at the same time.  Atomic instructions, which are support by the processor Instruction Set Architecture(ISA), are typically used for hardware synchronization mechanism.  Examples of these hardware synchronization mechanisms include: Test-and-Set, Fetch-and-Increment, Test-and-Test-and-Set.  Below are descriptions of these atomic instructions:
 ===Software Implementations===

CSC/ECE 506 Spring 2010/Ch 9/Synchronization: Difference between revisions

Revision as of 03:20, 19 April 2010

Contents

Hardware Support For Synchronization

Hardware Implementations

Software Implementations

Mutual Exclusion

Overhead

Test and Set

Test and Test and Set

API Synchronization

Improved Hardware Primitives

Load Locked, Store Conditonal

Barriers

Centralized Barrier

Tree Barrier

References

Navigation menu

CSC/ECE 506 Spring 2010/Ch 9/Synchronization: Difference between revisions

Revision as of 03:20, 19 April 2010

Hardware Support For Synchronization

Hardware Implementations

Software Implementations

Mutual Exclusion

Overhead

Test and Set

Test and Test and Set

API Synchronization

Improved Hardware Primitives

Load Locked, Store Conditonal

Barriers

Centralized Barrier

Tree Barrier

References

Navigation menu

Search