CSC/ECE 506 Spring 2010/Ch 9/Synchronization

Hardware Support For Synchronization

Hardware Implementations

Hardware implementations for synchronization traditionally include locks, barriers, and mutual exclusion. These types of hardware synchronizations use a method called busy-waiting, or spinning which prevents threads from continuing to execute. Spinning simply means that a thread continuously checks if it is okay to continue (no other useful work is accomplished by the thread). In the case for a lock, multiple threads are competing for access to the lock. Only one thread is allowed to access the critical code protected by the lock at a time. For barriers, multiple threads reach the barrier at different times and no thread can continue until all threads have reached the barrier. Spinning is implemented in hardware through mutexes or semaphores, which prevent multiple processes from being accessed at the same time. Atomic instructions, which are support by the processor Instruction Set Architecture(ISA), are typically used for hardware synchronization mechanism. Examples of these hardware synchronization mechanisms include: Test-and-Set, Fetch-and-Increment, Test-and-Test-and-Set. Below are descriptions of these atomic instructions:

Software Implementations

The software implementations discussed are specifically used in Networks of Workstations (NOWs). Using explicit messages, synchronization can be implemented using software.

Mutual Exclusion

Mutual Exclusion is the process of preventing concurrent access to shared variables in a parallel program. Mutual exclusion is implement with locks. A lock only allows a single thread access to a critical section of code. Below are several hardware locks that provide mutual exclusion:

Test-and-Set Rx, M: Read the value stored in memory location M, test the value against a constant, and if they match, write the value in register Rx to memory location M.

Fetch-and-Increment M: Read the value stored in memory location M, increment it, and then store the new value to the memory location.

Test-and-Test-and-Set: This is the the same as Test-and-Set, except that there is an extra testing phase. Only when the atomic Test-and-set instruction has a good chance of succeeding, the atomic instruction is executed. See example code below:

 lock: ld R1, &lockvar // R1 = lockvar
       bnz R1, lock // jump to lock if R1 != 0
       t&s R1, &lockvar // test-and-set atomic instruction
                                     // R1=lockvar; if(R1 == 0) lockvar=1
       bnz R1, lcok //jump to lock if R1 != 0
       ret     //return to caller
 unlock: st &lockvar, #0 //lockvar = 0
         ret  //return to caller

Overhead

Improved Hardware Primitives

Barriers

Centralized Barrier

In a centralized barrier algorithm, each processor updates a counter to indicate that it has arrived at the barrier and then repeatedly polls a flag that is set when all threads have reached the barrier. Once all threads have arrived, each of them is allowed to continue past the barrier. The flag can be a sense reversal flag, to prevent intervention of adjacent barrier operations. A disadvantage to this type of barrier is that it does not scale well with a large number of processors, however, its advantage is that it uses a small amount of memory. See code below:

 // Shared data. 
 barrier_count // not yet arrived 
 barrier_flag // indicate all have arrived 
 team_size // number of threads
 procedure central_barrier 
    barrier_flag=team->barrier-flag 
    new_count = atomic_inc(barrier_count)
 if (new_count == team_size) 
    team->barrier_count = 0 
    team->barrier_flag = barrier_flag ^ 1
 else 
    while (team->barrier_flag == barrier_flag)

Tree Barrier

In tree barriers, each thread is assigned to a node in the tree. The tree structure is such that a parent node notifies each of its child nodes by setting a flag, and similarly the child node responds to its parent node in order to let the parent node know that it has arrived at the barrier. The parent node clears the flag only after ALL of its child nodes have reached the barrier. Eventually all levels of tree will have all its child node reach the barrier, at which point the barrier can be crossed. The code below shows an implementation of a tree barrier:

 // Shared data:
 typedef struct { volatile boolean *parentflag; 
                  boolean *child_notify[2]; 
                  volatile long havechild; 
                  volatile long childnotready; 
                  boolean wakeup_sense;
 } treenode; 
 treenode shared_array[P]; 
 
 Private data for each thread: 
   volatile int parity; 
   volatile boolean sense; 
   treenode *mynode;
 
 procedure tree_barrier 
   vpid=thread_id; 
   treenode *mynode_reg; 
   mynode_reg = cur_thread->mynode;
 
   while (mynode_reg->childnotready);
 
   mynode_reg->childnotready = mynode_reg->havechild; 
   *mynode_reg->parentflag = False;
 
   if (vpid) 
     while(mynode_reg->wakeup_sense != cur_thread->sense);
 
   *mynode_reg->child_notify[0] = cur_thread->sense; 
   *mynode_reg->child_notify[1] = cur_thread->sense; 
   cur_thread->sense ^= True;