CSC/ECE 506 Spring 2011/ch5 LL: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 57: Line 57:
We have presented the pseudo code for the insert and the remove operations below. The linked list assumes a head pointer, nodes are added to the head. The algorithm permits any node in the list to be removed. A node once removed from the list is not returned back to the list. The list is null terminated.
We have presented the pseudo code for the insert and the remove operations below. The linked list assumes a head pointer, nodes are added to the head. The algorithm permits any node in the list to be removed. A node once removed from the list is not returned back to the list. The list is null terminated.


The insert_at_head function, reads the head pointer and tries to atomically update it to the node being inserted in the list.  The CAS can fail in case of contention, in that case the head is re-read inside the <code>do-while</code> loop and the CAS is retried.


   void insert_at_head (Node* new_node) {
   void insert_at_head (Node* new_node) {
Line 64: Line 65:
     } while  (old_head != CAS (new_head, &head, new_node))  // CAS head to the new_node, re-execute loop if CAS fails.
     } while  (old_head != CAS (new_head, &head, new_node))  // CAS head to the new_node, re-execute loop if CAS fails.
   }
   }
The remove function, walks down the list till it finds the remove_node i.e  the node to be removed from the list. The next pointer from the previous node is then atomically updated to the remove_node's next pointer.  Each node has a <code>is_removed</code> flag, this is to distinguish a node that has been removed, from the final node on the list. This flag is set after the CAS succeeds, a list traversal at this point will not find the node. The next pointer of the remove_node is then set to <code>NULL</code>. The order of the writes to these fields i.e. the flag and the next pointer fields, is important.  In an architecture, that allows for store re-ordering [], a reader could see the NULL value of the next pointer before it sees the correct value of the flag. This would result in the list being corrupted, since the remove function would assume that this is the final node of the list. To avoid this, we need to make use of memory fence instructions[] which enforce the store ordering.  This is covered in good detail in chapter 10 of the textbook [].
The stores can also be re-ordered by the compiler since these are writes to independent locations. To avoid store reordering by the compiler, both the flag field and the next pointer field should be declared as <code>volatile</code>.


   void remove(Node* remove_node) {
   void remove(Node* remove_node) {
Line 84: Line 88:




==Insert non-unique case aka the ABA problem==
==Insert non-unique elements case aka the ABA problem==
==Tag-based solution==
===Tag-based solution===
==Hazard pointers==
===Hazard pointers===
==DCAS based solutions==
===DCAS based solutions===
 
=Memory model and barrier instructions=
=Memory model and barrier instructions=
==The X86 memory model==
==The X86 memory model==

Revision as of 08:02, 28 February 2011

Supplement to Chapter 5: Parallel Programming For Linked Data Structures

Introduction

Chapter 5 of Solihin (2008) discusses parallel algorithms for operating on Linked data structures. The focus is primarily on thread-safe accesses based on locks. The semantics of locks imply that a single thread can access the data structure at any given time. The chapter discusses optimizations such as finer grained locks and read-write locks to enable more parallelization. This supplements discusses non-blocking algorithms to access linked data structures and the related issues.

Non blocking algorithms

Motivation

Parallel algorithms enable threads (or processes) to concurrently access shared data structures in a thread-safe manner. All lock based algorithms are examples of blocking algorithms, since only one thread can be present in the critical section while other threads trying to get in will block on the lock. All the methods outlined in chapter 5 Solihin (2008) are lock based and hence are blocking in nature. Lock based algorithms have the following drawbacks:

  • Blocking limits the extent of scalability, especially when the granularity of the lock is too coarse.
  • Locks can result in deadlocks and livelocks [].
  • Locks can cause priority inversion i.e. high priority threads/processes cannot proceed if a low priority thread/process is holding the common lock.
  • Locks can result in convoying. All other threads have to wait if a thread holding a lock is descheduled due to a time-slice interrupt or page fault (See lock convoy)
  • Locks can cause other issues such as thundering herd problem []

Non-blocking algorithms permit truly concurrent thread-safe access to shared data structures where threads don't block on access. A thread might however have to retry to achieve it's objective. In the modern computing era, multi-core architecture [] is ubiquitous. In these architectures non-blocking algorithms can make use of the available compute rather than the blocking approach where the extra cores would generally be wasted. Thus non-blocking algorithms have significant throughput benefits on current architectures.

Categories

Non-blocking algorithms fall into three main categories. This taxonomy is primarily driven by the level of ‘progress guarantee’ for an individual thread level or a system as a whole.

Wait-freedom: A synchronization technique is wait-free if it ensures that every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. Thus it combines guaranteed system-wide throughput with starvation-freedom and is the strongest non-blocking guarantee of progress.

Lock-freedom: An algorithm is lock-free if it ensures only that some thread always makes progress. Thus it allows individual threads to starve but guarantees system-wide throughput. All wait-free algorithms are lock-free

Obstruction-freedom [Herlihy-obs-free]: A synchronization technique is obstruction-free if it guarantees progress for any thread that eventually executes in isolation. Even though other threads may be in the midst of executing operations, a thread is considered to execute in isolation as long as the other threads do not take any steps. (Pragmatically, it is enough for the thread to run long enough without encountering a synchronization conflict from a concurrent thread.) Like the wait-free and lock-free conditions, obstruction-free synchronization ensures that no thread can be blocked by delays or failures of other threads. This property is weaker than lock-free synchronization, because it does not guarantee progress when two or more conflicting threads are executing concurrently. All lock-free algorithms are obstruction-free.

In practice, lock-free and obstruction free algorithms are the most common. Wait-freedom comes at a substantial performance cost [Herlihy-obs-free]

Building blocks

All non-blocking algorithms are built on atomic primitives provided by the underlying architecture. The most commonly found primitives are:

  • Compare And Swap (CAS []): A CAS instruction is used to swap in a new value at an address location. The instruction succeeds only if the value at the address location matches the expected old-value at that location. The instruction returns the value at the address location. The instruction is atomic i.e. if there are multiple threads trying to CAS-in a value at a location, only one of them would succeed. An example for the use case of the CAS instruction is shown below:

  do {
         old_value = *addr;                // read old_value
         new_value = old_value + x; // compute new_value
   } while (old_value != CAS(new_value, addr, old_value) // re-execute loop if CAS doesn't succeed 


  • Load Linked/Store Conditional (LL-SC []):

These are a pair of instructions that together implement a lock-free atomic read-modify-write operation. Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link.

Most modern architectures provide the CAS instruction. The power-pc architecture provides a LL-SC implementation. Most compiler/tool chains provide convenient wrappers around these instructions, e.g. gcc[]. It is important to note that the atomic primitives like other memory instructions operate at the granularity of a single memory word. 64 bit machines have 64 bit memory words, while 32 bit machines operate on 32 bit memory words. Algorithms that use these atomic primitives, frequently rely on being able to atomically update two distinct fields of a data structure. This is generally achieved by concatenating the two fields to be part of a single memory word.

Lock-free singly linked list implementation

We discuss a lock-free implementation of a singly linked list. This algorithm being lock-free, ensures that at least one of the threads in the system will make progress. Starvation is not ruled out. The algorithm supports insert, remove and iterate operations on the list. Iteration is safe as long as elements are not deleted i.e. the underlying memory is freed. Removing an element from the list is distinct from freeing up the underlying memory for the node and is permitted. Nodes can be completely recycled by garbage collection at safe points where no thread is accessing the list. We consider two distinct cases, depending upon whether a node can be returned to the list. The algorithm relies on the fact that the insertion and removal operations require an atomically updating a single pointer. This pointer being the size of the memory word and can be updated using the CAS instruction.

Insert unique elements case

We have presented the pseudo code for the insert and the remove operations below. The linked list assumes a head pointer, nodes are added to the head. The algorithm permits any node in the list to be removed. A node once removed from the list is not returned back to the list. The list is null terminated.

The insert_at_head function, reads the head pointer and tries to atomically update it to the node being inserted in the list. The CAS can fail in case of contention, in that case the head is re-read inside the do-while loop and the CAS is retried.

  void insert_at_head (Node* new_node) {
   do {
          old_head =  head;   // read old_head
          new_node->_next = old_head;  // assign the next pointer of the new-node to the head          
    } while  (old_head != CAS (new_head, &head, new_node))   // CAS head to the new_node, re-execute loop if CAS fails.
  }

The remove function, walks down the list till it finds the remove_node i.e the node to be removed from the list. The next pointer from the previous node is then atomically updated to the remove_node's next pointer. Each node has a is_removed flag, this is to distinguish a node that has been removed, from the final node on the list. This flag is set after the CAS succeeds, a list traversal at this point will not find the node. The next pointer of the remove_node is then set to NULL. The order of the writes to these fields i.e. the flag and the next pointer fields, is important. In an architecture, that allows for store re-ordering [], a reader could see the NULL value of the next pointer before it sees the correct value of the flag. This would result in the list being corrupted, since the remove function would assume that this is the final node of the list. To avoid this, we need to make use of memory fence instructions[] which enforce the store ordering. This is covered in good detail in chapter 10 of the textbook []. The stores can also be re-ordered by the compiler since these are writes to independent locations. To avoid store reordering by the compiler, both the flag field and the next pointer field should be declared as volatile.

  void remove(Node* remove_node) {
   new_next_node =  remove_node->_next
   if (new_next_node  == NULL) { 
       if (remove_node->_is_removed) 
         return; // node has already been removed so exit
   }
   do {
          old_head =  head;   // read old_head
          while (old_head->_next != remove_node) {
             old_head = old_head->_next;  // walk the list till you find the node
          }          
    } while  (remove_node != CAS (new_next_node, &(old_head->_next), remove_node))   // CAS the next pointer to the node after the node to be removed
    remove_node->is_removed = true; 
    remove_node->_next = NULL
  }



Insert non-unique elements case aka the ABA problem

Tag-based solution

Hazard pointers

DCAS based solutions

Memory model and barrier instructions

The X86 memory model

Barrier instructions

Linked list example of use of barrier instructions

Skip lists

References



Appendix

// Sequential code, from Solihin (2008), p. 25.

for (i = 0; i < 8; i++)
    a[i] = b[i] + c[i];
sum = 0;
for (i = 0; i < 8; i++)
    if (a[i] > 0)
        sum = sum + a[i];
Print sum;