Supplement to Chapter 5: Parallel Programming For Linked Data Structures

Introduction

Chapter 5 of Solihin (2008) discusses parallel algorithms for operating on linked data structures. The focus is primarily on thread-safe accesses based on locks. The semantics of locks implies safety based on mutual exclusion i.e. only a single thread can access the data structure at any given time. The chapter discusses optimizations such as finer grained locks and read-write locks to enable more parallelization. This supplements discusses non-blocking algorithms to access linked data structures and the related issues.

Non-blocking algorithms

Motivation

Parallel algorithms enable threads (or processes) to concurrently access shared data structures in a thread-safe manner. All lock based algorithms are examples of blocking algorithms, since only one thread can be present in the critical section while other threads trying to get in will block on the lock. All the methods outlined in chapter 5 Solihin (2008) are lock based and hence are blocking in nature. Lock based algorithms have the following drawbacks:

Blocking limits the extent of scalability, especially when the granularity of the lock is too coarse.
Locks can result in deadlocks and livelocks livelocks .
Locks can cause priority inversion i.e. high priority threads/processes cannot proceed if a low priority thread/process is holding the common lock.
Locks can result in convoying. All other threads have to wait if a thread holding a lock is descheduled due to a time-slice interrupt or page fault (See lock convoy)
Locks can cause other issues such as the thundering herd problem

Non-blocking algorithms permit truly concurrent thread-safe access to shared data structures where threads don't block on access. A thread might however have to retry to achieve it's objective. In the modern computing era, multi-core architecture is ubiquitous. In these architectures non-blocking algorithms can make use of the available compute rather than the blocking approach where the extra cores would generally be wasted. Thus non-blocking algorithms have significant throughput benefits on current architectures.

Building blocks

All non-blocking algorithms are built on atomic primitives provided by the underlying architecture. The most commonly found primitives are:

Compare And Swap (CAS []): A CAS instruction is used to swap in a new value at an address location. The instruction succeeds only if the value at the address location matches the expected old-value at that location. The instruction returns the value at the address location. The instruction is atomic i.e. if there are multiple threads trying to CAS-in a value at a location, only one of them would succeed. An example for the use case of the CAS instruction is shown below:

  do {
         old_value = *addr;                // read old_value
         new_value = old_value + x; // compute new_value
   } while (old_value != CAS(new_value, addr, old_value) // re-execute loop if CAS doesn't succeed

Load Linked/Store Conditional (LL-SC []):

These are a pair of instructions that together implement a lock-free atomic read-modify-write operation. Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link.

Most modern architectures provide the CAS instruction. The power-pc architecture provides a LL-SC implementation. Most compiler/tool chains provide convenient wrappers around these instructions, e.g. gcc[]. It is important to note that the atomic primitives like other memory instructions operate at the granularity of a single memory word. 64 bit machines have 64 bit memory words, while 32 bit machines operate on 32 bit memory words. Algorithms that use these atomic primitives, frequently rely on being able to atomically update two distinct fields of a data structure. This is generally achieved by concatenating the two fields to be part of a single memory word.

Lock-free singly linked list implementation

We discuss a lock-free implementation of a singly linked list. This algorithm being lock-free, ensures that at least one of the threads in the system will make progress. Starvation is not ruled out. The algorithm supports insert, remove and iterate operations on the list. Iteration is safe as long as elements are not deleted i.e. the underlying memory is freed. Removing an element from the list is distinct from freeing up the underlying memory for the node and is permitted. Nodes can be completely recycled by garbage collection at safe points where no thread is accessing the list. We consider two distinct cases, depending upon whether a node can be returned to the list. The algorithm relies on the fact that the insertion and removal operations require an atomically updating a single pointer. This pointer being the size of the memory word and can be updated using the CAS instruction. The linked list assumes a head pointer, and the nodes are added to the head. The algorithm listed below permit any node in the list to be removed.

Insert unique elements

We have presented the pseudo code for the insert and the remove operations below. A node once removed from the list is not returned back to the list. The list is null terminated.

The insert_at_head function, reads the head pointer and tries to atomically update it to the node being inserted in the list. The CAS can fail in case of contention, in that case the head is re-read inside the do-while loop and the CAS is retried.

  void insert_at_head (Node* new_node) {
   do {
          old_head =  head;   // read old_head
          new_node->_next = old_head;  // assign the next pointer of the new-node to the head          
    } while  (old_head != CAS (new_head, &head, new_node))   // CAS head to the new_node, re-execute loop if CAS fails.
  }

The remove function, walks down the list till it finds the remove_node i.e the node to be removed from the list. The next pointer from the previous node is then atomically updated to the remove_node's next pointer. Each node has a is_removed flag, this is to distinguish a node that has been removed, from the final node on the list. This flag is set after the CAS succeeds, a list traversal at this point will not find the node. The next pointer of the remove_node is then set to NULL. The order of the writes to these fields i.e. the flag and the next pointer fields, is important. In an architecture, that allows for store re-ordering [], a reader could see the NULL value of the next pointer before it sees the correct value of the flag. This would result in the list being corrupted, since the remove function would assume that this is the final node of the list. To avoid this, we need to make use of memory fence instructions[] which enforce the store ordering. This is covered in good detail in chapter 10 of the textbook []. The stores can also be re-ordered by the compiler since these are writes to independent locations. To avoid store reordering by the compiler, both the flag field and the next pointer field should be declared as volatile.

  void remove(Node* remove_node) {
   new_next_node =  remove_node->_next
   if (new_next_node  == NULL) { 
       if (remove_node->_is_removed) 
         return; // node has already been removed so exit
   }
   do {
          old_head =  head;   // read old_head
          while (old_head->_next != remove_node) {
             old_head = old_head->_next;  // walk the list till you find the node
          }          
    } while  (remove_node != CAS (new_next_node, &(old_head->_next), remove_node))   // CAS the next pointer to the node after the node to be removed
    remove_node->is_removed = true; 
    remove_node->_next = NULL
  }

Allow elements to be returned to the list aka the ABA problem

We consider the case, where a node can be returned to the list and both the insert and the remove happen at the head of the list. Let's assume that we have two threads that are trying to update the list, one that is trying to remove the head node and the other thread executes both removes and insert operations.

The state of the list according to thread-1 is: head-> A->B->C. It then gets preempted by the scheduler before it executes the CAS instruction but after it has read A's next pointer that points to B. In the meanwhile, Thread-2 does multiple updates of the list and transitions it to: head->A->C->D. Thus, it removes A from the list and inserts it back onto the list. Thread-1 now gets to execute its CAS instruction that updates the head pointer to A's next pointer, which is node B according to thread-1. This CAS succeeds since the head still points to A. The list now looks like: head->A->B. We have inadvertently removed nodes C and D from the list and corrupted it. This issue is referred to as the ABA issue.

There are few methods to resolve this issue and we highlight some of them below.

Tag-based solution

A counter or a tag field is concatenated with the head pointer. This counter is incremented on each insert onto the list. Thus in the above case, even if node A is returned to the list, the CAS by thread-1 will fail since the counter value would have been incremented by thread-2 when it re-inserted node A into the list. The sequence of events listed below, demonstrate how adding the tag to the head pointer field resolves the ABA issue.

thread 1, reads head, head is (A, 1)
thread 2 removes node A, head becomes (B, 1)
thread 2 inserts A, head becomes (A, 2) // increment tag
thread 1 tries to remove A, but CAS fails because of the tag field mis-match

It is important to note that the counter and the pointer field are part of the same single memory word. Current 64-bit X86 architectures, only use 48 bits for addressing, this allows the top 16 bits of the memory word to be used for tagging purposes. This solution does however, have a flaw. It is possible that the counter will overflow and return back to the same value as what thread-1 saw in the above example. With 16 bits used for tagging, this would require an awful lot of time and the likelihood of this happening is quite remote. In most practical systems, this is the solution used to get around the ABA issue. The failure caused by the counter overflow is rare enough to be equated to any other random failure caused by hardware malfunction.

Hazard pointers

DCAS based solutions

Memory model and barrier instructions

The X86 memory model

Barrier instructions

Linked list example of use of barrier instructions

Skip lists

References

Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.
David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Gulf Professional Publishing, August 1998.
Nimar S. Arora , Robert D. Blumofe , C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors (1998)" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.3853
Wikipedia, NonBlocking http://en.wikipedia.org/wiki/Non-blocking_algorithm
Wikipedia, ABA http://en.wikipedia.org/wiki/ABA_problem
Wikipedia, DD [1]
Wikipedia, Hazard http://en.wikipedia.org/wiki/Hazard_pointer
Wikipedia, CAS http://en.wikipedia.org/wiki/Compare-And-Swap
Wikipedia, LLSC http://en.wikipedia.org/wiki/Load-Link/Store-Conditional
GCC, Atomic builtins http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html#Atomic-Builtins
Thundering Herd, http://catb.org/~esr/jargon/html/T/thundering-herd-problem.html

CSC/ECE 506 Spring 2011/ch5 LL

Contents

Supplement to Chapter 5: Parallel Programming For Linked Data Structures

Introduction

Non-blocking algorithms

Motivation

Categories

Building blocks

Lock-free singly linked list implementation

Insert unique elements

Allow elements to be returned to the list aka the ABA problem

Tag-based solution

Hazard pointers

DCAS based solutions

Memory model and barrier instructions

The X86 memory model

Barrier instructions

Linked list example of use of barrier instructions

Skip lists

References

Appendix

Navigation menu

CSC/ECE 506 Spring 2011/ch5 LL

Supplement to Chapter 5: Parallel Programming For Linked Data Structures

Introduction

Non-blocking algorithms

Motivation

Categories

Building blocks

Lock-free singly linked list implementation

Insert unique elements

Allow elements to be returned to the list aka the ABA problem

Tag-based solution

Hazard pointers

DCAS based solutions

Memory model and barrier instructions

The X86 memory model

Barrier instructions

Linked list example of use of barrier instructions

Skip lists

References

Appendix

Navigation menu

Search