CSC/ECE 506 Spring 2011/ch5 LL
Supplement to Chapter 5: Parallel Programming For Linked Data Structures
Introduction
Chapter 5 of Solihin (2008) discusses parallel algorithms for operating on linked data structures. The focus is primarily on thread-safe accesses based on locks. The semantics of locks implies safety based on mutual exclusion i.e. only a single thread can access the data structure at any given time. The chapter discusses optimizations such as finer grained locks and read-write locks to enable more parallelization. This supplements discusses non-blocking algorithms to access linked data structures and the related issues.
Non-blocking algorithms
Motivation
Parallel algorithms enable threads (or processes) to concurrently access shared data structures in a thread-safe manner. All lock based algorithms are examples of blocking algorithms, since only one thread can be present in the critical section while other threads trying to get in will block on the lock. All the methods outlined in chapter 5 Solihin (2008) are lock based and hence are blocking in nature. Lock based algorithms have the following drawbacks:
- Blocking limits the extent of scalability, especially when the granularity of the lock is too coarse.
- Locks can result in deadlocks and livelocks livelocks .
- Locks can cause priority inversion i.e. high priority threads/processes cannot proceed if a low priority thread/process is holding the common lock.
- Locks can result in convoying. All other threads have to wait if a thread holding a lock is descheduled due to a time-slice interrupt or page fault (See lock convoy)
- Locks can cause other issues such as the thundering herd problem
Non-blocking algorithms permit truly concurrent thread-safe access to shared data structures where threads don't block on access. A thread might however have to retry to achieve it's objective. In the modern computing era, multi-core architecture is ubiquitous. In these architectures non-blocking algorithms can make use of the available compute rather than the blocking approach where the extra cores would generally be wasted. Thus non-blocking algorithms have significant throughput benefits on current architectures.
Categories
Non-blocking algorithms fall into three main categories. This taxonomy is primarily driven by the level of ‘progress guarantee’ for an individual thread level or a system as a whole.
Wait-freedom: A synchronization technique is wait-free if it ensures that every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. Thus it combines guaranteed system-wide throughput with starvation-freedom and is the strongest non-blocking guarantee of progress.
Lock-freedom: An algorithm is lock-free if it ensures only that some thread always makes progress. Thus it allows individual threads to starve but guarantees system-wide throughput. All wait-free algorithms are lock-free
Obstruction-freedom [Herlihy-obs-free]: A synchronization technique is obstruction-free if it guarantees progress for any thread that eventually executes in isolation. Even though other threads may be in the midst of executing operations, a thread is considered to execute in isolation as long as the other threads do not take any steps. (Pragmatically, it is enough for the thread to run long enough without encountering a synchronization conflict from a concurrent thread.) Like the wait-free and lock-free conditions, obstruction-free synchronization ensures that no thread can be blocked by delays or failures of other threads. This property is weaker than lock-free synchronization, because it does not guarantee progress when two or more conflicting threads are executing concurrently. All lock-free algorithms are obstruction-free.
In practice, lock-free and obstruction free algorithms are the most common. Wait-freedom comes at a substantial performance cost [Herlihy-obs-free]
Building blocks
All non-blocking algorithms are built on atomic primitives provided by the underlying architecture. The most commonly found primitives are:
- Compare And Swap (CAS []): A CAS instruction is used to swap in a new value at an address location. The instruction succeeds only if the value at the address location matches the expected old-value at that location. The instruction returns the value at the address location. The instruction is atomic i.e. if there are multiple threads trying to CAS-in a value at a location, only one of them would succeed. An example for the use case of the CAS instruction is shown below:
do {
old_value = *addr; // read old_value
new_value = old_value + x; // compute new_value
} while (old_value != CAS(new_value, addr, old_value) // re-execute loop if CAS doesn't succeed
- Load Linked/Store Conditional (LL-SC []):
These are a pair of instructions that together implement a lock-free atomic read-modify-write operation. Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link.
Most modern architectures provide the CAS instruction. The power-pc architecture provides a LL-SC implementation. Most compiler/tool chains provide convenient wrappers around these instructions, e.g. gcc[]. It is important to note that the atomic primitives like other memory instructions operate at the granularity of a single memory word. 64 bit machines have 64 bit memory words, while 32 bit machines operate on 32 bit memory words. Algorithms that use these atomic primitives, frequently rely on being able to atomically update two distinct fields of a data structure. This is generally achieved by concatenating the two fields to be part of a single memory word.
Lock-free singly linked list implementation
We discuss a lock-free implementation of a singly linked list. This algorithm being lock-free, ensures that at least one of the threads in the system will make progress. Starvation is not ruled out. The algorithm supports insert, remove and iterate operations on the list. Iteration is safe as long as elements are not deleted i.e. the underlying memory is freed. Removing an element from the list is distinct from freeing up the underlying memory for the node and is permitted. Nodes can be completely recycled by garbage collection at safe points where no thread is accessing the list. We consider two distinct cases, depending upon whether a node can be returned to the list. The algorithm relies on the fact that the insertion and removal operations require an atomically updating a single pointer. This pointer being the size of the memory word and can be updated using the CAS instruction. The linked list assumes a head pointer, and the nodes are added to the head. The algorithm listed below permit any node in the list to be removed.
Insert unique elements
We have presented the pseudo code for the insert and the remove operations below. A node once removed from the list is not returned back to the list. The list is null terminated.
The insert_at_head function, reads the head pointer and tries to atomically update it to the node being inserted in the list. The CAS can fail in case of contention, in that case the head is re-read inside the do-while
loop and the CAS is retried.
void insert_at_head (Node* new_node) { do { old_head = head; // read old_head new_node->_next = old_head; // assign the next pointer of the new-node to the head } while (old_head != CAS (new_head, &head, new_node)) // CAS head to the new_node, re-execute loop if CAS fails. }
The remove function, walks down the list till it finds the remove_node i.e the node to be removed from the list. The next pointer from the previous node is then atomically updated to the remove_node's next pointer. Each node has a is_removed
flag, this is to distinguish a node that has been removed, from the final node on the list. This flag is set after the CAS succeeds, a list traversal at this point will not find the node. The next pointer of the remove_node is then set to NULL
. The order of the writes to these fields i.e. the flag and the next pointer fields, is important. In an architecture, that allows for store re-ordering [], a reader could see the NULL value of the next pointer before it sees the correct value of the flag. This would result in the list being corrupted, since the remove function would assume that this is the final node of the list. To avoid this, we need to make use of memory fence instructions[] which enforce the store ordering. This is covered in good detail in chapter 10 of the textbook [].
The stores can also be re-ordered by the compiler since these are writes to independent locations. To avoid store reordering by the compiler, both the flag field and the next pointer field should be declared as volatile
.
void remove(Node* remove_node) { new_next_node = remove_node->_next if (new_next_node == NULL) { if (remove_node->_is_removed) return; // node has already been removed so exit } do { old_head = head; // read old_head while (old_head->_next != remove_node) { old_head = old_head->_next; // walk the list till you find the node } } while (remove_node != CAS (new_next_node, &(old_head->_next), remove_node)) // CAS the next pointer to the node after the node to be removed remove_node->is_removed = true; remove_node->_next = NULL }
Allow elements to be returned to the list aka the ABA problem
We consider the case, where a node can be returned to the list and both the insert and the remove happen at the head of the list. Let's assume that we have two threads that are trying to update the list, one that is trying to remove the head node and the other thread executes both removes and insert operations.
The state of the list according to thread-1 is: head-> A->B->C
. It then gets preempted by the scheduler before it executes the CAS instruction but after it has read A's next pointer that points to B. In the meanwhile, Thread-2 does multiple updates of the list and transitions it to: head->A->C->D
. Thus, it removes A from the list and inserts it back onto the list. Thread-1 now gets to execute its CAS instruction that updates the head pointer to A's next pointer, which is node B according to thread-1. This CAS succeeds since the head still points to A. The list now looks like: head->A->B
. We have inadvertently removed nodes C and D from the list and corrupted it. This issue is referred to as the ABA issue.
There are few methods to resolve this issue and we highlight some of them below.
Tag-based solution
A counter or a tag field is concatenated with the head pointer. This counter is incremented on each insert onto the list. Thus in the above case, even if node A is returned to the list, the CAS by thread-1 will fail since the counter value would have been incremented by thread-2 when it re-inserted node A into the list. The sequence of events listed below, demonstrate how adding the tag to the head pointer field resolves the ABA issue.
- thread 1, reads head, head is (A, 1)
- thread 2 removes node A, head becomes (B, 1)
- thread 2 inserts A, head becomes (A, 2) // increment tag
- thread 1 tries to remove A, but CAS fails because of the tag field mis-match
It is important to note that the counter and the pointer field are part of the same single memory word. Current 64-bit X86 architectures, only use 48 bits for addressing, this allows the top 16 bits of the memory word to be used for tagging purposes. This solution does however, have a flaw. It is possible that the counter will overflow and return back to the same value as what thread-1 saw in the above example. With 16 bits used for tagging, this would require an awful lot of time and the likelihood of this happening is quite remote. In most practical systems, this is the solution used to get around the ABA issue. The failure caused by the counter overflow is rare enough to be equated to any other random failure caused by hardware malfunction.
Hazard pointers
DCAS based solutions
Memory model and barrier instructions
The X86 memory model
Barrier instructions
Linked list example of use of barrier instructions
Skip lists
References
- Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.
- David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Gulf Professional Publishing, August 1998.
- Nimar S. Arora , Robert D. Blumofe , C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors (1998)" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.3853
- Wikipedia, NonBlocking http://en.wikipedia.org/wiki/Non-blocking_algorithm
- Wikipedia, ABA http://en.wikipedia.org/wiki/ABA_problem
- Wikipedia, DD [1]
- Wikipedia, Hazard http://en.wikipedia.org/wiki/Hazard_pointer
- Wikipedia, CAS http://en.wikipedia.org/wiki/Compare-And-Swap
- Wikipedia, LLSC http://en.wikipedia.org/wiki/Load-Link/Store-Conditional
- GCC, Atomic builtins http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html#Atomic-Builtins
- Thundering Herd, http://catb.org/~esr/jargon/html/T/thundering-herd-problem.html