CSC/ECE 506 Spring 2011/ch5 LL: Difference between revisions
(4 intermediate revisions by 2 users not shown) | |||
Line 13: | Line 13: | ||
* Locks can cause other issues such as [http://en.wikipedia.org/wiki/Thundering_herd_problem the thundering-herd problem] | * Locks can cause other issues such as [http://en.wikipedia.org/wiki/Thundering_herd_problem the thundering-herd problem] | ||
[http://en.wikipedia.org/wiki/Non-blocking_synchronization Non-blocking algorithms] permit truly concurrent thread-safe access to shared data structures where threads don't block on access. A thread might however have to retry to achieve | [http://en.wikipedia.org/wiki/Non-blocking_synchronization Non-blocking algorithms] permit truly concurrent thread-safe access to shared data structures where threads don't block on access. A thread might however have to retry to achieve its objective. In the modern computing era, [http://en.wikipedia.org/wiki/Nehalem_(microarchitecture) multi-core architecture] is ubiquitous. In these architectures non-blocking algorithms can make use of the available compute rather than the blocking approach where the extra cores would generally be wasted. Thus non-blocking algorithms have significant throughput benefits on current architectures. | ||
==Categories== | ==Categories== | ||
Line 46: | Line 46: | ||
* [http://en.wikipedia.org/wiki/Load-Link/Store-Conditional Load Linked/Store Conditional / LL-SC ]: | * [http://en.wikipedia.org/wiki/Load-Link/Store-Conditional Load Linked/Store Conditional / LL-SC ]: | ||
These are a pair of instructions that together implement a lock-free atomic read-modify-write operation. Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. | These are a pair of instructions that together implement a lock-free atomic read-modify-write operation. Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Section 9.1.5 of [[#References | Solihin (2009)]] discusses the semantics of the LL-SC instructions and a lock implementation based on the LL-SC instructions. | ||
Most modern architectures provide the CAS instruction. The Power PC architecture provides a LL-SC implementation. Most compiler/tool chains provide convenient wrappers around these instructions, e.g., [http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html#Atomic-Builtins GCC atomic builtins]. | Most modern architectures provide the CAS instruction. The Power PC architecture provides a LL-SC implementation. Most compiler/tool chains provide convenient wrappers around these instructions, e.g., [http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html#Atomic-Builtins GCC atomic builtins]. | ||
Line 87: | Line 87: | ||
==Allow elements to be returned to the list== | ==Allow elements to be returned to the list== | ||
Allowing elements to be returned to the list can cause list corruption in a subtle manner. This is generally referred to as the [http://en.wikipedia.org/wiki/ABA_problem ABA problem] | Allowing elements to be returned to the list can cause list corruption in a subtle manner. This is generally referred to as the [http://en.wikipedia.org/wiki/ABA_problem ABA problem], where the head of the list transitions from A to B and back to A. | ||
We consider the case | We consider the case where a node can be returned to the list and both the insert and the remove happen at the head of the list. Let's assume that we have two threads that are trying to update the list, one that is trying to remove the head node and the other thread executes both removes and insert operations. | ||
The state of the list according to thread-1 is: <code>head-> A->B->C</code>. It then gets preempted by the scheduler before it executes the CAS instruction but after it has read A's next pointer that points to B. In the meanwhile, Thread-2 does multiple updates of the list and transitions it to: <code>head->A->C->D</code>. Thus, it removes A from the list and inserts it back onto the list. Thread-1 now gets to execute its CAS instruction that updates the head pointer to A's next pointer, which is node B according to thread-1. This CAS succeeds since the head still points to A. The list now looks like: <code>head->A->B</code>. We have inadvertently removed nodes C and D from the list and corrupted it. This issue is referred to as the ''ABA'' issue, indicating the transition states of the ''head'' pointer. | The state of the list according to thread-1 is: <code>head-> A->B->C</code>. It then gets preempted by the scheduler before it executes the CAS instruction but after it has read A's next pointer that points to B. In the meanwhile, Thread-2 does multiple updates of the list and transitions it to: <code>head->A->C->D</code>. Thus, it removes A from the list and inserts it back onto the list. Thread-1 now gets to execute its CAS instruction that updates the head pointer to A's next pointer, which is node B according to thread-1. This CAS succeeds since the head still points to A. The list now looks like: <code>head->A->B</code>. We have inadvertently removed nodes C and D from the list and corrupted it. This issue is referred to as the ''ABA'' issue, indicating the transition states of the ''head'' pointer. |
Latest revision as of 01:04, 27 March 2011
Supplement to Chapter 5: Parallel Programming For Linked Data Structures
Introduction
Chapter 5 of Solihin (2009) discusses parallel algorithms for operating on linked data structures. The focus is primarily on thread-safe accesses based on locks. The semantics of locks implies safety based on mutual exclusion, i.e., only a single thread can access the data structure at any given time. The chapter discusses optimizations such as finer grained locks and read-write locks to enable more parallelization. This supplement discusses non-blocking algorithms to access linked data structures and the related issues. We categorize non-blocking algorithms, discuss the building blocks required to implement these and follow it up with an example of a lock-free singly-linked list. The article also briefly discusses the memory model in the X86 architecture and data structures that are more amenable to lock-free algorithms.
Non-blocking algorithms
Motivation
Parallel algorithms enable threads (or processes) to concurrently access shared data structures in a thread-safe manner. All lock-based algorithms are examples of blocking algorithms, since only one thread can be present in the critical section while other threads trying to get in will block on the lock. All the methods outlined in chapter 5 Solihin (2008) are lock based and hence are blocking in nature. Lock-based algorithms have the following drawbacks:
- Blocking limits the extent of scalability, especially when the granularity of the lock is too coarse.
- Locks can result in deadlocks and livelocks .
- Locks can cause priority inversion ,i.e., high priority threads/processes cannot proceed if a low priority thread/process is holding the common lock.
- Locks can result in convoying. All other threads have to wait if a thread holding a lock is descheduled due to a time-slice interrupt or page fault (See lock convoy)
- Locks can cause other issues such as the thundering-herd problem
Non-blocking algorithms permit truly concurrent thread-safe access to shared data structures where threads don't block on access. A thread might however have to retry to achieve its objective. In the modern computing era, multi-core architecture is ubiquitous. In these architectures non-blocking algorithms can make use of the available compute rather than the blocking approach where the extra cores would generally be wasted. Thus non-blocking algorithms have significant throughput benefits on current architectures.
Categories
Non-blocking algorithms fall into three main categories. This taxonomy is primarily driven by the level of ‘progress guarantee’ for an individual thread level or a system as a whole.
Wait-freedom: A synchronization technique is wait-free if it ensures that every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. Thus it combines guaranteed system-wide throughput with starvation-freedom and is the strongest non-blocking guarantee of progress.
Lock-freedom: An algorithm is lock-free if it ensures only that some thread always makes progress. Thus it allows individual threads to starve but guarantees system-wide throughput. All wait-free algorithms are lock-free
Obstruction-freedom: Herlihy 2003 define obstruction freedom as a synchronization technique that guarantees progress for any thread that eventually executes in isolation. Even though other threads may be in the midst of executing operations, a thread is considered to execute in isolation as long as the other threads do not take any steps. (Pragmatically, it is enough for the thread to run long enough without encountering a synchronization conflict from a concurrent thread.) Like the wait-free and lock-free conditions, obstruction-free synchronization ensures that no thread can be blocked by delays or failures of other threads. This property is weaker than lock-free synchronization, because it does not guarantee progress when two or more conflicting threads are executing concurrently. All lock-free algorithms are obstruction-free. Livelocks are a possibility with this set of algorithms, and it is the responsibility of a contention manager to prevent this.
In practice, lock-free and obstruction-free algorithms are the most common. Wait-freedom comes at a substantial performance cost.
Building blocks
All non-blocking algorithms are built on atomic primitives provided by the underlying architecture. The most commonly found primitives are:
A CAS instruction is used to swap in a new value at an address location. The instruction succeeds only if the value at the address location matches the expected old-value at that location. The instruction returns the value at the address location. The instruction is atomic ,i.e., if there are multiple threads trying to CAS-in a value at a location, only one of them would succeed. An example for the use case of the CAS instruction is shown below:
do {
old_value = *addr; // read old_value
new_value = old_value + x; // compute new_value
} while (old_value != CAS(new_value, addr, old_value) // re-execute loop if CAS doesn't succeed
These are a pair of instructions that together implement a lock-free atomic read-modify-write operation. Load-link returns the current value of a memory location. A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Section 9.1.5 of Solihin (2009) discusses the semantics of the LL-SC instructions and a lock implementation based on the LL-SC instructions.
Most modern architectures provide the CAS instruction. The Power PC architecture provides a LL-SC implementation. Most compiler/tool chains provide convenient wrappers around these instructions, e.g., GCC atomic builtins. It is important to note that the atomic primitives like other memory instructions operate at the granularity of a single memory word. 64 bit machines have 64 bit memory words, while 32 bit machines operate on 32 bit memory words. Algorithms that use these atomic primitives, frequently rely on being able to atomically update two distinct fields of a data structure. This is generally achieved by concatenating the two fields to be part of a single memory word.
Lock-free singly linked list implementation
We discuss a lock-free implementation of a singly linked list. This algorithm, being lock-free, ensures that at least one of the threads in the system will make progress. Starvation is not ruled out. The algorithm supports insert, remove and iterate operations on the list. Iteration is safe as long as elements are not deleted, i.e., the underlying memory is freed. Removing an element from the list is distinct from freeing up the underlying memory for the node and is permitted. Nodes can be completely recycled by garbage collection at safe points where no thread is accessing the list. We consider two distinct cases, depending upon whether a node can be returned to the list. The algorithm relies on the fact that the insertion and removal operations require atomically updating a single pointer. This pointer is the size of the memory word, and can be updated using the CAS instruction. The linked list assumes a head pointer, and the nodes are added to the head. The algorithm listed below permits any node in the list to be removed.
Insert unique elements
We present the pseudocode for the insert and the remove operations below. A node once removed from the list is not returned back to the list. The list is null terminated.
The insert_at_head
function reads the head pointer and tries to atomically update it to the node being inserted in the list. The CAS can fail in case of contention, in that case the head is re-read inside the do-while
loop and the CAS is retried.
void insert_at_head (Node* new_node) { do { old_head = head; // read old_head new_node->_next = old_head; // assign the next pointer of the new-node to the head } while (old_head != CAS (new_head, &head, new_node)) // CAS head to the new_node, re-execute loop if CAS fails. }
The remove
function walks down the list till it finds the remove_node, i.e, the node to be removed from the list. The next pointer from the previous node is then atomically updated to the remove_node's next pointer. Each node has a is_removed
flag. This is to distinguish a node that has been removed, from the final node on the list. This flag is set after the CAS succeeds, a list traversal at this point will not find the node. The next pointer of the remove_node is then set to NULL
. The order of the writes to these fields, i.e., the is_removed
flag and the next pointer fields, is important. In an architecture, that allows for store re-ordering, a reader could see the NULL value of the next pointer before it sees the correct value of the flag. This would result in the list being corrupted, since the remove function would assume that this is the final node of the list. To avoid this, we need to make use of memory fence instructions which enforce store ordering. This is covered in good detail in chapter 10 of Solihin (2009). We briefly discuss fence instructions further down in the article.
The stores can also be re-ordered by the compiler since these are writes to independent locations. To avoid store reordering by the compiler, both the flag field and the next pointer field should be declared as volatile
.
void remove(Node* remove_node) { new_next_node = remove_node->_next if (new_next_node == NULL) { if (remove_node->_is_removed) return; // node has already been removed so exit } do { old_head = head; // read old_head while (old_head->_next != remove_node) { old_head = old_head->_next; // walk the list till you find the node } } while (remove_node != CAS (new_next_node, &(old_head->_next), remove_node)) // CAS the next pointer to the node after the node to be removed remove_node->is_removed = true; remove_node->_next = NULL }
Allow elements to be returned to the list
Allowing elements to be returned to the list can cause list corruption in a subtle manner. This is generally referred to as the ABA problem, where the head of the list transitions from A to B and back to A. We consider the case where a node can be returned to the list and both the insert and the remove happen at the head of the list. Let's assume that we have two threads that are trying to update the list, one that is trying to remove the head node and the other thread executes both removes and insert operations.
The state of the list according to thread-1 is: head-> A->B->C
. It then gets preempted by the scheduler before it executes the CAS instruction but after it has read A's next pointer that points to B. In the meanwhile, Thread-2 does multiple updates of the list and transitions it to: head->A->C->D
. Thus, it removes A from the list and inserts it back onto the list. Thread-1 now gets to execute its CAS instruction that updates the head pointer to A's next pointer, which is node B according to thread-1. This CAS succeeds since the head still points to A. The list now looks like: head->A->B
. We have inadvertently removed nodes C and D from the list and corrupted it. This issue is referred to as the ABA issue, indicating the transition states of the head pointer.
There are a few methods to resolve this issue and we highlight some of them below.
Tag-based solution
A counter or a tag field is concatenated with the head pointer. This counter is incremented on each insert onto the list. Thus in the above case, even if node A is returned to the list, the CAS by thread 1 will fail since the counter value would have been incremented by thread 2 when it re-inserted node A into the list. The sequence of events listed below, demonstrate how adding the tag to the head pointer field resolves the ABA issue.
- thread 1 reads head; head is (A, 1)
- thread 2 removes node A; head becomes (B, 1)
- thread 2 inserts A; head becomes (A, 2) // increment tag
- thread 1 tries to remove A, but CAS fails because of the tag field mis-match
It is important to note that the counter and the pointer field are part of the same single memory word. Current 64-bit X86 architectures, only use 48 bits for addressing. This allows the top 16 bits of the memory word to be used for tagging purposes. This solution does however, have a flaw. It is possible that the counter will overflow and return back to the same value as what thread 1 saw in the above example. With 16 bits used for tagging, this would require an awful lot of time and the likelihood of this happening is quite remote. In most practical systems, this is the solution used to get around the ABA issue. The failure caused by the counter overflow is rare enough to be equated to any other random failure caused by hardware malfunction.
Hazard pointers
Michael came up with hazard pointers to get around the ABA issue. Hazard pointers are maintained by individual threads. The owning thread has write permissions while other threads only have read permissions.
DCAS based solutions
The double compare-and-swap instruction allows atomic update of a double word in memory. This could be used to use implement a much larger tag and further decrease the probability of the tag wrapping around.
Other approaches
A correct but expensive approach is to use intermediate nodes that are not data elements and thus assure invariants as elements are inserted and removed.
Memory model and barrier instructions
The memory model, or memory consistency model, is at the heart of the concurrency semantics of a shared-memory program or system. Adve defines the memory model as an interface between a program and any hardware or software that may transform that program, (e.g., the compiler, the virtual machine, or any dynamic optimizer). It defines the set of values that a read in a program is allowed to return, thereby defining the basic semantics of shared variables. Defining a memory model is generally a series of tradeoffs between performance gained by reordering memory operations to distinct addresses, the complexity created by a complex memory model and the constraints placed on compilers that generate code in order to conform to the underlying memory model. Memory model is defined at the hardware level, modern languages such as Java also define a memory model at the language level. Chapter 10 of Solihin (2009) talks about this in good detail, including concepts such as sequential consistency. A corresponding wiki chapter will be a better place to further discuss this topic. The goal of this wiki is to provide just enough content to help us understand the effect of the memory on a parallel implementation of a linked list.
We briefly discuss the X86 memory model, which serves as an example of a hardware memory model. We then discuss the barrier instructions provided by the hardware to implement a tighter memory model in software. We discussed the use of these instructions in our implementation of the remove_node
function further up this article. Since the underlying hardware memory models differ quite a bit, modern programming languages such as Java expose the memory model at the programming language level. The Java bytecode compiler then becomes responsible to emit code that conforms to the underlying memory model, while providing consistent semantics for programmers.
The X86 memory model
This talk by Rick Hudson is an excellent summary of the X86 memory model. Some of the key aspects of the X86 memory model below:
- Loads are not reordered with OTHER loads
- Stores are not reordered with OTHER stores
- Stores are not reordered with OLDER loads
- Loads may be reordered with OLDER stores to different locations
- In a mutliprocessor system, we have a causality of memory ordering, i.e., transitive visibility of memory operations
- Stores to the same location have total order
- Locked instructions have total order
- Loads and stores are NOT reordered with locked instructions
This hardware memory model is quite close to the Total Store Ordering that Sparc implements.
Barrier instructions
Most architectures provide barrier instructions also referred to as fence instructions that enable software to force an order on memory operations. For, e.g., in the X86 architecture, loads can moved above stores as long as they are to different locations, however this can be restricted by inserting a fence instruction between the store and the load.
The barrier instructions in the X86 architecture are:
- LFENCE: serialize loads
- SFENCE: serialize stores
- MFENCE: serialize loads and stores
Other architectures provide similar fence instructions. GCC also provides built-in macros that are wrappers for these instructions for easy use.
Skip lists
Some linked data structures are more amenable to parallel algorithms compared to the singly or the doubly linked list. The skip list is one such example. It uses a hierarchy of linked lists that allow you to skip over some of the nodes of the list. A parallel algorithm could have multiple threads iterating this data structure, with each thread skipping through different offsets. This allows for the more traditional data-parallelization. The data structure however, does have a higher memory overhead since it needs to stores the extra pointers to allow skipping.
References
- Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.
- David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Gulf Professional Publishing, August 1998.
- Nimar S. Arora , Robert D. Blumofe , C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors (1998)" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.3853
- GCC, Atomic builtins http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html#Atomic-Builtins
- Thundering Herd, http://catb.org/~esr/jargon/html/T/thundering-herd-problem.html
- M. Herlihy, V. Luchangco and M. Moir. "Obstruction-Free Synchronization: Double-Ended Queues as an Example." 23rd International Conference on Distributed Computing Systems, 2003, p.522
- Maged Michael (2004). "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects". IEEE Transactions on Parallel and Distributed Systems 15 (6): 491–504.
- S. V. Adve and H.-J. Boehm Memory Models: A Case for Rethinking Parallel Languages and Hardware, to appear in the Communications of the ACM (CACM).