CSC/ECE 506 Spring 2011/ch5 LL: Difference between revisions
No edit summary |
No edit summary |
||
Line 35: | Line 35: | ||
Compare And Swap (CAS []): A CAS instruction is used to swap in a new value at an address location. The instruction succeeds only if the value at the address location matches the expected old-value at that location. The instruction returns the value at the address location. The instruction is atomic i.e. if there are multiple threads trying to CAS-in a value at a location, only one of them would succeed. Example: | Compare And Swap (CAS []): A CAS instruction is used to swap in a new value at an address location. The instruction succeeds only if the value at the address location matches the expected old-value at that location. The instruction returns the value at the address location. The instruction is atomic i.e. if there are multiple threads trying to CAS-in a value at a location, only one of them would succeed. Example: | ||
// code | // code | ||
// | // | ||
{{{ | {{{ | ||
old_value = *addr; | old_value = *addr; |
Revision as of 19:56, 27 February 2011
Supplement to Chapter 5: Parallel Programming For Linked Data Structures
Introduction
Chapter 5 of Solihin (2008) discusses parallel algorithms for operating on Linked data structures. The focus is primarily on thread-safe accesses based on locks. The semantics of locks imply that a single thread can access the data structure at any given time. The chapter discusses optimizations such as finer grained locks and read-write locks to enable more parallelization. This supplements discusses non-blocking algorithms to access linked data structures and the related issues.
Non blocking algorithms
Motivation
Parallel algorithms enable threads (or processes) to concurrently access shared data structures in a thread-safe manner. All lock based algorithms are examples of blocking algorithms, since only one thread can be present in the critical section while other threads trying to get in will block on the lock. All the methods outlined in chapter 5 Solihin (2008) are lock based and hence are blocking in nature. Lock based algorithms have the following drawbacks:
- Blocking limits the extent of scalability, especially when the granularity of the lock is too coarse.
- Locks can result in deadlocks and livelocks [].
- Locks can cause priority inversion i.e. high priority threads/processes cannot proceed if a low priority thread/process is holding the common lock.
- Locks can result in convoying. All other threads have to wait if a thread holding a lock is descheduled due to a time-slice interrupt or page fault (See lock convoy)
- Locks can cause other issues such as thundering herd problem []
Non-blocking algorithms permit truly concurrent thread-safe access to shared data structures where threads don't block on access. A thread might however have to retry to achieve it's objective. In the modern computing era, multi-core architecture [] is ubiquitous. In these architectures non-blocking algorithms can make use of the available compute rather than the blocking approach where the extra cores would generally be wasted. Thus non-blocking algorithms have significant throughput benefits on current architectures.
Categories
Non-blocking algorithms fall into three main categories. This taxonomy is primarily driven by the level of ‘progress guarantee’ for an individual thread level or a system as a whole.
Wait-freedom: A synchronization technique is wait-free if it ensures that every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. Thus it combines guaranteed system-wide throughput with starvation-freedom and is the strongest non-blocking guarantee of progress.
Lock-freedom: An algorithm is lock-free if it ensures only that some thread always makes progress. Thus it allows individual threads to starve but guarantees system-wide throughput. All wait-free algorithms are lock-free
Obstruction-freedom [Herlihy-obs-free]: A synchronization technique is obstruction-free if it guarantees progress for any thread that eventually executes in isolation. Even though other threads may be in the midst of executing operations, a thread is considered to execute in isolation as long as the other threads do not take any steps. (Pragmatically, it is enough for the thread to run long enough without encountering a synchronization conflict from a concurrent thread.) Like the wait-free and lock-free conditions, obstruction-free synchronization ensures that no thread can be blocked by delays or failures of other threads. This property is weaker than lock-free synchronization, because it does not guarantee progress when two or more conflicting threads are executing concurrently. All lock-free algorithms are obstruction-free.
In practice, lock-free and obstruction free algorithms are the most common. Wait-freedom comes at a substantial performance cost [Herlihy-obs-free]
Building blocks
All non-blocking algorithms are built on atomic primitives provided by the underlying architecture. The most commonly found primitives are:
Compare And Swap (CAS []): A CAS instruction is used to swap in a new value at an address location. The instruction succeeds only if the value at the address location matches the expected old-value at that location. The instruction returns the value at the address location. The instruction is atomic i.e. if there are multiple threads trying to CAS-in a value at a location, only one of them would succeed. Example:
// code
//
{{{ old_value = *addr; new_value = old_value + x; if (old_value == CAS(new_value, addr, old_value) { // CAS succeeded } else { // CAS failed } }}}
old
writes a new value into an address location if the value hasn't changed since the last time the reader
Load Linked/Store Conditional (LL-SC []):
Lock-free linked list implementation
Pop-only case
Push-Pop case aka the ABA problem
Tag-based solution
Hazard pointers
DCAS based solutions
Memory model and barrier instructions
The X86 memory model
Barrier instructions
Linked list example of use of barrier instructions
Skip lists
References
- Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.
- David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Gulf Professional Publishing, August 1998.
- Nimar S. Arora , Robert D. Blumofe , C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors (1998)" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130.3853
- Wikipedia, NonBlocking http://en.wikipedia.org/wiki/Non-blocking_algorithm
- Wikipedia, ABA http://en.wikipedia.org/wiki/ABA_problem
- Wikipedia, Hazard http://en.wikipedia.org/wiki/Hazard_pointer
- Wikipedia, CAS http://en.wikipedia.org/wiki/Compare-And-Swap
- Wikipedia, LLSC http://en.wikipedia.org/wiki/Load-Link/Store-Conditional
- GCC, Atomic builtins http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html#Atomic-Builtins
- Thundering Herd, http://catb.org/~esr/jargon/html/T/thundering-herd-problem.html
Appendix
// Sequential code, from Solihin (2008), p. 25.
for (i = 0; i < 8; i++) a[i] = b[i] + c[i]; sum = 0; for (i = 0; i < 8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum;