CSC/ECE 506 Fall 2007/wiki4 2 helperThreads
A helper thread is a thread that does some of the work of the main thread in advance of the main thread so that the main thread can work more quickly. The Olukotun text only scratches the surface on all the different ways that helper threads can be used. Survey these ways, making sure to include the Slipstream approach developed at NCSU.
What is a helper thread?
The potential of chip multiprocessors (CMPs) can be exploited in a better way if the applications are divided into semi - independent parts, or threads that can operate simultaneously across the processors within a system. The simplest way to use parallel threads within a CMP to increase the performance of a single thread is to have helper threads. A 'helper' thread performs work on behalf of main thread in an effort to accelerate its performance.
Uses of helper threads
Predicting branch at early stages:
Helper threads are made up of program copies of the main thread stripped of all unessential parts which are not of absolute necessity for the helper thread to achieve it tasks. Hence helper threads runs ahead of main thread. Hence helper threads run ahead of main thread. Hence helper threads can do computations that are necessary to decide the direction of the branch. This will inturn remove branch mispredictions which may occur when branch predictions are used.
Prefetching of Data:
Since helper threads run ahead of main thread, they can predict the which data in memory is needed in future by the main thread.they prefetch the data required by the main thread. The helper thread prefetches the data and places it in the nearest cache accessible to the main thread. This avoids most of the L1 cache misses that would have been encountered by the main thread had if helper thread was not present. As the memory latency is reduced, the execution of main thread speeds up.
Memory Bug Reduction:
Memory related bugs such as reads from unitialised memory, read or writes using dangling pointers or memory leaks are difficult to detect by code inspection because they may invoive different code fragments and exist in differnt modules or source code files. Compilers are of little help becuase it fails to disambiguate pointers. Hence, in practice memory bug detection relies on run time checkers that insert monitor code to the application during testing. In CMPs, the program which detects the memory bugs can run as a helper thread.
Ex: Heapmon, a memory bug checker monitors application heap space to detect heap memory bugs. The memory access events in the application thread are automatically forwarded to helper thread using appropriate hardware mechanisms. Also the redundant and unnecessary memory accessess are filtered out. Hence the helper thread approach completely decouples bug monitoring from appication execution and the filtering mechanism reduces bug - check frequency. Conseqently, HeapMon achieves a very low performance overhead.
Slipstream approach for helper threads:
Slipstream approach is developed at North Carolina State University. Slipstream execution provides a simple means to use a second processor in a CMP to accelerate a single program.Slipstream execution involves running two redundant copies of the program. A significant number of dynamic instructions are speculatively removed from one of the program copies, without sacrificing its ability to make correct forward progress. The second program verifies the forward progress of the first and is also sped up in the process.
The speculative reduced program is called A-stream (Advanced Stream) and runs ahead of the unreduced program (main thread or Redundant stream).The R-stream exploits the A-stream to get an accurate picture of the future. For example, the A-stream provides very accurate branch and value predictions. The predictions are more accurate than predictions made by conventional history-based predictors because they are produced by future program computation. The speculative A-stream occasionally (but infrequently) goes astray. The R-stream serves as a checker of the speculative A-stream and redirects it when needed.
Synchronization between an R-stream and its corresponding A-stream is required for two reasons.First, an A-stream that has taken a completely wrong control path and is generating useless data access predictions has to be corrcted. Second, limit how far the A-stream gets ahead, so that its prefetches are not issued so early that the prefetched lines are often replaced or invalidated in the L2 cache before the R-stream uses them. A single semaphore is required between each Astream/
R-stream pair to control their synchronization. It can be a shared hardware register, but any shared location that supports an atomic read-modify-write operation is sufficient. Using this semaphore, the following are controlled
(a) how many synchronization events the A-stream can skip without waiting for the Rstream
(b) whether the synchronization is local (involving only the companion R-stream) or global (involving all R-streams). The initial value of the semaphore indicates how many sessions the A-stream may proceed ahead of the Rstream.
This can be viewed as creating an initial pool of tokens that are consumed as the A-stream enters a new session. When there are no tokens, the Astream may not proceed. The R-stream issues tokens by incrementing the semaphore counter.Once the A thread gets the semaphore, A threads advances and gets past the R thread.
Disadvantages of helper threads:
A very tight synchronisation is needed between the main thread and the helper threads in oder to keep them the proper distance ahead of main thread. If the helper threads are too far ahead, then they will cause cache thrashing by prefetching data and then replaing it with subsequent prefetches beofre the main thread can even use it. If they are not far enough ahead, they might not be able to prefetch cache lines in time.
Only certain types of single threaded programs can be sped up with these techniques. Most integer applications have regular data access patterns and hence there will be only few misses to eliminate. The applications with larger memory footprints are easily parellelisable and hence doesn't run on a single main thread. Most floating point applications the data access patterns are fairly regular and hence can be prefetched easily using hardware or software prefetch mechanisms. Hence the range of programs that can be accelerated using helper threads is fairly limited.
References
1)ChipMultiprocessor Architecture:Techniques to Improve Throughput and Latency by Kunle Olukotun, Lance Hammond, and James Laudon
2)Slipstream Execution Mode for CMP-Based Multiprocessors by Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg