CSC/ECE 506 Fall 2007/wiki4 7 jp07

From Expertiza_Wiki
Revision as of 03:37, 29 November 2007 by Japoovey (talk | contribs)
Jump to navigation Jump to search

Helper Threads

One of the problems when using parallel machines is that the machine is only trying to execute sequential code. Therefore, much of the benefit of having the ability to run multiple threads simultaneously is lost. This is true of many multi-threading paradigms including Simultaneous Multithread Systems (SMTs) (link), Symmetric Multiprocessors (SMPs) (link), and Chip Multiprocessors (CMPs) (link).

The natural solution it seems would be to rewrite or recompile the programs to make use of parallel execution. But, in some cases this may be too time consuming or even unfeasible due to the nature of the program. Therefore, there is a middle ground where the program is not truly parallelized but the multithreading capabilities are utilized to improve execution time. This technique is known as helper threads.

Helper threads run in parallel to the main thread, and do work for the main thread to improve it's performance [Olokuton]. Typically these threads will execute parts of the program "ahead" of the main thread, in an attempt to predict branches and/or values before the main thread completes. This is done to help shadow the penalty of long latency instructions. The figure below illustrates the basic concepts of helper thread execution.

Note that in a CMP the two contexts would be separate chips, or in an SMT they would be separate thread contexts. The main sequential program will have some knowledge from previous training or via the compiler [1] that a potential long latency instruction such as a cache miss is upcoming. Through some history or prior knowledge the helper thread runs ahead of the main thread executing the instruction vital only to the long-latency instruction. This thread completes ahead of the main thread, so that when the main thread finally reaches the long latency instruction, the helper thread can forward the computed result.

Applications of Helper Threads

Speculative Branch Prediction

One class of helper thread applications is prediction helper threads early [2] The technique proposed uses Simultaneous Subordinate Multithreading (SSMT) to run helper threads called "microthreads". These threads consist of microcode, which is code written specifically for manipulating hardware structures within the processor. A SPAWN instruction within the program is used to indicate when a microthread should be initiated. For the branch prediction mechanism the scheme dynamically decides on likely mispredicted branches and constructs microthreads to predict these branches.

Difficult to predict branches are determined using a path cache that stores information about previous branch mispredictions and tracks the difficulty of these branches to predict. Before a microthread is created a branch must go through a training interval where the difficulty is determined.

The microthreads are created using a post retirement buffer (PRB). Upon retirement, instructions are inserted into the post retirement buffer along with dependence information. When a difficult branch retires a scanner scans the PRB for the recent dependent instructions that the branch depended on and creates a microthread. This microthread is stored and used when the microthread is spawned before the next branch occurs. This way the helper thread computes the branch target before the actual thread reaches the branch.

Speculative Prefetching

Another application of helper thread technology is the use of speculative prefetching. This attempts to mask the long latency of an L2 miss by handling it prior to the main thread execution.[3]

In these ideas, the techniques will use helper threads to follow pointer chains or procedure calls into the future to encounter memory accesses ahead of time. This way if there are L2 misses during this pre-execution, then the helper thread will have prefetched this data for the main thread.

Potential Problems

In speculative prefetching, the key is that runahead execution can cause major cache thrashing. If the execution runahead prefetches data before the data is used in the main thread, then misses will occur that did not exist previously. Furthermore, if speculative execution is too aggressive, then multiple prefetches to the same cache line may occur and the prefetched data may be removed before the main thread ever uses it, thus causing cache misses and cache thrashing.

Alternatively, if the thread is not aggressive enough, then there will be no benefit for the prefetching. The prefetch will not cover the L2 miss latency and thus the usefulness of the work is lost. It is possible that the prefetch start slightly ahead of the main thread and partially covers the L2 miss penalty but not fully.

Therefore, the key area of research in this field is to balance the aggressiveness of speculation with the amount of thread synchronization.

Slipstream Technology

The Slipstream processor [4] uses helper threads to run two versions of a program simulataneously. An instruction removal predictor and confidence are used to generate an "A-stream" program that executes a subset of the instructions of the "R-stream" program (the full program execution). Both run simulatenously alongside one another with the A-stream providing data values and branch predictions ahead of time to the R-stream.

A delay buffer is placed between the two instruction streams to communicate the necessary information. This allows the A-stream to mask some of the latencies incurred by the R-stream execution. The name slipstream occurs because the A-stream is running slightly faster than the R-stream, and the residual trail is pulling the R-stream along, similar to how a car is pulled along when it is drafting another car at high speeds.

This technology not only provides improvement in execution time for a single threaded program, but also provide fault tolerance. At high clock speeds and tighter transistor packing, the likelihood of transient faults are increasing. Therefore redundant execution can be used to assist in detecting transient faults. Since the A-stream and R-stream have overlap, the two can validate execution and detect transient faults. However, this will only work for instructions that exist in both versions of the program.


Additional Links