CSC/ECE 506 Fall 2007/wiki4 7 jp07: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 9: Line 9:
<div align="center">[[Image:Helper_threads.png]]</div>
<div align="center">[[Image:Helper_threads.png]]</div>


Note that in a CMP the two contexts would be separate chips, or in an SMT they would be separate thread contexts.  The main sequential program will have some knowledge from previous training or via the compiler that a potential long latency instruction such as a cache miss is upcoming.  Through some history or prior knowledge the helper thread runs ahead of the main thread executing the instruction vital only to the long-latency instruction.  This thread completes ahead of the main thread, so that when the main thread finally reaches the long latency instruction, the helper thread can forward the computed result.
Note that in a CMP the two contexts would be separate chips, or in an SMT they would be separate thread contexts.  The main sequential program will have some knowledge from previous training or via the compiler [http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3A%2F%2Fpact05.ce.ucsc.edu%2Fdatacentric.pdf&ei=8SpOR_2oNJvMeMXjpY8N&usg=AFQjCNGKl9aloiJ_PrQHobjnHMmHvwzo0g&sig2=klWNxGwt7yrPpjciV9MpzQ] that a potential long latency instruction such as a cache miss is upcoming.  Through some history or prior knowledge the helper thread runs ahead of the main thread executing the instruction vital only to the long-latency instruction.  This thread completes ahead of the main thread, so that when the main thread finally reaches the long latency instruction, the helper thread can forward the computed result.


== Applications of Helper Threads ==
== Applications of Helper Threads ==
Line 21: Line 21:
The microthreads are created using a post retirement buffer (PRB). Upon retirement, instructions are inserted into the post retirement buffer along with dependence information.  When a difficult branch retires a scanner scans the PRB for the recent dependent instructions that the branch depended on and creates a microthread.  This microthread is stored and used when the microthread is spawned before the next branch occurs.  This way the helper thread computes the branch target before the actual thread reaches the branch.
The microthreads are created using a post retirement buffer (PRB). Upon retirement, instructions are inserted into the post retirement buffer along with dependence information.  When a difficult branch retires a scanner scans the PRB for the recent dependent instructions that the branch depended on and creates a microthread.  This microthread is stored and used when the microthread is spawned before the next branch occurs.  This way the helper thread computes the branch target before the actual thread reaches the branch.


=== Speculative Value Prediction ===
=== Speculative Prefetching ===
 
Another application of helper thread technology is the use of speculative prefetching.  This attempts to mask the long latency of an L2 miss by handling it prior to the main thread execution.[http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3A%2F%2Fwww.cs.cmu.edu%2F~luk%2Fluk_papers%2Fisca01.ps.gz&ei=ci5OR4GPGKiMeu3R2JIN&usg=AFQjCNFJr4YY6GLRhUe9kp4ynBdYtqRknw&sig2=oePO2l1PgzkQhI5agm6zlQ ]
 
In these ideas, the techniques will use helper threads to follow pointer chains or procedure calls into the future to encounter memory accesses ahead of time.  This way if there are L2 misses during this pre-execution, then the helper thread will have prefetched this data for the main thread.
 
=== Potential Problems ===
 
In speculative prefetching, the key is that runahead execution can cause major cache thrashing.  If the execution runahead prefetches data before the data is used in the main thread, then misses will occur that did not exist previously.  Furthermore, if speculative execution is too aggressive, then multiple prefetches to the same cache line may occur and the prefetched data may be removed before the main thread ever uses it, thus causing cache misses and cache thrashing.
 
Alternatively, if the thread is not aggressive enough, then there will be no benefit for the prefetching.  The prefetch will not cover the L2 miss latency and thus the usefulness of the work is lost.  It is possible that the prefetch start slightly ahead of the main thread and partially covers the L2 miss penalty but not fully.
 
Therefore, the key area of research in this field is to balance the aggressiveness of speculation with the amount of thread synchronization.


=== Slipstream Technology ===
=== Slipstream Technology ===
Line 36: Line 48:
* [http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/proceedings/&toc=comp/proceedings/isca/2002/1605/00/1605toc.xml&DOI=10.1109/ISCA.2002.1003588 Difficult Path Branch Prediction using Helper Threads]
* [http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/proceedings/&toc=comp/proceedings/isca/2002/1605/00/1605toc.xml&DOI=10.1109/ISCA.2002.1003588 Difficult Path Branch Prediction using Helper Threads]
* [http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3A%2F%2Fpact05.ce.ucsc.edu%2Fdatacentric.pdf&ei=8SpOR_2oNJvMeMXjpY8N&usg=AFQjCNGKl9aloiJ_PrQHobjnHMmHvwzo0g&sig2=klWNxGwt7yrPpjciV9MpzQ Compiler Support for Helper Threading]
* [http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3A%2F%2Fpact05.ce.ucsc.edu%2Fdatacentric.pdf&ei=8SpOR_2oNJvMeMXjpY8N&usg=AFQjCNGKl9aloiJ_PrQHobjnHMmHvwzo0g&sig2=klWNxGwt7yrPpjciV9MpzQ Compiler Support for Helper Threading]
* [http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3A%2F%2Fwww.cs.cmu.edu%2F~luk%2Fluk_papers%2Fisca01.ps.gz&ei=ci5OR4GPGKiMeu3R2JIN&usg=AFQjCNFJr4YY6GLRhUe9kp4ynBdYtqRknw&sig2=oePO2l1PgzkQhI5agm6zlQ Tolerating Memory Latency through prefetching with helper threads]

Revision as of 03:21, 29 November 2007

Helper Threads

One of the problems when using parallel machines is that the machine is only trying to execute sequential code. Therefore, much of the benefit of having the ability to run multiple threads simultaneously is lost. This is true of many multi-threading paradigms including Simultaneous Multithread Systems (SMTs) (link), Symmetric Multiprocessors (SMPs) (link), and Chip Multiprocessors (CMPs) (link).

The natural solution it seems would be to rewrite or recompile the programs to make use of parallel execution. But, in some cases this may be too time consuming or even unfeasible due to the nature of the program. Therefore, there is a middle ground where the program is not truly parallelized but the multithreading capabilities are utilized to improve execution time. This technique is known as helper threads.

Helper threads run in parallel to the main thread, and do work for the main thread to improve it's performance [Olokuton]. Typically these threads will execute parts of the program "ahead" of the main thread, in an attempt to predict branches and/or values before the main thread completes. This is done to help shadow the penalty of long latency instructions. The figure below illustrates the basic concepts of helper thread execution.

Note that in a CMP the two contexts would be separate chips, or in an SMT they would be separate thread contexts. The main sequential program will have some knowledge from previous training or via the compiler [1] that a potential long latency instruction such as a cache miss is upcoming. Through some history or prior knowledge the helper thread runs ahead of the main thread executing the instruction vital only to the long-latency instruction. This thread completes ahead of the main thread, so that when the main thread finally reaches the long latency instruction, the helper thread can forward the computed result.

Applications of Helper Threads

Speculative Branch Prediction

One class of helper thread applications is prediction helper threads early [2] The technique proposed uses Simultaneous Subordinate Multithreading (SSMT) to run helper threads called "microthreads". These threads consist of microcode, which is code written specifically for manipulating hardware structures within the processor. A SPAWN instruction within the program is used to indicate when a microthread should be initiated. For the branch prediction mechanism the scheme dynamically decides on likely mispredicted branches and constructs microthreads to predict these branches.

Difficult to predict branches are determined using a path cache that stores information about previous branch mispredictions and tracks the difficulty of these branches to predict. Before a microthread is created a branch must go through a training interval where the difficulty is determined.

The microthreads are created using a post retirement buffer (PRB). Upon retirement, instructions are inserted into the post retirement buffer along with dependence information. When a difficult branch retires a scanner scans the PRB for the recent dependent instructions that the branch depended on and creates a microthread. This microthread is stored and used when the microthread is spawned before the next branch occurs. This way the helper thread computes the branch target before the actual thread reaches the branch.

Speculative Prefetching

Another application of helper thread technology is the use of speculative prefetching. This attempts to mask the long latency of an L2 miss by handling it prior to the main thread execution.[3]

In these ideas, the techniques will use helper threads to follow pointer chains or procedure calls into the future to encounter memory accesses ahead of time. This way if there are L2 misses during this pre-execution, then the helper thread will have prefetched this data for the main thread.

Potential Problems

In speculative prefetching, the key is that runahead execution can cause major cache thrashing. If the execution runahead prefetches data before the data is used in the main thread, then misses will occur that did not exist previously. Furthermore, if speculative execution is too aggressive, then multiple prefetches to the same cache line may occur and the prefetched data may be removed before the main thread ever uses it, thus causing cache misses and cache thrashing.

Alternatively, if the thread is not aggressive enough, then there will be no benefit for the prefetching. The prefetch will not cover the L2 miss latency and thus the usefulness of the work is lost. It is possible that the prefetch start slightly ahead of the main thread and partially covers the L2 miss penalty but not fully.

Therefore, the key area of research in this field is to balance the aggressiveness of speculation with the amount of thread synchronization.

Slipstream Technology

Pre-computation Slices

Additional Links