CSC/ECE 506 Fall 2007/wiki1 12 dp3: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 31: Line 31:
::: New Advances
::: New Advances


For any data transfer we would like to estimate the time it consumes so that we can improve the overall performance of the system by reducing data transfer time. To estimate the data transfer time a simple liner model is used which is following:


Total Transfer Time (T) = Start-up Time (T0) + Data Transfer Time (TD)
Total transfer time has two components: first part is a constant term (T0) which is called start up cost. We will shortly return to this with more details, first let us look at second part: data transfer time which is estimated as following:
Data Transfer Time (TD) = Size of Data (n) / Bandwidth (B)
Bandwidth (B) is also called data transfer rate.
Here couple of observations which should make in order to understand this model properly.
1. If we have only one pair of host then data transfer rate is simply the bandwidth of the link connecting those hosts.
2. However if there are many hosts between the source host and destination host, bottleneck is the link with lowest bandwidth.
Coming back to start-up cost, notice that To is a constant term for a particular data transfer, it might vary as we consider data transfer over different entities. For example, in memory operation start up cost is memory access time. In message passing the start up cost can be estimated as time taken by fist bit to reach destination. For pipelined operations , start up cost is simply time taken to fill up the pipeline. For bus transactions  it is arbitration and command phases.
As parallel computing has grown, one of major focus has been to reduce start up cost.
There are many ways to do so, we describe few of them here. As stated earlier start up cost is basically the memory access time. To reduce memory access time, architects have introduced costly (hence small) but fast storage area called cache. Depending upon the spatial and temporal locality cache is filled with useful items and hence processor does not have to go to memory (long latency) every time it needs data.
Average access time is governed by the following formula:
Average memory access time = Hit time for the cache + (Miss Rate X Miss Penalty)
Therefore architects often try to reduce all three components by adopting different cache optimization like: multilevel cache, larger blocks size, higher associatively etc.
Just to give a quantitative feeling, we can quote the access time of cache and main memory to see how useful it is to introduce cache if we manage to get higher hit rates.
Cache access time is typically 0.5-25 ns while for the main memory it is  50-250 ns.
Bandwidth for caches range around 5,000-20,000 MB/sec but for memory its as low as 2,500-10,000MB/sec.
Similarly for bus transactions, start up cost is arbitration and command phases, suppose on a 3.6 GHz Intel Xeon machine (Year 2004) it takes 3 cycles to arbitrate the bus and present the address the start up cost is around    0.83 nano seconds. However assuming the that  in around year 1994-95 it took same 3 cycles on Alpha 21064A  0.3 GHz processor we can see that start up cost has been reduced by more than 10 times.
Pipelining is another way to reduce the data transfer time, for pipelined systems filling up the pipeline is the total start up cost. Though it seems that introducing pipeline adds extra start up cost, however more importantly pipeline allows multiple operations to take place concurrently and this indeed helps in achieving higher bandwidth.
Important point to note is that the achievable bandwidth depends on the the transfer size, that is why sometimes bandwidth (B) is called as peak bandwidth.  To make this clearer:
Suppose we have two hosts connected by a link with bandwidth of 20MB/s and start up cost of communication is 2 micro seconds.  We want to transfer an image of size 40MB then the total transfer time is 2 seconds plus 2 micro seconds. Given the available peak bandwidth of 20MB/s, one might have expected to complete the transfer in 2 seconds achieving the peak bandwidth but start up cost prohibits this. Clearly as you increase the amount of data achievable bandwidth approaches the asymptotic rate of bandwidth (B), start up cost determines how fast the asymptotic rate would be achieved.
As a special case, the amount of data required to achieve half of bandwidth (B) is equal to ToXB (calculated using equation I).
However, this data transfer model has few shortcomings too.
This model doesn’t indicate when the next operation can be performed, this is particularly very useful because bandwidth delivered depends on how frequently operations are initiated.
This model does not tell about whether other useful work can be done during transfer or not.
Startup cost calculation is many times challenging, with enhancement in technology focus is too reduce the start-up cost and increase bandwidth.
Data transfers usually take place across the network in parallel computing environment. It is invoked by processor through the communication assist so next we look at how communication and cost is estimated and what are important factors to consider.


==== Overhead and Occupancy ====
==== Overhead and Occupancy ====

Revision as of 01:13, 6 September 2007

Sections 1.3.3 and 1.3.4: Most changes here are probably related to performance metrics. Cite other models for measuring artifacts such as data-transfer time, overhead, occupancy, and communication cost. Focus on the models that are most useful in practice.

Communication and Replication

Performance

Introduction

In this introduction section, we describe why performance issue is important for parallel computer architects and what the metrics of performance are. As we already know performance is a very important issue in uniprocessor system where architects always try to improve performance of system in terms of execution time by using several techniques such as minimizing memory access time (to access the data fast), designing hardware which can execute many instruction in parallel and possibly faster ( micro level parallelism extraction).Consequently, performance issue plays a big role in parallel computing for several reasons like data is shared among many processors and hence processes on different processors need to communicate efficiently and coherently. To make our point more precise, let us consider the following example: Assume we want to run a program which takes 100ms on uniprocessor with no pipeline scheme. However, we also know that the full program can be decomposed in many processes and these processes can be run on different machine. So basically, in best case we get the speed up of ‘n’ where is the number of processors available. We might have virtually divided the computing load but these processes are not independent so in order to complete the program they do need to communicate to each other for data sharing, synchronization etc. So there is communication overhead involved in parallel computing. A wise architect would not like to have any parallel system where communication overhead overwhelms speed up achieved by dividing the work load. So we need to analyze the performance considering various trade-offs. There are three basic performance metrics. 1. Latency : Time taken by a operation to get completed.(seconds per operation) 2. Bandwidth: The rate at which operations are performed ( operations per second) 3. Cost: It is the measure of impact operations have on total execution time of the program. (latency times number of operations) In parallel computing many operations are performed concurrently so relationship among performance metrics is not simple and we need to consider the performance for communication operations in parallel computing and see how they are measured using these three matrices. As data transfer operations are the most frequent type of communication operation, we will first discuss that.

Artifacts of Measuring Performance

Data Transfer

How to measure- Mathematical Equation[1]
Quantitative Measure
Implication of such measurement
Advantage and Disadvantage[2]
New Advances

For any data transfer we would like to estimate the time it consumes so that we can improve the overall performance of the system by reducing data transfer time. To estimate the data transfer time a simple liner model is used which is following:

Total Transfer Time (T) = Start-up Time (T0) + Data Transfer Time (TD)

Total transfer time has two components: first part is a constant term (T0) which is called start up cost. We will shortly return to this with more details, first let us look at second part: data transfer time which is estimated as following:

Data Transfer Time (TD) = Size of Data (n) / Bandwidth (B)

Bandwidth (B) is also called data transfer rate.

Here couple of observations which should make in order to understand this model properly.

1. If we have only one pair of host then data transfer rate is simply the bandwidth of the link connecting those hosts.

2. However if there are many hosts between the source host and destination host, bottleneck is the link with lowest bandwidth.

Coming back to start-up cost, notice that To is a constant term for a particular data transfer, it might vary as we consider data transfer over different entities. For example, in memory operation start up cost is memory access time. In message passing the start up cost can be estimated as time taken by fist bit to reach destination. For pipelined operations , start up cost is simply time taken to fill up the pipeline. For bus transactions it is arbitration and command phases.

As parallel computing has grown, one of major focus has been to reduce start up cost. There are many ways to do so, we describe few of them here. As stated earlier start up cost is basically the memory access time. To reduce memory access time, architects have introduced costly (hence small) but fast storage area called cache. Depending upon the spatial and temporal locality cache is filled with useful items and hence processor does not have to go to memory (long latency) every time it needs data. Average access time is governed by the following formula:

Average memory access time = Hit time for the cache + (Miss Rate X Miss Penalty)

Therefore architects often try to reduce all three components by adopting different cache optimization like: multilevel cache, larger blocks size, higher associatively etc.

Just to give a quantitative feeling, we can quote the access time of cache and main memory to see how useful it is to introduce cache if we manage to get higher hit rates. Cache access time is typically 0.5-25 ns while for the main memory it is 50-250 ns. Bandwidth for caches range around 5,000-20,000 MB/sec but for memory its as low as 2,500-10,000MB/sec.

Similarly for bus transactions, start up cost is arbitration and command phases, suppose on a 3.6 GHz Intel Xeon machine (Year 2004) it takes 3 cycles to arbitrate the bus and present the address the start up cost is around 0.83 nano seconds. However assuming the that in around year 1994-95 it took same 3 cycles on Alpha 21064A 0.3 GHz processor we can see that start up cost has been reduced by more than 10 times.

Pipelining is another way to reduce the data transfer time, for pipelined systems filling up the pipeline is the total start up cost. Though it seems that introducing pipeline adds extra start up cost, however more importantly pipeline allows multiple operations to take place concurrently and this indeed helps in achieving higher bandwidth.

Important point to note is that the achievable bandwidth depends on the the transfer size, that is why sometimes bandwidth (B) is called as peak bandwidth. To make this clearer: Suppose we have two hosts connected by a link with bandwidth of 20MB/s and start up cost of communication is 2 micro seconds. We want to transfer an image of size 40MB then the total transfer time is 2 seconds plus 2 micro seconds. Given the available peak bandwidth of 20MB/s, one might have expected to complete the transfer in 2 seconds achieving the peak bandwidth but start up cost prohibits this. Clearly as you increase the amount of data achievable bandwidth approaches the asymptotic rate of bandwidth (B), start up cost determines how fast the asymptotic rate would be achieved.


As a special case, the amount of data required to achieve half of bandwidth (B) is equal to ToXB (calculated using equation I).

However, this data transfer model has few shortcomings too. This model doesn’t indicate when the next operation can be performed, this is particularly very useful because bandwidth delivered depends on how frequently operations are initiated. This model does not tell about whether other useful work can be done during transfer or not.

Startup cost calculation is many times challenging, with enhancement in technology focus is too reduce the start-up cost and increase bandwidth.

Data transfers usually take place across the network in parallel computing environment. It is invoked by processor through the communication assist so next we look at how communication and cost is estimated and what are important factors to consider.

Overhead and Occupancy

Communication Cost

Linking to previous Artifacts.
How to measure- Mathematical Equation
Quantitative Measure
Implication of such measurement
Advantage and Disadvantage
New Advances

New Artifact

not decided yet!