CSC/ECE 506 Fall 2007/wiki1 12 dp3

Sections 1.3.3 and 1.3.4: Most changes here are probably related to performance metrics. Cite other models for measuring artifacts such as data-transfer time, overhead, occupancy, and communication cost. Focus on the models that are most useful in practice.

Communication and Replication

In this section, we describe two terms Communication and Replication, simultaneously we also make distinction between these two terms. Later we use an example to show how communication and replication understandings help in designing a parallel computing architecture.

Communication between any two processes is said to occur when data written by one process is read by second process. This causes a data transfer between the processes however, if data is just stored at one process (because initially data was configured to be on this process or it was too large to fit at any other place) and transfer only makes another copy of the data at second process then it called replication. For example, on processor’s request of data if we copy something from main memory and put it in cache this operation is replication of data. On the contrast if a data is produced by a sender process and it is transferred to a receiver process by message passing then it an example of communication.

Communication and replication both involves data transfer, which can be defined as transfer of data across different memory locations. For interprocess(or) communication the data is transferred across the memory local to the communicating processor or from a remote storage device. When a miss occurs in cache, the data is transferred from the memory to the cache. In case, where the cache content, as a result of replication, is updated or changed, these changes must be transported to all the other hidden replication. This is another aspect of data transfer.

Performance

Introduction

In this introduction section, we describe why performance issue is important for parallel computer architects and what the metrics of performance are. As we already know performance is a very important issue in uniprocessor system where architects always try to improve performance of system in terms of execution time by using several techniques such as minimizing memory access time (to access the data fast), designing hardware which can execute many instruction in parallel and possibly faster ( micro level parallelism extraction).Consequently, performance issue plays a big role in parallel computing for several reasons like data is shared among many processors and hence processes on different processors need to communicate efficiently and coherently.

To make our point more precise, let us consider the following example:

Assume we want to run a program which takes 100ms on uniprocessor with no pipeline scheme. However, we also know that the full program can be decomposed in many processes and these processes can be run on different machine. So basically, in best case we get the speed up of n where n is the number of processors available. We might have virtually divided the computing load but these processes are not independent so in order to complete the program they do need to communicate to each other for data sharing, synchronization etc. So there is communication overhead involved in parallel computing. A wise architect would not like to have any parallel system where communication overhead overwhelms speed up achieved by dividing the work load. So we need to analyze the performance considering various trade-offs.

There are three basic performance metrics.

Latency : Time taken by a operation to get completed.(seconds per operation)
Bandwidth: The rate at which operations are performed ( operations per second)
Cost: It is the measure of impact operations have on total execution time of the program. (latency times number of operations)

In parallel computing many operations are performed concurrently so relationship among performance metrics is not simple and we need to consider the performance for communication operations in parallel computing and see how they are measured using these three matrices. As data transfer operations are the most frequent type of communication operation, we will first discuss that.

Artifacts of Measuring Performance

Data Transfer

For any data transfer we would like to estimate the time it consumes, so that we can improve the overall performance of the system by reducing data transfer time. To estimate the data transfer time a simple linear model is used which is following as discussed in the text book:

Total Transfer Time (T) = Start-up Time (T₀) + Data Transfer Time (T_D)

Total transfer time has two components: first part is a constant term (T0) which is called start up cost. We will shortly return to this with more details, first let us look at second part: data transfer time which is estimated as following:

Data Transfer Time (T_D)) = Size of Data (n) / Bandwidth (B)

Bandwidth (B) is also called data transfer rate.

Here couple of observations which we should make in order to understand this model properly.

If we have only one pair of host then data transfer rate is simply the bandwidth of the link connecting those hosts.
However if there are many hosts between the source host and destination host, bottleneck is the link with lowest bandwidth.

Coming back to start-up cost, notice that T₀ is a constant term for a particular data transfer,but it might vary as we consider data transfer over different entities. For example, in memory operation start up cost is memory access time. In message passing the start up cost can be estimated as time taken by fist bit to reach destination. For pipelined operations , start up cost is simply time taken to fill up the pipeline. For bus transactions it is arbitration and command phases.

As parallel computing has grown, one of major focus has been to reduce start up cost.There are many ways to do so, we describe few of them here. As stated earlier start up cost for memory operations is basically the memory access time. To reduce memory access time, architects have introduced costly (hence small size) but fast storage area called cache. Depending upon the spatial and temporal locality cache is filled with useful items and hence processor does not have to go to memory (long latency ) every time it needs data. Average access time is governed by the following formulaComputer Architecture: A Quantitative Approach, Appendix C:

Average memory access time = Hit time for the cache + (Miss Rate X Miss Penalty)

Therefore architects often try to reduce all three components by adopting different cache optimization like: multilevel cache, larger blocks size, higher associatively etc.

Just to give a quantitative feeling, we can quote the access time of cache and main memory to see how useful it is to introduce cache if we manage to get higher hit rates. Cache access time is typically 0.5-25 ns while for the main memory it is 50-250 ns. Bandwidth for caches range around 5,000-20,000 MB/sec but for memory its as low as 2,500-10,000MB/sec. Computer Architecture: A Quantitative Approach, Appendix C

Similarly for bus transactions, start up cost is arbitration and command phases, suppose on a 3.6 GHz Intel Xeon machine (year 2004) it takes 3 cycles to arbitrate the bus and present the address the start up cost is around 0.83 nano seconds. However assuming that around year 1994-95 it took same 3 cycles on Alpha 21064A 0.3 GHz processor we can see that start up cost has been reduced by more than 10 times.Computer Architecture: A Quantitative Approach, Chapter 1

Pipelining is another way to reduce the data transfer time, for pipelined systems filling up the pipeline is the total start up cost. Though it seems that introducing pipeline adds extra start up cost, however more importantly pipeline allows multiple operations to take place concurrently and this indeed helps in achieving higher bandwidth..Computer Architecture: A Quantitative Approach, Appendix A

Important point to note is that the achievable bandwidth depends on the the transfer size, that is why sometimes bandwidth (B) is called as peak bandwidth. To make this clearer:

Suppose we have two hosts connected by a link with bandwidth of 20MB/s and start up cost of communication is 2 micro seconds. We want to transfer an image of size 40MB then the total transfer time is 2 seconds plus 2 micro seconds. Given the available peak bandwidth of 20MB/s, one might have expected to complete the transfer in 2 seconds achieving the peak bandwidth but start up cost prohibits this. Clearly as you increase the amount of data achievable bandwidth approaches the asymptotic rate of bandwidth (B), start up cost determines how fast the asymptotic rate would be achieved.

As a special case, the amount of data required to achieve half of peak bandwidth (B) is equal to T₀XB.

However, this data transfer model has few shortcomings too.

This model does not indicate when the next operation can be performed, this is particularly very useful because bandwidth delivered depends on how frequently operations are initiated. This model does not tell about whether other useful work can be done during transfer or not. Startup cost calculation is many times challenging, with enhancement in technology focus is too reduce the start-up cost and increase bandwidth.

Data transfers usually take place across the network in parallel computing environment. It is invoked by processor through the communication assist so next, we look at how communication and cost is estimated and what are important factors to consider.

Overhead and Occupancy

One of the three components of Processor execution time, apart from Computation Time and Idle Time, is Communication Time. Communication time is the time spent by the processor on sending and receiving messages. There can be two different types of communication, interprocessor and intraprocessor. In interprocessor communication the two communicating tasks are handled by two different processors. While in intraprocessor communication the communicating tasks are handled by the same processor. Generally both intraprocessor and interprocessor communication costs same, provided the intraprocessor communication is not highly optimized.

The communication time is function of number of bytes transferred across. It can be given as below

T(n) = Overhead + Occupancy + Network Delay + Message Size / Bandwidth + Delay due to Contention

Communication Overhead includes time spent to

Create messages

Execute communications protocols

Physically send messages

Run through the protocol sets and decode the message on the receiving node.

During this period the processor cannot do any useful or computational work. Parallel programs, running on different processors, need coordination of work among themselves. This results in increased rate of interprocessor communication, which in turn increases the net overhead cost.

Occupancy is the time spent at the slowest component in the communication assist and it affects performance in couple of ways. It delays the current request and indirectly contributes to the delays of subsequent requests. The occupancy gets to set the upper limit on how frequently communication operation can be initiated by the processor.

From processor’s point of view there are number of other network delays which can be categorized as occupancy. Contention for resources can be view as occupancy. The net bandwidth reduces as a result of this. If P concurrent processors are using a network of Bandwidth B, then the effective bandwidth would be B/P.

Some of the recent trends/designs helped reduce these communication costs. IBM Blue Gene (L) uses Global Collective Network, which carries out operations within the network itself. This saves the processors time to decode messages with intermediate values, calculate new intermediate values, create new messages, and send them on to other nodes. The overhead now is primarily because of communication protocol. It also has a dedicated communications network, Global Barrier and Interrupt Network, to speed up task-to-task coordination activities. IBM BG/L also employs Torus Network, which results in linear growth of the path length while nodes (processors) scale as a cube. Torus Network also gives the ability to send messages in either direction, something like a ring network and hence reduces the distance between furthest points to half. This in turn reduces the network delays.

Communication Cost

At the end of the day we want to reduce the communication cost. Communication cost is given by following equation:

Communication cost = Frequency of communication x (Communication Time – Overlap)

Frequency of communication is self explanatory, which depends on the machine architecture and program design. Some architecture like scale-up symmetric multiprocessing (SMP) and scale-out massively parallel processing (MPP) systems supports tightly coupled parallel applications. This results in high frequency of communication, which makes it important to have the other parts of communication time like overhead and network delays to be small. Loosely coupled parallel applications, on inherently parallel system architecture, requires minimal inter process communication.

The portion of the communication operation which is performed concurrently with processor engaged in other useful work (computation and other communication) is the overlap. This reduces the communication cost. But, in case where the context switch(s) of work takes up a large portion of the communication period then the net overlap time is less.

CSC/ECE 506 Fall 2007/wiki1 12 dp3

Contents

Communication and Replication

Performance

Introduction

Artifacts of Measuring Performance

Data Transfer

Overhead and Occupancy

Communication Cost

Navigation menu

CSC/ECE 506 Fall 2007/wiki1 12 dp3

Communication and Replication

Performance

Introduction

Artifacts of Measuring Performance

Data Transfer

Overhead and Occupancy

Communication Cost

Navigation menu

Search