CSC/ECE 506 Fall 2007/wiki1 1.3.3 1.3.4 chase2007
Introduction to Parallel Computer Architecture ->Fundamental Design Issues -> Communication and Replication (section 1.3.3)
Communication and Replication
Communication occurs when data written from one process is read by another process.
Replication helps to avoid unnecessary communication.
Replication is the creation of a local copy of data to help enable parallelism.
Introduction to Parallel Computer Architecture ->Fundamental Design Issues -> Performance (section 1.3.4)
Computer Performance:
Computer performance is a measure of the output of a computer with with respect to time and resources used.
Performance metrics:
Following are the important metrics used to measure a computer's performance:
1.Latency: The time taken to perform an operation
2.Bandwidth:The rate at which the operations are performed
3.Cost: The impact these operations have on the execution time of the program
All the above metrics can be used to define a uniprocessor systems where a single CPU operates.
However, in the context of parallel computers, it becomes difficult to express the performance in above stated metrics. The reason for this is the communication between the processors that occurs mostly in the form of data transfers between the processors. So, to completely define the performace of a parallel computer, the following metrics are also considered.
Data Transfer Time:
It is the time taken for initiation of a data transfer and the time required for actual data transfer. So the Data Transfer Time can be given as:
Transfer Time (n) = T+(n/B)
where
n = Amount of Data (in bytes)
B = Transfer Rate of the component moving the data (bytes per second)
T = Start up cost, a constant
This is a very convinient model, and it is used to describe a diverse collection of operations, including messages, memory accesses, bus transactions, and vector operations. For message passing, the start up cost can be thought of as the time for the first bit to get to the destination. For memory operations, it is essentially the access time. For bus transactions, it reflects the bus arbitration and command phases. For any sort of pipelinedoperation, including pipelined instruction processing or vector operations, it is the time to fill the pipeline.
Overhead and Occupancy:
The data transfer operations are initiated by the processor through communication assist. The essential components of this operation can be described by the following simple model. This model is very generic and can be used to explain data transfers in many places in modern, highly pipelined computer systems.:
Communication Time = Overhead+Occupancy+Network Delay
The overhead is the time the processor spends initiating the transfer of data. This may be a fixed cost, if the processor imply has to tell the communication assist to start. The overhead can also be linear with Tranfer time, if the processor has to copy the data into the assist.The key point is that this is time the processor is busy with the communication event; it cannot do other useful work to initiate other communication during this time. Thr ramaining portion of the communication time is considered as network latency; it is the part that can be hidden by other processor operations
The occupancy is the time it takes for the data to pass through the slowest componant on the communication path.The data wol occupy oyher resources, including buffers,switches, and the communication assist. Often the communication assist is the bottleneck that determines the occupancy. The occupancy limits how frequently communication operations can be initiated.The next data transfer will have to wait untill the critical resource is no longer occupied before it can use the same resource. If there is buffering between the processor and the bottleneck, the processor may be able to issue a burst of transfers at a frequency greater than I/O occupancy; however, once this buffer is full, the processor must slow to the rate set by the occupancy. A new transfer can start only when an older one finishes
The remaining communication time is lumped into the network dealy,which includes the time for the bit to be routed across the actual network as well as many other factors, such as the time to get through communication assists.
Communication Cost:
It is the time that the processor spends in communicating with other processors. It can be given by the following:
Communication Cost = Frequency of Communication * (Communication Time - Overlap)
Frequency of Communication = Number of communications per unit of work Communication Time = Overhead + Occupancy + Network Delay Overlap = The portion of the communication operation that is performed concurrently with other useful work.
Frequency of communication depends on many programming factors and many hardware design factors.In particular, hardware may limit the transfer size and thereby determine the minimum number of messages. It may automatically replicate data or migrate it to where it is used. However, a certain amount of communication is inherent to parallel execution since data must be shared and processors must coordinate their work. In general, for a machine to support programs with a high communication frequency, the other parts of the communication cost equation must be small - low overhead, low network delay, and small occupancy.
The overlap is the portion of the communication coperation that is performed caoncurrently with other useful work, including computation or other communication time. The reduction of the effective cost is possible because much of the communication time involves work done by the components of the system other than the processor such as the communication assist, the bus, the network, or the remote processor or memory.Overlapping communicationw ith oether work is a form of small-scale parallelism, as is the instruction-level parallelism exploited by fast microprocessors.In effect, some available parallelism will be invested to hide the actual cost of communication.
BSP Model:
Bulk Synchronization Parallel Model:
The BSP model is a series of supersteps.
A superstep on a BSP machine is given by W + G*H + L
W is the maximum possible work for a given processor
H is the maximum bytes sent or recieved by a processor
G is the number of available processors
L is the time required for the barrier synchronization
There are other similar variations of BSP like LogP
Terminology:
Saturation Point: In performance of a parallel system there is a point that is reached where the system can not keep up with the load on it. At this point response times will go to infinity. This point is where the load on the system completely utilizes the most scarce resource, often the cpu.
Scaleability: Often a parallel system is not rated by its speed alone, but instead based on how it performes with varying loads. A system that is scaleable will have a gradual increase in response times untill it reaches saturation.
Load: The load on a parallel system is the measure of the work/time that is required of the system. It is relative to the number of actions taking place as well as the time between those actions and the complexity of each action.
References:
Parallel Computer Architecture- A Hardware/Software Approach by David E Culler, Jaswinder Pal Singh and Anoop Guptha
http://en.wikipedia.org/wiki/Computer_performance
The BSP Model http://wwwcs.uni-paderborn.de/fachbereich/AG/agmadh/WWW/bono/paper/nestedbsp/node6.html
LogP: Towards a Realistic Model of Parallel Computation http://cs315b-wiki.stanford.edu/images/8/8b/Logp.pdf