<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Schakra8</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Schakra8"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Schakra8"/>
	<updated>2026-07-01T22:51:38Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74692</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74692"/>
		<updated>2013-04-04T00:14:22Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Combining Tree Barrier */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Instruction and Previous Work =&lt;br /&gt;
Changes are made as per the instructions given in https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM/edit&lt;br /&gt;
&lt;br /&gt;
Previous Version can be found here: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2011/ch9_ms&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A sense-reversing barrier is a more elegant and practical solution to the problem of reusing barriers. A phase’s sense is a Boolean value: true for even-numbered phases and false otherwise. Each SenseBarrier object has a Boolean sense field indicating the sense of the currently executing phase. Each thread keeps its current sense as a thread-local object. Initially the barrier’s sense is the complement of the local sense of all the threads. When a thread calls await(), it checks whether it is the last thread to decrement the counter. If so, it reverses the barrier’s sense and continues. Otherwise, it spins waiting for the balancer’s sense field to change to match its own local sense.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;13body&amp;quot;&amp;gt;[[#13foot|[13]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 struct barrier {&lt;br /&gt;
 shared int count; &lt;br /&gt;
 // a Fetch&amp;amp;Inc/Dec() object with initial value n &lt;br /&gt;
 // this object supports also read and write&lt;br /&gt;
 boolean sense;&lt;br /&gt;
 // initially FALSE;&lt;br /&gt;
 boolean mysense[n];&lt;br /&gt;
 // initially, psense[i] = TRUE,&lt;br /&gt;
 // for each 1 ≤ i ≤ n.&lt;br /&gt;
 };&lt;br /&gt;
 void await(struct barrier *B) { &lt;br /&gt;
 // code for process pi&lt;br /&gt;
 int position = Get&amp;amp;Dec(B-&amp;gt;count); &lt;br /&gt;
 if (position == 1) {&lt;br /&gt;
 B-&amp;gt;count = n;&lt;br /&gt;
 B-&amp;gt;sense = B-&amp;gt;mysense[i]; &lt;br /&gt;
 }&lt;br /&gt;
 else {&lt;br /&gt;
 while (B-&amp;gt;sense!= B-&amp;gt;mysense[i]) &lt;br /&gt;
 noop;&lt;br /&gt;
 }&lt;br /&gt;
 B-&amp;gt;mysense[i] = 1-B-&amp;gt;mysense[i];&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another example in form of pseudo code is given below:&lt;br /&gt;
&lt;br /&gt;
 procedure combining_barrier &lt;br /&gt;
  combining_barrier_aux(mynode) // join the barrier &lt;br /&gt;
  sense := not sense // for next barrier &lt;br /&gt;
 procedure combining_barrier_aux(nodepointer : ^node) &lt;br /&gt;
  with nodepointer^ do &lt;br /&gt;
  if fetch_and_decrement(&amp;amp;count) = 1 // last to reach this node &lt;br /&gt;
  if parent != nil &lt;br /&gt;
  combining_barrier_aux(parent) &lt;br /&gt;
  count := k // prepare for next barrier&lt;br /&gt;
  nodesense := not nodesense // release waiting processors&lt;br /&gt;
  repeat until nodesense = sense&lt;br /&gt;
&lt;br /&gt;
Here, each processor starts at a leaf node of the tree and the leaf count value is decreased. The last descendant to reach each node in tree gets to continue further. The processor reaching the root wakes up and retraces its path through tree and unblocks siblings at each node along path.&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
An example of tournament barrier is given as follows:&lt;br /&gt;
&lt;br /&gt;
 void await(boolean mySense) {&lt;br /&gt;
  if (top) {&lt;br /&gt;
   return;&lt;br /&gt;
  } else if (parent != null) {&lt;br /&gt;
   while (flag != mySense) {};&lt;br /&gt;
   parent.await(mySense);&lt;br /&gt;
   partner.flag = mySense;&lt;br /&gt;
  } else {&lt;br /&gt;
   partner.flag = mySense;&lt;br /&gt;
   while (flag != mySense) {};&lt;br /&gt;
  }}}&lt;br /&gt;
Here, the parent variable stores the value i.e. null if not a winner else a winner, and then waits for a partner (represented by flag) and then performs synchronization. Otherwise a natural loser is chosen and partnered with current thread.&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier. Overall synchronization is achieved by implication from a carefully chosen sequence of pairwise synchronizations. &lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
An example of this type of barrier can be found below:&lt;br /&gt;
&lt;br /&gt;
 for [s = 0 to stages-1] { ## there will be ceiling(log_2 p) stages&lt;br /&gt;
 work out who my out-going partner is at stage s;&lt;br /&gt;
 &amp;lt;await (arrive[partner][s] == 0);&amp;gt;&lt;br /&gt;
 arrive[partner][s] = 1;&lt;br /&gt;
 &amp;lt;await (arrive[myid][s] == 1);&amp;gt;&lt;br /&gt;
 arrive[myid][s] = 0;&lt;br /&gt;
 }&lt;br /&gt;
Here, each thread xecutes the same code, choosing partners for the pairwise synchs as a function of its own identiﬁer and the internal&lt;br /&gt;
iteration.Our own arrive ﬂag is now set by an “in-coming” partner, who is distinct from our “out-going” partner (except when p=2).&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;13foot&amp;quot;&amp;gt;[[#13body|13.]]&amp;lt;/span&amp;gt;http://my.safaribooksonline.com/book/software-engineering-and-development/9780123705914/barriers/ch17lev1sec3#X2ludGVybmFsX0h0bWxWaWV3P3htbGlkPTk3ODAxMjM3MDU5MTQlMkZjaDE3bGV2MXNlYzMmcXVlcnk9&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74687</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74687"/>
		<updated>2013-04-04T00:08:36Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Tournament Barrier */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Instruction and Previous Work =&lt;br /&gt;
Changes are made as per the instructions given in https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM/edit&lt;br /&gt;
&lt;br /&gt;
Previous Version can be found here: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2011/ch9_ms&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A sense-reversing barrier is a more elegant and practical solution to the problem of reusing barriers. A phase’s sense is a Boolean value: true for even-numbered phases and false otherwise. Each SenseBarrier object has a Boolean sense field indicating the sense of the currently executing phase. Each thread keeps its current sense as a thread-local object. Initially the barrier’s sense is the complement of the local sense of all the threads. When a thread calls await(), it checks whether it is the last thread to decrement the counter. If so, it reverses the barrier’s sense and continues. Otherwise, it spins waiting for the balancer’s sense field to change to match its own local sense.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;13body&amp;quot;&amp;gt;[[#13foot|[13]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 struct barrier {&lt;br /&gt;
 shared int count; &lt;br /&gt;
 // a Fetch&amp;amp;Inc/Dec() object with initial value n &lt;br /&gt;
 // this object supports also read and write&lt;br /&gt;
 boolean sense;&lt;br /&gt;
 // initially FALSE;&lt;br /&gt;
 boolean mysense[n];&lt;br /&gt;
 // initially, psense[i] = TRUE,&lt;br /&gt;
 // for each 1 ≤ i ≤ n.&lt;br /&gt;
 };&lt;br /&gt;
 void await(struct barrier *B) { &lt;br /&gt;
 // code for process pi&lt;br /&gt;
 int position = Get&amp;amp;Dec(B-&amp;gt;count); &lt;br /&gt;
 if (position == 1) {&lt;br /&gt;
 B-&amp;gt;count = n;&lt;br /&gt;
 B-&amp;gt;sense = B-&amp;gt;mysense[i]; &lt;br /&gt;
 }&lt;br /&gt;
 else {&lt;br /&gt;
 while (B-&amp;gt;sense!= B-&amp;gt;mysense[i]) &lt;br /&gt;
 noop;&lt;br /&gt;
 }&lt;br /&gt;
 B-&amp;gt;mysense[i] = 1-B-&amp;gt;mysense[i];&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
An example of tournament barrier is given as follows:&lt;br /&gt;
&lt;br /&gt;
 void await(boolean mySense) {&lt;br /&gt;
  if (top) {&lt;br /&gt;
   return;&lt;br /&gt;
  } else if (parent != null) {&lt;br /&gt;
   while (flag != mySense) {};&lt;br /&gt;
   parent.await(mySense);&lt;br /&gt;
   partner.flag = mySense;&lt;br /&gt;
  } else {&lt;br /&gt;
   partner.flag = mySense;&lt;br /&gt;
   while (flag != mySense) {};&lt;br /&gt;
  }}}&lt;br /&gt;
Here, the parent variable stores the value i.e. null if not a winner else a winner, and then waits for a partner (represented by flag) and then performs synchronization. Otherwise a natural loser is chosen and partnered with current thread.&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier. Overall synchronization is achieved by implication from a carefully chosen sequence of pairwise synchronizations. &lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
An example of this type of barrier can be found below:&lt;br /&gt;
&lt;br /&gt;
 for [s = 0 to stages-1] { ## there will be ceiling(log_2 p) stages&lt;br /&gt;
 work out who my out-going partner is at stage s;&lt;br /&gt;
 &amp;lt;await (arrive[partner][s] == 0);&amp;gt;&lt;br /&gt;
 arrive[partner][s] = 1;&lt;br /&gt;
 &amp;lt;await (arrive[myid][s] == 1);&amp;gt;&lt;br /&gt;
 arrive[myid][s] = 0;&lt;br /&gt;
 }&lt;br /&gt;
Here, each thread xecutes the same code, choosing partners for the pairwise synchs as a function of its own identiﬁer and the internal&lt;br /&gt;
iteration.Our own arrive ﬂag is now set by an “in-coming” partner, who is distinct from our “out-going” partner (except when p=2).&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;13foot&amp;quot;&amp;gt;[[#13body|13.]]&amp;lt;/span&amp;gt;http://my.safaribooksonline.com/book/software-engineering-and-development/9780123705914/barriers/ch17lev1sec3#X2ludGVybmFsX0h0bWxWaWV3P3htbGlkPTk3ODAxMjM3MDU5MTQlMkZjaDE3bGV2MXNlYzMmcXVlcnk9&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74682</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74682"/>
		<updated>2013-04-03T23:59:01Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Disseminating Barrier */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Instruction and Previous Work =&lt;br /&gt;
Changes are made as per the instructions given in https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM/edit&lt;br /&gt;
&lt;br /&gt;
Previous Version can be found here: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2011/ch9_ms&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A sense-reversing barrier is a more elegant and practical solution to the problem of reusing barriers. A phase’s sense is a Boolean value: true for even-numbered phases and false otherwise. Each SenseBarrier object has a Boolean sense field indicating the sense of the currently executing phase. Each thread keeps its current sense as a thread-local object. Initially the barrier’s sense is the complement of the local sense of all the threads. When a thread calls await(), it checks whether it is the last thread to decrement the counter. If so, it reverses the barrier’s sense and continues. Otherwise, it spins waiting for the balancer’s sense field to change to match its own local sense.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;13body&amp;quot;&amp;gt;[[#13foot|[13]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 struct barrier {&lt;br /&gt;
 shared int count; &lt;br /&gt;
 // a Fetch&amp;amp;Inc/Dec() object with initial value n &lt;br /&gt;
 // this object supports also read and write&lt;br /&gt;
 boolean sense;&lt;br /&gt;
 // initially FALSE;&lt;br /&gt;
 boolean mysense[n];&lt;br /&gt;
 // initially, psense[i] = TRUE,&lt;br /&gt;
 // for each 1 ≤ i ≤ n.&lt;br /&gt;
 };&lt;br /&gt;
 void await(struct barrier *B) { &lt;br /&gt;
 // code for process pi&lt;br /&gt;
 int position = Get&amp;amp;Dec(B-&amp;gt;count); &lt;br /&gt;
 if (position == 1) {&lt;br /&gt;
 B-&amp;gt;count = n;&lt;br /&gt;
 B-&amp;gt;sense = B-&amp;gt;mysense[i]; &lt;br /&gt;
 }&lt;br /&gt;
 else {&lt;br /&gt;
 while (B-&amp;gt;sense!= B-&amp;gt;mysense[i]) &lt;br /&gt;
 noop;&lt;br /&gt;
 }&lt;br /&gt;
 B-&amp;gt;mysense[i] = 1-B-&amp;gt;mysense[i];&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier. Overall synchronization is achieved by implication from a carefully chosen sequence of pairwise synchronizations. &lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
An example of this type of barrier can be found below:&lt;br /&gt;
&lt;br /&gt;
 for [s = 0 to stages-1] { ## there will be ceiling(log_2 p) stages&lt;br /&gt;
 work out who my out-going partner is at stage s;&lt;br /&gt;
 &amp;lt;await (arrive[partner][s] == 0);&amp;gt;&lt;br /&gt;
 arrive[partner][s] = 1;&lt;br /&gt;
 &amp;lt;await (arrive[myid][s] == 1);&amp;gt;&lt;br /&gt;
 arrive[myid][s] = 0;&lt;br /&gt;
 }&lt;br /&gt;
Here, each thread xecutes the same code, choosing partners for the pairwise synchs as a function of its own identiﬁer and the internal&lt;br /&gt;
iteration.Our own arrive ﬂag is now set by an “in-coming” partner, who is distinct from our “out-going” partner (except when p=2).&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;13foot&amp;quot;&amp;gt;[[#13body|13.]]&amp;lt;/span&amp;gt;http://my.safaribooksonline.com/book/software-engineering-and-development/9780123705914/barriers/ch17lev1sec3#X2ludGVybmFsX0h0bWxWaWV3P3htbGlkPTk3ODAxMjM3MDU5MTQlMkZjaDE3bGV2MXNlYzMmcXVlcnk9&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74680</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74680"/>
		<updated>2013-04-03T23:56:07Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Instruction and Previous Work =&lt;br /&gt;
Changes are made as per the instructions given in https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM/edit&lt;br /&gt;
&lt;br /&gt;
Previous Version can be found here: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2011/ch9_ms&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A sense-reversing barrier is a more elegant and practical solution to the problem of reusing barriers. A phase’s sense is a Boolean value: true for even-numbered phases and false otherwise. Each SenseBarrier object has a Boolean sense field indicating the sense of the currently executing phase. Each thread keeps its current sense as a thread-local object. Initially the barrier’s sense is the complement of the local sense of all the threads. When a thread calls await(), it checks whether it is the last thread to decrement the counter. If so, it reverses the barrier’s sense and continues. Otherwise, it spins waiting for the balancer’s sense field to change to match its own local sense.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;13body&amp;quot;&amp;gt;[[#13foot|[13]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 struct barrier {&lt;br /&gt;
 shared int count; &lt;br /&gt;
 // a Fetch&amp;amp;Inc/Dec() object with initial value n &lt;br /&gt;
 // this object supports also read and write&lt;br /&gt;
 boolean sense;&lt;br /&gt;
 // initially FALSE;&lt;br /&gt;
 boolean mysense[n];&lt;br /&gt;
 // initially, psense[i] = TRUE,&lt;br /&gt;
 // for each 1 ≤ i ≤ n.&lt;br /&gt;
 };&lt;br /&gt;
 void await(struct barrier *B) { &lt;br /&gt;
 // code for process pi&lt;br /&gt;
 int position = Get&amp;amp;Dec(B-&amp;gt;count); &lt;br /&gt;
 if (position == 1) {&lt;br /&gt;
 B-&amp;gt;count = n;&lt;br /&gt;
 B-&amp;gt;sense = B-&amp;gt;mysense[i]; &lt;br /&gt;
 }&lt;br /&gt;
 else {&lt;br /&gt;
 while (B-&amp;gt;sense!= B-&amp;gt;mysense[i]) &lt;br /&gt;
 noop;&lt;br /&gt;
 }&lt;br /&gt;
 B-&amp;gt;mysense[i] = 1-B-&amp;gt;mysense[i];&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;13foot&amp;quot;&amp;gt;[[#13body|13.]]&amp;lt;/span&amp;gt;http://my.safaribooksonline.com/book/software-engineering-and-development/9780123705914/barriers/ch17lev1sec3#X2ludGVybmFsX0h0bWxWaWV3P3htbGlkPTk3ODAxMjM3MDU5MTQlMkZjaDE3bGV2MXNlYzMmcXVlcnk9&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74679</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74679"/>
		<updated>2013-04-03T23:55:26Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Sense-Reversal Barrier */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Instruction and Previous Work =&lt;br /&gt;
Changes are made as per the instructions given in https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM/edit&lt;br /&gt;
&lt;br /&gt;
Previous Version can be found here: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2011/ch9_ms&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A sense-reversing barrier is a more elegant and practical solution to the problem of reusing barriers. A phase’s sense is a Boolean value: true for even-numbered phases and false otherwise. Each SenseBarrier object has a Boolean sense field indicating the sense of the currently executing phase. Each thread keeps its current sense as a thread-local object. Initially the barrier’s sense is the complement of the local sense of all the threads. When a thread calls await(), it checks whether it is the last thread to decrement the counter. If so, it reverses the barrier’s sense and continues. Otherwise, it spins waiting for the balancer’s sense field to change to match its own local sense.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;13body&amp;quot;&amp;gt;[[#13foot|[13]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 struct barrier {&lt;br /&gt;
 shared int count; &lt;br /&gt;
 // a Fetch&amp;amp;Inc/Dec() object with initial value n &lt;br /&gt;
 // this object supports also read and write&lt;br /&gt;
 boolean sense;&lt;br /&gt;
 // initially FALSE;&lt;br /&gt;
 boolean mysense[n];&lt;br /&gt;
 // initially, psense[i] = TRUE,&lt;br /&gt;
 // for each 1 ≤ i ≤ n.&lt;br /&gt;
 };&lt;br /&gt;
 void await(struct barrier *B) { &lt;br /&gt;
 // code for process pi&lt;br /&gt;
 int position = Get&amp;amp;Dec(B-&amp;gt;count); &lt;br /&gt;
 if (position == 1) {&lt;br /&gt;
 B-&amp;gt;count = n;&lt;br /&gt;
 B-&amp;gt;sense = B-&amp;gt;mysense[i]; &lt;br /&gt;
 }&lt;br /&gt;
 else {&lt;br /&gt;
 while (B-&amp;gt;sense!= B-&amp;gt;mysense[i]) &lt;br /&gt;
 noop;&lt;br /&gt;
 }&lt;br /&gt;
 B-&amp;gt;mysense[i] = 1-B-&amp;gt;mysense[i];&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74674</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74674"/>
		<updated>2013-04-03T23:35:16Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Performance Comparison */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Instruction and Previous Work =&lt;br /&gt;
Changes are made as per the instructions given in https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM/edit&lt;br /&gt;
&lt;br /&gt;
Previous Version can be found here: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2011/ch9_ms&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74673</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74673"/>
		<updated>2013-04-03T23:32:09Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Instruction and Previous Work =&lt;br /&gt;
Changes are made as per the instructions given in https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM/edit&lt;br /&gt;
&lt;br /&gt;
Previous Version can be found here: http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2011/ch9_ms&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74672</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74672"/>
		<updated>2013-04-03T23:26:31Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Avoiding Locks */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
The above code basically follows the Read-Write-Update order and makes sure that each thread has its local copy to work on and gets it updated sequentially so that the next thread has the correct value. If the value is not correct, write fails&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74671</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74671"/>
		<updated>2013-04-03T23:23:51Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt;http://preshing.com/20120612/an-introduction-to-lock-free-programming&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74670</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74670"/>
		<updated>2013-04-03T23:23:01Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Avoiding Locks */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid locks, lock free programming techniques like compare and swap and sequential consistency can be followed. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 void LockFreeQueue::push(Node* newHead)&lt;br /&gt;
 {&lt;br /&gt;
     for (;;)&lt;br /&gt;
     {&lt;br /&gt;
         // Copy a shared variable (m_Head) to a local.&lt;br /&gt;
         Node* oldHead = m_Head;&lt;br /&gt;
 &lt;br /&gt;
         // Do some speculative work, not yet visible to other threads.&lt;br /&gt;
         newHead-&amp;gt;next = oldHead;&lt;br /&gt;
 &lt;br /&gt;
         // Next, attempt to publish our changes to the shared variable.&lt;br /&gt;
         // If the shared variable hasn't changed, the CAS succeeds and we return.&lt;br /&gt;
         // Otherwise, repeat.&lt;br /&gt;
         if (_InterlockedCompareExchange(&amp;amp;m_Head, newHead, oldHead) == oldHead)&lt;br /&gt;
            return;&lt;br /&gt;
     }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74658</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74658"/>
		<updated>2013-04-03T20:46:46Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Avoiding Locks */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked. For example, there are two processes A and B and they need two methods (1 and 2) to be grabbed for completion of a task which are shared by both A and B. This can create a deadlock when A grabs 1 and B grabs 2. This reaches to a situation where both A and B cannot change their states. The solution to this problem is to set a condition where in A process can lock method 2 only if t has locked method 1.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74657</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74657"/>
		<updated>2013-04-03T20:35:20Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked.  For example if 2 threads are both spinning on a lock that is locked, they will continue to spin forever, as each thread 'thinks' that the other is inside the critical section.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt;http://www.cs.duke.edu/courses/fall09/cps110/handouts/threads3.pdf&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74656</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74656"/>
		<updated>2013-04-03T20:34:39Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Hand-off Lock */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;. Waiting thread gives up processor so that other threads (e.g. the thread with the lock) can run more quickly. Someone wakes up thread when the lock is free. The separation of ready queue from that of waiting queue fastens the process of execution. In the below example, the lock is given to a waiting thread by the unlock method. The interrupts are enabled only when the thread is placed in waiting queue and disabled once its given the lock. &lt;br /&gt;
&lt;br /&gt;
 lock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 if (value == FREE) {&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 } else {&lt;br /&gt;
 add thread to queue of threads waiting for&lt;br /&gt;
 this lock&lt;br /&gt;
 switch to next runnable thread&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 }&lt;br /&gt;
 unlock () {&lt;br /&gt;
 disable interrupts&lt;br /&gt;
 value = FREE&lt;br /&gt;
 if (any thread is waiting for this lock) {&lt;br /&gt;
 move waiting thread from waiting queue to&lt;br /&gt;
 ready queue&lt;br /&gt;
 value = BUSY&lt;br /&gt;
 }&lt;br /&gt;
 enable interrupts&lt;br /&gt;
 } &lt;br /&gt;
&lt;br /&gt;
But this fails if interrupt happens after thread enable interrupts Lock() adds thread to wait queue Lock() enables interrupts Interrupts causes pre-emption, i.e. switch to another thread. Pre-emption moves thread to ready queue. Now thread is on two queues (wait and ready)! Also, switch is likely to be a critical section Adding thread to wait queue and switching to next thread must be atomic. Solution to this problem is waiting thread leaves interrupts disabled when it calls switch. Next thread to run has the responsibility of re-enabling interrupts before returning to user code. When waiting thread wakes up, it returns from switch with interrupts disabled (from the last thread)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked.  For example if 2 threads are both spinning on a lock that is locked, they will continue to spin forever, as each thread 'thinks' that the other is inside the critical section.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74655</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74655"/>
		<updated>2013-04-03T19:30:01Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Atomic Instructions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
 spin: # see if the lock-variable is clear&lt;br /&gt;
 mov cmos_lock, %eax&lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
 # ok, now we try to grab the lock&lt;br /&gt;
 lock cmpxchg %edx, cmos_lock&lt;br /&gt;
 # did another CPU grab it first? &lt;br /&gt;
 test %eax, %eax&lt;br /&gt;
 jnz spin&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked.  For example if 2 threads are both spinning on a lock that is locked, they will continue to spin forever, as each thread 'thinks' that the other is inside the critical section.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74654</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74654"/>
		<updated>2013-04-03T19:28:19Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; The syntax for the instruction is LOCK CMPXCHG SOURCE DESTINATION. Here, the source has to be a register while the destination can be either Register or Memory. An example of LOCK CMPXCHG is implementation in dis-assembly of Linux kernel's rtc_cmos_read(). The CPU locks needs to be copied and and then updated to a non zero value if the lock is to be applied. But there is a possibility when the value got copied earlier and changed later. This causes a race condition. Hence,it helps in briefing the busy wait loop by preventing race condition. The pseudo code for same is given as below&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;:&lt;br /&gt;
spin: # see if the lock-variable is clear&lt;br /&gt;
mov cmos_lock, %eax&lt;br /&gt;
test %eax, %eax&lt;br /&gt;
jnz spin&lt;br /&gt;
# ok, now we try to grab the lock&lt;br /&gt;
lock cmpxchg %edx, cmos_lock&lt;br /&gt;
# did another CPU grab it first? &lt;br /&gt;
test %eax, %eax&lt;br /&gt;
jnz spin&lt;br /&gt;
      &lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked.  For example if 2 threads are both spinning on a lock that is locked, they will continue to spin forever, as each thread 'thinks' that the other is inside the critical section.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt;http://heather.cs.ucdavis.edu/~matloff/50/PLN/lock.pdf&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74652</id>
		<title>CSC/ECE 506 Spring 2013/9b sc</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/9b_sc&amp;diff=74652"/>
		<updated>2013-04-03T18:49:56Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: Created page with &amp;quot;= Synchronization = In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at th...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Synchronization =&lt;br /&gt;
In addition to a proper cache coherency model, it is important for a multiprocessor system to provide support for synchronization. This must be provided at the hardware level, and can also be implemented to some degree in software. The most common types of hardware synchronization are locks and barriers, which are discussed in this chapter.&lt;br /&gt;
&lt;br /&gt;
== Hardware vs. Operating System Synchronization ==&lt;br /&gt;
Synchronizations are characterized by three parts: acquire, wait, and release. Both hardware and operating system synchronization rely on atomic instructions for acquire and release, but the implementations for the wait portion of a synchronization can vary. Hardware implementations usually use busy-waiting, where a process repeatedly tests a variable waiting for it to change. The operating system can wait using blocking, where the process suspends itself until woken up by another process.&lt;br /&gt;
&lt;br /&gt;
Blocking has the advantage of freeing up the processor for use by other processes, but requires operating system functions to work and thus has a lot of overhead. Busy-waiting has much less overhead, but uses processor and cache resources while waiting. Because of these tradeoffs, it is usually better to use busy-waiting for short wait periods, and blocking for long wait periods.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Lock Implementations =&lt;br /&gt;
&lt;br /&gt;
Locks are an important concept when programming for multiprocessor systems. The purpose of a lock is to protect the code inside the lock, known as a critical section.  It must be certain that while some thread X has entered the critical section, another thread Y is not also inside of the critical section and possibly modifying critical values. While thread X has entered the critical section, thread Y must wait until X has exited before entering. This can be accomplished in a variety of ways of varying complexity and performance.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these four criteria:&lt;br /&gt;
&lt;br /&gt;
*Acquisition Latency - How much time does it take to acquire the lock?&lt;br /&gt;
*Traffic - How much bus traffic is generated by threads attempting to acquire the lock?&lt;br /&gt;
*Fairness - FIFO vs. Luck&lt;br /&gt;
*Storage - How much storage is needed compared to the number of threads?&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Acquisition Latency ===&lt;br /&gt;
We want a low acquisition latency, especially for applications that repeatedly acquire and release locks. Over the run time of a program, the acquisition latency compounded many times could have a large performance affect. However, low acquisition latency has to be balanced against other factors when implementing a lock.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
Traffic is an important consideration when evaluating lock performance. If acquiring lock causes a lot of bus traffic, it will not scale well as the number of threads increases. Eventually the bus will become choked.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fairness ===&lt;br /&gt;
In general, threads should acquire a lock in them same order that the locks were requested. If this does not happen, as the number of threads increases so does the chance that a thread will become starved.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Storage ===&lt;br /&gt;
The amount of storage needed per lock should be small enough such that it scales to a large number of threads without any problems. If a large amount of storage is required, then multiple threads could cause locks to consume too much memory.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
&lt;br /&gt;
Two common lock implementations are test&amp;amp;set and test and test&amp;amp;set. In terms of performance, both have a low storage cost (scalable) and are not fair. Test&amp;amp;set has a low acquisition latency but is not scalable due do high bus traffic when there is a lot of contention for the lock. Test-and-test&amp;amp;set has a higher acquisition latency but scales better because it generates less bus traffic. The next figure shows performance comparisons for these two lock implementations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:lockperf.png|frame|none|Performance comparison of test&amp;amp;set and test-and-test&amp;amp;set (spin on read)&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=== Improving performance ===&lt;br /&gt;
Optimizing software can improve the performance of locks. For instance, inserting a delay between a processor noticing a release and then trying to acquire a lock can reduce contention and bus traffic, and increase the performance of the locks. TCP-like back off algorithms can be used to adjust this delay depending on the number of threads contending for the lock.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Another way to improve performance is to insert a delay between memory references so as to limit the bandwidth (traffic) each processor can use for spinning.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One way to guarantee fairness is to queue lock attempts in shared memory so that they happen in order.&amp;lt;sup&amp;gt;&amp;lt;span&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Atomic Instructions ==&lt;br /&gt;
&lt;br /&gt;
Since a multiprocessor system cannot disable interrupts as an effective method to execute code atomically, there must be hardware support for atomic operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The ability to execute an atomic instruction is a requirement for most lock implementations. It is important that when a processor attempts to set a lock, either the lock is fully set and the thread is able to enter into the critical section, or the lock is not set and it appears that none of the instructions required to set the lock have executed.  &lt;br /&gt;
&lt;br /&gt;
In the x86 instruction set the opcode CMPXCHG (compare and exchange) can be used in a lock implementation in order to guarantee atomicity. This function works by sending a destination and a source. The accumulator is compared to the destination and, if they are equal, loaded with the source. If they are not equal the accumulator is loaded with the destination value. In order to assure that this is executed atomically the opcode must be issued with the LOCK prefix. This is useful in implementing some locks, such as ticket locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Hand-off Lock ==&lt;br /&gt;
&lt;br /&gt;
Another type of lock that is not discussed in the text is known as the &amp;quot;hand-off&amp;quot; lock. In this lock the first thread acquires the lock if no other thread is currently locked (since it is the first thread). When another thread attempts to gain the lock it will see that the lock is in use and adds itself to the queue. Once done this thread can sleep until called by the thread with the lock. Once the thread in the lock is finished, it will pass the lock to the next thread in the queue.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Avoiding Locks ==&lt;br /&gt;
&lt;br /&gt;
There are many reasons why a programmer should attempt to write programs in such a way as to avoid locks if possible.  There are many problems that can arise with the use of locks.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
One of the must well known issues is deadlock.  This can occur when the threads are waiting to acquire the lock, but the lock will never be unlocked.  For example if 2 threads are both spinning on a lock that is locked, they will continue to spin forever, as each thread 'thinks' that the other is inside the critical section.&lt;br /&gt;
&lt;br /&gt;
Another problem with using locks is that the performance is not optimal, as often a lock is used when there is only a chance of conflict.  This approach to programming yields slower performance than what might be possible with other methods.  This also leads to questions of granularity, that is how much of the code should be protected under the critical section.  The programmer must decide between many small (fine grain) locks or fewer, more encompassing locks.  This decision can greatly effect the performance of the program.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Barrier Implementations =&lt;br /&gt;
&lt;br /&gt;
In many ways, barriers are simpler than locks.  A barrier is simply a point in a program where one or more threads must reach before the parallel program is allowed to continue.  When using barriers, a programmer does not have to be concerned with advanced topics such as fairness, that are required when programming for locks.&lt;br /&gt;
&lt;br /&gt;
== Performance Evaluation ==&lt;br /&gt;
&lt;br /&gt;
To evaluate lock performance, we can use these two criteria:&lt;br /&gt;
&lt;br /&gt;
*Latency - The time required to enter and exit a barrier.&lt;br /&gt;
*Traffic - Communications overhead required by the barrier.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Latency ===&lt;br /&gt;
&lt;br /&gt;
Ideally the latency of a barrier should be small, but for barriers that don't scale well latency can increase as the number of threads increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Traffic ===&lt;br /&gt;
&lt;br /&gt;
Low traffic is also good, as we do not want excess buss traffic to prevent scalability.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;1body&amp;quot;&amp;gt;[[#1foot|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Performance Comparison ===&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+Performance Comparison of Barrier Implementations&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
! Criteria&lt;br /&gt;
! Sense Reversal&lt;br /&gt;
Centralized Barrier&lt;br /&gt;
! Combining Tree&lt;br /&gt;
Barrier&lt;br /&gt;
! Barrier Network&lt;br /&gt;
(Hardware)&lt;br /&gt;
|-&lt;br /&gt;
| Latency&lt;br /&gt;
| O(1)&lt;br /&gt;
| O(log p)&lt;br /&gt;
| O(log p)&lt;br /&gt;
|-&lt;br /&gt;
| Traffic&lt;br /&gt;
| O(p&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;)&lt;br /&gt;
| O(p)&lt;br /&gt;
| moved to a separate network&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki05.png|frame|none|Additive Schwarz Preconditioned Conjugate Gradient (ASPCG) kernel on Altix System - This figure shows the timings of the ASPCG kernel using the different barrier implementations. It can be seen that the blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki06.png|frame|none|EPCC Microbenchmark - This figure shows the timings to implement a barrier of the EPCC Microbenchmark using the different barrier implementations. It can be seen that the blocking barrier/centralized blocking barrier does not scale with number of threads as with increase in number of threads the contention increases.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
== Sense-Reversal Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a centralized barrier where a single count variable protected by a lock is shared among all the threads. Each thread on reaching the barrier increments its value and waits till the value of variable has reached the number of threads for which barrier was implemented.&lt;br /&gt;
&lt;br /&gt;
Since all the threads are spinning around a single variable the miss rate is scales quadratically with number of processors.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki01.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Combining Tree Barrier ==&lt;br /&gt;
&lt;br /&gt;
This barrier is a distributed barrier where group of processors form clusters and updates value of a local variable. The local variable on reaching a value equal to number of threads updating it, proceeds to increment another variable higher in hierarchy in the combining tree. When the variable in the highest level of hierarchy in the combining tree reaches its max value it is considered that all the threads have reached the barrier and synchronization is complete.&lt;br /&gt;
&lt;br /&gt;
Since in this case all the threads are updated local variables in form of smaller groups the miss rate is not as high as sense-reversal barrier.&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the operation of combining tree barrier with threads grouped in two groups. The variables C0 and C1 are local to each group and C2 is the variable that is at higher level of hierarchy in the tree.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki02.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Tournament Barrier ==&lt;br /&gt;
&lt;br /&gt;
In tournament barrier the threads are considered to be leaves at the end of a binary tree and each node represents a flag. Two threads compete with each other and the loser thread is allowed to set the flag and move to higher level and compete to lose with the loser thread from other section of binary tree. Thus the thread which completes last is able to set the highest flag in the binary tree. On setting the flag it indicates to all the threads that barrier has been completed and thus synchronization is achieved.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki03.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Disseminating Barrier ==&lt;br /&gt;
&lt;br /&gt;
In this barrier each thread maintains a record of the activity of other threads. For every round i with n threads, thread A notifies thread (A + 2i) mod n. Thus after logn rounds all the threads are aware of the status of every other thread running and whether it has reached the barrier.&lt;br /&gt;
&lt;br /&gt;
[[Image:Ch9wiki04.png]]&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= References = &lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; David Culler, J.P. Singh and Anoop Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. August 1998.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Yan Solihin. Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin Books. August 2009.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Thomas E. Anderson. The Performance of Spin Lock Alternatives for Shared-memory Multiprocessors. http://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/anderson-spinlock.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; http://www.statemaster.com/encyclopedia/Lock-(computer-science)&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; http://faydoc.tripod.com/cpu/cmpxchg.htm&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; http://www.cs.duke.edu/courses/fall09/cps110/slides/threads3.ppt&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; Scalability evaluation of barrier algorithms for OpenMP. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; Carwyn Ball and Mark Bull. Barrier Synchronization in Java. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.cs.brown.edu/courses/cs176/barrier.ppt&amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=74651</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=74651"/>
		<updated>2013-04-03T18:49:23Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Supplements to Solihin Text */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/1a_ag]]&lt;br /&gt;
* Chapter 3a [[CSC/ECE_506_Spring_2013/3a_bs]]&lt;br /&gt;
* Chapter 6a [[CSC/ECE_506_Spring_2013/6a_cs]]&lt;br /&gt;
* Chapter 5a [[CSC/ECE_506_Spring_2013/5a_ks]]&lt;br /&gt;
* Chapter 8a [[CSC/ECE_506_Spring_2013/8a_an]]&lt;br /&gt;
* Chapter 7a [[CSC/ECE_506_Spring_2013/7a_bs]]&lt;br /&gt;
* Chapter 8b [[CSC/ECE_506_Spring_2013/8b_ap]]&lt;br /&gt;
* Chpater 8c [[CSC/ECE_506_Spring_2013/8c_da]]&lt;br /&gt;
* Chpater 10a [[CSC/ECE_506_Spring_2013/10a_os]]&lt;br /&gt;
* Chapter 10c [[CSC/ECE_506_Spring_2013/10c_ks]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/9b_sc]]&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72632</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72632"/>
		<updated>2013-02-12T20:59:29Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Supplements to Solihin Text */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72631</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72631"/>
		<updated>2013-02-12T20:59:01Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Supplements to Solihin Text */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[MISD Architecture(CSC/ECE 506 Spring 2013/1a sp)]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72630</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72630"/>
		<updated>2013-02-12T20:58:36Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Supplements to Solihin Text */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[&amp;lt;b&amp;gt;MISD Architecture&amp;lt;/b&amp;gt;(CSC/ECE 506 Spring 2013/1a sp)]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72487</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72487"/>
		<updated>2013-02-11T04:45:07Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* = */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|centre|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
[[Image:systolic_2.png|thumb|centre|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
[[Image:systolic_4.png|thumb|centre|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|centre|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed.The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
[[Image:fig13.png|thumb|centre|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table border=1 align=&amp;quot;center&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
&amp;lt;p align=&amp;quot;center&amp;quot;&amp;gt;Table 1: Pipeline structure for image edge detection[8]&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Recursive Pattern Matching using MISD Architecture &amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72486</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72486"/>
		<updated>2013-02-11T04:44:15Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Edge Detection of Images on TMS320C80 Multiprocessor SystemM.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|centre|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
[[Image:systolic_2.png|thumb|centre|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
[[Image:systolic_4.png|thumb|centre|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|centre|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed.The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
[[Image:fig13.png|thumb|centre|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table border=1 align=&amp;quot;center&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
&amp;lt;p align=&amp;quot;center&amp;quot;&amp;gt;Table 1: Pipeline structure for image edge detection[8]&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Recursive Pattern Matching using MISD Architecture &amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72485</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72485"/>
		<updated>2013-02-11T04:43:44Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Edge Detection of Images on TMS320C80 Multiprocessor SystemArne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|centre|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
[[Image:systolic_2.png|thumb|centre|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
[[Image:systolic_4.png|thumb|centre|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|centre|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed.The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
[[Image:fig13.png|thumb|centre|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table border=1 align=&amp;quot;center&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
&amp;lt;p align=&amp;quot;center&amp;quot;&amp;gt;Table 1: Pipeline structure for image edge detection[8]&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Recursive Pattern Matching using MISD Architecture &amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72484</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72484"/>
		<updated>2013-02-11T04:37:38Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Edge Detection of Images on TMS320C80 Multiprocessor SystemM.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|centre|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
[[Image:systolic_2.png|thumb|centre|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
[[Image:systolic_4.png|thumb|centre|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|centre|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed.The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
[[Image:fig13.png|thumb|centre|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table border=1 align=&amp;quot;center&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
&amp;lt;p align=&amp;quot;center&amp;quot;&amp;gt;Table 1: Pipeline structure for image edge detection[8]&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72483</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72483"/>
		<updated>2013-02-11T04:36:19Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Fault Tolerant Systemshttp://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|centre|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
[[Image:systolic_2.png|thumb|centre|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
[[Image:systolic_4.png|thumb|centre|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|centre|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed.The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
[[Image:fig13.png|thumb|centre|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72482</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72482"/>
		<updated>2013-02-11T04:33:43Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Programmable Systolic Array */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|centre|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
[[Image:systolic_2.png|thumb|centre|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
[[Image:systolic_4.png|thumb|centre|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72481</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72481"/>
		<updated>2013-02-11T04:32:11Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Special-purpose systolic array */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|centre|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72480</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72480"/>
		<updated>2013-02-11T04:31:33Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Special-purpose systolic array */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
[[Image:systolic_1.png|thumb|left|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72479</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=72479"/>
		<updated>2013-02-11T04:24:52Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Overview of the MISD Architecture */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
===MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of concurrent information streams that flowed into the processor. According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction (or Control) Stream and the Data Stream. Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent. Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories. They are Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD). &lt;br /&gt;
&lt;br /&gt;
The following table (Figure 1) shows the categories of different parallel architectures according to Flynn:&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously. &lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers since the MIMD and SIMD are often better suited to common data parallel techniques.  Under most circumstances MIMD and SIMD architectures tend to provide better scaling and more efficient use of computational resources. Instead of commercial use, MISD architectures are mainly used to create custom hardware in order to solve specific problems. The majority of the implementations have been created are in academia.&lt;br /&gt;
 &lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy. One commonly cited example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction. It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline. Similarly, other architectures or applications exist which may be called MISD by some while others disagree. The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction. &lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
The MISD architecture is one of the four types of parallel architecture described in Flynn’s Taxonomy.  By definition, the MISD architecture has multiple instruction streams and a single data stream.  The definition also describes this architecture as requiring that all processing elements receive the same data, but most commonly cited implementations of this class of architecture do not adhere to this detail.  The result is there is debate among some experts on whether or not certain implementations are truly MISD.&lt;br /&gt;
&lt;br /&gt;
MISD architectures are not commonly found in industry as they tend to lack the scalability and efficiency of resource use of the MIMD and SIMD class of architectures.  However, MISD architectures are efficient for solving specific classes of problems often found in research.  Some examples of its use in research include image processing and systolic arrays.  While rare, some implementations do exist outside of academic settings.  One such example is fault tolerant systems such as the flight control system on the Boeing 777.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''ALU''': Arithmetic Logic Unit, the control unit in a computer&lt;br /&gt;
&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
&lt;br /&gt;
'''Q1: Using the classical definition of the MISD architecture from Flynn’s Taxonomy, what is the minimum number of input streams that must be present to be classified as MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A. One information stream, the data stream, the number of instruction streams do not matter.&lt;br /&gt;
&lt;br /&gt;
B. Two information streams, one instruction stream, one data stream.&lt;br /&gt;
&lt;br /&gt;
C. Three information streams, two instruction streams, one data stream.&lt;br /&gt;
&lt;br /&gt;
D. The number of information streams is irrelevant as per Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A1: C, Three information streams, two instruction streams, one data stream.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q2: Why is the Pipeline architecture often cited as an example of MISD?'''&lt;br /&gt;
&lt;br /&gt;
A. The Pipeline architecture is exactly like the definition of MISD architecture from Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”&lt;br /&gt;
&lt;br /&gt;
C. The data stream sends the same data to all processing elements for every cycle.&lt;br /&gt;
&lt;br /&gt;
D. Flynn based his description of MISD architecture on the Pipeline architecture in his paper that introduced Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A2: B, The data to be processed travels linearly across all processing elements with each processing element executing a different instruction and passing it on to the next processing element.  This line of data is cited to be a single “stream.”''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q3: Why is the Pipeline architecture NOT considered a true MISD architecture by many experts?'''&lt;br /&gt;
&lt;br /&gt;
A. This is incorrect; there is consensus that the Pipeline architecture is synonymous with the MISD architecture described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
B. There is only a single, global information/control stream in the Pipeline architecture.&lt;br /&gt;
&lt;br /&gt;
C. The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.&lt;br /&gt;
&lt;br /&gt;
D. It has been proven that it is impossible to build a MISD architecture parallel computer as described by Flynn’s Taxonomy.&lt;br /&gt;
&lt;br /&gt;
''A3: C, The data passed to each processing element is different as it reflects the result of the processing of all the previous processing elements.  Flynn’s Taxonomy describes the same data going to every processing element.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q4: True or False? MISD architecture is the most common parallel computing architecture used for general computing tasks.'''&lt;br /&gt;
&lt;br /&gt;
A. True&lt;br /&gt;
&lt;br /&gt;
B. False&lt;br /&gt;
&lt;br /&gt;
''A4: B, False, it is often not the best suited architecture for general computing tasks and as such not common outside of academic research.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q5: When classifying a particular parallel architecture, it was noted that there was only one physical connection to provide all information streams, is it possible for this architecture to be classified as MISD architecture? (Note that there are three choices, you need to select the correct reason for the Yes/No/Maybe answer.)'''&lt;br /&gt;
&lt;br /&gt;
A. Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.&lt;br /&gt;
&lt;br /&gt;
B. Maybe, we do not have enough information about the architecture to rule out the possibility of this parallel computer having a MISD architecture.  We will need to see how the connections are physically laid out on the chip.&lt;br /&gt;
&lt;br /&gt;
C. No, there must be multiple physical information streams.&lt;br /&gt;
&lt;br /&gt;
''A5. A, Yes, Flynn’s taxonomy does not require the number of physical connections in the hardware correspond to the number of logical streams.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q6. True or False? To be a MISD architecture, it requires that every processing element is identical.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A6. B, False, it is not required by definition of Flynn’s Taxonomy that all processing elements are identical.  See example in the wiki reading about the fault tolerant flight control system on the Boeing 777.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q7: Which of these is an application of the MISD architecture?'''&lt;br /&gt;
&lt;br /&gt;
A.	Keyword search&lt;br /&gt;
&lt;br /&gt;
B.	Controlling an assembly line.&lt;br /&gt;
&lt;br /&gt;
C.	Home computing&lt;br /&gt;
&lt;br /&gt;
D.	Image edge detection.&lt;br /&gt;
&lt;br /&gt;
''A7: D, Image edge detection.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q8. True or False? The systolic array is another name for the MISD architecture.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A8. B, False, systolic arrays may be implemented using a MISD architecture but they also may be other architectures such as SIMD and MIMD.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q9.  True or False? MISD is the considered the easiest parallel architecture to scale.'''&lt;br /&gt;
&lt;br /&gt;
A.	True&lt;br /&gt;
&lt;br /&gt;
B.	False&lt;br /&gt;
&lt;br /&gt;
''A9. False, MIMD and SIMD are often used because of advantages in scaling.''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Q10. Flynn’s Taxonomy relies on the number of logical information streams to classify the types of parallel architectures.  What is the difference between a logical information stream and a physical information stream?'''&lt;br /&gt;
&lt;br /&gt;
A.	A logical information stream is a connection between the ALU and the location of the data that is responsible for passing information about boolean (logical) computations, where the physical information stream it the connection between computer and other parts of the network.&lt;br /&gt;
&lt;br /&gt;
B.	Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.&lt;br /&gt;
&lt;br /&gt;
C.	Logical and physical information streams transport different kinds of information but are no different in implementation.&lt;br /&gt;
&lt;br /&gt;
D.	There is no difference; the two terms can be used interchangeably.&lt;br /&gt;
&lt;br /&gt;
''A10. B, Physical information streams are determined by how the information is physically transported around chip while logical information streams are one layer of abstraction above the physical information streams.  Logical streams describe conceptually how the information on the stream can be treated.''&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71798</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71798"/>
		<updated>2013-02-04T23:22:56Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* General-purpose systolic array */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Programmable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======&amp;lt;font size=&amp;quot;2&amp;quot;&amp;gt;Reconfigurable Systolic Array&amp;lt;/font&amp;gt;======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71796</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71796"/>
		<updated>2013-02-04T23:19:55Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor, the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System&amp;lt;ref&amp;gt;Arne Halaas, Børge Svingen, Magnar Nedland, Pål Sætrom, Ola Snøve, Jr., and Olaf René Birkeland. 2004. A recursive MISD architecture for pattern matching. IEEE Trans. Very Large Scale Integr. Syst. 12, 7 (July 2004), 727-734. DOI=10.1109/TVLSI.2004.830918 http://dx.doi.org/10.1109/TVLSI.2004.830918&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The situations where several patterns are required to be matched concurrently in data and no persistent index can be created for look-up requires the use of Online multipattern approximate searching. This algorithm requires evaluation of the hit function H(Si,P) which is the function working on strings and pattern,and requires partitioning P into smaller components and combining the individual results. A recursive approach is required and there exists a need to evaluate the Si value in parallel. Thus it helps in using an architecture with two complete binary trees which take care of the data distribution and result processing. The architecture is implemented with as many PE's that are practically possible within the implementation constraints. The data from the stream is fed to the root node of the data distribution tree and simultaneously distributed to several collection of PE's belonging to separate queries thus implementing the MISD architecture with a single data stream and multiple instruction streams.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71794</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71794"/>
		<updated>2013-02-04T23:14:44Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Edge Detection of Images on TMS320C80 Multiprocessor System */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor &amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt; , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection[8]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71793</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71793"/>
		<updated>2013-02-04T23:13:03Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor &amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt; , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table border=1&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;MP&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Image acquisition&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP0/td&amp;gt;&amp;lt;td&amp;gt;Low-pass Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP1&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Thresholding&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;4&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP2&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Laplace Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;5&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;PP3&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Logic Filtering&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;	&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection &amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71792</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71792"/>
		<updated>2013-02-04T23:10:39Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor &amp;lt;ref&amp;gt;M.Gajer;,“ Parallel Image Processing on The TMS320C80 Multiprocessor System”, http://focus.ti.com/pdfs/univ/11-CommunicationAudioSpeech.pdf&amp;lt;/ref&amp;gt; , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture&amp;lt;ref&amp;gt;TMS320C80 System Level Synopsis, Texas Instruments, 1995&amp;lt;/ref&amp;gt;. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&amp;lt;table&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Stage&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Processor&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Operation&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
&amp;lt;/table&amp;gt;&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71787</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71787"/>
		<updated>2013-02-04T23:05:44Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;Johnson, K.T.; Hurson, A.R.; Shirazi, B.; , &amp;quot;General-purpose systolic arrays,&amp;quot; Computer , vol.26, no.11, pp.20-31, Nov. 1993 doi: 10.1109/2.241423&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71786</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71786"/>
		<updated>2013-02-04T23:04:57Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;Yeh, Y.C.;, &amp;quot;Triple-triple redundant 777 primary flight computer,&amp;quot; Aerospace Applications Conference, 1996. Proceedings., 1996 IEEE , vol.1, no., pp.293-307 vol.1, 3-10 Feb 1996&lt;br /&gt;
doi: 10.1109/AERO.1996.495891&amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71782</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71782"/>
		<updated>2013-02-04T23:01:25Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 4: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71781</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71781"/>
		<updated>2013-02-04T23:00:48Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 8 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71780</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71780"/>
		<updated>2013-02-04T22:58:41Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* The Flight Control System – MISD Example for fault tolerance */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 9: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 9].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71778</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71778"/>
		<updated>2013-02-04T22:54:26Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Architecture of systolic arrays as against MISD architecture */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 7.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 7] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71777</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71777"/>
		<updated>2013-02-04T22:53:49Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* General-purpose systolic array */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 6: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 6]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 11.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 11] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71776</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71776"/>
		<updated>2013-02-04T22:53:02Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* General-purpose systolic array */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 5: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 10: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 10]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 11.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 11] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71775</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71775"/>
		<updated>2013-02-04T22:52:28Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Programmable Systolic Array */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 4: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 4) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 10: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 10]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 11.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 11] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71774</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71774"/>
		<updated>2013-02-04T22:48:55Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* MISD Architectures within the Context of Flynn’s Taxonomyhttp://en.wikipedia.org/wiki/Flynn's_taxonomyhttp://www.phy.ornl.gov/csep/ca/node11.html */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 9: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 7) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 10: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 10]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 11.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 11] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71771</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71771"/>
		<updated>2013-02-04T22:47:31Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Implementations of MISD architecture */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] &lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
======Programmable Systolic Array======&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 9: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 7) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
======Reconfigurable Systolic Array======&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 10: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 10]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 11.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 11] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71769</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71769"/>
		<updated>2013-02-04T22:46:00Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Overview of the MISD Architecture */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] &lt;br /&gt;
&lt;br /&gt;
==Overview of the MISD Architecture==&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===1. Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====1.1 Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====1.1.1 Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====1.1.2 General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
=====1.1.2.1 Programmable Systolic Array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 9: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 7) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
=====1.1.2.1 Reconfigurable Systolic Array=====&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 10: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 10]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====1.2 Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 11.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 11] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===2. Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====2.1 The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===3. Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71768</id>
		<title>CSC/ECE 506 Spring 2013/1a sp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/1a_sp&amp;diff=71768"/>
		<updated>2013-02-04T22:43:44Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: Created page with &amp;quot;==Multiple Instruction Single Data (MISD) Architecture==  Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or mult...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Multiple Instruction Single Data (MISD) Architecture==&lt;br /&gt;
&lt;br /&gt;
Multiple Instruction Single Data, or MISD, architecture is one of the four general categories of multi-processor or multi-core architectures described by Flynn's Taxonomy.&lt;br /&gt;
&lt;br /&gt;
==MISD Architectures within the Context of Flynn’s Taxonomy&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Flynn's_taxonomy&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://www.phy.ornl.gov/csep/ca/node11.html&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Flynn's Taxonomy categorizes parallel computing architectures based on the number of information streams that flowed into the processor.&lt;br /&gt;
According to Flynn's Taxonomy, there are two kinds of logical streams of information that flows into a parallel computer, the Instruction Stream and the Data stream.  Note that for the purposes of classification using Flynn’s Taxonomy, it does not matter whether or not the streams are physically independent of each other on the hardware, only that the streams can be logically treated as independent.&lt;br /&gt;
Based on how many of each type of stream flows into the parallel computer (single vs. multiple), Flynn's Taxonomy divides parallel architectures into four categories.  They are Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instruction Single Data (MISD), and Multiple Instruction Multiple Data (MIMD).&lt;br /&gt;
&lt;br /&gt;
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture are&amp;lt;ref&amp;gt;https://computing.llnl.gov/tutorials/parallel_comp/#Flynn&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Flynn's Taxonomy.PNG|thumb|center|400px|Figure 1. [http://en.wikipedia.org/wiki/Michael_J._Flynn Flynn]'s Taxonomy [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]] &lt;br /&gt;
&lt;br /&gt;
===Overview of the MISD Architecture===&lt;br /&gt;
&lt;br /&gt;
In order to be categorized as MISD architecture, there must be at least two Instruction streams and there can only be one Data stream as shown on Figure 2.  As per the description according to Flynn’s Taxonomy, each processing elements should receive the same data and execute different instructions on the same data simultaneously.&lt;br /&gt;
&lt;br /&gt;
[[Image:MISD.PNG|thumb|right|100px|Figure 2. MISD Architecture [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-0 1]]]]&lt;br /&gt;
&lt;br /&gt;
There are very few implementations of the classical MISD architecture as described by Flynn’s Taxonomy, none of which are commercial, general purpose parallel computers.  Instead, MISD architectures are mainly used to create custom hardware in order to solve specific problems.  The majority of the implementations have been created are in academia.&lt;br /&gt;
Over the years, several architectures have been described as MISD even though the architectures do not completely follow the description in Flynn’s Taxonomy.  One common example is the Pipeline architecture where the data is passed from processing element to processing element and each processing element executes a different instruction.  It is often argued that this is not a true MISD architecture as the data changes as it moves through the pipeline.  Similarly, other architectures or applications exist which may be called MISD by some while others disagree.  The following section looks at some of those implementations.&lt;br /&gt;
&lt;br /&gt;
Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. &lt;br /&gt;
&lt;br /&gt;
However, one prominent example of MISD in computing is the Space Shuttle flight control computers.  Another example of this machine is the systolic array, such as the [http://www.cs.cmu.edu/~iwarp/ CMU iWrap] [BORKAR et al., 1990].  All the elements in this array are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.&lt;br /&gt;
&lt;br /&gt;
==Implementations of MISD architecture==&lt;br /&gt;
&lt;br /&gt;
Some major implementations of MISD are in the fields of systolic arrays, pattern matching, flight control systems and fault tolerance systems. These MISD implementations are discussed below.&lt;br /&gt;
&lt;br /&gt;
===1. Systolic Array===&lt;br /&gt;
&lt;br /&gt;
A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions.  Each Processor at each step takes in data from one or more neighbors, processes it and, in the next step, outputs results in the opposite direction.&lt;br /&gt;
&lt;br /&gt;
The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the [http://en.wikipedia.org/wiki/Von_Neumann_model von Neumann paradigm], instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Systolic_array&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====1.1 Type of Systolic Arrays&amp;lt;ref&amp;gt;http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf General Purpose Systolic Arrays &amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
=====1.1.1 Special-purpose systolic array=====&lt;br /&gt;
[[Image:systolic_1.png|thumb|right|250px|Figure 3: The algorithm for the sum of a scalar product, computed in systolic element [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
[[Image:systolic_2.png|thumb|right|250px|Figure 7: The systolic product of two 3x3 matrices [[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]]&lt;br /&gt;
&lt;br /&gt;
An array of hardwired systolic processing elements tailored for a specific application.  Typically, many tens or hundreds of cells fit on a single chip. One of the major applications of special-purpose systolic array is in matrix operations.  [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Systolic_1.png Figure 3] illustrates the algorithm for the sum of a scalar product, computed in a single systolic element. Here, a’s and b’s are synchronously shifted through the processing element to be available for next element. These data synchronously exits the processing element unmodified for the next element.  The sum of the products is then shifted out of the accumulator.&lt;br /&gt;
&lt;br /&gt;
=====1.1.2 General-purpose systolic array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of systolic processing elements, which gets adapted to a variety of applications via programming or reconfiguration.  Array topologies can be either programmable or reconfigurable.  Likewise, array cells are either programmable or reconfigurable.  This is referred to as Systolic topologies.&lt;br /&gt;
&lt;br /&gt;
=====1.1.2.1 Programmable Systolic Array=====&lt;br /&gt;
&lt;br /&gt;
It is an array of programmable systolic elements that operates either in SIMD or MIMD fashion.  Either the arrays interconnect or each processing unit is programmable and a program controls dataflow through the elements. Programmable systolic arrays are programmable either at a high level or a low level.  At either level, programmable arrays can be categorized as either SIMD or MIMD machines.&lt;br /&gt;
 &lt;br /&gt;
[[Image:systolic_4.png|thumb|right|250px|Figure 9: General organization of MIMD programmable linear systolic arrays [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
For the latter approach, the workstation downloads a program to each MIMD (Figure 7) systolic cell. Each cell may be loaded with a different program, or all the cells in the array may be loaded with the same program. Each cell's architecture is somewhat similar to the conventional von Neumann architecture: It contains a control unit, an ALU, and local memory. MIMD systolic cells have more local memory than their SIMD counterparts to support the von Neumann-style organization.&lt;br /&gt;
&lt;br /&gt;
This architecture is defined as Multiple Instruction Multiple Data (MIMD) architecture in [http://home.engineering.iastate.edu/~zambreno/classes/cpre583/documents/JohHur93A.pdf 5]. The architecture has multiple instruction streams for the PEs and a single data stream passing through all the PEs. Thus, it can also be defined as Multiple Instruction Single Data (MISD) architecture. The architecture of Systolic array configuration are controversial as explained in the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Architecture_of_systolic_arrays_as_against_MISD_architecture section 4.1.2.]&lt;br /&gt;
&lt;br /&gt;
=====1.1.2.1 Reconfigurable Systolic Array=====&lt;br /&gt;
[[Image:reconfig.jpg|thumb|right|250px|Figure 10: Block Diagram of the RSA Architecture [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-4 5]]]&lt;br /&gt;
It is an array of systolic elements that can be programmed at the lowest level.  Recent gate density advances in FPGA technology have produced a low-level, reconfigurable systolic array architecture that bridges the gap between special-purpose arrays and the more versatile, programmable general-purpose arrays.  The FPGA architecture is unusual because a single hardware platform can be logically reconfigured as an exact duplicate of a special-purpose systolic array. &lt;br /&gt;
&lt;br /&gt;
The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Reconfig.jpg Figure 10]. The homogeneous characteristic of the Reconfigurable Systolic Array (RSA) architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs.  Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.&lt;br /&gt;
&lt;br /&gt;
====1.2 Architecture of systolic arrays as against MISD architecture====&lt;br /&gt;
[[Image:comp.png|thumb|right|250px|Figure 11.Comparison between Architecture of systolic arrays and MISD]]&lt;br /&gt;
&lt;br /&gt;
As from the above mentioned configurations of the Systolic Arrays, it is seen that generally the configurations have multiple processing elements executing different instructions from dedicated instruction streams for each processing element. There is a single data stream that connects the adjacent PEs. Thus, systolic array can be defined as an MISD architecture.&lt;br /&gt;
&lt;br /&gt;
Many authors say that as the data read as input by one processing element is processed data output of the adjacent PE. The data stream cannot be considered as single because all the data paths do not carry the same data to all the PEs. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Comp.png Figure 11] shows the difference between the Data Stream for Systolic Arrays and the MISD architecture. Thus the systolic array should be considered as “Multiple Data” architecture and not Single Data architecture.&lt;br /&gt;
&lt;br /&gt;
===2. Fault Tolerant Systems&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Fault-tolerant_computer_system#Types_of_fault_tolerance&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
The fault tolerant systems are designed to handle the possible failures in software, hardware or interfaces. The hardware faults include hard disk failures, input or output device failures, etc. and the software and interface faults include  driver failures; operator errors, installing unexpected software etc. The hardware faults can be detected and identified by implementing redundant hardware and multiple backups. The software faults can be tolerable by removing the program errors by executing the software redundantly or by implementing small programs that take over the tasks that crash or generate errors.&lt;br /&gt;
&lt;br /&gt;
[[Image:fault.png|thumb|right|250px|Figure 12 MISD as fault tolerant architecture]]&lt;br /&gt;
&lt;br /&gt;
In computer systems, the [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Single_Instruction.2C_Multiple_Data_streams_.28SIMD.29 SIMD], [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD] and [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instruction.2C_Multiple_Data_streams_.28MIMD.29 MIMD] architectures facilitate the implementation of the fault tolerance systems by multiple instruction streams or multiple data streams or both. Fault tolerance on computations can be implemented by multiple processors (likely with different architectures) executing the algorithms on the same set of data. The output of each processor is compared with that of the others and M out of N majority voting method is used to determine the faulty processor. Thus MISD architecture is utilized to get the fault tolerance on critical computations.&lt;br /&gt;
&lt;br /&gt;
There are various examples of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture] being used as fault tolerant architecture. The major examples being flight control systems, nuclear power plants, satellite systems, super collider experiment systems, etc. Here, the flight control system is explained as an example of [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#Multiple_Instructions.2C_Single_Data_stream_.28MISD.29 MISD architecture].&lt;br /&gt;
&lt;br /&gt;
====2.1 The Flight Control System – MISD Example for fault tolerance====&lt;br /&gt;
&lt;br /&gt;
A [http://en.wikipedia.org/wiki/Fly-by-wire Fly-By-Wire] system is used to replace the manual flight control by an electronic control interface. The movements of the flight control in the cockpit are converted to electronic signals and are transmitted to the actuators by wires. The control computers use the feedback from the sensors to compute and control the movement of the actuators to provide the expected response. These computers also perform the task to stabilize the aircraft and perform other tasks without the knowledge of the pilot. Flight control systems must meet extremely high levels of accuracy and functional integrity.&lt;br /&gt;
&lt;br /&gt;
There are redundant flight control computers present in the flight control system. If one of the flight-control computers crashes, gets damaged or is affected by electromagnetic pulses, the other computer can overrule the faulty one and hence the flight of the aircraft is unharmed. [[Image:fig13.png|thumb|right|250px|Figure 13: Architecture of triple redundant 777 primary flight computer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/1c_dm#cite_note-7 6]]]The number of redundant flight control computers is generally more than two, so that any computer whose results disagree with the others is ruled out to be faulty and is either ignored or rebooted.[[Image:fig14.png|thumb|right|250px|Figure 14: PFC with instruction and data streams]]&lt;br /&gt;
&lt;br /&gt;
=====Multiple Processors Implementation in Boeing 777&amp;lt;ref&amp;gt;http://www.citemaster.net/getdoc/8767/R8.pdf Y.C. (Bob) Yeh, Boeing Commercial Airplane Group, &amp;quot;Triple-Triple Redundant 777 Primary Flight Computer&amp;quot; &amp;lt;/ref&amp;gt;=====&lt;br /&gt;
&lt;br /&gt;
In modern computers, the redundant flight control computations are carried out by multiprocessor systems. The triple redundant 777 primary flight computer, has the architecture as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Fig13.png Figure 13].&lt;br /&gt;
&lt;br /&gt;
The system has three primary flight control computers, each of them having three lanes with different processors. The flight control program is compiled for each of the processors which get the input data from the same data bus but drive the output on their individual control bus. Thus each processor executes different instructions but they process the same data. Thus, it is the best suited example of Multiple Instruction Single Data (MISD) architecture.&lt;br /&gt;
&lt;br /&gt;
The three processors selected for the flight control system of [http://en.wikipedia.org/wiki/Boeing_777 Boeing 777] were [http://en.wikipedia.org/wiki/Intel_80486 Intel 80486], [http://en.wikipedia.org/wiki/Motorola_68040 Motorola 68040] and [http://en.wikipedia.org/wiki/AMD_Am29000 AMD 29050]. The dissimilar processors lead to dissimilar interface hardware circuits and compilers. Each lane of the flight control computer is data synchronized with the other lanes so that all of the lanes read the same frame of data from the flight sensors. As the outputs of each lane can be different, the median value of the outputs is used to select the output of the lane to be considered. The lane which has the median value select hardware selected is said to be in “command mode” whereas the other lanes are said to be in “monitoring mode”.  It receives the data from the other Primary Flight Computer (PFC) lanes and performs a median select of the outputs. This provides a fault blocking mechanism before the fault detection and identification by the cross-lane monitoring system. Thus, the MISD based multi computer architecture is capable of detecting generic errors in compilers or in complex hardware devices providing assurance beyond reasonable doubt of the dependability of the Fly-By-Wire system.&lt;br /&gt;
&lt;br /&gt;
The above mentioned system clearly has individual Instruction Streams as the architecture of each processor is different, thus different instruction sets and different instruction streams. These processors have frame synchronized input data which means they have same set of data to work upon which is fed from a single data stream. Thus the flight control system can be classified under MISD architecture.&lt;br /&gt;
&lt;br /&gt;
===3. Edge Detection of Images on TMS320C80 Multiprocessor System===&lt;br /&gt;
&lt;br /&gt;
The image pre-processing system of this Multiprocessor [9] , the pre-image processing time needed to be made shorter. To achieve the aim of reducing the pre-processing time, different types of image processing architectures for ‘C80 were presented. For example MIMD architecture is used for image histogram equalization; SIMD architecture for median filtering; MISD architecture for directional edge detection.&lt;br /&gt;
&lt;br /&gt;
The MISD architecture can be explained as follows. Here, each processor performs different operations on the same data set. It is mentioned at many places that MISD architecture was defined only for the sake of completeness of classification due to difficulty in practice to find a computer based on MISD architecture [10]. But in case of Image Processing, this assumption need not be true as here making many different operations on same image is rare and hence MISD architecture is a best fit for such cases. As an example of implementation of image processing operations for the MISD architecture the edge detection operation will be considered. The edges should be detected in four different directions (north, south, west and east) by the medium of four different filter masks. Each of the parallel processors of the TMS320C80 calculates one of the directional edges. Finally one obtains four images with detected edges, each in different direction.&lt;br /&gt;
&lt;br /&gt;
The image processing in computer is very often a multistage process composed of many stages. For example, an algorithm detecting contours of the objects in an image. The algorithm for achieving the before mentioned is composed of four basic image processing operation. The image acquired from the camera in the first step undergoes low-pass filtering, during which it is smoothed and noise is also eliminated. In the next step image is thresholding is performed and each pixel gets black or white depending on that if its grey level is below or above threshold value. The resulting image is a binary image from which edges are detected by the usage of the Laplace filter. In the last step the image with detected contours undergoes logic filtering in the purpose of elimination of single black pixels placed on the white background.  For implementing the above processes parallel, a five-stage pipeline was used. The first stage of this pipeline constitutes the master processor which accounts for control of the image acquisition process. Further pipe line stages were implemented for 4 parallel processors as mentioned in table 1.&lt;br /&gt;
&lt;br /&gt;
Stage 	Processor 	Operation&lt;br /&gt;
1	MP	Image acquisition&lt;br /&gt;
2	PP0	Low-pass Filtering&lt;br /&gt;
3	PP1	Thresholding&lt;br /&gt;
4	PP2	Laplace Filtering&lt;br /&gt;
5	PP3	Logic Filtering&lt;br /&gt;
Table 1: Pipeline structure for image edge detection [9]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=='''Glossary'''==&lt;br /&gt;
*'''CMU''': Carnegie Mellon University&lt;br /&gt;
&lt;br /&gt;
*'''CU''': Control Unit&lt;br /&gt;
&lt;br /&gt;
*'''DS''': Data Stream&lt;br /&gt;
&lt;br /&gt;
*'''Fly-By-Wire''': System that replaces the conventional manual flight controls of an aircraft with an electronic interface&lt;br /&gt;
&lt;br /&gt;
*'''FPGA''': Field Programmable Gate Array&lt;br /&gt;
&lt;br /&gt;
*'''Heterogeneous Systems''': A multiprocessor system with different kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''Homogeneous System''': A multiprocessor system with same kind of processors&lt;br /&gt;
&lt;br /&gt;
*'''ILP''': Instruction Level Parallelism&lt;br /&gt;
&lt;br /&gt;
*'''IS''': Instruction Stream&lt;br /&gt;
&lt;br /&gt;
*'''MIMD''': Multiple Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''MISD''': Multiple Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''PE''': Processing Element&lt;br /&gt;
&lt;br /&gt;
*'''PFC''': Primary Flight Computer&lt;br /&gt;
&lt;br /&gt;
*'''SAPO''': Short for Samočinný počítač&lt;br /&gt;
&lt;br /&gt;
*'''SIMD''': Single Instruction Multiple Data&lt;br /&gt;
&lt;br /&gt;
*'''SISD''': Single Instruction Single Data&lt;br /&gt;
&lt;br /&gt;
*'''VLSI''': Very Large Scale Integration&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=71734</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=71734"/>
		<updated>2013-02-04T20:14:41Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Supplements to Solihin Text */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_517_Fall_2012/ch2a_2w23_sr&amp;diff=69003</id>
		<title>CSC/ECE 517 Fall 2012/ch2a 2w23 sr</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_517_Fall_2012/ch2a_2w23_sr&amp;diff=69003"/>
		<updated>2012-10-27T03:59:07Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Advantages */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;''Responsibility-Driven Design''&lt;br /&gt;
== Introduction ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Object-oriented_programming Object-oriented programming(OOP) language] enables software to be reusable, refinable, testable, maintainable, and extensible by supporting [http://en.wikipedia.org/wiki/Encapsulation_(object-oriented_programming) encapsulation]. In order to realize these advantages, encapsulation should be maximized during the design process. Other techniques attempt to enforce encapsulation during the implementation phase. This is too late in the software life-cycle to achieve maximum benefits.&lt;br /&gt;
&lt;br /&gt;
Responsibility-driven design(RRD) is an alternate design method to focus on increasing the encapsulation by viewing a program in terms of the client/server model.&amp;lt;ref&amp;gt;http://dl.acm.org/citation.cfm?doid=74877.74885 Object-oriented design: a responsibility-driven approach&amp;lt;/ref&amp;gt; This method was proposed by [http://en.wikipedia.org/wiki/Rebecca_Wirfs-Brock Rebecca Wirfs-Brock] and Brian Wilkerson.&lt;br /&gt;
&lt;br /&gt;
== Definition ==&lt;br /&gt;
Responsibility-Driven Design is defined as follows:&lt;br /&gt;
&lt;br /&gt;
It is inspired by the [http://en.wikipedia.org/wiki/Client%E2%80%93server_model client/server model]. It focuses on the contract by asking:&lt;br /&gt;
&lt;br /&gt;
* What actions is this object responsible for?&lt;br /&gt;
* What information does this object share?&amp;lt;ref&amp;gt;http://dl.acm.org/citation.cfm?doid=74877.74885 Object-oriented design: a responsibility-driven approach&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Responsibility-Driven Design principles and analysis==&lt;br /&gt;
&lt;br /&gt;
'''''Principles'''''&lt;br /&gt;
&lt;br /&gt;
The basic principles of Responsibility-Driven Design are:&lt;br /&gt;
&lt;br /&gt;
*&amp;lt;b&amp;gt;Maximize Abstraction&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Initially hide the distinction between data and behavior.&lt;br /&gt;
&lt;br /&gt;
Think of objects responsibilities for “knowing”, “doing”,and “deciding”.&lt;br /&gt;
&lt;br /&gt;
*&amp;lt;b&amp;gt;Distribute Behavior&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Promote a delegated control architecture.&lt;br /&gt;
&lt;br /&gt;
Make objects smart i.e. have them behave intelligently, not just hold bundles of data.&lt;br /&gt;
&lt;br /&gt;
*&amp;lt;b&amp;gt;Preserve Flexibility&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Design objects such that interior details can be readily changed.&lt;br /&gt;
&lt;br /&gt;
'''''Analysis'''''&lt;br /&gt;
&lt;br /&gt;
The analysis stage of Responsible-Driven Design consists of three phases:&lt;br /&gt;
*&amp;lt;b&amp;gt;System Deﬁnition:&amp;lt;/b&amp;gt; High-level view of system&lt;br /&gt;
*&amp;lt;b&amp;gt;Detailed Description:&amp;lt;/b&amp;gt; Detailed view of development process, functional requirements, and non-functional requirements.&lt;br /&gt;
*&amp;lt;b&amp;gt;Object Analysis:&amp;lt;/b&amp;gt; Construction of domain objects。&amp;lt;ref&amp;gt;http://www.wirfs-brock.com/ RRD principles and constructs&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Example ==&lt;br /&gt;
&lt;br /&gt;
The basic example of Responsibility-driven design is Point Of Sale Terminal (POST). It is basically driven by use cases and requirements which include domain models and system operations derived from usecases. It then assigns responsibilities to objects/classes for use cases realization. Suppose the given use case for POST is represented in Figure 1. Then the RDD methodical context for POST is given by Figure 2. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
                        [[File:RDD2.jpg]]&lt;br /&gt;
&lt;br /&gt;
                                                Figure 1. [http://www.site.uottawa.ca/~ssome/Cours/SEG3202/rdd.pdf Usecase diagram of POST]&lt;br /&gt;
&lt;br /&gt;
                        [[File:rdd1.jpg]]&lt;br /&gt;
&lt;br /&gt;
                                                Figure 2. [http://www.site.uottawa.ca/~ssome/Cours/SEG3202/rdd.pdf RDD Methodological context of POST Usecase]&lt;br /&gt;
&lt;br /&gt;
Here, the figure 2. shows the considerations made while designing a model which is responsibility driven for POST. It includes business modelling,requirements and design of functionality.&lt;br /&gt;
&lt;br /&gt;
== Advantages ==&lt;br /&gt;
*'''''Increases encapsulation'''''&lt;br /&gt;
The responsibility-driven approach emphasizes the encapsulation of both the structure and behavior of objects by focusing on the contractual responsibilities of a class.It improves encapsulation with respect to subclass clients, ensuring that the inherited behavior is part of the contract of the subclass. The benefit is increased encapsulation, since the specification of the exact way in which a request is carried out is private to the server. Encapsulation is compromised when the structural details of an object become part of the interface. Responsibility-driven designs maximize encapsulation by ignoring the structural details.&lt;br /&gt;
*'''''Facilitates polymorphism'''''&lt;br /&gt;
By encouraging the designer to focus on responsibilities independently of implementation, the responsibility driven approach can help a designer identify standard protocols (message names), which facilitates polymorphism. It makes abstract classes easier to identify. One difficulty in identifying abstract classes lies in determining what parts of the protocol of existing classes are parts of the type of those classes, and what parts are implementation details. Because the protocol of a class includes only those messages which form the type of the class, this problem is eliminated.&lt;br /&gt;
&lt;br /&gt;
== Comparison of Responsibility-Driven Design with Data-Driven Design ==&lt;br /&gt;
&lt;br /&gt;
Adapting abstract data type design methods to object-oriented programming results in a data-driven design. On the other hand, Responsibility-driven design is in direct contrast with Data-driven design, which promotes defining the behavior of a class along the data that it holds. &lt;br /&gt;
&lt;br /&gt;
'''''Why responsibility-driven design over data-driven design?'''''&lt;br /&gt;
&lt;br /&gt;
While data-driven design does prevent coupling of data and functionality, in some cases, data-driven programming has been argued to lead to bad object-oriented design, especially when dealing with more abstract data. This is because a purely data-driven object or entity is defined by the way it is represented. Any attempt to change the structure of the object would immediately break the functions that rely on it. As a, perhaps trivial, example, one might represent driving directions as a series of intersections (two intersecting streets) where the driver must turn right or left. If an intersection (in the United States) is represented in data by the zip code (5-digit number) and two street names (strings of text), you might run into bugs when you discover a city where streets intersect multiple times. While this example may be over simplified, restructuring of data is fairly common problem in software engineering, either to eliminate bugs, increase efficiency, or support new features. In these cases responsibility-driven design may be promoted as a better approach, where functionality and data can be coupled together, so functions don't have to deal with the data representation itself.&lt;br /&gt;
&lt;br /&gt;
'''''Example of Responsibility-Driven v/s Data-Driven Design'''''&lt;br /&gt;
&lt;br /&gt;
Let us consider a class of raster image objects which stores visual images as two-dimensional array of bits. The operations of raster image include basic image manipulations like rotation, orientation, scaling etc. Now, we can define a RasterImage class which will include methods for each possible operation it can perform:&lt;br /&gt;
   class  Rasterlmage  is &lt;br /&gt;
     bitArray  : ByteArray; &lt;br /&gt;
     width  : Integer; &lt;br /&gt;
     height  : Integer;&lt;br /&gt;
   rotateBy(degree  :  Integer)  : void; &lt;br /&gt;
     scaleBy(x  : Integer,  y  :  Integer)  : void; &lt;br /&gt;
     bitAt(index  : Point)  : Integer; &lt;br /&gt;
     bitAtPut(index  : Point,  bitValue  : Integer)  : void;&lt;br /&gt;
   Rasterlmage  structure: &lt;br /&gt;
     bitArray  : ByteArray; &lt;br /&gt;
     bitArray(bits  : ByteArray)  : void; &lt;br /&gt;
     width0  : Integer; &lt;br /&gt;
     width(aNumber  : Integer)  : void; &lt;br /&gt;
     height0  : Integer; &lt;br /&gt;
     height(aNumber  : Integer)  : void; &lt;br /&gt;
   end  Rasterlmage;&lt;br /&gt;
&lt;br /&gt;
The same requirement can be modified through responsibility driven desgn. The responsibilities of RasterImage class are: knowing and maintaining the smallest rectangle that completely contains its image, knowing and maintaining all individual bit values within its image and rotating and scaling the image. Therefore, the RasterImage class based on responsibilities is as follows:&lt;br /&gt;
&lt;br /&gt;
   class  Rasterlmage  is &lt;br /&gt;
     boundingRectangle()  : Rectangle; &lt;br /&gt;
     boundingRectangIe(bounds  : Rectangle)  : void; &lt;br /&gt;
     bitAt(index  : Point)  : Integer; &lt;br /&gt;
     bitAtPut(index  : Point,  bitvalue  : Integer)  : void; &lt;br /&gt;
     scaleBy(x  : Integer,  y  : Integer)  : void; &lt;br /&gt;
     rotateBy(degrees  : Integer)  : void; &lt;br /&gt;
   end  Rasterlmage;&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Responsibility-Driven Design is a way to design software that emphasizes modeling of objects’ roles, responsibilities, and collaborations.  Its principle is to maximize abstraction, delegate control and preserve flexibility. It increases encapsulation by focusing on the contractual responsibilities of a class and facilitates polymorphism by encouraging the designer to focus on responsibilities independently of implementation.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
1. [http://www.ece.uprm.edu/~borges/CRC.pdf Responsibility Driven Design]&lt;br /&gt;
&lt;br /&gt;
2. [http://nccastaff.bournemouth.ac.uk/jmacey/CA1/Papers/Responsibility-Driven%20Design.pdf Object-Oriented  Design: A  Responsibility-Driven  Approach ]&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_517_Fall_2012/ch2a_2w23_sr&amp;diff=69002</id>
		<title>CSC/ECE 517 Fall 2012/ch2a 2w23 sr</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_517_Fall_2012/ch2a_2w23_sr&amp;diff=69002"/>
		<updated>2012-10-27T03:58:42Z</updated>

		<summary type="html">&lt;p&gt;Schakra8: /* Advantages */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;''Responsibility-Driven Design''&lt;br /&gt;
== Introduction ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Object-oriented_programming Object-oriented programming(OOP) language] enables software to be reusable, refinable, testable, maintainable, and extensible by supporting [http://en.wikipedia.org/wiki/Encapsulation_(object-oriented_programming) encapsulation]. In order to realize these advantages, encapsulation should be maximized during the design process. Other techniques attempt to enforce encapsulation during the implementation phase. This is too late in the software life-cycle to achieve maximum benefits.&lt;br /&gt;
&lt;br /&gt;
Responsibility-driven design(RRD) is an alternate design method to focus on increasing the encapsulation by viewing a program in terms of the client/server model.&amp;lt;ref&amp;gt;http://dl.acm.org/citation.cfm?doid=74877.74885 Object-oriented design: a responsibility-driven approach&amp;lt;/ref&amp;gt; This method was proposed by [http://en.wikipedia.org/wiki/Rebecca_Wirfs-Brock Rebecca Wirfs-Brock] and Brian Wilkerson.&lt;br /&gt;
&lt;br /&gt;
== Definition ==&lt;br /&gt;
Responsibility-Driven Design is defined as follows:&lt;br /&gt;
&lt;br /&gt;
It is inspired by the [http://en.wikipedia.org/wiki/Client%E2%80%93server_model client/server model]. It focuses on the contract by asking:&lt;br /&gt;
&lt;br /&gt;
* What actions is this object responsible for?&lt;br /&gt;
* What information does this object share?&amp;lt;ref&amp;gt;http://dl.acm.org/citation.cfm?doid=74877.74885 Object-oriented design: a responsibility-driven approach&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Responsibility-Driven Design principles and analysis==&lt;br /&gt;
&lt;br /&gt;
'''''Principles'''''&lt;br /&gt;
&lt;br /&gt;
The basic principles of Responsibility-Driven Design are:&lt;br /&gt;
&lt;br /&gt;
*&amp;lt;b&amp;gt;Maximize Abstraction&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Initially hide the distinction between data and behavior.&lt;br /&gt;
&lt;br /&gt;
Think of objects responsibilities for “knowing”, “doing”,and “deciding”.&lt;br /&gt;
&lt;br /&gt;
*&amp;lt;b&amp;gt;Distribute Behavior&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Promote a delegated control architecture.&lt;br /&gt;
&lt;br /&gt;
Make objects smart i.e. have them behave intelligently, not just hold bundles of data.&lt;br /&gt;
&lt;br /&gt;
*&amp;lt;b&amp;gt;Preserve Flexibility&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Design objects such that interior details can be readily changed.&lt;br /&gt;
&lt;br /&gt;
'''''Analysis'''''&lt;br /&gt;
&lt;br /&gt;
The analysis stage of Responsible-Driven Design consists of three phases:&lt;br /&gt;
*&amp;lt;b&amp;gt;System Deﬁnition:&amp;lt;/b&amp;gt; High-level view of system&lt;br /&gt;
*&amp;lt;b&amp;gt;Detailed Description:&amp;lt;/b&amp;gt; Detailed view of development process, functional requirements, and non-functional requirements.&lt;br /&gt;
*&amp;lt;b&amp;gt;Object Analysis:&amp;lt;/b&amp;gt; Construction of domain objects。&amp;lt;ref&amp;gt;http://www.wirfs-brock.com/ RRD principles and constructs&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Example ==&lt;br /&gt;
&lt;br /&gt;
The basic example of Responsibility-driven design is Point Of Sale Terminal (POST). It is basically driven by use cases and requirements which include domain models and system operations derived from usecases. It then assigns responsibilities to objects/classes for use cases realization. Suppose the given use case for POST is represented in Figure 1. Then the RDD methodical context for POST is given by Figure 2. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
                        [[File:RDD2.jpg]]&lt;br /&gt;
&lt;br /&gt;
                                                Figure 1. [http://www.site.uottawa.ca/~ssome/Cours/SEG3202/rdd.pdf Usecase diagram of POST]&lt;br /&gt;
&lt;br /&gt;
                        [[File:rdd1.jpg]]&lt;br /&gt;
&lt;br /&gt;
                                                Figure 2. [http://www.site.uottawa.ca/~ssome/Cours/SEG3202/rdd.pdf RDD Methodological context of POST Usecase]&lt;br /&gt;
&lt;br /&gt;
Here, the figure 2. shows the considerations made while designing a model which is responsibility driven for POST. It includes business modelling,requirements and design of functionality.&lt;br /&gt;
&lt;br /&gt;
== Advantages ==&lt;br /&gt;
*'''''Increased encapsulation'''''&lt;br /&gt;
The responsibility-driven approach emphasizes the encapsulation of both the structure and behavior of objects by focusing on the contractual responsibilities of a class.It improves encapsulation with respect to subclass clients, ensuring that the inherited behavior is part of the contract of the subclass. The benefit is increased encapsulation, since the specification of the exact way in which a request is carried out is private to the server. Encapsulation is compromised when the structural details of an object become part of the interface. Responsibility-driven designs maximize encapsulation by ignoring the structural details.&lt;br /&gt;
*'''''Facilitates polymorphism'''''&lt;br /&gt;
By encouraging the designer to focus on responsibilities independently of implementation, the responsibility driven approach can help a designer identify standard protocols (message names), which facilitates polymorphism. It makes abstract classes easier to identify. One difficulty in identifying abstract classes lies in determining what parts of the protocol of existing classes are parts of the type of those classes, and what parts are implementation details. Because the protocol of a class includes only those messages which form the type of the class, this problem is eliminated.&lt;br /&gt;
&lt;br /&gt;
== Comparison of Responsibility-Driven Design with Data-Driven Design ==&lt;br /&gt;
&lt;br /&gt;
Adapting abstract data type design methods to object-oriented programming results in a data-driven design. On the other hand, Responsibility-driven design is in direct contrast with Data-driven design, which promotes defining the behavior of a class along the data that it holds. &lt;br /&gt;
&lt;br /&gt;
'''''Why responsibility-driven design over data-driven design?'''''&lt;br /&gt;
&lt;br /&gt;
While data-driven design does prevent coupling of data and functionality, in some cases, data-driven programming has been argued to lead to bad object-oriented design, especially when dealing with more abstract data. This is because a purely data-driven object or entity is defined by the way it is represented. Any attempt to change the structure of the object would immediately break the functions that rely on it. As a, perhaps trivial, example, one might represent driving directions as a series of intersections (two intersecting streets) where the driver must turn right or left. If an intersection (in the United States) is represented in data by the zip code (5-digit number) and two street names (strings of text), you might run into bugs when you discover a city where streets intersect multiple times. While this example may be over simplified, restructuring of data is fairly common problem in software engineering, either to eliminate bugs, increase efficiency, or support new features. In these cases responsibility-driven design may be promoted as a better approach, where functionality and data can be coupled together, so functions don't have to deal with the data representation itself.&lt;br /&gt;
&lt;br /&gt;
'''''Example of Responsibility-Driven v/s Data-Driven Design'''''&lt;br /&gt;
&lt;br /&gt;
Let us consider a class of raster image objects which stores visual images as two-dimensional array of bits. The operations of raster image include basic image manipulations like rotation, orientation, scaling etc. Now, we can define a RasterImage class which will include methods for each possible operation it can perform:&lt;br /&gt;
   class  Rasterlmage  is &lt;br /&gt;
     bitArray  : ByteArray; &lt;br /&gt;
     width  : Integer; &lt;br /&gt;
     height  : Integer;&lt;br /&gt;
   rotateBy(degree  :  Integer)  : void; &lt;br /&gt;
     scaleBy(x  : Integer,  y  :  Integer)  : void; &lt;br /&gt;
     bitAt(index  : Point)  : Integer; &lt;br /&gt;
     bitAtPut(index  : Point,  bitValue  : Integer)  : void;&lt;br /&gt;
   Rasterlmage  structure: &lt;br /&gt;
     bitArray  : ByteArray; &lt;br /&gt;
     bitArray(bits  : ByteArray)  : void; &lt;br /&gt;
     width0  : Integer; &lt;br /&gt;
     width(aNumber  : Integer)  : void; &lt;br /&gt;
     height0  : Integer; &lt;br /&gt;
     height(aNumber  : Integer)  : void; &lt;br /&gt;
   end  Rasterlmage;&lt;br /&gt;
&lt;br /&gt;
The same requirement can be modified through responsibility driven desgn. The responsibilities of RasterImage class are: knowing and maintaining the smallest rectangle that completely contains its image, knowing and maintaining all individual bit values within its image and rotating and scaling the image. Therefore, the RasterImage class based on responsibilities is as follows:&lt;br /&gt;
&lt;br /&gt;
   class  Rasterlmage  is &lt;br /&gt;
     boundingRectangle()  : Rectangle; &lt;br /&gt;
     boundingRectangIe(bounds  : Rectangle)  : void; &lt;br /&gt;
     bitAt(index  : Point)  : Integer; &lt;br /&gt;
     bitAtPut(index  : Point,  bitvalue  : Integer)  : void; &lt;br /&gt;
     scaleBy(x  : Integer,  y  : Integer)  : void; &lt;br /&gt;
     rotateBy(degrees  : Integer)  : void; &lt;br /&gt;
   end  Rasterlmage;&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Responsibility-Driven Design is a way to design software that emphasizes modeling of objects’ roles, responsibilities, and collaborations.  Its principle is to maximize abstraction, delegate control and preserve flexibility. It increases encapsulation by focusing on the contractual responsibilities of a class and facilitates polymorphism by encouraging the designer to focus on responsibilities independently of implementation.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
1. [http://www.ece.uprm.edu/~borges/CRC.pdf Responsibility Driven Design]&lt;br /&gt;
&lt;br /&gt;
2. [http://nccastaff.bournemouth.ac.uk/jmacey/CA1/Papers/Responsibility-Driven%20Design.pdf Object-Oriented  Design: A  Responsibility-Driven  Approach ]&lt;/div&gt;</summary>
		<author><name>Schakra8</name></author>
	</entry>
</feed>