<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Akrepask</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Akrepask"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Akrepask"/>
	<updated>2026-05-09T02:44:21Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch12_ar&amp;diff=45248</id>
		<title>CSC/ECE 506 Spring 2011/ch12 ar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch12_ar&amp;diff=45248"/>
		<updated>2011-04-22T06:45:25Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;h1&amp;gt;Interconnection Network Architecture &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a mul&lt;br /&gt;
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Typically, in a multiprocessor system, message passed between processors are frequent and short&amp;lt;sup&amp;gt;1&amp;lt;/sup&amp;gt;. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:&lt;br /&gt;
&amp;lt;p&amp;gt;1.&amp;lt;b&amp;gt; Diameter &amp;lt;/b&amp;gt;: The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;2. &amp;lt;b&amp;gt;Bisection Bandwidth &amp;lt;/b&amp;gt;: A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. &amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;3. &amp;lt;b&amp;gt; No. of Links&amp;lt;/b&amp;gt; : Number of links in a network is the set of wires that connect two different nodes in the network. &amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;4.&amp;lt;b&amp;gt; Degree&amp;lt;/b&amp;gt; : Number of input/output links connecting to each router is defined as the degree of the network.&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;'''Simple Network Topologies''' &amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;The following network topologies are not widely used in processors, however it is important to discuss these networks because variations of them are used in real machines.&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Linear Array&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_linear.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming.  First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes.  In addition to not scaling well, this topology can also result in high congestion.  [[#References|&amp;lt;sup&amp;gt;[8]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a linear array interconnection network with '''p''' nodes, the total number of &amp;lt;b&amp;gt;links&amp;lt;/b&amp;gt; would be &amp;lt;b&amp;gt; p-1 &amp;lt;/b&amp;gt; and the &amp;lt;b&amp;gt;degree&amp;lt;/b&amp;gt; of the network would be &amp;lt;b&amp;gt;2 &amp;lt;/b&amp;gt;. The total link bandwidth is &amp;lt;b&amp;gt; p-1 &amp;lt;/b&amp;gt; times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Ring&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_ring.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added.  The congestion will also be cut in half since there is now 2 options for packets to traverse.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
In a Ring interconnection network with &amp;lt;b&amp;gt;p&amp;lt;/b&amp;gt; nodes, the total number of links in the network would be and the degree of the interconnection network is &amp;lt;b&amp;gt;2&amp;lt;/b&amp;gt;. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be &amp;lt;b&amp;gt; p/2 &amp;lt;/b&amp;gt;. The bisection bandwidth of the network of the interconnection would thus become 2.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Cube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_cube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_tree.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels.  Also because of the high connectivity, this topology has high average energy dissipation.[[#References|&amp;lt;sup&amp;gt;[8]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
In a k-ary Tree interconnection network with &amp;lt;b&amp;gt;p&amp;lt;/b&amp;gt; nodes the degree of the nodes k+1 and the total number of links is k*(p-1). The bisection bandwidth of the network is 1 and the diameter of the network is 2*(log&amp;lt;sub&amp;gt;k&amp;lt;/sub&amp;gt;p).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Butterfly&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_butterfly.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Another attempt to increase the “skinny” tree structure was the butterfly structure.  The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.  There are 2 problems with this topology.  First of all, there is no path diversity in this topology.  There is only one path from the root to a downstream node.  This is not ideal incase the network is congested in a certain area, but available in another.  There is no way for the network to rebalance the work.  Second of all, there are some very long routes in this topology.  This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]] &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
In a butterfly interconnection network topology with &amp;lt;b&amp;gt; p &amp;lt;/b&amp;gt;nodes, the degree of the network would be 4 and the total number of links in the network would be 2*p(log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;p). The diameter of the network is log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;p. The bisection bandwidth of the network is p/2.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;'''Evolution of Network Topologies''' &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;2-D Mesh&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_2Dmesh.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.  This topology has reasonably low energy dissipation without compromising throughput.  The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes.  (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|&amp;lt;sup&amp;gt;[8]&amp;lt;/sup&amp;gt;]] The 2-D mesh topology is one of the first interconnection structures where processors were connected to each other through neighbors. The Soloman machine, which implimented one of the first 2-D mesh topologies, was developed in 1962. In the early designs of these networks they could not perform routing, so the processors had to explicitly relay messages to processors that are not their neighbors.  Because of this, these networks had very poor performance.[[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
In a 2-D Mesh interconnection network with &amp;lt;b&amp;gt; p&amp;lt;/b&amp;gt; nodes, the degree of the network would be 4 and the total number of links would be &amp;lt;b&amp;gt; 2*sqrt(p)(sqrt(p) -1) &amp;lt;/b&amp;gt;. The diameter of the network would be 2*(sqrt(p) - 1). The bisection bandwidth of the interconnection is the total number of links divided by the diameter of the network, which would result to be sqrt(p).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Hypercube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_hypercube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The hypercube is essentially multiple cubes put together.  This topology became popular in the late 1970’s.  The main advantage of this topology is it’s low diameter. [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]Originally there were no routers.  The next hops were just programmed into each node.  Today the hypercube topology is used by many companies including Intel.  It is so attractive because of its small diameter.  The nodes are numbered in such a way that every neighboring node is only one bit difference.  This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability.  For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|&amp;lt;sup&amp;gt;[10]&amp;lt;/sup&amp;gt;]]  Once the industry reached a limit where it just wasn’t realistic to fit this topology into a package, most designs returned to low dimensional arrays such as 2-D torus. [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
In an interconnection network of &amp;lt;b&amp;gt;p&amp;lt;/b&amp;gt; nodes with Hypercube topology, the degree of the network would be log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;p and the diameter of the network would also be log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;p. The total number of links in the network would be (p/2)*log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;p and the bisection bandwidth of the network would be p/2.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;2-D Torus&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_2Dtorus.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher.  Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher.  The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout)  This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|&amp;lt;sup&amp;gt;[9]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fat Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_fat_tree.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree.  To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels.  This helps to alleviate the traffic at upper levels and to decrease the latency of the message.  However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|&amp;lt;sup&amp;gt;[11]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In an interconnection network with k-ary Fat Tree implementation of &amp;lt;b&amp;gt;p&amp;lt;/b&amp;gt; nodes, the degree of the network would be &amp;lt;b&amp;gt;'k+1'&amp;lt;/b&amp;gt; and the total number of links would be k*(p-1). The diameter of the network would be 2*log&amp;lt;sub&amp;gt;k&amp;lt;/sub&amp;gt;p, the bisection bandwidth of the network would be p/2.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Comparison of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The following table shows the total number of ports required for each network topology. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_ports.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Number of ports for each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been &amp;quot;fattened&amp;quot; up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_load.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Average path length and link load for each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The figure below shows the cost of the network topologies.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_cost.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Cost of each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
When the cost and average link load is factored the following graph is produced.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_overall.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Overall cost of each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Real-World Implementation of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_network.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Basic structure of Hospodor and Miller's experimental network''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fat Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In large scale, high performance applications, fat tree can be a choice. However, in order to &amp;quot;fatten&amp;quot; up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors&amp;lt;sup&amp;gt;7&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Butterfly&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_butterfly.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Butterfly structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Meshes and Tori&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors&amp;lt;sup&amp;gt;7&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_mesh.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Mesh structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_torus.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Torus structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Hypercube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_hypercube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Hypercube structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Routing&amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Deadlock&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_deadlock.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Example of deadlock''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
*'''Dimensional ordered (X-Y) routing -''' Turns from the y-dimension to the x-dimension are not allowed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
*'''West First -''' Turns to the west are not allowed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
*'''North Last -''' Turns after a north direction are not allowed. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
*'''Negative First -''' Turns in the negative direction (-x or -y) are not allowed, except on the first turn.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
*'''Odd-Even Turn Model -''' Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;. The model is designed primarily for 2-D meshes.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
''Turns from the east to north direction from any node on an even column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the north to west direction from any node on an odd column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the east to south direction from any node on an even column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the south to west direction from any node on an odd column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_odd_even.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Comparison of Turn Restriction Models&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few &amp;quot;hot spot&amp;quot; nodes that receive high traffic.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_uniform.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Uniform traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the &amp;quot;slowest&amp;quot; increase in average communication latency. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_transpose.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''First transpose traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_transpose2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Second transpose traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
With the second transpose simulation, the odd-even model outperforms the rest.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_hotspot.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Hotspot traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_hotspot2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Second hotspot traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Router Architecture&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_bandwidth.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Bandwidth of various routers over 10 year period''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_physical.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Router hardware over period of time''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_radix.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Radix and latency of routers over 10 year period''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Fault Tolerant Routing&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components&amp;lt;sup&amp;gt;6&amp;lt;/sup&amp;gt;. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fault Models&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Faults in a network can be categorized in two types:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
1.'''Transient Faults'''&amp;lt;sup&amp;gt;5&amp;lt;/sup&amp;gt; : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2.'''Permanent Faults'''&amp;lt;sup&amp;gt;5&amp;lt;/sup&amp;gt;: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fault Tolerance Mechanisms (for permanent faults)&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The permanent faults can be handled using one of the two mechanisms:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Fault_pic1.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Fault_pic2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;References&amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
1 Solihin text&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&amp;amp;hid=15&amp;amp;sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
7 [http://www.top500.org TOP500 Supercomputing Sites]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
12 [http://books.google.com/books?id=uyAg3zu_DYMC&amp;amp;pg=PA75&amp;amp;lpg=PA75&amp;amp;dq=butterfly+interconnection+network+topology&amp;amp;source=bl&amp;amp;ots=5tQm-TVTu0&amp;amp;sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&amp;amp;hl=en&amp;amp;ei=5cmrTYezINDQiAL9uKjvDA&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=8&amp;amp;ved=0CFMQ6AEwBw  Pros and Cons of the Butterfly Interconnection Network]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
13 [http://books.google.com/books?id=uyAg3zu_DYMC&amp;amp;printsec=frontcover&amp;amp;dq=principles+and+practices+of+interconnection+networks+dally&amp;amp;source=bl&amp;amp;ots=5tQm2YSKp4&amp;amp;sig=3ulQr5DwwSIQQad5VOeAAHs8lLw&amp;amp;hl=en&amp;amp;ei=qhGxTeuKCbPViALx04WwBg&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;sqi=2&amp;amp;ved=0CCQQ6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Principles and Practices of Interconnection Networks by Dally]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
NOTE:  This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch12_ar&amp;diff=45043</id>
		<title>CSC/ECE 506 Spring 2011/ch12 ar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch12_ar&amp;diff=45043"/>
		<updated>2011-04-18T05:53:51Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;h1&amp;gt;Interconnection Network Architecture &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Typically, in a multiprocessor system, message passed between processors are frequent and short&amp;lt;sup&amp;gt;1&amp;lt;/sup&amp;gt;. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Types of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Linear Array&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_linear.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming.  First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes.  In addition to not scaling well, this topology can also result in high congestion.  [[#References|&amp;lt;sup&amp;gt;[8]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Ring&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_ring.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added.  The congestion will also be cut in half since there is now 2 options for packets to traverse.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;2-D Mesh&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_2Dmesh.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.  This topology has reasonably low energy dissipation without compromising throughput.  The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes.  (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|&amp;lt;sup&amp;gt;[8]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Cube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_cube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Hypercube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_hypercube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The hypercube is essentially multiple cubes put together.  This topology was developed in 1983.  Originally there were no routers.  The next hops were just programmed into each node.  Today the hypercube topology is used by many companies including Intel.  It is so attractive because of its small diameter.  The nodes are numbered in such a way that every neighboring node is only one bit difference.  This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability.  For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|&amp;lt;sup&amp;gt;[10]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;2-D Torus&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_2Dtorus.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher.  Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher.  The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout)  This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|&amp;lt;sup&amp;gt;[9]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_tree.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels.  Also because of the high connectivity, this topology has high average energy dissipation.[[#References|&amp;lt;sup&amp;gt;[8]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fat Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_fat_tree.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree.  To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels.  This helps to alleviate the traffic at upper levels and to decrease the latency of the message.  However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|&amp;lt;sup&amp;gt;[11]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Butterfly&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_butterfly.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Another attempt to increase the “skinny” tree structure was the butterfly structure.  The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.  There are 2 problems with this topology.  First of all, there is no path diversity in this topology.  There is only one path from the root to a downstream node.  This is not ideal incase the network is congested in a certain area, but available in another.  There is no way for the network to rebalance the work.  Second of all, there are some very long routes in this topology.  This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]] &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Comparison of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The following table shows the total number of ports required for each network topology. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_ports.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Number of ports for each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been &amp;quot;fattened&amp;quot; up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_load.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Average path length and link load for each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The figure below shows the cost of the network topologies.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_cost.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Cost of each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
When the cost and average link load is factored the following graph is produced.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_overall.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Overall cost of each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Real-World Implementation of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_network.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Basic structure of Hospodor and Miller's experimental network''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fat Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In large scale, high performance applications, fat tree can be a choice. However, in order to &amp;quot;fatten&amp;quot; up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors&amp;lt;sup&amp;gt;7&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Butterfly&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_butterfly.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Butterfly structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Meshes and Tori&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors&amp;lt;sup&amp;gt;7&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_mesh.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Mesh structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_torus.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Torus structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Hypercube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_hypercube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Hypercube structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Routing&amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Deadlock&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_deadlock.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Example of deadlock''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;h2&amp;gt;Dimensional ordered (X-Y) routing&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns from the y-dimension to the x-dimension are not allowed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;West First&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns to the west are not allowed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;North Last&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns after a north direction are not allowed. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Negative First&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns in the negative direction (-x or -y) are not allowed, except on the first turn.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Odd-Even Turn Model&amp;lt;/h2&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;. The model is designed primarily for 2-D meshes.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
''Turns from the east to north direction from any node on an even column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the north to west direction from any node on an odd column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the east to south direction from any node on an even column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the south to west direction from any node on an odd column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_odd_even.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Comparison of Turn Restriction Models&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few &amp;quot;hot spot&amp;quot; nodes that receive high traffic.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_uniform.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Uniform traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the &amp;quot;slowest&amp;quot; increase in average communication latency. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_transpose.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''First transpose traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_transpose2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Second transpose traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
With the second transpose simulation, the odd-even model outperforms the rest.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_hotspot.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Hotspot traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_hotspot2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Second hotspot traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Router Architecture&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_bandwidth.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Bandwidth of various routers over 10 year period''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_physical.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Router hardware over period of time''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_radix.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Radix and latency of routers over 10 year period''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Fault Tolerant Routing&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components&amp;lt;sup&amp;gt;6&amp;lt;/sup&amp;gt;. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fault Models&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Faults in a network can be categorized in two types:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
1.'''Transient Faults'''&amp;lt;sup&amp;gt;5&amp;lt;/sup&amp;gt; : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2.'''Permanent Faults'''&amp;lt;sup&amp;gt;5&amp;lt;/sup&amp;gt;: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fault Tolerance Mechanisms (for permanent faults)&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The permanent faults can be handled using one of the two mechanisms:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Fault_pic1.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Fault_pic2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;References&amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
1 Solihin text&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&amp;amp;hid=15&amp;amp;sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
7 [http://www.top500.org TOP500 Supercomputing Sites]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
12 [http://books.google.com/books?id=uyAg3zu_DYMC&amp;amp;pg=PA75&amp;amp;lpg=PA75&amp;amp;dq=butterfly+interconnection+network+topology&amp;amp;source=bl&amp;amp;ots=5tQm-TVTu0&amp;amp;sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&amp;amp;hl=en&amp;amp;ei=5cmrTYezINDQiAL9uKjvDA&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=8&amp;amp;ved=0CFMQ6AEwBw  Pros and Cons of the Butterfly Interconnection Network]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
NOTE:  This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch12_ar&amp;diff=44935</id>
		<title>CSC/ECE 506 Spring 2011/ch12 ar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch12_ar&amp;diff=44935"/>
		<updated>2011-04-15T01:21:23Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;h1&amp;gt;Interconnection Network Architecture &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Typically, in a multiprocessor system, message passed between processors are frequent and short&amp;lt;sup&amp;gt;1&amp;lt;/sup&amp;gt;. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Types of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Linear Array&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_linear.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Ring&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_ring.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;2-D Mesh&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_2Dmesh.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;2-D Torus&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_2Dtorus.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Cube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_cube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The cube can be thought of as a three-dimensional mesh.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Hypercube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_hypercube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The hypercube is essentially multiple cubes put together.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_tree.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fat Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_fat_tree.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The fat tree alleviates the traffic at upper levels by &amp;quot;fattening&amp;quot; up the links at the upper levels. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Butterfly&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
[[Image:Top_butterfly.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Real-World Implementation of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_network.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Basic structure of Hospodor and Miller's experimental network''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fat Tree&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In large scale, high performance applications, fat tree can be a choice. However, in order to &amp;quot;fatten&amp;quot; up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors&amp;lt;sup&amp;gt;7&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Butterfly&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_butterfly.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Butterfly structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Meshes and Tori&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors&amp;lt;sup&amp;gt;7&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_mesh.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Mesh structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_torus.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Torus structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Hypercube&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_hypercube.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Hypercube structure''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Comparison of Network Topologies &amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The following table shows the total number of ports required for each network topology. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_ports.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Number of ports for each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been &amp;quot;fattened&amp;quot; up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_load.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Average path length and link load for each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The figure below shows the cost of the network topologies.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_cost.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Cost of each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
When the cost and average link load is factored the following graph is produced.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Disknet_overall.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Overall cost of each topology''&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Routing&amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Deadlock&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_deadlock.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Example of deadlock''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;h2&amp;gt;Dimensional ordered (X-Y) routing&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns from the y-dimension to the x-dimension are not allowed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;West First&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns to the west are not allowed.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;North Last&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns after a north direction are not allowed. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Negative First&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Turns in the negative direction (-x or -y) are not allowed, except on the first turn.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Odd-Even Turn Model&amp;lt;/h2&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;. The model is designed primarily for 2-D meshes.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
''Turns from the east to north direction from any node on an even column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the north to west direction from any node on an odd column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the east to south direction from any node on an even column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Turns from the south to west direction from any node on an odd column are not allowed.''&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed. &lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_odd_even.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Comparison of Turn Restriction Models&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few &amp;quot;hot spot&amp;quot; nodes that receive high traffic.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_uniform.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Uniform traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the &amp;quot;slowest&amp;quot; increase in average communication latency. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_transpose.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''First transpose traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_transpose2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Second transpose traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
With the second transpose simulation, the odd-even model outperforms the rest.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_hotspot.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Hotspot traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Routing_hotspot2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Second hotspot traffic simulation of various turn restriction models''&amp;lt;sup&amp;gt;3&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Router Architecture&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_bandwidth.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Bandwidth of various routers over 10 year period''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_physical.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Router hardware over period of time''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Router_radix.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
''Radix and latency of routers over 10 year period''&amp;lt;sup&amp;gt;4&amp;lt;/sup&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible. &lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;Fault Tolerant Routing&amp;lt;/h1&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components&amp;lt;sup&amp;gt;6&amp;lt;/sup&amp;gt;. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fault Models&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
Faults in a network can be categorized in two types:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
1.'''Transient Faults'''&amp;lt;sup&amp;gt;5&amp;lt;/sup&amp;gt; : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2.'''Permanent Faults'''&amp;lt;sup&amp;gt;5&amp;lt;/sup&amp;gt;: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h2&amp;gt;Fault Tolerance Mechanisms (for permanent faults)&amp;lt;/h2&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
The permanent faults can be handled using one of the two mechanisms:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Fault_pic1.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Fault_pic2.jpg]]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;h1&amp;gt;References&amp;lt;/h1&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;p&amp;gt;&lt;br /&gt;
1 Solihin text&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&amp;amp;hid=15&amp;amp;sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
7 [http://www.top500.org TOP500 Supercomputing Sites]&lt;br /&gt;
&amp;lt;/p&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=43812</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=43812"/>
		<updated>2011-02-21T21:06:48Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a (Revision 1) [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43811</id>
		<title>CSC/ECE 506 Spring 2011/ch3 ab</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43811"/>
		<updated>2011-02-21T21:05:09Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Overview==&lt;br /&gt;
	The main goal of this wiki is to explore synchronization mechanisms in various architectures.  These mechanisms are used to maintain program “correctness” and prevent data corruption when parallelism is employed by a system.  This wiki will first give a brief description of various parallel programming models that could be using the synchronization mechanisms, which will be covered later in the wiki.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Types of Parallelism==&lt;br /&gt;
&lt;br /&gt;
===Section Overview===&lt;br /&gt;
&lt;br /&gt;
This section will give a brief overview of common types of parallel programming models.  For more detailed information on this topic please see [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/ch_3_yl THIS WIKI].  The following parallelisms will be covered here: DOALL, DOACROSS, DOPIPE, reduction, and functional parallelism&lt;br /&gt;
&lt;br /&gt;
===DOALL Parallelism===&lt;br /&gt;
&lt;br /&gt;
DOALL parallelism allows all iterations of a loop to be executed in parallel. There are no loop-carried dependencies.[[#References|&amp;lt;sup&amp;gt;[2]&amp;lt;/sup&amp;gt;]] The following code is an example of a loop that could use DOALL parallelism to parallelis for the i loop [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]:&lt;br /&gt;
&lt;br /&gt;
  for (i=0; i&amp;lt;n; i++)&lt;br /&gt;
    for (j=0; j&amp;lt; n; j++)&lt;br /&gt;
      S3: a[i][j] = a[i][j-1] + 1;&lt;br /&gt;
&lt;br /&gt;
Note the lack of dependencies across the different iterations of the i loop.&lt;br /&gt;
&lt;br /&gt;
[[Image:DOALL.jpg]] [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
===DOACROSS Parallelism===&lt;br /&gt;
Consider this the following loop[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]: &lt;br /&gt;
&lt;br /&gt;
  for (i=1; i&amp;lt;=N; i++) {&lt;br /&gt;
    S: a[i] = a[i-1] + b[i] * c[i];&lt;br /&gt;
  }&lt;br /&gt;
It is not possible to use DOALL parallelism on this loop because of the loop-carried dependence of the “a” variable.  But notice that the “b[i] * c[i]” portion of the code does not have any loop-carried dependencies.  This is the situation needed to use DOACROSS parallelism.  The following loop can be developed to use DOACROSS parallelism.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  post(0);&lt;br /&gt;
  for (i=1; i&amp;lt;=N; i++) {&lt;br /&gt;
    S1: temp = b[i] * c[i];&lt;br /&gt;
    wait(i-1);&lt;br /&gt;
    S2: a[i] = a[i-1] + temp;&lt;br /&gt;
    post(i);&lt;br /&gt;
  }&lt;br /&gt;
&lt;br /&gt;
Each iteration of “b[i] * c[i]” can be performed in parallel, then as soon as the loop-carried dependence on a is satisfied S2 can execute.&lt;br /&gt;
&lt;br /&gt;
[[Image:DOACROSS.jpg]] [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
===DOPIPE parallelism===&lt;br /&gt;
DOPIPE parallelism is another method of parallelism for loops that have loop-carried dependences that uses pipelining.  Consider the following loop [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]:&lt;br /&gt;
&lt;br /&gt;
  for (i=2; i&amp;lt;=N; i++) {&lt;br /&gt;
    S1: a[i] = a[i-1] + b[i];&lt;br /&gt;
    S2: c[i] = c[i] + a[i];&lt;br /&gt;
  }&lt;br /&gt;
&lt;br /&gt;
In this example there is both a loop-carried dependence on S1 and a loop-independent dependence between S1 and S2.  These dependencies require that S1[i] executes before S1[i+1] and S2[i].  This leads to the following parallelized code [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]:&lt;br /&gt;
&lt;br /&gt;
  for (i=2; i&amp;lt;=N; i++) {&lt;br /&gt;
    a[i] = a[i-1] + b[i];&lt;br /&gt;
    post(i);&lt;br /&gt;
  }&lt;br /&gt;
  for (i=2; i&amp;lt;=N; i++) {&lt;br /&gt;
    wait(i);&lt;br /&gt;
    c[i] = c[i] + a[i];&lt;br /&gt;
  }&lt;br /&gt;
&lt;br /&gt;
This code satisfies all of the above requirements. &lt;br /&gt;
&lt;br /&gt;
[[Image:DOPIPE.jpg]] [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
===Functional parallelism===&lt;br /&gt;
Functional parallelism is used when a loop contains statements that are independent of one another.  It provides a modest amount of parallelism and it does not grow with input size.  However, it can be used in conjunction with data parallelism (i.e. DOALL, DOACROSS, etc).  Consider the following loop [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]:&lt;br /&gt;
&lt;br /&gt;
  for (i=0; i&amp;lt;n; i++) {&lt;br /&gt;
    S1: a[i] = b[i+1] * a[i-1];&lt;br /&gt;
    S2: b[i] = b[i] * coef;&lt;br /&gt;
    S3: c[i] = 0.5 * (c[i] + a[i]);&lt;br /&gt;
    S4: d[i] = d[i-1] * d[i];&lt;br /&gt;
  }&lt;br /&gt;
&lt;br /&gt;
Statement S4 has no dependence on any of the other statements in the loop, therefore it can be executed in parallel of statements S1, S2, and S3 [[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]:&lt;br /&gt;
&lt;br /&gt;
  for (i=0; i&amp;lt;n; i++) {&lt;br /&gt;
    S1: a[i] = b[i+1] * a[i-1];&lt;br /&gt;
    S2: b[i] = b[i] * coef;&lt;br /&gt;
    S3: c[i] = 0.5 * (c[i] + a[i]);&lt;br /&gt;
  }&lt;br /&gt;
  for (i=0; i&amp;lt;n; i++) {&lt;br /&gt;
    S4: d[i] = d[i-1] * d[i];&lt;br /&gt;
  }&lt;br /&gt;
&lt;br /&gt;
===Reduction===&lt;br /&gt;
Reduction can be used on operations that are both commutative and associative such as addition multiplication, and logical operations. An example of this is if a sum of products needs to be performed on a matrix.  The matrix can be divided into smaller portions and assign one processor to work on each portion of the matrix.  After all of the processors have completed their tasks, the individual sums can be combined into a global sum. [[#References|&amp;lt;sup&amp;gt;[4]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==Why Synchronization is Needed==&lt;br /&gt;
&lt;br /&gt;
When using any of the above parallel programming models, synchronization is needed to guarantee accuracy of the overall program.  The following are a few example situations where synchronization will be needed.&lt;br /&gt;
*The code following the parallelized loop requires that all of the parallel processes be completed before advancing.  It cannot be triggered simply by one of the processes completing.&lt;br /&gt;
*A portion of code in the middle of a parallelized section MUST be executed in a very particular order so that global variables used across processes get read and written in the proper order.  This is known as the critical section&lt;br /&gt;
*Multiple processes must update a global variable in such a way that one process does not overwrite the updates of a different process. (i.e. SUM = SUM + &amp;lt;process update&amp;gt;)&lt;br /&gt;
This is just a few examples.  Every architecture implements synchronization in a unique way using different types of mechanisms.  The following section will highlight various architectures’ synchronization mechanisms.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Synchronization Mechanisms==&lt;br /&gt;
&lt;br /&gt;
===Section Overview===&lt;br /&gt;
&lt;br /&gt;
In order to accomplish the above parallelizations in a real system, the memory must be carefully orchestrated such that no information gets corrupted.  Every architecture handles synchronizing data from parallel processors slightly differently.  This section is going to look at different architectures and highlight a few of the mechanisms that are used to achieve this memory synchronization.&lt;br /&gt;
&lt;br /&gt;
===IA-64===&lt;br /&gt;
IA-64 is an Intel architecture that is mainly used in Itanium processors.&lt;br /&gt;
====Spinlock====&lt;br /&gt;
the spinlock is used to guard against multiple accesses to the critical section at the same time.  The critical section is a section of code that must be executed in sequential order.  It cannot be parallelized.  Therefore, when a parallel process comes across an occupied critical section the process will “spin” until the lock is released. [[#References|&amp;lt;sup&amp;gt;[5]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // available. If it is 1, another process is in the critical section.&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  spin_lock:&lt;br /&gt;
    mov	ar.ccv = 0			// cmpxchg looks for avail (0)&lt;br /&gt;
    mov	r2 = 1				// cmpxchg sets to held (1)&lt;br /&gt;
  &lt;br /&gt;
  spin: &lt;br /&gt;
    ld8	r1 = [lock] ;;			// get lock in shared state&lt;br /&gt;
    cmp.ne	p1, p0 = r1, r2		// is lock held (ie, lock == 1)?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// yes, continue spinning&lt;br /&gt;
    cmpxchg8.acqrl = [lock], r2		// attempt to grab lock&lt;br /&gt;
    cmp.ne p1, p0 = r1, r2		// was lock empty?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// bummer, continue spinning&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
    st8.rel(lock) = r0 ;;		//release the lock&lt;br /&gt;
&lt;br /&gt;
The above code demonstrates how a spin lock is used.  Once the process gets to a spin lock, it will check to see if the lock is available. If it is not, then the process will proceed into the spin loop where it will continuously check to see if the lock is available.  Once it finds out the lock is available, it will attempt to obtain the lock.  If another process obtains the lock first, then the process will branch back into the spin loop and continue to wait.&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
&lt;br /&gt;
A barrier is a common mechanism used to hold up processes until all processes can get to the same point.  The mechanism is useful in various kinds of different parallelisms (DOALL, DOACROSS, DOPIPE, reduction, and functional parallelism)  This architecture uses a unique form of the barrier mechanism called the sense-reversing barrier.  The idea behind this barrier is to prevent race conditions.  If a process from the “next” instance of the barrier races ahead while slow processes from the current barrier are leaving, the fast processes could trap the slow processes at the “next” barrier and thus corrupting the memory synchronization. [[#References|&amp;lt;sup&amp;gt;[5]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Dekker’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Dekker’s Algorithm uses variables to indicate which processors are using which resources.  It basically arbitrates for a resource using these variables.  Every processor has a flag that indicates when it is in the critical section.  So when a processor is getting ready to enter the critical section it will set its flag to one, then it will check to make sure that all of the other processor flags are zero, then it will proceed into the section.  This behavior is demonstrated in the code below.  It is a two-way multiprocessor system, so there are two processor flags, flag_me and flag_you. [[#References|&amp;lt;sup&amp;gt;[5]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The flag_me variable is zero if we are not in the synchronization and &lt;br /&gt;
  // critical section code and non-zero otherwise; flag_you is similarly set&lt;br /&gt;
  // for the other processor.  This algorithm does not retry access to the &lt;br /&gt;
  // resource if there is contention.&lt;br /&gt;
  &lt;br /&gt;
  dekker:&lt;br /&gt;
    mov		r1 = 1 ;;		// my_flag = 1 (i want access)&lt;br /&gt;
    st8  	[flag_me] = r1&lt;br /&gt;
    mf ;;				// make st visible first&lt;br /&gt;
    ld8 	r2 = [flag_you] ;;		// is other's flag 0?&lt;br /&gt;
    cmp.eq p1, p0 = 0, r2&lt;br /&gt;
  &lt;br /&gt;
  (p1) &lt;br /&gt;
    br.cond.spnt cs_skip ;;		// if not, resource in use &lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  cs_skip:&lt;br /&gt;
    st8.rel[flag_me] = r0 ;;		// release lock&lt;br /&gt;
&lt;br /&gt;
====Lamport’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Lamport’s Algorithm is similar to a spinlock with the addition of a fairness mechanism that keeps track of the order in which processes request the shared resource and provides access to the shared resource in the same order.  It makes use of two variable x and y and a shared array, b.  The example below shows example code for this algorithm.  [[#References|&amp;lt;sup&amp;gt;[5]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The proc_id variable holds a unique, non-zero id for the process that &lt;br /&gt;
  // attempts access to the critical section.  x and y are the synchronization&lt;br /&gt;
  // variables that indicate who is in the critical section and who is attempting&lt;br /&gt;
  // entry. ptr_b_1 and ptr_b_id point at the 1'st and id'th element of b[].&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  lamport:&lt;br /&gt;
    	ld8		r1 = [proc_id] ;;	// r1 = unique process id&lt;br /&gt;
  start:&lt;br /&gt;
    	st8	[ptr_b_id] = r1		// b[id] = &amp;quot;true&amp;quot;&lt;br /&gt;
    	st8	[x] = r1			// x = process id&lt;br /&gt;
   	mf					// MUST fence here!&lt;br /&gt;
    	ld8	r2 = [y] ;;&lt;br /&gt;
    	cmp.ne p1, p0 = 0, r2;;		// if (y !=0) then...&lt;br /&gt;
  (p1)	st8	[ptr_b_id] = r0		// ... b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  (p1)	br.cond.sptk	wait_y		// ... wait until y == 0&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r1		// y = process id&lt;br /&gt;
  	mf&lt;br /&gt;
  	ld8 	r3 = [x] ;;		&lt;br /&gt;
  	cmp.eq p1, p0 = r1, r3 ;;	// if (x == id) then..&lt;br /&gt;
  (p1)	br.cond.sptk cs_begin		// ... enter critical section&lt;br /&gt;
  &lt;br /&gt;
  	st8 	[ptr_b_id] = r0		// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  	ld8	r3 = [ptr_b_1]		// r3 = &amp;amp;b[1]&lt;br /&gt;
  	mov	ar.lc = N-1 ;;		// lc = number of processors - 1&lt;br /&gt;
  wait_b:&lt;br /&gt;
  	ld8	r2 = [r3] ;;		&lt;br /&gt;
  	cmp.ne p1, p0 = r1, r2		// if (b[j] != 0) then...&lt;br /&gt;
  (p1)	br.cond.spnt	wait_b ;;	// ... wait until b[j] == 0&lt;br /&gt;
  	add	r3 = 8, r3		// r3 = &amp;amp;b[j+1]&lt;br /&gt;
  	br.cloop.sptk	wait_b ;;	// loop over b[j] for each j&lt;br /&gt;
  &lt;br /&gt;
  	ld8	r2 = [y] ;;		// if (y != id) then...&lt;br /&gt;
  	cmp.ne p1, p2 = 0, r2&lt;br /&gt;
  (p1)  br.cond.spnt 	wait_y&lt;br /&gt;
  	br	start			// back to start to try again&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r0		// release the lock&lt;br /&gt;
  	st8.rel[ptr_b_id] = r0 ;;	// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
===IA-32=== &lt;br /&gt;
&lt;br /&gt;
IA-32 is an Intel architecture that is also known as x86.  This is a very widely used architecture.&lt;br /&gt;
&lt;br /&gt;
====Locked Atomic Operation====&lt;br /&gt;
This is the main mechanism for this architecture to manage shared data structures such as semaphores and system segments.  The process uses the following three interdependent mechanisms to implement the locked atomic operation: [[#References|&amp;lt;sup&amp;gt;[6]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*  Guaranteed atomic operations.&lt;br /&gt;
*  Bus locking, using the LOCK# signal and the LOCK instruction prefix.&lt;br /&gt;
*  Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock). This mechanism is present in the P6 family processors.&lt;br /&gt;
&lt;br /&gt;
=====Guaranteed Atomic Operation=====&lt;br /&gt;
The following are guaranteed to be carried out automatically: [[#References|&amp;lt;sup&amp;gt;[6]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*  Reading or writing a byte.&lt;br /&gt;
*  Reading or writing a word aligned on a 16-bit boundary.&lt;br /&gt;
*  Reading or writing a doubleword aligned on a 32-bit boundary.The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:&lt;br /&gt;
*  Reading or writing a quadword aligned on a 64-bit boundary. (This operation is also guaranteed on the Pentium® processor.)&lt;br /&gt;
*  16-bit accesses to uncached memory locations that fit within a 32-bit data bus.&lt;br /&gt;
*  16-, 32-, and 64-bit accesses to cached memory that fit within a 32-Byte cache line.&lt;br /&gt;
&lt;br /&gt;
=====Bus Locking=====&lt;br /&gt;
A LOCK signal is asserted automatically during certain critical sections in order to lock the system bus and grant control to the process executing the critical section.  This signal will disallow control of this bus by any other process while the LOCK is engaged.&lt;br /&gt;
&lt;br /&gt;
===Linux Kernel===&lt;br /&gt;
&lt;br /&gt;
Linux Kernel is referred to as an “architecture”, however it is fairly unconventional in that it is an open source operating system that has full access to the hardware. It uses many common synchronization mechanisms, so it will be considered here. [[#References|&amp;lt;sup&amp;gt;[8]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Busy-waiting lock====&lt;br /&gt;
&lt;br /&gt;
=====Spinlocks=====&lt;br /&gt;
&lt;br /&gt;
This mechanism is very similar to the mechanism described in the IA-64 architecture.  It is a mechanism used to manage access to a critical section of code.  If a process tries to access the critical section and is rejected it will sit and “spin” while it waits for the lock to be released.&lt;br /&gt;
&lt;br /&gt;
=====Rwlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a special kind of spinlock.  It is for protected structures that are frequently read, but rarely written.  This lock allows multiple reads in parallel, which can increase efficiency if process are not having to sit and wait in order to merely carry out a read function.  Like before however, one write is allowed at a time with no reads done in parallel&lt;br /&gt;
&lt;br /&gt;
=====Brlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a super fast read/write lock, but it has a write-side penalty.  The main advantage of this lock is to prevent cache “ping-pong” in a multiple read case.&lt;br /&gt;
&lt;br /&gt;
====Sleeper locks====&lt;br /&gt;
&lt;br /&gt;
=====Semiphores=====&lt;br /&gt;
&lt;br /&gt;
A semaphore is special variable that acts similar to a lock.  If the semaphore can be acquired then the process can proceed into the critical section.  If the semaphore cannon be acquired, then the process is “put to sleep” and the processor is then used for another process.  This means the processes cache is saved off in a place where it can be retrieved when the process is “woken up”.  Once the semaphore is available the “sleeping” process is woken up and obtains the semaphore and proceeds in to the critical section. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===CUDA=== &lt;br /&gt;
&lt;br /&gt;
CUDA, or Compute Unified Device Architecture, is an Nvidia architecture which is the computing engine for their graphics processors.&lt;br /&gt;
&lt;br /&gt;
====_syncthreads====&lt;br /&gt;
&lt;br /&gt;
The _syncthreads operation can be used at the end of a parallel section as a sort of “barrier” mechanicm.  It is necessary to ensure the accuracy of the memory.  In the following example, there are two calls to _syncthreads.  They are both necessary to ensure the expected results are obtained.  Without it, myArray[tid] could end up being either 2 or the original value of myArray[] depending on when the read and write take place.[[#References|&amp;lt;sup&amp;gt;[7]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // myArray is an array of integers located in global or shared&lt;br /&gt;
  // memory&lt;br /&gt;
  __global__ void MyKernel(int* result) {&lt;br /&gt;
     int tid = threadIdx.x;&lt;br /&gt;
    ...    &lt;br /&gt;
     int ref1 = myArray[tid];&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    myArray[tid + 1] = 2;&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    int ref2 = myArray[tid];&lt;br /&gt;
    result[tid] = ref1 * ref2;&lt;br /&gt;
    ...    &lt;br /&gt;
  {&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
PowerPC is an IBM architecture that stands for Performance Optimization With Enhanced RISC-Performance Computing.  It is a RISC architecture that was originally designed for PCs, however it has grown into the embedded and high-performance space. [[#References|&amp;lt;sup&amp;gt;[10]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Isync==== &lt;br /&gt;
&lt;br /&gt;
isync is an instruction that guarantees that before any code proceeding after the isync instruction can execute, all of the code preceding it has already completed.   It also ensures that any cache block invalidations instructions that were executed before the isync have been carried out with respect to the processor executing the isync instruction.  It then causes any prefetched instructions to be discarded. [[#References|&amp;lt;sup&amp;gt;[9]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Memory Barrier Instructions====&lt;br /&gt;
&lt;br /&gt;
Memory Barrier Instructions can be used to control the order in which storage access are performed. [[#References|&amp;lt;sup&amp;gt;[9]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=====HeavyWeight sync=====&lt;br /&gt;
This memory barrier creates an ordering function for the storage accesses that are associated with all of the instructions that are executed by the processor executing the sync instruction.&lt;br /&gt;
&lt;br /&gt;
=====LightWeight sync=====&lt;br /&gt;
This memory barrier creates an ordering function for the storage accesses caused by LOAD and STORE instructions that are executed by the processor executing the sync instruction.  Also, this instruction must execute on the specified storage location in storage that is neither a Write Through Required nor a Caching Inhibited.&lt;br /&gt;
&lt;br /&gt;
=====Enforce In-order Execution of I/O=====&lt;br /&gt;
The Enforce In-order Execution of I/O, or eieio, instruction is a memory barrier that creates an ordering function for the storage accesses caused by LOADs and STOREs.  These instructions are split into two groups: [[#References|&amp;lt;sup&amp;gt;[9]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
1. Loads and stores to storage that is both Caching Inhibited and Guarded, and stores to main storage caused by stores to storage that is Write Through Required&lt;br /&gt;
&lt;br /&gt;
2. Stores to storage that is Memory Coherence Required and is neither Write Through Required nor Caching Inhibited&lt;br /&gt;
&lt;br /&gt;
For the first group the ordering done by the memory barrier for accesses in this set is not cumulative.  For the second group the ordering done by the memory barrier for accesses in this set is cumulative.&lt;br /&gt;
&lt;br /&gt;
===Cell Broadband Engine===&lt;br /&gt;
Cell Broadband Engine, also referred to as Cell or Cell BE, is an IBM architecture whose first major application was in Sony’s PlayStation 3.  Cell has streamlined coprocessing elements which is great for fast multimedia and vector processing applications. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
This architecture is interesting because it uses a shared memory model in which the LOADs and STOREs use a “weakly consistent” storage model.  Meaning that, the sequence in which any of the following orders are executed may be different from each other: [[#References|&amp;lt;sup&amp;gt;[11]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
* The order of any processor element (PPE or SPE) performing storage access&lt;br /&gt;
* The order in which the accesses are performed with respect to another processor element&lt;br /&gt;
* The order in which the accesses are performed in main storage&lt;br /&gt;
&lt;br /&gt;
It is important that the accesses to the shared memory happen in the correct program order or information could be lost or corrupted.  In order to ensure that this doesn’t happen the following memory barrier instructions are used:&lt;br /&gt;
&lt;br /&gt;
====Fence====&lt;br /&gt;
After all previous issued commands within the same “tag group” have been performed the fence instruction can be issued.  If there is a command that is issued after the fence command, it might be executed before the fence command. [[#References|&amp;lt;sup&amp;gt;[11]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
After all previous issued commands have been performed, the barrier command and all of the instructions after the barrier command can then be executed. [[#References|&amp;lt;sup&amp;gt;[11]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&amp;lt;ol&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/ch_3_yl WIKI reference for parallel programming models]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/ch_3_jb/Parallel_Programming_Model_Support WIKI reference for DOALL parallelism]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://courses.ncsu.edu/csc506/lec/001/lectures/notes/lec5.doc Lecture 5 from NC State's ECE/CSC506]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://courses.ncsu.edu/csc506/lec/001/lectures/notes/lec6.doc Lecture 6 from NC State's ECE/CSC506]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA64-softdevman-vol2.pdf IA-64 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA32-softdevman-vol3.pdf IA-32 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf CUDA Programming Guide]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=6&amp;amp;ved=0CEQQFjAF&amp;amp;url=http%3A%2F%2Flinuxindore.com%2Fdownloads%2Fdownload%2Fdata-structures%2Flinux-kernel-arch&amp;amp;ei=jxZWTaGTNI34sAPWm-ScDA&amp;amp;usg=AFQjCNG9UOAz7rHfwUDfayhr50M87uNOYA&amp;amp;sig2=azvo4h85RkoNHcZUtNIkJw Linux Kernel Architecture Overveiw]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://download.boulder.ibm.com/ibmdl/pub/software/dw/library/es-ppcbook2.zip PowerPC Architecture Book]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCEQFjAA&amp;amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPowerPC&amp;amp;ei=77RYTejKFZSisQOm6-GiDA&amp;amp;usg=AFQjCNFt0LpxmNviHKFxCur-amK9HAG08Q&amp;amp;sig2=Kmm9RzJY-4AlG66AwWxlRA Wikipedia information on PowerPC]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf IBM cell Cell Architecture Book]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=5&amp;amp;ved=0CDgQFjAE&amp;amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCell_(microprocessor)&amp;amp;ei=3MJYTeK5Aov6sAPC5-yiDA&amp;amp;usg=AFQjCNENg6PvayZebvtWf7KQstpJDk6URw&amp;amp;sig2=xs87jzBsFgneYOxP0k-_aQ Wikipedia information on Cell]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/ol&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:ALIsDOALL.JPG&amp;diff=43810</id>
		<title>File:ALIsDOALL.JPG</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:ALIsDOALL.JPG&amp;diff=43810"/>
		<updated>2011-02-21T21:00:06Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:DOALL.JPG&amp;diff=43809</id>
		<title>File:DOALL.JPG</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:DOALL.JPG&amp;diff=43809"/>
		<updated>2011-02-21T20:58:33Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43736</id>
		<title>CSC/ECE 506 Spring 2011/ch3 ab</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43736"/>
		<updated>2011-02-14T06:24:49Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Supplement to Chapter 3: Support for parallel-programming models. Discuss how DOACROSS, DOPIPE, DOALL, etc. are implemented in packages such as Posix threads, Intel Thread Building Blocks, OpenMP 2.0 and 3.0.&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this wiki supplement, we will discuss how the three kinds of parallelisms, i.e. DOALL, DOACROSS and DOPIPE implemented in the threads packages - OpenMP, Intel Threading Building Block, POSIX Threads. We discuss each package from the perspective of variable scopes &amp;amp; Reduction/DOALL/DOACROSS/DOPIPE implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementation==&lt;br /&gt;
&lt;br /&gt;
===OpenMP===&lt;br /&gt;
The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.&lt;br /&gt;
&lt;br /&gt;
====Variable Clauses ====&lt;br /&gt;
There are many different types of clauses in OpenMP and each of them has various characteristics. Here we introduce data sharing attribute clauses, Synchronization clauses, Scheduling clauses, Initialization and Reduction. &lt;br /&gt;
=====Data sharing attribute clauses=====&lt;br /&gt;
* ''shared'': the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.&lt;br /&gt;
  Format: shared ''(list)''&lt;br /&gt;
&lt;br /&gt;
  SHARED variables behave as follows:&lt;br /&gt;
  1. Existing in only one memory location and all threads can read or write to that address &lt;br /&gt;
&lt;br /&gt;
* ''private'': the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.&lt;br /&gt;
  Format: private ''(list)''&lt;br /&gt;
&lt;br /&gt;
  PRIVATE variables behave as follows: &lt;br /&gt;
    1. A new object of the same type is declared once for each thread in the team&lt;br /&gt;
    2. All references to the original object are replaced with references to the new object&lt;br /&gt;
    3. Variables declared PRIVATE should be assumed to be uninitialized for each thread &lt;br /&gt;
&lt;br /&gt;
* ''default'': allows the programmer to state that the default data scoping within a parallel region will be either ''shared'', or ''none'' for C/C++, or ''shared'', ''firstprivate'', ''private'', or ''none'' for Fortran.  The ''none'' option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.&lt;br /&gt;
  Format: default (shared | none)&lt;br /&gt;
&lt;br /&gt;
  DEFAULT variables behave as follows: &lt;br /&gt;
    1. Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. &lt;br /&gt;
    2. Using NONE as a default requires that the programmer explicitly scope all variables.&lt;br /&gt;
&lt;br /&gt;
=====Synchronization clauses=====&lt;br /&gt;
* ''critical section'': the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.&lt;br /&gt;
  Format: #pragma omp critical ''[ name ]  newline''&lt;br /&gt;
           ''structured_block''&lt;br /&gt;
&lt;br /&gt;
  CRITICAL SECTION behaves as follows:&lt;br /&gt;
    1. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.&lt;br /&gt;
    2. It is illegal to branch into or out of a CRITICAL block. &lt;br /&gt;
&lt;br /&gt;
* ''atomic'': similar to ''critical section'', but advise the compiler to use special hardware instructions for better performance. Compilers may choose to ignore this suggestion from users and use ''critical section'' instead.&lt;br /&gt;
  Format: #pragma omp atomic  ''newline''&lt;br /&gt;
           ''statement_expression''&lt;br /&gt;
&lt;br /&gt;
  ATOMIC behaves as follows:&lt;br /&gt;
    1. Only to a single, immediately following statement.&lt;br /&gt;
    2. An atomic statement must follow a specific syntax. &lt;br /&gt;
&lt;br /&gt;
* ''ordered'': the structured block is executed in the order in which iterations would be executed in a sequential loop&lt;br /&gt;
  Format: #pragma omp for ordered ''[clauses...]''&lt;br /&gt;
          ''(loop region)''&lt;br /&gt;
          #pragma omp ordered  ''newline''&lt;br /&gt;
          ''structured_block&lt;br /&gt;
          (endo of loop region)''&lt;br /&gt;
&lt;br /&gt;
  ORDERED behaves as follows:&lt;br /&gt;
    1. only appear in the dynamic extent of ''for'' or ''parallel for (C/C++)''.&lt;br /&gt;
    2. Only one thread is allowed in an ordered section at any time.&lt;br /&gt;
    3. It is illegal to branch into or out of an ORDERED block. &lt;br /&gt;
    4. A loop which contains an ORDERED directive, must be a loop with an ORDERED clause. &lt;br /&gt;
&lt;br /&gt;
* ''barrier'': each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.&lt;br /&gt;
   Format: #pragma omp barrier  ''newline''&lt;br /&gt;
&lt;br /&gt;
   BARRIER behaves as follows:&lt;br /&gt;
    1. All threads in a team (or none) must execute the BARRIER region.&lt;br /&gt;
    2. The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team.&lt;br /&gt;
&lt;br /&gt;
*''taskwait'': specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.&lt;br /&gt;
   Format: #pragma omp taskwait  ''newline''&lt;br /&gt;
&lt;br /&gt;
   TASKWAIT behaves as follows:&lt;br /&gt;
    1. Placed only at a point where a base language statement is allowed.&lt;br /&gt;
    2. Not be used in place of the statement following an if, while, do, switch, or label.&lt;br /&gt;
&lt;br /&gt;
*''flush'': The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory. Thread-visible variables are written back to memory at this point. &lt;br /&gt;
   Format: #pragma omp flush ''(list)  newline''&lt;br /&gt;
&lt;br /&gt;
   FLUSH behaves as follows:&lt;br /&gt;
    1. The optional list contains a list of named variables that will be flushed in order to avoid flushing all variables.&lt;br /&gt;
    2. Implementations must ensure any prior modifications to thread-visible variables are visible to all threads after this point.&lt;br /&gt;
&lt;br /&gt;
=====Scheduling clauses=====&lt;br /&gt;
*''schedule(type, chunk)'': This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are:&lt;br /&gt;
#''static'': Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter &amp;quot;chunk&amp;quot; will allocate &amp;quot;chunk&amp;quot; number of contiguous iterations to a particular thread.&lt;br /&gt;
#''dynamic'': Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter &amp;quot;chunk&amp;quot; defines the number of contiguous iterations that are allocated to a thread at a time.&lt;br /&gt;
#''guided'': A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter &amp;quot;chunk&amp;quot;&lt;br /&gt;
=====Initialization=====&lt;br /&gt;
* ''firstprivate'': the data is private to each thread, but initialized using the value of the variable using the same name from the master thread.&lt;br /&gt;
  Format: firstprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  FIRSTPRIVATE variables behave as follows: &lt;br /&gt;
    1. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. &lt;br /&gt;
&lt;br /&gt;
* ''lastprivate'': the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop.  A variable can be both ''firstprivate'' and ''lastprivate''. &lt;br /&gt;
  Format: lastprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
* ''threadprivate'': The data is a global data, but it is private in each parallel region during the runtime. The difference between ''threadprivate'' and ''private'' is the global scope associated with threadprivate and the preserved value across parallel regions.&lt;br /&gt;
  Format: #pragma omp threadprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  THREADPRIVATE variables behave as follows: &lt;br /&gt;
    1. On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined. &lt;br /&gt;
    2. The THREADPRIVATE directive must appear after every declaration of a thread private variable/common block.&lt;br /&gt;
&lt;br /&gt;
=====Reduction=====&lt;br /&gt;
* ''reduction'': the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in &amp;quot;operator&amp;quot; for this particular clause) on a datatype that runs iteratively so that its value at a particular iteration depends on its value at a previous iteration. Basically, the steps that lead up to the operational increment are parallelized, but the threads gather up and wait before updating the datatype, then increments the datatype in order so as to avoid racing condition. &lt;br /&gt;
  Format: reduction ''(operator: list)''&lt;br /&gt;
&lt;br /&gt;
  REDUTION variables behave as follows: &lt;br /&gt;
    1. Variables in the list must be named scalar variables. They can not be array or structure type variables. They must also be declared SHARED in the enclosing context.&lt;br /&gt;
    2. Reduction operations may not be associative for real numbers.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
In code 3.20, first it must include the header file ''omp.h'' which contains OpenMP function declarations. Next, A parallel region is started by  #pragma omp parallel and we enclose this program bu curly brackets. We can use (setenv OMP_NUM_THREADS n) to specify the number of threads. Another way to determine the number of threads is directly calling a function (omp_set_numtheads (n)). &lt;br /&gt;
Code 3.20 only has one loop to execute and we want it to execute in parallel, so we combine the start of the parallel loop and the start of the parallel region with one directive ''#pragma omp parallel for''. &lt;br /&gt;
 &lt;br /&gt;
 '''Code 3.20 A DOALL parallelism example in OpenMP&lt;br /&gt;
 '''#include''' &amp;lt;omp.h&amp;gt;&lt;br /&gt;
 '''...'''&lt;br /&gt;
 '''#pragma''' omp parallel //start of parallel region&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''...'''&lt;br /&gt;
  '''#pragma''' omp parallel for default (shared)&lt;br /&gt;
  '''for''' ( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
    '''A[i]''' = A[i] + A[i] - 3.0;&lt;br /&gt;
 '''}'''//end for parallel region&lt;br /&gt;
&lt;br /&gt;
Apparently, there is no loop-carried dependence in ''i'' loop. With OpenMP, we only need to insert the ''pragma'' directive ''parallel for''. The ''dafault(shared)'' clauses states that all variables within the scope of the loop are shared  unless otherwise specified.&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
We will now introduce how to implement DOACROSS in OpenMP. Here is an example code which has not been paralleled yet.&lt;br /&gt;
 &lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02: for(j=1; j&amp;lt;N; j++){&lt;br /&gt;
 03: a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 04: }&lt;br /&gt;
 05: }&lt;br /&gt;
&lt;br /&gt;
From this sample code, obviously, there is dependence existing here. &lt;br /&gt;
 a[i,j] -&amp;gt; T a[i+1, j+1]&lt;br /&gt;
&lt;br /&gt;
In OpenMP, DOALL parallel can be implemented by insert a “#pragma omp for” before the “for” structure in the source code. But there is not a pragma corresponding to DOACROSS parallel.&lt;br /&gt;
&lt;br /&gt;
When we implement DOACROSS, we use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is gotten by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is gotten by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
*omp_get_num_threads(): Returns the number of threads that are currently in the team executing the parallel region from which it is called.&lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_num_threads(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_NUM_THREADS behaves as following:&lt;br /&gt;
  1. If this call is made from a serial portion of the program, or a nested parallel region that is serialized, it will return 1. &lt;br /&gt;
  2. The default number of threads is implementation dependent. &lt;br /&gt;
&lt;br /&gt;
*omp_get_thread_num(): Returns the thread number of the thread, within the team, making this call. This number will be between 0 and OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0 &lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_thread_num(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_THREAD_NUM behaves as followings:&lt;br /&gt;
  1. If called from a nested parallel region, or a serial region, this function will return 0. &lt;br /&gt;
&lt;br /&gt;
Now, let's see the code which has been paralleled and explanation. &lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 		//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(j_tile = 0; j_tile&amp;lt;N-1; j_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       for(j=j_tile;j&amp;lt;j_tile+M;j++){&lt;br /&gt;
 19:         a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 20:       }&lt;br /&gt;
 21:     }&lt;br /&gt;
 22:     _mylock[myid] += 1;&lt;br /&gt;
 23:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 24:   }&lt;br /&gt;
 25: }&lt;br /&gt;
&lt;br /&gt;
We paralleled the original program in two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other four processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take four interations of the loop i. The same to j loop. Assume the size of each block is 4. Each processor will execute four iterations of loop j. In order to let the total iterations be equal to the original program, j has to be enclosed in loop i. So, the new loop will be looked like ''for (j_tile = 2; j_tile &amp;lt;= 15; j_tile += 4)'', line 18.&lt;br /&gt;
The lower bound of loop j is set to j_tile and the upper bound will be j_tile+3. We will keep the other statement unchanged.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the neighbor threads. After first step, the four processor will finish computing a block 4x4. If we parallel all these four processors, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
We set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
With the four variables, threads are synchronized:&lt;br /&gt;
The first thread continues to run with out waiting (line 9), because its thread ID is 0. Then all other thread can not go down after line 12. If the value in ''_mylocks[_my_id-1]'' is smaller than ''_counter0''.&lt;br /&gt;
&lt;br /&gt;
Otherwise, the block that the current thread is waiting for must have to be completed, and the current thread can go down to line 12, and mark the next block it will wait for by adding 1 to ''_counter0'' (line 14).&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it&lt;br /&gt;
has finish a block by ''mylocks[proc]++''. Once the neighbor thread finds the value has been changed, it will continue running and so on. The below figure presents it to us.&lt;br /&gt;
[[Image:Synchorization.jpg]]&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
Here is another example code and we are going to parallelize it in DOPIPE parallelism. There is a dependence, which is S2 -&amp;gt; T S1, existing in the sample code.&lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02:   S1: a[i]=b[i];&lt;br /&gt;
 03:   S2: c[i]=c[i-1]+a[i];&lt;br /&gt;
 04: &lt;br /&gt;
 05: }&lt;br /&gt;
Now, let's see how to parallel the sample code to DOPIPE parallelism.&lt;br /&gt;
we still use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is got by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is got by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 			//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(i_tile = 0; i_tile&amp;lt;N-1; i_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       a[i]=b[i];&lt;br /&gt;
 19:     }&lt;br /&gt;
 20:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 21:       c[i]=c[i-1]+a[i];&lt;br /&gt;
 22:     }&lt;br /&gt;
 23:     _mylock[myid] += 1;&lt;br /&gt;
 24:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 25:   }&lt;br /&gt;
 26: }&lt;br /&gt;
&lt;br /&gt;
Ideally, We parallelized the original program into two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take interations of the loop i. Now, there are two loop i existing and each loop i contains different statements. Also, we will keep other statements remained.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the threads. After first step, processors will finish computing &lt;br /&gt;
a[i]=b[i]. If we parallel all the processors to do the second loop i, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
Still, we set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it has finish a block by ''mylocks[proc]++''. Once the processors finish their own block, the other processors will be able to get the value to use that value to execute in its statement and process that.&lt;br /&gt;
&lt;br /&gt;
====Functional Parallelism====&lt;br /&gt;
&lt;br /&gt;
In order to introduce function parallelism, we want to execute some code section in parallel with another code section. We use code 3.21 to show two loops execute in parallel with respect to one another, although each loop is sequentially executed.&lt;br /&gt;
&lt;br /&gt;
 '''Code''' 3.21 A function parallelism example in OpenMP&lt;br /&gt;
 '''pragma''' omp parallel shared(A, B)private(i)&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''#pragma''' omp sections nowait&lt;br /&gt;
  '''{'''&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''A[i]''' = A[i]*A[i] - 4.0;&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''B[i]''' = B[i]*B[i] - 9.0;&lt;br /&gt;
  '''}'''//end omp sections&lt;br /&gt;
 '''}'''//end omp parallel&lt;br /&gt;
&lt;br /&gt;
In code 3.21, there are two loops needed to be executed in parallel. We just need to insert two ''pragma omp section'' statements. Once we insert these two statements, those two loops will execute sequentially.&lt;br /&gt;
&lt;br /&gt;
===Intel Thread Building Blocks===&lt;br /&gt;
&lt;br /&gt;
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable &lt;br /&gt;
parallel programming using standard ISO C++ code. It does not require special &lt;br /&gt;
languages or compilers. It is designed to promote scalable data parallel programming. &lt;br /&gt;
The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually. Instead the library abstracts access to the multiple processors by allowing the operations to be treated as &amp;quot;tasks,&amp;quot; which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Also, Intel Threading Building Blocks provides net results, which enables you to specify &lt;br /&gt;
parallelism more conveniently than using raw threads, and at the same time can &lt;br /&gt;
improve performance.&lt;br /&gt;
&lt;br /&gt;
====Variables Scope====&lt;br /&gt;
&lt;br /&gt;
Intel TBB is a collection of components for parallel programming, here is the overview of the library contents:&lt;br /&gt;
&lt;br /&gt;
* Basic algorithms: parallel_for, parallel_reduce, parallel_scan&lt;br /&gt;
* Advanced algorithms: parallel_while, parallel_do,pipeline, parallel_sort&lt;br /&gt;
* Containers: concurrent_queue, concurrent_vector, concurrent_hash_map&lt;br /&gt;
* Scalable memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator&lt;br /&gt;
* Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive mutex&lt;br /&gt;
* Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store&lt;br /&gt;
* Timing: portable fine grained global time stamp&lt;br /&gt;
* Task Scheduler: direct access to control the creation and activation of tasks&lt;br /&gt;
&lt;br /&gt;
Then we will focus on some specific TBB variables.&lt;br /&gt;
&lt;br /&gt;
=====parallel_for=====&lt;br /&gt;
&lt;br /&gt;
Parallel_for is the template function that performs parallel iteration over a range of values. In Intel TBB, a lot of DOALL cases could be implemented by using this function. The syntax is as follows: &lt;br /&gt;
 template&amp;lt;typename Index, typename Function&amp;gt;&lt;br /&gt;
 Function parallel_for(Index first, Index_type last, Index step, Function f);&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_for( const Range&amp;amp; range, const Body&amp;amp; body, [, partitioner] );&lt;br /&gt;
&lt;br /&gt;
A parallel_for(first, last, step, f) represents parallel execution of the loop: &amp;quot;for( auto i=first; i&amp;lt;last; i+=step ) f(i);&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
=====parallel_reduce=====&lt;br /&gt;
&lt;br /&gt;
Function parallel_reduce computes reduction over a range. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Value, typename Func, typename Reduction&amp;gt;&lt;br /&gt;
 Value parallel_reduce( const Range&amp;amp; range, const Value&amp;amp; identity, const Func&amp;amp; func, const Reduction&amp;amp; reduction );&lt;br /&gt;
&lt;br /&gt;
The functional form parallel_reduce(range,identity,func,reduction) performs a&lt;br /&gt;
parallel reduction by applying func to subranges in range and reducing the results&lt;br /&gt;
using binary operator reduction. It returns the result of the reduction. Parameter func&lt;br /&gt;
and reduction can be lambda expressions.&lt;br /&gt;
&lt;br /&gt;
=====parallel_scan=====&lt;br /&gt;
&lt;br /&gt;
This template function computes parallel prefix. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const auto_partitioner&amp;amp; );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const simple_partitioner&amp;amp; );&lt;br /&gt;
&lt;br /&gt;
A parallel_scan(range,body) computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing that is&lt;br /&gt;
sometimes useful in scenarios that appear to have inherently serial dependences. A&lt;br /&gt;
further explanation will be given in the DOACROSS example.&lt;br /&gt;
&lt;br /&gt;
=====pipeline=====&lt;br /&gt;
&lt;br /&gt;
This class performs pipelined execution. Members as follows:&lt;br /&gt;
 namespace tbb {&lt;br /&gt;
     class pipeline {&lt;br /&gt;
     public:&lt;br /&gt;
        pipeline();&lt;br /&gt;
        ~pipeline(); &lt;br /&gt;
        void add_filter( filter&amp;amp; f );&lt;br /&gt;
        void run( size_t max_number_of_live_tokens );&lt;br /&gt;
        void clear();&lt;br /&gt;
   };&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
A pipeline represents pipelined application of a series of filters to a stream of items.&lt;br /&gt;
Each filter operates in a particular mode: parallel, serial in order, or serial out of order. With a parallel filter, &lt;br /&gt;
we could implement DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
====Reduction====&lt;br /&gt;
&lt;br /&gt;
The reduction in Intel TBB is implemented using parallel_reduce function. A parallel_reduce recursively splits the range into subranges and uses the splitting constructor to make one or more copies of the body for each thread. We use an example to illustrate this: &lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 struct Sum {&lt;br /&gt;
     float value;&lt;br /&gt;
     Sum() : value(0) {}&lt;br /&gt;
     Sum( Sum&amp;amp; s, split ) {value = 0;}&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;float*&amp;gt;&amp;amp; r ) {&lt;br /&gt;
         float temp = value;&lt;br /&gt;
         for( float* a=r.begin(); a!=r.end(); ++a ) {&lt;br /&gt;
             temp += *a;&lt;br /&gt;
         }&lt;br /&gt;
         value = temp;&lt;br /&gt;
     }&lt;br /&gt;
     void join( Sum&amp;amp; rhs ) {value += rhs.value;}&lt;br /&gt;
 };&lt;br /&gt;
 float ParallelSum( float array[], size_t n ) {&lt;br /&gt;
     Sum total;&lt;br /&gt;
     parallel_reduce( blocked_range&amp;lt;float*&amp;gt;( array, array+n ), total );&lt;br /&gt;
     return total.value;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The above example sums the values in the array. The parallel_reduce will do the reduction within the range of (array, array+n), to split the working body, and then join them by the return value for each split.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
The implementation of DOALL parallelism in Intel TBB will involve Parallel_for function. &lt;br /&gt;
To better illustrate the usage, here we discuss a simple example. The following is the original code:&lt;br /&gt;
 &lt;br /&gt;
 void SerialApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     for( size_t i=0; i&amp;lt;n; ++i )&lt;br /&gt;
         Foo(a[i]);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
After using Intel TBB, it could be switched to the following:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_for.h&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 class ApplyFoo {&lt;br /&gt;
     float *const my_a;&lt;br /&gt;
 public:&lt;br /&gt;
     void operator( )( const blocked_range&amp;lt;size_t&amp;gt;&amp;amp; r ) const {&lt;br /&gt;
         float *a = my_a;&lt;br /&gt;
         for( size_t i=r.begin(); i!=r.end( ); ++i )&lt;br /&gt;
             Foo(a[i]);&lt;br /&gt;
     }&lt;br /&gt;
     ApplyFoo( float a[] ) :&lt;br /&gt;
         my_a(a)&lt;br /&gt;
     {}&lt;br /&gt;
 };&lt;br /&gt;
 &lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n,The_grain_size_You_Pick), ApplyFoo(a) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example is the simplest DOALL parallelism, similar to the one in the textbook, and execution graph will be very similar as the one in DOALL section above. But with the help of this simple illustration, the TBB code just gives you a flavor of how it would be implemented in Intel Threading Building Blocks.&lt;br /&gt;
&lt;br /&gt;
A little more to say, parallel_for takes an optional third argument to specify a partitioner, which I used &amp;quot;The_grain_size_You_Pick&amp;quot; to represent. If you want to manually divide the grain and assign the work to processors, you could specify that in the function. Or, you could use automatic grain provided TBB. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempts to limit overhead while still providing ample opportunities for load balancing. Then, the last three line of the TBB code above will be:&lt;br /&gt;
&lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n), ApplyFoo(a), auto_partitioner( ) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
We could find a good example in Intel TBB to implement a DOACROSS with the help of parallel_scan. As stated in the parallel_scan section, this function computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing which&lt;br /&gt;
could be helpful in scenarios that appear to have inherently serial dependences, which could be loop-carried dependences. &lt;br /&gt;
&lt;br /&gt;
Let's consider this scenario (which is actually the mathematical definition of parallel prefix):  &lt;br /&gt;
 T temp = id⊕;&lt;br /&gt;
 for( int i=1; i&amp;lt;=n; ++i ) {&lt;br /&gt;
     temp = temp ⊕ x[i];&lt;br /&gt;
     y[i] = temp;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
When we implement this in TBB using parallel_scan, it becomes:&lt;br /&gt;
&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 class Body {&lt;br /&gt;
     T sum;&lt;br /&gt;
     T* const y;&lt;br /&gt;
     const T* const x;&lt;br /&gt;
 public:&lt;br /&gt;
     Body( T y_[], const T x_[] ) : sum(id⊕), x(x_), y(y_) {}&lt;br /&gt;
     T get_sum() const {return sum;}&lt;br /&gt;
     template&amp;lt;typename Tag&amp;gt;&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;int&amp;gt;&amp;amp; r, Tag ) {&lt;br /&gt;
         T temp = sum;&lt;br /&gt;
         for( int i=r.begin(); i&amp;lt;r.end(); ++i ) {&lt;br /&gt;
             temp = temp ⊕ x[i];&lt;br /&gt;
             if( Tag::is_final_scan() )&lt;br /&gt;
                 y[i] = temp;&lt;br /&gt;
         } &lt;br /&gt;
         sum = temp;&lt;br /&gt;
     }&lt;br /&gt;
     Body( Body&amp;amp; b, split ) : x(b.x), y(b.y), sum(id⊕) {}&lt;br /&gt;
     void reverse_join( Body&amp;amp; a ) { sum = a.sum ⊕ sum;}&lt;br /&gt;
     void assign( Body&amp;amp; b ) {sum = b.sum;}&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
 float DoParallelScan( T y[], const T x[], int n ) {&lt;br /&gt;
     Body body(y,x);&lt;br /&gt;
     parallel_scan( blocked_range&amp;lt;int&amp;gt;(0,n), body );&lt;br /&gt;
     return body.get_sum();&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
It is the second part (function DoParallelScan) that we have to focus on. &lt;br /&gt;
&lt;br /&gt;
Actually, this example is just the scenario mentioned above that could take advantages of parallel_scan. The &amp;quot;inherently serial dependences&amp;quot; is taken care of by the functionality of parallel_scan. By computing the prefix, the serial code could be implemented in parallel with just one function.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
&lt;br /&gt;
Pipeline class is the Intel TBB that performs pipelined execution. A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order. So this class can be used to implement a DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
Here is a comparatively complex example about pipeline implementation. Also, if we look carefully, this is an example with both DOPIPE and DOACROSS:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;iostream&amp;gt;&lt;br /&gt;
 #include &amp;quot;tbb/pipeline.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/tbb_thread.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 char InputString[] = &amp;quot;abcdefg\n&amp;quot;;&lt;br /&gt;
 class InputFilter: public filter {&lt;br /&gt;
     char* my_ptr;&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void*) {&lt;br /&gt;
         if (*my_ptr)&lt;br /&gt;
             return my_ptr++;&lt;br /&gt;
         else&lt;br /&gt;
             return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     InputFilter() :&lt;br /&gt;
         filter( serial_in_order ), my_ptr(InputString) {}&lt;br /&gt;
 };&lt;br /&gt;
 class OutputFilter: public thread_bound_filter {&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void* item) {&lt;br /&gt;
         std::cout &amp;lt;&amp;lt; *(char*)item;&lt;br /&gt;
         return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     OutputFilter() : thread_bound_filter(serial_in_order) {}&lt;br /&gt;
 };&lt;br /&gt;
 void RunPipeline(pipeline* p) {&lt;br /&gt;
     p-&amp;gt;run(8);&lt;br /&gt;
 }&lt;br /&gt;
 int main() {&lt;br /&gt;
     // Construct the pipeline&lt;br /&gt;
     InputFilter f;&lt;br /&gt;
     OutputFilter g;&lt;br /&gt;
     pipeline p;&lt;br /&gt;
     p.add_filter(f);&lt;br /&gt;
     p.add_filter(g);&lt;br /&gt;
     // Another thread initiates execution of the pipeline&lt;br /&gt;
     tbb_thread t(RunPipeline,&amp;amp;p);&lt;br /&gt;
     // Process the thread_bound_filter with the current thread.&lt;br /&gt;
     while (g.process_item()!=thread_bound_filter::end_of_stream)&lt;br /&gt;
         continue;&lt;br /&gt;
     // Wait for pipeline to finish on the other thread.&lt;br /&gt;
     t.join();&lt;br /&gt;
     return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example above shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. The main thread does the following after constructing the pipeline:&lt;br /&gt;
1. Start the pipeline on another thread.&lt;br /&gt;
2. Service the thread_bound_filter until it reaches end_of_stream.&lt;br /&gt;
3. Wait for the other thread to finish.&lt;br /&gt;
&lt;br /&gt;
===POSIX Threads===&lt;br /&gt;
&lt;br /&gt;
POSIX Threads, or Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), defines an API for creating and manipulating threads.&lt;br /&gt;
&lt;br /&gt;
====Variable Scopes====&lt;br /&gt;
Pthreads defines a set of C programming language types, functions and constants. It is implemented with a pthread.h header and a thread library.&lt;br /&gt;
&lt;br /&gt;
There are around 100 Pthreads procedures, all prefixed &amp;quot;pthread_&amp;quot;. The subroutines which comprise the Pthreads API can be informally grouped into four major groups:&lt;br /&gt;
&lt;br /&gt;
* '''Thread management:''' Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.) E.g.pthread_create(), pthread_join().&lt;br /&gt;
* '''Mutexes:''' Routines that deal with synchronization, called a &amp;quot;mutex&amp;quot;, which is an abbreviation for &amp;quot;mutual exclusion&amp;quot;. Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes. E.g. pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock().&lt;br /&gt;
* '''Condition variables:''' Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included. E.g. pthread_cond_signal(); pthread_cond_broadcast(); pthread_cond_wait(); pthread_cond_timedwait();pthread_cond_reltimedwait_np().&lt;br /&gt;
* '''Synchronization:''' Routines that manage read/write locks and barriers. E.g. pthread_rwlock_rdlock(); pthread_rwlock_tryrdlock(); pthread_rwlock_wrlock();pthread_rwlock_trywrlock(); pthread_rwlock_unlock();pthread_barrier_init(); pthread_barrier_wait()&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
The following is a simple code example in C, as DOALL parallelism, to print out each threads' ID#.&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS     5&lt;br /&gt;
  &lt;br /&gt;
 void *PrintHello(void *threadid)&lt;br /&gt;
 {&lt;br /&gt;
    long tid;&lt;br /&gt;
  &lt;br /&gt;
    tid = (long)threadid;&lt;br /&gt;
    printf(&amp;quot;Hello World! It's me, thread #%ld!\n&amp;quot;, tid);&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
  &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
    pthread_t threads[NUM_THREADS];&lt;br /&gt;
  &lt;br /&gt;
    int rc;&lt;br /&gt;
    long t;&lt;br /&gt;
    for(t=0; t&amp;lt;NUM_THREADS; t++){&lt;br /&gt;
       printf(&amp;quot;In main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
       rc = pthread_create(&amp;amp;threads[t], NULL, PrintHello, (void *)t);&lt;br /&gt;
  &lt;br /&gt;
       if (rc){&lt;br /&gt;
          printf(&amp;quot;ERROR; return code from pthread_create() is %d\n&amp;quot;, rc);&lt;br /&gt;
          exit(-1);&lt;br /&gt;
       }&lt;br /&gt;
    }&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This loop contains only single statement which doesn't cross the iterations, so each iteration could be considered as a parallel task.&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
When it comes to using Pthreads to implement DOACROSS, it could express functional parallelism easily, but make the parallelism unnecessarily complicated. See an example below: from '''POSIX Threads Programming''' by Blaise Barney&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS	&lt;br /&gt;
 &lt;br /&gt;
 void *BusyWork(void *t)&lt;br /&gt;
 {&lt;br /&gt;
   int i;&lt;br /&gt;
   long tid;&lt;br /&gt;
   double result=0.0;&lt;br /&gt;
   tid = (long)t;&lt;br /&gt;
   printf(&amp;quot;Thread %ld starting...\n&amp;quot;,tid);&lt;br /&gt;
   for (i=0; i&amp;lt;1000000; i++)&lt;br /&gt;
   {&lt;br /&gt;
      result = result + sin(i) * tan(i);&lt;br /&gt;
   }&lt;br /&gt;
   printf(&amp;quot;Thread %ld done. Result = %e\n&amp;quot;,tid, result);&lt;br /&gt;
   pthread_exit((void*) t);&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
   pthread_t thread[NUM_THREADS];&lt;br /&gt;
   pthread_attr_t attr;&lt;br /&gt;
   int rc;&lt;br /&gt;
   long t;&lt;br /&gt;
   void *status;&lt;br /&gt;
 &lt;br /&gt;
   /* Initialize and set thread detached attribute */&lt;br /&gt;
   pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
   pthread_attr_setdetachstate(&amp;amp;attr, PTHREAD_CREATE_JOINABLE);&lt;br /&gt;
 &lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      printf(&amp;quot;Main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
      rc = pthread_create(&amp;amp;thread[t], &amp;amp;attr, BusyWork, (void *)t); &lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_create() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
   /* Free attribute and wait for the other threads */&lt;br /&gt;
   pthread_attr_destroy(&amp;amp;attr);&lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      rc = pthread_join(thread[t], &amp;amp;status);&lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_join() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      printf(&amp;quot;Main: completed join with thread %ld having a status   &lt;br /&gt;
            of %ld\n&amp;quot;,t,(long)status);&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
 printf(&amp;quot;Main: program completed. Exiting.\n&amp;quot;);&lt;br /&gt;
 pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This example demonstrates how to &amp;quot;wait&amp;quot; for thread completions by using the Pthread join routine. Since some implementations of Pthreads may not create threads in a joinable state, the threads in this example are explicitly created in a joinable state so that they can be joined later.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
There is examples of using Posix Threads to implement DOPIPE parallelism, but unnecessarily complex. Due to the long length, we won't provide it here. If the reader is interested, it could be found in &amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/code/pipe.c Pthreads DOPIPE example]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Comparison among the three===&lt;br /&gt;
&lt;br /&gt;
====A unified example====&lt;br /&gt;
&lt;br /&gt;
We use a simple parallel example from [http://sourceforge.net Sourceforge.net] to show how it will be implemented in the three packages, namely, POSIX Threads, Intel TBB, OpenMP, to show some commonalities and differences among them.&lt;br /&gt;
&lt;br /&gt;
Following is the original code:&lt;br /&gt;
&lt;br /&gt;
 Grid1 *g = new Grid1(0, n+1);&lt;br /&gt;
 Grid1IteratorSub it(1, n, g);&lt;br /&gt;
 DistArray x(g), y(g);&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 ForEach(int i, it,&lt;br /&gt;
    x(i) += ( y(i+1) + y(i-1) )*.5;&lt;br /&gt;
    e += sqr( y(i) ); )&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Then we are going to show the implementations in different packages, and also make a brief summary of the three packages.&lt;br /&gt;
&lt;br /&gt;
=====In POSIX Thread=====&lt;br /&gt;
&lt;br /&gt;
POSIX Thread: Symmetric multi processing, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global declaration:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 float *x, *y;&lt;br /&gt;
 float vec[8];&lt;br /&gt;
 int nn, pp;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
&lt;br /&gt;
 void *sub1(void *arg) {&lt;br /&gt;
    int p = (int)arg;&lt;br /&gt;
    float e_local = 0;&lt;br /&gt;
    for (int i=1+(nn*p)/pp; i&amp;lt;1+(nn*(p+1))/pp; ++i) {&lt;br /&gt;
      x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
      e_local += y[i] * y[i];&lt;br /&gt;
    }&lt;br /&gt;
    vec[p] = e_local;&lt;br /&gt;
    return (void*) 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
&lt;br /&gt;
 x = new float[n+1];&lt;br /&gt;
 y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 int p_threads = 8;&lt;br /&gt;
 nn = n-1;&lt;br /&gt;
 pp = p_threads;&lt;br /&gt;
 pthread_t threads[8];&lt;br /&gt;
 pthread_attr_t attr;&lt;br /&gt;
 pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p)&lt;br /&gt;
    pthread_create(&amp;amp;threads[p], &amp;amp;attr,&lt;br /&gt;
      sub1, (void *)p);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p) {&lt;br /&gt;
    pthread_join(threads[p], NULL);&lt;br /&gt;
    e += vec[p];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
=====In Intel Threading Building Blocks=====&lt;br /&gt;
&lt;br /&gt;
Intel TBB: A C++ library for thread programming, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
Translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/cache_aligned_allocator.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
 struct sub1 {&lt;br /&gt;
    float ee;&lt;br /&gt;
    float *x, *y;&lt;br /&gt;
    sub1(float *xx, float *yy) : ee(0), x(xx), y(yy) {}&lt;br /&gt;
    sub1(sub1&amp;amp; s, split) { ee = 0; x = s.x; y = s.y; }&lt;br /&gt;
    void operator() (const blocked_range&amp;lt;int&amp;gt; &amp;amp; r){&lt;br /&gt;
      float e = ee;&lt;br /&gt;
      for (int i = r.begin(); i!= r.end(); ++i) {&lt;br /&gt;
        x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
        e += y[i] * y[i];&lt;br /&gt;
      }&lt;br /&gt;
      ee = e;&lt;br /&gt;
    }&lt;br /&gt;
    void join(sub1&amp;amp; s) { ee += s.ee; }&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 task_scheduler_init init;&lt;br /&gt;
 ...&lt;br /&gt;
 float e;&lt;br /&gt;
 float *x = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 float *y = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 ...&lt;br /&gt;
 sub1 s(x, y);&lt;br /&gt;
 parallel_reduce(blocked_range&amp;lt;int&amp;gt;(1, n, 1000), s);&lt;br /&gt;
 e = s.ee;&lt;br /&gt;
 ...&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(x, n+1);&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(y, n+1);&lt;br /&gt;
&lt;br /&gt;
=====In OpenMP shared memory parallel code annotations=====&lt;br /&gt;
&lt;br /&gt;
OpenMP: Usually automatic paralleization with a run-time system based on a thread library.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 float e;&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 float *x = new float[n+1];&lt;br /&gt;
 float *y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 e = 0;&lt;br /&gt;
 #pragma omp for reduction(+:e)&lt;br /&gt;
 for (int i=1; i&amp;lt;n; ++i) {&lt;br /&gt;
    x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
    e += y[i] * y[i];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
====Summary: Difference among them====&lt;br /&gt;
&lt;br /&gt;
*Pthreads works for all the parallelism and could express functional parallelism easily, but it needs to build specialized synchronization primitives and explicitly privatize variables, means there is more effort needed to switch a serial program in to parallel mode. &lt;br /&gt;
&lt;br /&gt;
*OpenMP can provide many performance enhancing features, such as atomic, barrier and flush synchronization primitives. It is very simple to use OpenMP to exploit DOALL parallelism, but the syntax for expressing functional parallelism is awkward. &lt;br /&gt;
&lt;br /&gt;
*Intel TBB relies on generic programming, it performs better with custom iteration spaces or complex reduction operations. Also, it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes, so it's better in cases that go beyond loop-based parallelism.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the differences [[#References|&amp;lt;sup&amp;gt;[16]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
{| align=&amp;quot;center cellpadding=&amp;quot;4&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!Type of Parallelism&lt;br /&gt;
!Posix Threads&lt;br /&gt;
!Intel&amp;amp;reg; TBB&lt;br /&gt;
!OpenMP 2.0&lt;br /&gt;
!OpenMp 3.0&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOALL&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOACROSS&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOPIPE&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Reduction&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Functional Parallelism&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==Synchronization Mechanisms==&lt;br /&gt;
&lt;br /&gt;
===Overveiw===&lt;br /&gt;
&lt;br /&gt;
In order to accomplish the above parallelizations in a real system, the memory must be carefully orchestrated such that no information gets corrupted.  Every architecture handles synchronizing data from parallel processors slightly differently.  This section is going to look at different architectures and highlight a few of the mechanisms that are used to achieve this memory synchorization.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===IA-64===&lt;br /&gt;
IA-64 is an Intel architecture that is mainly used in Itanium processors.&lt;br /&gt;
====Spinlock====&lt;br /&gt;
the spinlock is used to guard against multiple accesses to the critical section at the same time.  The critical section is a section of code that must be executed in sequential order, it cannot be parallelized.  Therefore, when a parallel process comes across an occupied critical section the process will “spin” until the lock is released. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // available. If it is 1, another process is in the critical section.&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  spin_lock:&lt;br /&gt;
    mov	ar.ccv = 0			// cmpxchg looks for avail (0)&lt;br /&gt;
    mov	r2 = 1				// cmpxchg sets to held (1)&lt;br /&gt;
  &lt;br /&gt;
  spin: &lt;br /&gt;
    ld8	r1 = [lock] ;;			// get lock in shared state&lt;br /&gt;
    cmp.ne	p1, p0 = r1, r2		// is lock held (ie, lock == 1)?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// yes, continue spinning&lt;br /&gt;
    cmpxchg8.acqrl = [lock], r2		// attempt to grab lock&lt;br /&gt;
    cmp.ne p1, p0 = r1, r2		// was lock empty?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// bummer, continue spinning&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
    st8.rel(lock) = r0 ;;		//release the lock&lt;br /&gt;
&lt;br /&gt;
The above code demonstrates how a spin lock is used.  Once the process gets to a spin lock, it will check to see if the lock is available, if it is not, then the process will proceed into the spin loop where it will continuously check to see if the lock is available.  Once it finds out the lock is available, it will attempt to obtain the lock.  If another process obtains the lock first, then the process will branch back into the spin loop and continue to wait.&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
&lt;br /&gt;
A barrier is a common mechanism used to hold up processes until all processes can get to the same point.  The mechanism is useful in kinds of different parallelisms (DOALL, DOACROSS, DOPIPE, reduction, and functional parallelism)  This architecture uses a unique form of the barrier mechanism called the sense-reversing barrier.  The idea behind this barrier is to prevent race conditions.  If a process from the “next” instance of the barrier races ahead while slow processes from the current barrier are leaving, the fast processes could trap the slow processes at the “next” barrier and thus corrupting the memory synchronization. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Dekker’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Dekker’s Algorithm uses variables to indicate which processers are using which resources.  It basically arbitrates for a resource using these variables.  Every processor has a flag that indicates when it is in the critical section.  So when a processor is getting ready to enter the critical section it will set its flag to one, then it will check to make sure that all of the other processor flags are zero, then it will proceed into the section.  This behavior is demonstrated in the code below.  It is a two-way multiprocessor system, so there are two processor flags, flag_me and flag_you. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The flag_me variable is zero if we are not in the synchronization and &lt;br /&gt;
  // critical section code and non-zero otherwise; flag_you is similarly set&lt;br /&gt;
  // for the other processor.  This algorithm does not retry access to the &lt;br /&gt;
  // resource if there is contention.&lt;br /&gt;
  &lt;br /&gt;
  dekker:&lt;br /&gt;
    mov		r1 = 1 ;;		// my_flag = 1 (i want access)&lt;br /&gt;
    st8  	[flag_me] = r1&lt;br /&gt;
    mf ;;				// make st visible first&lt;br /&gt;
    ld8 	r2 = [flag_you] ;;		// is other's flag 0?&lt;br /&gt;
    cmp.eq p1, p0 = 0, r2&lt;br /&gt;
  &lt;br /&gt;
  (p1) &lt;br /&gt;
    br.cond.spnt cs_skip ;;		// if not, resource in use &lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  cs_skip:&lt;br /&gt;
    st8.rel[flag_me] = r0 ;;		// release lock&lt;br /&gt;
&lt;br /&gt;
====Lamport’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Lamport’s Algorithm is similar to a spinlock with the addition of a fairness mechanism that keeps track of the order in which processes request the shared resource and provides access to the shared resource in the same order.  It makes use of two variable x and y and a shared array, b.  The example below shows example code for this algorithm.  [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The proc_id variable holds a unique, non-zero id for the process that &lt;br /&gt;
  // attempts access to the critical section.  x and y are the synchronization&lt;br /&gt;
  // variables that indicate who is in the critical section and who is attempting&lt;br /&gt;
  // entry. ptr_b_1 and ptr_b_id point at the 1'st and id'th element of b[].&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  lamport:&lt;br /&gt;
    	ld8		r1 = [proc_id] ;;	// r1 = unique process id&lt;br /&gt;
  start:&lt;br /&gt;
    	st8	[ptr_b_id] = r1		// b[id] = &amp;quot;true&amp;quot;&lt;br /&gt;
    	st8	[x] = r1			// x = process id&lt;br /&gt;
   	mf					// MUST fence here!&lt;br /&gt;
    	ld8	r2 = [y] ;;&lt;br /&gt;
    	cmp.ne p1, p0 = 0, r2;;		// if (y !=0) then...&lt;br /&gt;
  (p1)	st8	[ptr_b_id] = r0		// ... b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  (p1)	br.cond.sptk	wait_y		// ... wait until y == 0&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r1		// y = process id&lt;br /&gt;
  	mf&lt;br /&gt;
  	ld8 	r3 = [x] ;;		&lt;br /&gt;
  	cmp.eq p1, p0 = r1, r3 ;;	// if (x == id) then..&lt;br /&gt;
  (p1)	br.cond.sptk cs_begin		// ... enter critical section&lt;br /&gt;
  &lt;br /&gt;
  	st8 	[ptr_b_id] = r0		// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  	ld8	r3 = [ptr_b_1]		// r3 = &amp;amp;b[1]&lt;br /&gt;
  	mov	ar.lc = N-1 ;;		// lc = number of processors - 1&lt;br /&gt;
  wait_b:&lt;br /&gt;
  	ld8	r2 = [r3] ;;		&lt;br /&gt;
  	cmp.ne p1, p0 = r1, r2		// if (b[j] != 0) then...&lt;br /&gt;
  (p1)	br.cond.spnt	wait_b ;;	// ... wait until b[j] == 0&lt;br /&gt;
  	add	r3 = 8, r3		// r3 = &amp;amp;b[j+1]&lt;br /&gt;
  	br.cloop.sptk	wait_b ;;	// loop over b[j] for each j&lt;br /&gt;
  &lt;br /&gt;
  	ld8	r2 = [y] ;;		// if (y != id) then...&lt;br /&gt;
  	cmp.ne p1, p2 = 0, r2&lt;br /&gt;
  (p1)  br.cond.spnt 	wait_y&lt;br /&gt;
  	br	start			// back to start to try again&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r0		// release the lock&lt;br /&gt;
  	st8.rel[ptr_b_id] = r0 ;;	// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
===IA-32=== &lt;br /&gt;
&lt;br /&gt;
IA-32 is an Intel architecture that is also known as x86.  This is a very widely used architecture.&lt;br /&gt;
&lt;br /&gt;
====Locked Atomic Operation====&lt;br /&gt;
This is the main mechanism for this architecture to manage shared data structures such as semaphores and system segments.  The process uses the following three interdependent mechanisms to implement the locked atomic operation: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*  Guaranteed atomic operations.&lt;br /&gt;
*  Bus locking, using the LOCK# signal and the LOCK instruction prefix.&lt;br /&gt;
*  Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock). This mechanism is present in the P6 family processors.&lt;br /&gt;
&lt;br /&gt;
=====Guaranteed Atomic Operation=====&lt;br /&gt;
The following are guaranteed to be carried out automatically: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
Reading or writing a byte.&lt;br /&gt;
*  Reading or writing a word aligned on a 16-bit boundary.&lt;br /&gt;
*  Reading or writing a doubleword aligned on a 32-bit boundary.The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:&lt;br /&gt;
*  Reading or writing a quadword aligned on a 64-bit boundary. (This operation is also guaranteed on the Pentium® processor.)&lt;br /&gt;
*  16-bit accesses to uncached memory locations that fit within a 32-bit data bus.&lt;br /&gt;
*  16-, 32-, and 64-bit accesses to cached memory that fit within a 32-Byte cache line.&lt;br /&gt;
&lt;br /&gt;
=====Bus Locking=====&lt;br /&gt;
A LOCK signal is asserted automatically during certain critical sections in order to lock the system bus and grant control to the process executing the critical section.  This signal will disallow control of this bus by any other process while the LOCK is engaged.&lt;br /&gt;
&lt;br /&gt;
===Linux Kernel===&lt;br /&gt;
&lt;br /&gt;
Linux Kernel is referred to as an “architecture”, however it is fairly unconventional in that it is an open source operating system that has full access to the hardware. It uses many common synchronization mechanisms, so it will be considered here. [[#References|&amp;lt;sup&amp;gt;[15]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Busy-waiting lock====&lt;br /&gt;
&lt;br /&gt;
=====Spinlocks=====&lt;br /&gt;
&lt;br /&gt;
This mechanism is very similar to the mechanism described in the IA-64 architecture.  It is a mechanism used to manage access to a critical section of code.  If a process tries to access the critical section and is rejected it will sit and “spin” while it waits for the lock to be released.&lt;br /&gt;
&lt;br /&gt;
=====Rwlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a special kind of spinlock.  It is for protected structures that are frequently read, but rarely written.  This lock allows multiple reads in parallel, which can increase efficiency if process are not having to sit and wait in order to merely carry out a read function.  Like before however, one write is allowed at a time with no reads done in parallel&lt;br /&gt;
&lt;br /&gt;
=====Brlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a super fast read/write lock, but it has a write-side penalty.  The main advantage of this lock is to prevent cache “ping-pong” in a multiple read case.&lt;br /&gt;
&lt;br /&gt;
====Sleeper locks====&lt;br /&gt;
&lt;br /&gt;
=====Semiphores=====&lt;br /&gt;
&lt;br /&gt;
A semaphore is special variable that acts similar to a lock.  If the semaphore can be acquired then the process can proceed into the critical section.  If the semaphore cannon be acquired, then the process is “put to sleep” and the processor is then used for another process.  This means the processes cache is saved off in a place where it can be retrieved when the process is “woken up”.  Once the semaphore is available the “sleeping” process is woken up and obtains the semaphore and proceeds in to the critical section. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===CUDA=== &lt;br /&gt;
&lt;br /&gt;
CUDA, or Compute Unified Device Architecture, is an Nvidia architecture which is the computing engine for their graphics processors.&lt;br /&gt;
&lt;br /&gt;
====_syncthreads====&lt;br /&gt;
&lt;br /&gt;
The _syncthreads operation can be used at the end of a parallel section as a sort of “barrier” mechanicm.  It is necessary to ensure the accuracy of the memory.  In the following example, there are two calls to _syncthreads.  They are both necessary to ensure the expected results are obtained.  Without it, myArray[tid] could end up being either 2 or the original value of myArray[] depending on when the read and write take place.[[#References|&amp;lt;sup&amp;gt;[14]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // myArray is an array of integers located in global or shared&lt;br /&gt;
  // memory&lt;br /&gt;
  __global__ void MyKernel(int* result) {&lt;br /&gt;
     int tid = threadIdx.x;&lt;br /&gt;
    ...    &lt;br /&gt;
     int ref1 = myArray[tid];&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    myArray[tid + 1] = 2;&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    int ref2 = myArray[tid];&lt;br /&gt;
    result[tid] = ref1 * ref2;&lt;br /&gt;
    ...    &lt;br /&gt;
  {&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
PowerPC is an IBM architecture that stands for Performance Optimization With Enhanced RISC-Performance Computing.  It is a RISC architecture that was originally designed for PCs, however it has grown into the embedded and high-performance space. [[#References|&amp;lt;sup&amp;gt;[18]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Isync==== &lt;br /&gt;
&lt;br /&gt;
isync is an instruction that guarantees that before any code proceeding after the isync instruction can execute, all of the code preceding it has already completed.   It also ensures that any cache block invalidations instructions that were executed before the isync have been carried out with respect to the processor executing the isync instruction.  It then causes any prefetched instructions to be discarded. [[#References|&amp;lt;sup&amp;gt;[17]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Memory Barrier Instructions====&lt;br /&gt;
&lt;br /&gt;
Memory Barrier Instructions can be used to control the order in which storage access are performed. [[#References|&amp;lt;sup&amp;gt;[17]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=====HeavyWeight sync=====&lt;br /&gt;
This memory barrier creates an ordering function for the storage accesses that are associated with all of the instructions that are executed by the processor executing the sync instruction.&lt;br /&gt;
&lt;br /&gt;
=====LightWeight sync=====&lt;br /&gt;
This memory barrier creates an ordering function for the storage accesses caused by LOAD and STORE instructions that are executed by the processor executing the sync instruction.  Also, this instruction must execute on the specified storage location in storage that is neither a Write Through Required nor a Caching Inhibited.&lt;br /&gt;
&lt;br /&gt;
=====Enforce In-order Execution of I/O=====&lt;br /&gt;
The Enforce In-order Execution of I/O, or eieio, instruction is a memory barrier that creates an ordering function for the storage accesses caused by LOADs and STOREs.  These instructions are split into two groups: [[#References|&amp;lt;sup&amp;gt;[17]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
1. Loads and stores to storage that is both Caching Inhibited and Guarded, and stores to main storage caused by stores to storage that is Write Through Required&lt;br /&gt;
&lt;br /&gt;
2. Stores to storage that is Memory Coherence Required and is neither Write Through Required nor Caching Inhibited&lt;br /&gt;
&lt;br /&gt;
For the first group the ordering done by the memory barrier for accesses in this set is not cumulative.  For the second group the ordering done by the memory barrier for accesses in this set is cumulative.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Cell Broadband Engine===&lt;br /&gt;
Cell Broadband Engine Architecture, also referred to as Cell or Cell BE, is an IBM architecture whose first major application was in Sony’s PlayStation 3.  Cell has streamlined coprocessing elements which is great for fast multimedia and vector processing applications. [[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This architecture is interesting because it uses a shared memory model in which the LOADs and STOREs use a “weakly consistent” storage model.  Meaning that, the sequence in which any of the following orders are executed may be different from each other: [[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
* The order of any processor element (PPE or SPE) performing storage access&lt;br /&gt;
* The order in which the accesses are performed with respect to another processor element&lt;br /&gt;
* The order in which the accesses are performed in main storage&lt;br /&gt;
&lt;br /&gt;
It is important that the accesses to the shared memory happen in the correct program order or information could be lost or corrupted.  In order to ensure that this doesn’t happen the following memory barrier instructions are used:&lt;br /&gt;
&lt;br /&gt;
====Fence====&lt;br /&gt;
After all previous issued commands within the same “tag group” have been performed the fence instruction can be issued.  If there is a command that is issued after the fence command, it might be executed before the fence command. [[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
After all previous issued commands have been performed, the barrier command and all of the instructions after the barrier command can then be executed. [[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&amp;lt;ol&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://openmp.org/wp/about-openmp/ OpenMP.org]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://docs.google.com/viewer?a=v&amp;amp;pid=gmail&amp;amp;attid=0.1&amp;amp;thid=126f8a391c11262c&amp;amp;mt=application%2Fpdf&amp;amp;url=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3D2%26ik%3Dd38b56c94f%26view%3Datt%26th%3D126f8a391c11262c%26attid%3D0.1%26disp%3Dattd%26realattid%3Df_g602ojwk0%26zw&amp;amp;sig=AHIEtbTeQDhK98IswmnVSfrPBMfmPLH5Nw An Optimal Abtraction Model for Hardware Multithreading in Modern Processor Architectures]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Reference.pdf Intel Threading Building Blocks 2.2 for Open Source Reference Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.csc.ncsu.edu/faculty/efg/506/s10/ NCSU CSC 506 Parallel Computing Systems]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://parallel-for.sourceforge.net/tbb.html Sourceforge.net]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/openMP/ OpenMP]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.16 Barrier Optimization for OpenMP Program]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://cs.anu.edu.au/~Alistair.Rendell/sc02/module3.pdf Performance Programming: Theory, Practice and Case Studies]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ Intel® Threading Building Blocks, OpenMP, or native threads?]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/pthreads/#Joining POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/source.html Programing with POSIX Threads source code]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA64-softdevman-vol2.pdf IA-64 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA32-softdevman-vol3.pdf IA-32 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf CUDA Programming Guide]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=6&amp;amp;ved=0CEQQFjAF&amp;amp;url=http%3A%2F%2Flinuxindore.com%2Fdownloads%2Fdownload%2Fdata-structures%2Flinux-kernel-arch&amp;amp;ei=jxZWTaGTNI34sAPWm-ScDA&amp;amp;usg=AFQjCNG9UOAz7rHfwUDfayhr50M87uNOYA&amp;amp;sig2=azvo4h85RkoNHcZUtNIkJw Linux Kernel Architecture Overveiw]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/ch_3_jb/Parallel_Programming_Model_Support Spring 2010 NC State ECE/CSC506 Chapter 3 wiki]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://download.boulder.ibm.com/ibmdl/pub/software/dw/library/es-ppcbook2.zip PowerPC Architecture Book]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCEQFjAA&amp;amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPowerPC&amp;amp;ei=77RYTejKFZSisQOm6-GiDA&amp;amp;usg=AFQjCNFt0LpxmNviHKFxCur-amK9HAG08Q&amp;amp;sig2=Kmm9RzJY-4AlG66AwWxlRA Wikipedia information on PowerPC]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf IBM cell Cell Architecture Book]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=5&amp;amp;ved=0CDgQFjAE&amp;amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCell_(microprocessor)&amp;amp;ei=3MJYTeK5Aov6sAPC5-yiDA&amp;amp;usg=AFQjCNENg6PvayZebvtWf7KQstpJDk6URw&amp;amp;sig2=xs87jzBsFgneYOxP0k-_aQ Wikipedia information on Cell]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/ol&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43735</id>
		<title>CSC/ECE 506 Spring 2011/ch3 ab</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43735"/>
		<updated>2011-02-14T06:15:38Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Supplement to Chapter 3: Support for parallel-programming models. Discuss how DOACROSS, DOPIPE, DOALL, etc. are implemented in packages such as Posix threads, Intel Thread Building Blocks, OpenMP 2.0 and 3.0.&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this wiki supplement, we will discuss how the three kinds of parallelisms, i.e. DOALL, DOACROSS and DOPIPE implemented in the threads packages - OpenMP, Intel Threading Building Block, POSIX Threads. We discuss each package from the perspective of variable scopes &amp;amp; Reduction/DOALL/DOACROSS/DOPIPE implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementation==&lt;br /&gt;
&lt;br /&gt;
===OpenMP===&lt;br /&gt;
The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.&lt;br /&gt;
&lt;br /&gt;
====Variable Clauses ====&lt;br /&gt;
There are many different types of clauses in OpenMP and each of them has various characteristics. Here we introduce data sharing attribute clauses, Synchronization clauses, Scheduling clauses, Initialization and Reduction. &lt;br /&gt;
=====Data sharing attribute clauses=====&lt;br /&gt;
* ''shared'': the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.&lt;br /&gt;
  Format: shared ''(list)''&lt;br /&gt;
&lt;br /&gt;
  SHARED variables behave as follows:&lt;br /&gt;
  1. Existing in only one memory location and all threads can read or write to that address &lt;br /&gt;
&lt;br /&gt;
* ''private'': the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.&lt;br /&gt;
  Format: private ''(list)''&lt;br /&gt;
&lt;br /&gt;
  PRIVATE variables behave as follows: &lt;br /&gt;
    1. A new object of the same type is declared once for each thread in the team&lt;br /&gt;
    2. All references to the original object are replaced with references to the new object&lt;br /&gt;
    3. Variables declared PRIVATE should be assumed to be uninitialized for each thread &lt;br /&gt;
&lt;br /&gt;
* ''default'': allows the programmer to state that the default data scoping within a parallel region will be either ''shared'', or ''none'' for C/C++, or ''shared'', ''firstprivate'', ''private'', or ''none'' for Fortran.  The ''none'' option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.&lt;br /&gt;
  Format: default (shared | none)&lt;br /&gt;
&lt;br /&gt;
  DEFAULT variables behave as follows: &lt;br /&gt;
    1. Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. &lt;br /&gt;
    2. Using NONE as a default requires that the programmer explicitly scope all variables.&lt;br /&gt;
&lt;br /&gt;
=====Synchronization clauses=====&lt;br /&gt;
* ''critical section'': the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.&lt;br /&gt;
  Format: #pragma omp critical ''[ name ]  newline''&lt;br /&gt;
           ''structured_block''&lt;br /&gt;
&lt;br /&gt;
  CRITICAL SECTION behaves as follows:&lt;br /&gt;
    1. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.&lt;br /&gt;
    2. It is illegal to branch into or out of a CRITICAL block. &lt;br /&gt;
&lt;br /&gt;
* ''atomic'': similar to ''critical section'', but advise the compiler to use special hardware instructions for better performance. Compilers may choose to ignore this suggestion from users and use ''critical section'' instead.&lt;br /&gt;
  Format: #pragma omp atomic  ''newline''&lt;br /&gt;
           ''statement_expression''&lt;br /&gt;
&lt;br /&gt;
  ATOMIC behaves as follows:&lt;br /&gt;
    1. Only to a single, immediately following statement.&lt;br /&gt;
    2. An atomic statement must follow a specific syntax. &lt;br /&gt;
&lt;br /&gt;
* ''ordered'': the structured block is executed in the order in which iterations would be executed in a sequential loop&lt;br /&gt;
  Format: #pragma omp for ordered ''[clauses...]''&lt;br /&gt;
          ''(loop region)''&lt;br /&gt;
          #pragma omp ordered  ''newline''&lt;br /&gt;
          ''structured_block&lt;br /&gt;
          (endo of loop region)''&lt;br /&gt;
&lt;br /&gt;
  ORDERED behaves as follows:&lt;br /&gt;
    1. only appear in the dynamic extent of ''for'' or ''parallel for (C/C++)''.&lt;br /&gt;
    2. Only one thread is allowed in an ordered section at any time.&lt;br /&gt;
    3. It is illegal to branch into or out of an ORDERED block. &lt;br /&gt;
    4. A loop which contains an ORDERED directive, must be a loop with an ORDERED clause. &lt;br /&gt;
&lt;br /&gt;
* ''barrier'': each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.&lt;br /&gt;
   Format: #pragma omp barrier  ''newline''&lt;br /&gt;
&lt;br /&gt;
   BARRIER behaves as follows:&lt;br /&gt;
    1. All threads in a team (or none) must execute the BARRIER region.&lt;br /&gt;
    2. The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team.&lt;br /&gt;
&lt;br /&gt;
*''taskwait'': specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.&lt;br /&gt;
   Format: #pragma omp taskwait  ''newline''&lt;br /&gt;
&lt;br /&gt;
   TASKWAIT behaves as follows:&lt;br /&gt;
    1. Placed only at a point where a base language statement is allowed.&lt;br /&gt;
    2. Not be used in place of the statement following an if, while, do, switch, or label.&lt;br /&gt;
&lt;br /&gt;
*''flush'': The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory. Thread-visible variables are written back to memory at this point. &lt;br /&gt;
   Format: #pragma omp flush ''(list)  newline''&lt;br /&gt;
&lt;br /&gt;
   FLUSH behaves as follows:&lt;br /&gt;
    1. The optional list contains a list of named variables that will be flushed in order to avoid flushing all variables.&lt;br /&gt;
    2. Implementations must ensure any prior modifications to thread-visible variables are visible to all threads after this point.&lt;br /&gt;
&lt;br /&gt;
=====Scheduling clauses=====&lt;br /&gt;
*''schedule(type, chunk)'': This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are:&lt;br /&gt;
#''static'': Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter &amp;quot;chunk&amp;quot; will allocate &amp;quot;chunk&amp;quot; number of contiguous iterations to a particular thread.&lt;br /&gt;
#''dynamic'': Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter &amp;quot;chunk&amp;quot; defines the number of contiguous iterations that are allocated to a thread at a time.&lt;br /&gt;
#''guided'': A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter &amp;quot;chunk&amp;quot;&lt;br /&gt;
=====Initialization=====&lt;br /&gt;
* ''firstprivate'': the data is private to each thread, but initialized using the value of the variable using the same name from the master thread.&lt;br /&gt;
  Format: firstprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  FIRSTPRIVATE variables behave as follows: &lt;br /&gt;
    1. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. &lt;br /&gt;
&lt;br /&gt;
* ''lastprivate'': the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop.  A variable can be both ''firstprivate'' and ''lastprivate''. &lt;br /&gt;
  Format: lastprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
* ''threadprivate'': The data is a global data, but it is private in each parallel region during the runtime. The difference between ''threadprivate'' and ''private'' is the global scope associated with threadprivate and the preserved value across parallel regions.&lt;br /&gt;
  Format: #pragma omp threadprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  THREADPRIVATE variables behave as follows: &lt;br /&gt;
    1. On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined. &lt;br /&gt;
    2. The THREADPRIVATE directive must appear after every declaration of a thread private variable/common block.&lt;br /&gt;
&lt;br /&gt;
=====Reduction=====&lt;br /&gt;
* ''reduction'': the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in &amp;quot;operator&amp;quot; for this particular clause) on a datatype that runs iteratively so that its value at a particular iteration depends on its value at a previous iteration. Basically, the steps that lead up to the operational increment are parallelized, but the threads gather up and wait before updating the datatype, then increments the datatype in order so as to avoid racing condition. &lt;br /&gt;
  Format: reduction ''(operator: list)''&lt;br /&gt;
&lt;br /&gt;
  REDUTION variables behave as follows: &lt;br /&gt;
    1. Variables in the list must be named scalar variables. They can not be array or structure type variables. They must also be declared SHARED in the enclosing context.&lt;br /&gt;
    2. Reduction operations may not be associative for real numbers.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
In code 3.20, first it must include the header file ''omp.h'' which contains OpenMP function declarations. Next, A parallel region is started by  #pragma omp parallel and we enclose this program bu curly brackets. We can use (setenv OMP_NUM_THREADS n) to specify the number of threads. Another way to determine the number of threads is directly calling a function (omp_set_numtheads (n)). &lt;br /&gt;
Code 3.20 only has one loop to execute and we want it to execute in parallel, so we combine the start of the parallel loop and the start of the parallel region with one directive ''#pragma omp parallel for''. &lt;br /&gt;
 &lt;br /&gt;
 '''Code 3.20 A DOALL parallelism example in OpenMP&lt;br /&gt;
 '''#include''' &amp;lt;omp.h&amp;gt;&lt;br /&gt;
 '''...'''&lt;br /&gt;
 '''#pragma''' omp parallel //start of parallel region&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''...'''&lt;br /&gt;
  '''#pragma''' omp parallel for default (shared)&lt;br /&gt;
  '''for''' ( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
    '''A[i]''' = A[i] + A[i] - 3.0;&lt;br /&gt;
 '''}'''//end for parallel region&lt;br /&gt;
&lt;br /&gt;
Apparently, there is no loop-carried dependence in ''i'' loop. With OpenMP, we only need to insert the ''pragma'' directive ''parallel for''. The ''dafault(shared)'' clauses states that all variables within the scope of the loop are shared  unless otherwise specified.&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
We will now introduce how to implement DOACROSS in OpenMP. Here is an example code which has not been paralleled yet.&lt;br /&gt;
 &lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02: for(j=1; j&amp;lt;N; j++){&lt;br /&gt;
 03: a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 04: }&lt;br /&gt;
 05: }&lt;br /&gt;
&lt;br /&gt;
From this sample code, obviously, there is dependence existing here. &lt;br /&gt;
 a[i,j] -&amp;gt; T a[i+1, j+1]&lt;br /&gt;
&lt;br /&gt;
In OpenMP, DOALL parallel can be implemented by insert a “#pragma omp for” before the “for” structure in the source code. But there is not a pragma corresponding to DOACROSS parallel.&lt;br /&gt;
&lt;br /&gt;
When we implement DOACROSS, we use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is gotten by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is gotten by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
*omp_get_num_threads(): Returns the number of threads that are currently in the team executing the parallel region from which it is called.&lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_num_threads(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_NUM_THREADS behaves as following:&lt;br /&gt;
  1. If this call is made from a serial portion of the program, or a nested parallel region that is serialized, it will return 1. &lt;br /&gt;
  2. The default number of threads is implementation dependent. &lt;br /&gt;
&lt;br /&gt;
*omp_get_thread_num(): Returns the thread number of the thread, within the team, making this call. This number will be between 0 and OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0 &lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_thread_num(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_THREAD_NUM behaves as followings:&lt;br /&gt;
  1. If called from a nested parallel region, or a serial region, this function will return 0. &lt;br /&gt;
&lt;br /&gt;
Now, let's see the code which has been paralleled and explanation. &lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 		//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(j_tile = 0; j_tile&amp;lt;N-1; j_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       for(j=j_tile;j&amp;lt;j_tile+M;j++){&lt;br /&gt;
 19:         a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 20:       }&lt;br /&gt;
 21:     }&lt;br /&gt;
 22:     _mylock[myid] += 1;&lt;br /&gt;
 23:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 24:   }&lt;br /&gt;
 25: }&lt;br /&gt;
&lt;br /&gt;
We paralleled the original program in two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other four processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take four interations of the loop i. The same to j loop. Assume the size of each block is 4. Each processor will execute four iterations of loop j. In order to let the total iterations be equal to the original program, j has to be enclosed in loop i. So, the new loop will be looked like ''for (j_tile = 2; j_tile &amp;lt;= 15; j_tile += 4)'', line 18.&lt;br /&gt;
The lower bound of loop j is set to j_tile and the upper bound will be j_tile+3. We will keep the other statement unchanged.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the neighbor threads. After first step, the four processor will finish computing a block 4x4. If we parallel all these four processors, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
We set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
With the four variables, threads are synchronized:&lt;br /&gt;
The first thread continues to run with out waiting (line 9), because its thread ID is 0. Then all other thread can not go down after line 12. If the value in ''_mylocks[_my_id-1]'' is smaller than ''_counter0''.&lt;br /&gt;
&lt;br /&gt;
Otherwise, the block that the current thread is waiting for must have to be completed, and the current thread can go down to line 12, and mark the next block it will wait for by adding 1 to ''_counter0'' (line 14).&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it&lt;br /&gt;
has finish a block by ''mylocks[proc]++''. Once the neighbor thread finds the value has been changed, it will continue running and so on. The below figure presents it to us.&lt;br /&gt;
[[Image:Synchorization.jpg]]&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
Here is another example code and we are going to parallelize it in DOPIPE parallelism. There is a dependence, which is S2 -&amp;gt; T S1, existing in the sample code.&lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02:   S1: a[i]=b[i];&lt;br /&gt;
 03:   S2: c[i]=c[i-1]+a[i];&lt;br /&gt;
 04: &lt;br /&gt;
 05: }&lt;br /&gt;
Now, let's see how to parallel the sample code to DOPIPE parallelism.&lt;br /&gt;
we still use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is got by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is got by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 			//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(i_tile = 0; i_tile&amp;lt;N-1; i_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       a[i]=b[i];&lt;br /&gt;
 19:     }&lt;br /&gt;
 20:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 21:       c[i]=c[i-1]+a[i];&lt;br /&gt;
 22:     }&lt;br /&gt;
 23:     _mylock[myid] += 1;&lt;br /&gt;
 24:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 25:   }&lt;br /&gt;
 26: }&lt;br /&gt;
&lt;br /&gt;
Ideally, We parallelized the original program into two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take interations of the loop i. Now, there are two loop i existing and each loop i contains different statements. Also, we will keep other statements remained.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the threads. After first step, processors will finish computing &lt;br /&gt;
a[i]=b[i]. If we parallel all the processors to do the second loop i, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
Still, we set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it has finish a block by ''mylocks[proc]++''. Once the processors finish their own block, the other processors will be able to get the value to use that value to execute in its statement and process that.&lt;br /&gt;
&lt;br /&gt;
====Functional Parallelism====&lt;br /&gt;
&lt;br /&gt;
In order to introduce function parallelism, we want to execute some code section in parallel with another code section. We use code 3.21 to show two loops execute in parallel with respect to one another, although each loop is sequentially executed.&lt;br /&gt;
&lt;br /&gt;
 '''Code''' 3.21 A function parallelism example in OpenMP&lt;br /&gt;
 '''pragma''' omp parallel shared(A, B)private(i)&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''#pragma''' omp sections nowait&lt;br /&gt;
  '''{'''&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''A[i]''' = A[i]*A[i] - 4.0;&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''B[i]''' = B[i]*B[i] - 9.0;&lt;br /&gt;
  '''}'''//end omp sections&lt;br /&gt;
 '''}'''//end omp parallel&lt;br /&gt;
&lt;br /&gt;
In code 3.21, there are two loops needed to be executed in parallel. We just need to insert two ''pragma omp section'' statements. Once we insert these two statements, those two loops will execute sequentially.&lt;br /&gt;
&lt;br /&gt;
===Intel Thread Building Blocks===&lt;br /&gt;
&lt;br /&gt;
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable &lt;br /&gt;
parallel programming using standard ISO C++ code. It does not require special &lt;br /&gt;
languages or compilers. It is designed to promote scalable data parallel programming. &lt;br /&gt;
The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually. Instead the library abstracts access to the multiple processors by allowing the operations to be treated as &amp;quot;tasks,&amp;quot; which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Also, Intel Threading Building Blocks provides net results, which enables you to specify &lt;br /&gt;
parallelism more conveniently than using raw threads, and at the same time can &lt;br /&gt;
improve performance.&lt;br /&gt;
&lt;br /&gt;
====Variables Scope====&lt;br /&gt;
&lt;br /&gt;
Intel TBB is a collection of components for parallel programming, here is the overview of the library contents:&lt;br /&gt;
&lt;br /&gt;
* Basic algorithms: parallel_for, parallel_reduce, parallel_scan&lt;br /&gt;
* Advanced algorithms: parallel_while, parallel_do,pipeline, parallel_sort&lt;br /&gt;
* Containers: concurrent_queue, concurrent_vector, concurrent_hash_map&lt;br /&gt;
* Scalable memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator&lt;br /&gt;
* Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive mutex&lt;br /&gt;
* Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store&lt;br /&gt;
* Timing: portable fine grained global time stamp&lt;br /&gt;
* Task Scheduler: direct access to control the creation and activation of tasks&lt;br /&gt;
&lt;br /&gt;
Then we will focus on some specific TBB variables.&lt;br /&gt;
&lt;br /&gt;
=====parallel_for=====&lt;br /&gt;
&lt;br /&gt;
Parallel_for is the template function that performs parallel iteration over a range of values. In Intel TBB, a lot of DOALL cases could be implemented by using this function. The syntax is as follows: &lt;br /&gt;
 template&amp;lt;typename Index, typename Function&amp;gt;&lt;br /&gt;
 Function parallel_for(Index first, Index_type last, Index step, Function f);&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_for( const Range&amp;amp; range, const Body&amp;amp; body, [, partitioner] );&lt;br /&gt;
&lt;br /&gt;
A parallel_for(first, last, step, f) represents parallel execution of the loop: &amp;quot;for( auto i=first; i&amp;lt;last; i+=step ) f(i);&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
=====parallel_reduce=====&lt;br /&gt;
&lt;br /&gt;
Function parallel_reduce computes reduction over a range. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Value, typename Func, typename Reduction&amp;gt;&lt;br /&gt;
 Value parallel_reduce( const Range&amp;amp; range, const Value&amp;amp; identity, const Func&amp;amp; func, const Reduction&amp;amp; reduction );&lt;br /&gt;
&lt;br /&gt;
The functional form parallel_reduce(range,identity,func,reduction) performs a&lt;br /&gt;
parallel reduction by applying func to subranges in range and reducing the results&lt;br /&gt;
using binary operator reduction. It returns the result of the reduction. Parameter func&lt;br /&gt;
and reduction can be lambda expressions.&lt;br /&gt;
&lt;br /&gt;
=====parallel_scan=====&lt;br /&gt;
&lt;br /&gt;
This template function computes parallel prefix. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const auto_partitioner&amp;amp; );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const simple_partitioner&amp;amp; );&lt;br /&gt;
&lt;br /&gt;
A parallel_scan(range,body) computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing that is&lt;br /&gt;
sometimes useful in scenarios that appear to have inherently serial dependences. A&lt;br /&gt;
further explanation will be given in the DOACROSS example.&lt;br /&gt;
&lt;br /&gt;
=====pipeline=====&lt;br /&gt;
&lt;br /&gt;
This class performs pipelined execution. Members as follows:&lt;br /&gt;
 namespace tbb {&lt;br /&gt;
     class pipeline {&lt;br /&gt;
     public:&lt;br /&gt;
        pipeline();&lt;br /&gt;
        ~pipeline(); &lt;br /&gt;
        void add_filter( filter&amp;amp; f );&lt;br /&gt;
        void run( size_t max_number_of_live_tokens );&lt;br /&gt;
        void clear();&lt;br /&gt;
   };&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
A pipeline represents pipelined application of a series of filters to a stream of items.&lt;br /&gt;
Each filter operates in a particular mode: parallel, serial in order, or serial out of order. With a parallel filter, &lt;br /&gt;
we could implement DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
====Reduction====&lt;br /&gt;
&lt;br /&gt;
The reduction in Intel TBB is implemented using parallel_reduce function. A parallel_reduce recursively splits the range into subranges and uses the splitting constructor to make one or more copies of the body for each thread. We use an example to illustrate this: &lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 struct Sum {&lt;br /&gt;
     float value;&lt;br /&gt;
     Sum() : value(0) {}&lt;br /&gt;
     Sum( Sum&amp;amp; s, split ) {value = 0;}&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;float*&amp;gt;&amp;amp; r ) {&lt;br /&gt;
         float temp = value;&lt;br /&gt;
         for( float* a=r.begin(); a!=r.end(); ++a ) {&lt;br /&gt;
             temp += *a;&lt;br /&gt;
         }&lt;br /&gt;
         value = temp;&lt;br /&gt;
     }&lt;br /&gt;
     void join( Sum&amp;amp; rhs ) {value += rhs.value;}&lt;br /&gt;
 };&lt;br /&gt;
 float ParallelSum( float array[], size_t n ) {&lt;br /&gt;
     Sum total;&lt;br /&gt;
     parallel_reduce( blocked_range&amp;lt;float*&amp;gt;( array, array+n ), total );&lt;br /&gt;
     return total.value;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The above example sums the values in the array. The parallel_reduce will do the reduction within the range of (array, array+n), to split the working body, and then join them by the return value for each split.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
The implementation of DOALL parallelism in Intel TBB will involve Parallel_for function. &lt;br /&gt;
To better illustrate the usage, here we discuss a simple example. The following is the original code:&lt;br /&gt;
 &lt;br /&gt;
 void SerialApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     for( size_t i=0; i&amp;lt;n; ++i )&lt;br /&gt;
         Foo(a[i]);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
After using Intel TBB, it could be switched to the following:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_for.h&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 class ApplyFoo {&lt;br /&gt;
     float *const my_a;&lt;br /&gt;
 public:&lt;br /&gt;
     void operator( )( const blocked_range&amp;lt;size_t&amp;gt;&amp;amp; r ) const {&lt;br /&gt;
         float *a = my_a;&lt;br /&gt;
         for( size_t i=r.begin(); i!=r.end( ); ++i )&lt;br /&gt;
             Foo(a[i]);&lt;br /&gt;
     }&lt;br /&gt;
     ApplyFoo( float a[] ) :&lt;br /&gt;
         my_a(a)&lt;br /&gt;
     {}&lt;br /&gt;
 };&lt;br /&gt;
 &lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n,The_grain_size_You_Pick), ApplyFoo(a) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example is the simplest DOALL parallelism, similar to the one in the textbook, and execution graph will be very similar as the one in DOALL section above. But with the help of this simple illustration, the TBB code just gives you a flavor of how it would be implemented in Intel Threading Building Blocks.&lt;br /&gt;
&lt;br /&gt;
A little more to say, parallel_for takes an optional third argument to specify a partitioner, which I used &amp;quot;The_grain_size_You_Pick&amp;quot; to represent. If you want to manually divide the grain and assign the work to processors, you could specify that in the function. Or, you could use automatic grain provided TBB. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempts to limit overhead while still providing ample opportunities for load balancing. Then, the last three line of the TBB code above will be:&lt;br /&gt;
&lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n), ApplyFoo(a), auto_partitioner( ) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
We could find a good example in Intel TBB to implement a DOACROSS with the help of parallel_scan. As stated in the parallel_scan section, this function computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing which&lt;br /&gt;
could be helpful in scenarios that appear to have inherently serial dependences, which could be loop-carried dependences. &lt;br /&gt;
&lt;br /&gt;
Let's consider this scenario (which is actually the mathematical definition of parallel prefix):  &lt;br /&gt;
 T temp = id⊕;&lt;br /&gt;
 for( int i=1; i&amp;lt;=n; ++i ) {&lt;br /&gt;
     temp = temp ⊕ x[i];&lt;br /&gt;
     y[i] = temp;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
When we implement this in TBB using parallel_scan, it becomes:&lt;br /&gt;
&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 class Body {&lt;br /&gt;
     T sum;&lt;br /&gt;
     T* const y;&lt;br /&gt;
     const T* const x;&lt;br /&gt;
 public:&lt;br /&gt;
     Body( T y_[], const T x_[] ) : sum(id⊕), x(x_), y(y_) {}&lt;br /&gt;
     T get_sum() const {return sum;}&lt;br /&gt;
     template&amp;lt;typename Tag&amp;gt;&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;int&amp;gt;&amp;amp; r, Tag ) {&lt;br /&gt;
         T temp = sum;&lt;br /&gt;
         for( int i=r.begin(); i&amp;lt;r.end(); ++i ) {&lt;br /&gt;
             temp = temp ⊕ x[i];&lt;br /&gt;
             if( Tag::is_final_scan() )&lt;br /&gt;
                 y[i] = temp;&lt;br /&gt;
         } &lt;br /&gt;
         sum = temp;&lt;br /&gt;
     }&lt;br /&gt;
     Body( Body&amp;amp; b, split ) : x(b.x), y(b.y), sum(id⊕) {}&lt;br /&gt;
     void reverse_join( Body&amp;amp; a ) { sum = a.sum ⊕ sum;}&lt;br /&gt;
     void assign( Body&amp;amp; b ) {sum = b.sum;}&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
 float DoParallelScan( T y[], const T x[], int n ) {&lt;br /&gt;
     Body body(y,x);&lt;br /&gt;
     parallel_scan( blocked_range&amp;lt;int&amp;gt;(0,n), body );&lt;br /&gt;
     return body.get_sum();&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
It is the second part (function DoParallelScan) that we have to focus on. &lt;br /&gt;
&lt;br /&gt;
Actually, this example is just the scenario mentioned above that could take advantages of parallel_scan. The &amp;quot;inherently serial dependences&amp;quot; is taken care of by the functionality of parallel_scan. By computing the prefix, the serial code could be implemented in parallel with just one function.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
&lt;br /&gt;
Pipeline class is the Intel TBB that performs pipelined execution. A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order. So this class can be used to implement a DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
Here is a comparatively complex example about pipeline implementation. Also, if we look carefully, this is an example with both DOPIPE and DOACROSS:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;iostream&amp;gt;&lt;br /&gt;
 #include &amp;quot;tbb/pipeline.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/tbb_thread.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 char InputString[] = &amp;quot;abcdefg\n&amp;quot;;&lt;br /&gt;
 class InputFilter: public filter {&lt;br /&gt;
     char* my_ptr;&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void*) {&lt;br /&gt;
         if (*my_ptr)&lt;br /&gt;
             return my_ptr++;&lt;br /&gt;
         else&lt;br /&gt;
             return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     InputFilter() :&lt;br /&gt;
         filter( serial_in_order ), my_ptr(InputString) {}&lt;br /&gt;
 };&lt;br /&gt;
 class OutputFilter: public thread_bound_filter {&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void* item) {&lt;br /&gt;
         std::cout &amp;lt;&amp;lt; *(char*)item;&lt;br /&gt;
         return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     OutputFilter() : thread_bound_filter(serial_in_order) {}&lt;br /&gt;
 };&lt;br /&gt;
 void RunPipeline(pipeline* p) {&lt;br /&gt;
     p-&amp;gt;run(8);&lt;br /&gt;
 }&lt;br /&gt;
 int main() {&lt;br /&gt;
     // Construct the pipeline&lt;br /&gt;
     InputFilter f;&lt;br /&gt;
     OutputFilter g;&lt;br /&gt;
     pipeline p;&lt;br /&gt;
     p.add_filter(f);&lt;br /&gt;
     p.add_filter(g);&lt;br /&gt;
     // Another thread initiates execution of the pipeline&lt;br /&gt;
     tbb_thread t(RunPipeline,&amp;amp;p);&lt;br /&gt;
     // Process the thread_bound_filter with the current thread.&lt;br /&gt;
     while (g.process_item()!=thread_bound_filter::end_of_stream)&lt;br /&gt;
         continue;&lt;br /&gt;
     // Wait for pipeline to finish on the other thread.&lt;br /&gt;
     t.join();&lt;br /&gt;
     return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example above shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. The main thread does the following after constructing the pipeline:&lt;br /&gt;
1. Start the pipeline on another thread.&lt;br /&gt;
2. Service the thread_bound_filter until it reaches end_of_stream.&lt;br /&gt;
3. Wait for the other thread to finish.&lt;br /&gt;
&lt;br /&gt;
===POSIX Threads===&lt;br /&gt;
&lt;br /&gt;
POSIX Threads, or Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), defines an API for creating and manipulating threads.&lt;br /&gt;
&lt;br /&gt;
====Variable Scopes====&lt;br /&gt;
Pthreads defines a set of C programming language types, functions and constants. It is implemented with a pthread.h header and a thread library.&lt;br /&gt;
&lt;br /&gt;
There are around 100 Pthreads procedures, all prefixed &amp;quot;pthread_&amp;quot;. The subroutines which comprise the Pthreads API can be informally grouped into four major groups:&lt;br /&gt;
&lt;br /&gt;
* '''Thread management:''' Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.) E.g.pthread_create(), pthread_join().&lt;br /&gt;
* '''Mutexes:''' Routines that deal with synchronization, called a &amp;quot;mutex&amp;quot;, which is an abbreviation for &amp;quot;mutual exclusion&amp;quot;. Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes. E.g. pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock().&lt;br /&gt;
* '''Condition variables:''' Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included. E.g. pthread_cond_signal(); pthread_cond_broadcast(); pthread_cond_wait(); pthread_cond_timedwait();pthread_cond_reltimedwait_np().&lt;br /&gt;
* '''Synchronization:''' Routines that manage read/write locks and barriers. E.g. pthread_rwlock_rdlock(); pthread_rwlock_tryrdlock(); pthread_rwlock_wrlock();pthread_rwlock_trywrlock(); pthread_rwlock_unlock();pthread_barrier_init(); pthread_barrier_wait()&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
The following is a simple code example in C, as DOALL parallelism, to print out each threads' ID#.&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS     5&lt;br /&gt;
  &lt;br /&gt;
 void *PrintHello(void *threadid)&lt;br /&gt;
 {&lt;br /&gt;
    long tid;&lt;br /&gt;
  &lt;br /&gt;
    tid = (long)threadid;&lt;br /&gt;
    printf(&amp;quot;Hello World! It's me, thread #%ld!\n&amp;quot;, tid);&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
  &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
    pthread_t threads[NUM_THREADS];&lt;br /&gt;
  &lt;br /&gt;
    int rc;&lt;br /&gt;
    long t;&lt;br /&gt;
    for(t=0; t&amp;lt;NUM_THREADS; t++){&lt;br /&gt;
       printf(&amp;quot;In main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
       rc = pthread_create(&amp;amp;threads[t], NULL, PrintHello, (void *)t);&lt;br /&gt;
  &lt;br /&gt;
       if (rc){&lt;br /&gt;
          printf(&amp;quot;ERROR; return code from pthread_create() is %d\n&amp;quot;, rc);&lt;br /&gt;
          exit(-1);&lt;br /&gt;
       }&lt;br /&gt;
    }&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This loop contains only single statement which doesn't cross the iterations, so each iteration could be considered as a parallel task.&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
When it comes to using Pthreads to implement DOACROSS, it could express functional parallelism easily, but make the parallelism unnecessarily complicated. See an example below: from '''POSIX Threads Programming''' by Blaise Barney&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS	&lt;br /&gt;
 &lt;br /&gt;
 void *BusyWork(void *t)&lt;br /&gt;
 {&lt;br /&gt;
   int i;&lt;br /&gt;
   long tid;&lt;br /&gt;
   double result=0.0;&lt;br /&gt;
   tid = (long)t;&lt;br /&gt;
   printf(&amp;quot;Thread %ld starting...\n&amp;quot;,tid);&lt;br /&gt;
   for (i=0; i&amp;lt;1000000; i++)&lt;br /&gt;
   {&lt;br /&gt;
      result = result + sin(i) * tan(i);&lt;br /&gt;
   }&lt;br /&gt;
   printf(&amp;quot;Thread %ld done. Result = %e\n&amp;quot;,tid, result);&lt;br /&gt;
   pthread_exit((void*) t);&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
   pthread_t thread[NUM_THREADS];&lt;br /&gt;
   pthread_attr_t attr;&lt;br /&gt;
   int rc;&lt;br /&gt;
   long t;&lt;br /&gt;
   void *status;&lt;br /&gt;
 &lt;br /&gt;
   /* Initialize and set thread detached attribute */&lt;br /&gt;
   pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
   pthread_attr_setdetachstate(&amp;amp;attr, PTHREAD_CREATE_JOINABLE);&lt;br /&gt;
 &lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      printf(&amp;quot;Main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
      rc = pthread_create(&amp;amp;thread[t], &amp;amp;attr, BusyWork, (void *)t); &lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_create() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
   /* Free attribute and wait for the other threads */&lt;br /&gt;
   pthread_attr_destroy(&amp;amp;attr);&lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      rc = pthread_join(thread[t], &amp;amp;status);&lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_join() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      printf(&amp;quot;Main: completed join with thread %ld having a status   &lt;br /&gt;
            of %ld\n&amp;quot;,t,(long)status);&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
 printf(&amp;quot;Main: program completed. Exiting.\n&amp;quot;);&lt;br /&gt;
 pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This example demonstrates how to &amp;quot;wait&amp;quot; for thread completions by using the Pthread join routine. Since some implementations of Pthreads may not create threads in a joinable state, the threads in this example are explicitly created in a joinable state so that they can be joined later.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
There is examples of using Posix Threads to implement DOPIPE parallelism, but unnecessarily complex. Due to the long length, we won't provide it here. If the reader is interested, it could be found in &amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/code/pipe.c Pthreads DOPIPE example]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Comparison among the three===&lt;br /&gt;
&lt;br /&gt;
====A unified example====&lt;br /&gt;
&lt;br /&gt;
We use a simple parallel example from [http://sourceforge.net Sourceforge.net] to show how it will be implemented in the three packages, namely, POSIX Threads, Intel TBB, OpenMP, to show some commonalities and differences among them.&lt;br /&gt;
&lt;br /&gt;
Following is the original code:&lt;br /&gt;
&lt;br /&gt;
 Grid1 *g = new Grid1(0, n+1);&lt;br /&gt;
 Grid1IteratorSub it(1, n, g);&lt;br /&gt;
 DistArray x(g), y(g);&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 ForEach(int i, it,&lt;br /&gt;
    x(i) += ( y(i+1) + y(i-1) )*.5;&lt;br /&gt;
    e += sqr( y(i) ); )&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Then we are going to show the implementations in different packages, and also make a brief summary of the three packages.&lt;br /&gt;
&lt;br /&gt;
=====In POSIX Thread=====&lt;br /&gt;
&lt;br /&gt;
POSIX Thread: Symmetric multi processing, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global declaration:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 float *x, *y;&lt;br /&gt;
 float vec[8];&lt;br /&gt;
 int nn, pp;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
&lt;br /&gt;
 void *sub1(void *arg) {&lt;br /&gt;
    int p = (int)arg;&lt;br /&gt;
    float e_local = 0;&lt;br /&gt;
    for (int i=1+(nn*p)/pp; i&amp;lt;1+(nn*(p+1))/pp; ++i) {&lt;br /&gt;
      x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
      e_local += y[i] * y[i];&lt;br /&gt;
    }&lt;br /&gt;
    vec[p] = e_local;&lt;br /&gt;
    return (void*) 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
&lt;br /&gt;
 x = new float[n+1];&lt;br /&gt;
 y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 int p_threads = 8;&lt;br /&gt;
 nn = n-1;&lt;br /&gt;
 pp = p_threads;&lt;br /&gt;
 pthread_t threads[8];&lt;br /&gt;
 pthread_attr_t attr;&lt;br /&gt;
 pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p)&lt;br /&gt;
    pthread_create(&amp;amp;threads[p], &amp;amp;attr,&lt;br /&gt;
      sub1, (void *)p);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p) {&lt;br /&gt;
    pthread_join(threads[p], NULL);&lt;br /&gt;
    e += vec[p];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
=====In Intel Threading Building Blocks=====&lt;br /&gt;
&lt;br /&gt;
Intel TBB: A C++ library for thread programming, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
Translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/cache_aligned_allocator.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
 struct sub1 {&lt;br /&gt;
    float ee;&lt;br /&gt;
    float *x, *y;&lt;br /&gt;
    sub1(float *xx, float *yy) : ee(0), x(xx), y(yy) {}&lt;br /&gt;
    sub1(sub1&amp;amp; s, split) { ee = 0; x = s.x; y = s.y; }&lt;br /&gt;
    void operator() (const blocked_range&amp;lt;int&amp;gt; &amp;amp; r){&lt;br /&gt;
      float e = ee;&lt;br /&gt;
      for (int i = r.begin(); i!= r.end(); ++i) {&lt;br /&gt;
        x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
        e += y[i] * y[i];&lt;br /&gt;
      }&lt;br /&gt;
      ee = e;&lt;br /&gt;
    }&lt;br /&gt;
    void join(sub1&amp;amp; s) { ee += s.ee; }&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 task_scheduler_init init;&lt;br /&gt;
 ...&lt;br /&gt;
 float e;&lt;br /&gt;
 float *x = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 float *y = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 ...&lt;br /&gt;
 sub1 s(x, y);&lt;br /&gt;
 parallel_reduce(blocked_range&amp;lt;int&amp;gt;(1, n, 1000), s);&lt;br /&gt;
 e = s.ee;&lt;br /&gt;
 ...&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(x, n+1);&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(y, n+1);&lt;br /&gt;
&lt;br /&gt;
=====In OpenMP shared memory parallel code annotations=====&lt;br /&gt;
&lt;br /&gt;
OpenMP: Usually automatic paralleization with a run-time system based on a thread library.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 float e;&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 float *x = new float[n+1];&lt;br /&gt;
 float *y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 e = 0;&lt;br /&gt;
 #pragma omp for reduction(+:e)&lt;br /&gt;
 for (int i=1; i&amp;lt;n; ++i) {&lt;br /&gt;
    x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
    e += y[i] * y[i];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
====Summary: Difference among them====&lt;br /&gt;
&lt;br /&gt;
*Pthreads works for all the parallelism and could express functional parallelism easily, but it needs to build specialized synchronization primitives and explicitly privatize variables, means there is more effort needed to switch a serial program in to parallel mode. &lt;br /&gt;
&lt;br /&gt;
*OpenMP can provide many performance enhancing features, such as atomic, barrier and flush synchronization primitives. It is very simple to use OpenMP to exploit DOALL parallelism, but the syntax for expressing functional parallelism is awkward. &lt;br /&gt;
&lt;br /&gt;
*Intel TBB relies on generic programming, it performs better with custom iteration spaces or complex reduction operations. Also, it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes, so it's better in cases that go beyond loop-based parallelism.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the differences [[#References|&amp;lt;sup&amp;gt;[16]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
{| align=&amp;quot;center cellpadding=&amp;quot;4&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!Type of Parallelism&lt;br /&gt;
!Posix Threads&lt;br /&gt;
!Intel&amp;amp;reg; TBB&lt;br /&gt;
!OpenMP 2.0&lt;br /&gt;
!OpenMp 3.0&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOALL&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOACROSS&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOPIPE&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Reduction&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Functional Parallelism&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==Synchronization Mechanisms==&lt;br /&gt;
&lt;br /&gt;
===Overveiw===&lt;br /&gt;
&lt;br /&gt;
In order to accomplish the above parallelizations in a real system, the memory must be carefully orchestrated such that no information gets corrupted.  Every architecture handles synchronizing data from parallel processors slightly differently.  This section is going to look at different architectures and highlight a few of the mechanisms that are used to achieve this memory synchorization.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===IA-64===&lt;br /&gt;
IA-64 is an Intel architecture that is mainly used in Itanium processors.&lt;br /&gt;
====Spinlock====&lt;br /&gt;
the spinlock is used to guard against multiple accesses to the critical section at the same time.  The critical section is a section of code that must be executed in sequential order, it cannot be parallelized.  Therefore, when a parallel process comes across an occupied critical section the process will “spin” until the lock is released. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // available. If it is 1, another process is in the critical section.&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  spin_lock:&lt;br /&gt;
    mov	ar.ccv = 0			// cmpxchg looks for avail (0)&lt;br /&gt;
    mov	r2 = 1				// cmpxchg sets to held (1)&lt;br /&gt;
  &lt;br /&gt;
  spin: &lt;br /&gt;
    ld8	r1 = [lock] ;;			// get lock in shared state&lt;br /&gt;
    cmp.ne	p1, p0 = r1, r2		// is lock held (ie, lock == 1)?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// yes, continue spinning&lt;br /&gt;
    cmpxchg8.acqrl = [lock], r2		// attempt to grab lock&lt;br /&gt;
    cmp.ne p1, p0 = r1, r2		// was lock empty?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// bummer, continue spinning&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
    st8.rel(lock) = r0 ;;		//release the lock&lt;br /&gt;
&lt;br /&gt;
The above code demonstrates how a spin lock is used.  Once the process gets to a spin lock, it will check to see if the lock is available, if it is not, then the process will proceed into the spin loop where it will continuously check to see if the lock is available.  Once it finds out the lock is available, it will attempt to obtain the lock.  If another process obtains the lock first, then the process will branch back into the spin loop and continue to wait.&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
&lt;br /&gt;
A barrier is a common mechanism used to hold up processes until all processes can get to the same point.  The mechanism is useful in kinds of different parallelisms (DOALL, DOACROSS, DOPIPE, reduction, and functional parallelism)  This architecture uses a unique form of the barrier mechanism called the sense-reversing barrier.  The idea behind this barrier is to prevent race conditions.  If a process from the “next” instance of the barrier races ahead while slow processes from the current barrier are leaving, the fast processes could trap the slow processes at the “next” barrier and thus corrupting the memory synchronization. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Dekker’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Dekker’s Algorithm uses variables to indicate which processers are using which resources.  It basically arbitrates for a resource using these variables.  Every processor has a flag that indicates when it is in the critical section.  So when a processor is getting ready to enter the critical section it will set its flag to one, then it will check to make sure that all of the other processor flags are zero, then it will proceed into the section.  This behavior is demonstrated in the code below.  It is a two-way multiprocessor system, so there are two processor flags, flag_me and flag_you. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The flag_me variable is zero if we are not in the synchronization and &lt;br /&gt;
  // critical section code and non-zero otherwise; flag_you is similarly set&lt;br /&gt;
  // for the other processor.  This algorithm does not retry access to the &lt;br /&gt;
  // resource if there is contention.&lt;br /&gt;
  &lt;br /&gt;
  dekker:&lt;br /&gt;
    mov		r1 = 1 ;;		// my_flag = 1 (i want access)&lt;br /&gt;
    st8  	[flag_me] = r1&lt;br /&gt;
    mf ;;				// make st visible first&lt;br /&gt;
    ld8 	r2 = [flag_you] ;;		// is other's flag 0?&lt;br /&gt;
    cmp.eq p1, p0 = 0, r2&lt;br /&gt;
  &lt;br /&gt;
  (p1) &lt;br /&gt;
    br.cond.spnt cs_skip ;;		// if not, resource in use &lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  cs_skip:&lt;br /&gt;
    st8.rel[flag_me] = r0 ;;		// release lock&lt;br /&gt;
&lt;br /&gt;
====Lamport’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Lamport’s Algorithm is similar to a spinlock with the addition of a fairness mechanism that keeps track of the order in which processes request the shared resource and provides access to the shared resource in the same order.  It makes use of two variable x and y and a shared array, b.  The example below shows example code for this algorithm.  [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The proc_id variable holds a unique, non-zero id for the process that &lt;br /&gt;
  // attempts access to the critical section.  x and y are the synchronization&lt;br /&gt;
  // variables that indicate who is in the critical section and who is attempting&lt;br /&gt;
  // entry. ptr_b_1 and ptr_b_id point at the 1'st and id'th element of b[].&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  lamport:&lt;br /&gt;
    	ld8		r1 = [proc_id] ;;	// r1 = unique process id&lt;br /&gt;
  start:&lt;br /&gt;
    	st8	[ptr_b_id] = r1		// b[id] = &amp;quot;true&amp;quot;&lt;br /&gt;
    	st8	[x] = r1			// x = process id&lt;br /&gt;
   	mf					// MUST fence here!&lt;br /&gt;
    	ld8	r2 = [y] ;;&lt;br /&gt;
    	cmp.ne p1, p0 = 0, r2;;		// if (y !=0) then...&lt;br /&gt;
  (p1)	st8	[ptr_b_id] = r0		// ... b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  (p1)	br.cond.sptk	wait_y		// ... wait until y == 0&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r1		// y = process id&lt;br /&gt;
  	mf&lt;br /&gt;
  	ld8 	r3 = [x] ;;		&lt;br /&gt;
  	cmp.eq p1, p0 = r1, r3 ;;	// if (x == id) then..&lt;br /&gt;
  (p1)	br.cond.sptk cs_begin		// ... enter critical section&lt;br /&gt;
  &lt;br /&gt;
  	st8 	[ptr_b_id] = r0		// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  	ld8	r3 = [ptr_b_1]		// r3 = &amp;amp;b[1]&lt;br /&gt;
  	mov	ar.lc = N-1 ;;		// lc = number of processors - 1&lt;br /&gt;
  wait_b:&lt;br /&gt;
  	ld8	r2 = [r3] ;;		&lt;br /&gt;
  	cmp.ne p1, p0 = r1, r2		// if (b[j] != 0) then...&lt;br /&gt;
  (p1)	br.cond.spnt	wait_b ;;	// ... wait until b[j] == 0&lt;br /&gt;
  	add	r3 = 8, r3		// r3 = &amp;amp;b[j+1]&lt;br /&gt;
  	br.cloop.sptk	wait_b ;;	// loop over b[j] for each j&lt;br /&gt;
  &lt;br /&gt;
  	ld8	r2 = [y] ;;		// if (y != id) then...&lt;br /&gt;
  	cmp.ne p1, p2 = 0, r2&lt;br /&gt;
  (p1)  br.cond.spnt 	wait_y&lt;br /&gt;
  	br	start			// back to start to try again&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r0		// release the lock&lt;br /&gt;
  	st8.rel[ptr_b_id] = r0 ;;	// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
===IA-32=== &lt;br /&gt;
&lt;br /&gt;
IA-32 is an Intel architecture that is also known as x86.  This is a very widely used architecture.&lt;br /&gt;
&lt;br /&gt;
====Locked Atomic Operation====&lt;br /&gt;
This is the main mechanism for this architecture to manage shared data structures such as semaphores and system segments.  The process uses the following three interdependent mechanisms to implement the locked atomic operation: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*  Guaranteed atomic operations.&lt;br /&gt;
*  Bus locking, using the LOCK# signal and the LOCK instruction prefix.&lt;br /&gt;
*  Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock). This mechanism is present in the P6 family processors.&lt;br /&gt;
&lt;br /&gt;
=====Guaranteed Atomic Operation=====&lt;br /&gt;
The following are guaranteed to be carried out automatically: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
Reading or writing a byte.&lt;br /&gt;
*  Reading or writing a word aligned on a 16-bit boundary.&lt;br /&gt;
*  Reading or writing a doubleword aligned on a 32-bit boundary.The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:&lt;br /&gt;
*  Reading or writing a quadword aligned on a 64-bit boundary. (This operation is also guaranteed on the Pentium® processor.)&lt;br /&gt;
*  16-bit accesses to uncached memory locations that fit within a 32-bit data bus.&lt;br /&gt;
*  16-, 32-, and 64-bit accesses to cached memory that fit within a 32-Byte cache line.&lt;br /&gt;
&lt;br /&gt;
=====Bus Locking=====&lt;br /&gt;
A LOCK signal is asserted automatically during certain critical sections in order to lock the system bus and grant control to the process executing the critical section.  This signal will disallow control of this bus by any other process while the LOCK is engaged.&lt;br /&gt;
&lt;br /&gt;
===Linux Kernel===&lt;br /&gt;
&lt;br /&gt;
Linux Kernel is referred to as an “architecture”, however it is fairly unconventional in that it is an open source operating system that has full access to the hardware. It uses many common synchronization mechanisms, so it will be considered here. [[#References|&amp;lt;sup&amp;gt;[15]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Busy-waiting lock====&lt;br /&gt;
&lt;br /&gt;
=====Spinlocks=====&lt;br /&gt;
&lt;br /&gt;
This mechanism is very similar to the mechanism described in the IA-64 architecture.  It is a mechanism used to manage access to a critical section of code.  If a process tries to access the critical section and is rejected it will sit and “spin” while it waits for the lock to be released.&lt;br /&gt;
&lt;br /&gt;
=====Rwlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a special kind of spinlock.  It is for protected structures that are frequently read, but rarely written.  This lock allows multiple reads in parallel, which can increase efficiency if process are not having to sit and wait in order to merely carry out a read function.  Like before however, one write is allowed at a time with no reads done in parallel&lt;br /&gt;
&lt;br /&gt;
=====Brlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a super fast read/write lock, but it has a write-side penalty.  The main advantage of this lock is to prevent cache “ping-pong” in a multiple read case.&lt;br /&gt;
&lt;br /&gt;
====Sleeper locks====&lt;br /&gt;
&lt;br /&gt;
=====Semiphores=====&lt;br /&gt;
&lt;br /&gt;
A semaphore is special variable that acts similar to a lock.  If the semaphore can be acquired then the process can proceed into the critical section.  If the semaphore cannon be acquired, then the process is “put to sleep” and the processor is then used for another process.  This means the processes cache is saved off in a place where it can be retrieved when the process is “woken up”.  Once the semaphore is available the “sleeping” process is woken up and obtains the semaphore and proceeds in to the critical section. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===CUDA=== &lt;br /&gt;
&lt;br /&gt;
CUDA, or Compute Unified Device Architecture, is an Nvidia architecture which is the computing engine for their graphics processors.&lt;br /&gt;
&lt;br /&gt;
====_syncthreads====&lt;br /&gt;
&lt;br /&gt;
The _syncthreads operation can be used at the end of a parallel section as a sort of “barrier” mechanicm.  It is necessary to ensure the accuracy of the memory.  In the following example, there are two calls to _syncthreads.  They are both necessary to ensure the expected results are obtained.  Without it, myArray[tid] could end up being either 2 or the original value of myArray[] depending on when the read and write take place.[[#References|&amp;lt;sup&amp;gt;[14]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // myArray is an array of integers located in global or shared&lt;br /&gt;
  // memory&lt;br /&gt;
  __global__ void MyKernel(int* result) {&lt;br /&gt;
     int tid = threadIdx.x;&lt;br /&gt;
    ...    &lt;br /&gt;
     int ref1 = myArray[tid];&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    myArray[tid + 1] = 2;&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    int ref2 = myArray[tid];&lt;br /&gt;
    result[tid] = ref1 * ref2;&lt;br /&gt;
    ...    &lt;br /&gt;
  {&lt;br /&gt;
&lt;br /&gt;
===PowerPC===&lt;br /&gt;
&lt;br /&gt;
PowerPC is an IBM architecture that stands for Performance Optimization With Enhanced RISC-Performance Computing.  It is a RISC architecture that was originally designed for PCs, however it has grown into the embedded and high-performance space. [[#References|&amp;lt;sup&amp;gt;[18]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Isync==== &lt;br /&gt;
&lt;br /&gt;
isync is an instruction that guarantees that before any code proceeding after the isync instruction can execute, all of the code preceding it has already completed.   It also ensures that any cache block invalidations instructions that were executed before the isync have been carried out with respect to the processor executing the isync instruction.  It then causes any prefetched instructions to be discarded. [[#References|&amp;lt;sup&amp;gt;[17]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Memory Barrier Instructions====&lt;br /&gt;
&lt;br /&gt;
Memory Barrier Instructions can be used to control the order in which storage access are performed. [[#References|&amp;lt;sup&amp;gt;[17]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
=====HeavyWeight sync=====&lt;br /&gt;
This memory barrier creates an ordering function for the storage accesses that are associated with all of the instructions that are executed by the processor executing the sync instruction.&lt;br /&gt;
&lt;br /&gt;
=====LightWeight sync=====&lt;br /&gt;
This memory barrier creates an ordering function for the storage accesses caused by LOAD and STORE instructions that are executed by the processor executing the sync instruction.  Also, this instruction must execute on the specified storage location in storage that is neither a Write Through Required nor a Caching Inhibited.&lt;br /&gt;
&lt;br /&gt;
=====Enforce In-order Execution of I/O=====&lt;br /&gt;
The Enforce In-order Execution of I/O, or eieio, instruction is a memory barrier that creates an ordering function for the storage accesses caused by LOADs and STOREs.  These instructions are split into two groups: [[#References|&amp;lt;sup&amp;gt;[17]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
1. Loads and stores to storage that is both Caching Inhibited and Guarded, and stores to main storage caused by stores to storage that is Write Through Required&lt;br /&gt;
&lt;br /&gt;
2. Stores to storage that is Memory Coherence Required and is neither Write Through Required nor Caching Inhibited&lt;br /&gt;
&lt;br /&gt;
For the first group the ordering done by the memory barrier for accesses in this set is not cumulative.  For the second group the ordering done by the memory barrier for accesses in this set is cumulative.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Cell Broadband Engine===&lt;br /&gt;
Cell Broadband Engine, also referred to as Cell or Cell BE, is an IBM architecture whose first major application was in Sony’s PlayStation 3.  Cell has streamlined coprocessing elements which is great for fast multimedia and vector processing applications. [[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
This architecture is interesting because it uses a shared memory model in which the LOADs and STORES use a “weakly consistent” storage model.  Meaning that, the sequence in which any of the following orders are executed in may be different from each other: [[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
*The order of any processor element (PPE or SPE) performing storage access&lt;br /&gt;
* The order in which the accesses are performed with respect to another &lt;br /&gt;
processor element&lt;br /&gt;
*The order in which the accesses are performed in main storage&lt;br /&gt;
&lt;br /&gt;
It is important that the accesses to the shared memory happen in the correct program order or information could be lost or corrupted.  In order to ensure that this doesn’t happen the following memory barrier instructions are used:&lt;br /&gt;
&lt;br /&gt;
====Fence====&lt;br /&gt;
After all previous issued commands within the same “tag group” have been performed the fence instruction can be issued.  If there is a command that is issued after the fence command, it might be executed before the fence command.&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
After all previous issued commands have been performed, the barrier command and all of the instructions after the barrier command can be executed. &lt;br /&gt;
&lt;br /&gt;
The diagram below demonstrates this behavior. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&amp;lt;ol&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://openmp.org/wp/about-openmp/ OpenMP.org]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://docs.google.com/viewer?a=v&amp;amp;pid=gmail&amp;amp;attid=0.1&amp;amp;thid=126f8a391c11262c&amp;amp;mt=application%2Fpdf&amp;amp;url=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3D2%26ik%3Dd38b56c94f%26view%3Datt%26th%3D126f8a391c11262c%26attid%3D0.1%26disp%3Dattd%26realattid%3Df_g602ojwk0%26zw&amp;amp;sig=AHIEtbTeQDhK98IswmnVSfrPBMfmPLH5Nw An Optimal Abtraction Model for Hardware Multithreading in Modern Processor Architectures]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Reference.pdf Intel Threading Building Blocks 2.2 for Open Source Reference Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.csc.ncsu.edu/faculty/efg/506/s10/ NCSU CSC 506 Parallel Computing Systems]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://parallel-for.sourceforge.net/tbb.html Sourceforge.net]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/openMP/ OpenMP]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.16 Barrier Optimization for OpenMP Program]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://cs.anu.edu.au/~Alistair.Rendell/sc02/module3.pdf Performance Programming: Theory, Practice and Case Studies]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ Intel® Threading Building Blocks, OpenMP, or native threads?]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/pthreads/#Joining POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/source.html Programing with POSIX Threads source code]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA64-softdevman-vol2.pdf IA-64 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA32-softdevman-vol3.pdf IA-32 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf CUDA Programming Guide]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=6&amp;amp;ved=0CEQQFjAF&amp;amp;url=http%3A%2F%2Flinuxindore.com%2Fdownloads%2Fdownload%2Fdata-structures%2Flinux-kernel-arch&amp;amp;ei=jxZWTaGTNI34sAPWm-ScDA&amp;amp;usg=AFQjCNG9UOAz7rHfwUDfayhr50M87uNOYA&amp;amp;sig2=azvo4h85RkoNHcZUtNIkJw Linux Kernel Architecture Overveiw]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/ch_3_jb/Parallel_Programming_Model_Support Spring 2010 NC State ECE/CSC506 Chapter 3 wiki]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://download.boulder.ibm.com/ibmdl/pub/software/dw/library/es-ppcbook2.zip PowerPC Architecture Book]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCEQFjAA&amp;amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPowerPC&amp;amp;ei=77RYTejKFZSisQOm6-GiDA&amp;amp;usg=AFQjCNFt0LpxmNviHKFxCur-amK9HAG08Q&amp;amp;sig2=Kmm9RzJY-4AlG66AwWxlRA Wikipedia information on PowerPC]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf IBM cell Cell Architecture Book]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=5&amp;amp;ved=0CDgQFjAE&amp;amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCell_(microprocessor)&amp;amp;ei=3MJYTeK5Aov6sAPC5-yiDA&amp;amp;usg=AFQjCNENg6PvayZebvtWf7KQstpJDk6URw&amp;amp;sig2=xs87jzBsFgneYOxP0k-_aQ Wikipedia information on Cell]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/ol&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=43734</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=43734"/>
		<updated>2011-02-14T04:35:23Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a (Submitted for 1st review) [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 3 (Revision 1) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43733</id>
		<title>CSC/ECE 506 Spring 2011/ch3 ab</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43733"/>
		<updated>2011-02-14T04:32:33Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Supplement to Chapter 3: Support for parallel-programming models. Discuss how DOACROSS, DOPIPE, DOALL, etc. are implemented in packages such as Posix threads, Intel Thread Building Blocks, OpenMP 2.0 and 3.0.&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this wiki supplement, we will discuss how the three kinds of parallelisms, i.e. DOALL, DOACROSS and DOPIPE implemented in the threads packages - OpenMP, Intel Threading Building Block, POSIX Threads. We discuss each package from the perspective of variable scopes &amp;amp; Reduction/DOALL/DOACROSS/DOPIPE implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementation==&lt;br /&gt;
&lt;br /&gt;
===OpenMP===&lt;br /&gt;
The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.&lt;br /&gt;
&lt;br /&gt;
====Variable Clauses ====&lt;br /&gt;
There are many different types of clauses in OpenMP and each of them has various characteristics. Here we introduce data sharing attribute clauses, Synchronization clauses, Scheduling clauses, Initialization and Reduction. &lt;br /&gt;
=====Data sharing attribute clauses=====&lt;br /&gt;
* ''shared'': the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.&lt;br /&gt;
  Format: shared ''(list)''&lt;br /&gt;
&lt;br /&gt;
  SHARED variables behave as follows:&lt;br /&gt;
  1. Existing in only one memory location and all threads can read or write to that address &lt;br /&gt;
&lt;br /&gt;
* ''private'': the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.&lt;br /&gt;
  Format: private ''(list)''&lt;br /&gt;
&lt;br /&gt;
  PRIVATE variables behave as follows: &lt;br /&gt;
    1. A new object of the same type is declared once for each thread in the team&lt;br /&gt;
    2. All references to the original object are replaced with references to the new object&lt;br /&gt;
    3. Variables declared PRIVATE should be assumed to be uninitialized for each thread &lt;br /&gt;
&lt;br /&gt;
* ''default'': allows the programmer to state that the default data scoping within a parallel region will be either ''shared'', or ''none'' for C/C++, or ''shared'', ''firstprivate'', ''private'', or ''none'' for Fortran.  The ''none'' option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.&lt;br /&gt;
  Format: default (shared | none)&lt;br /&gt;
&lt;br /&gt;
  DEFAULT variables behave as follows: &lt;br /&gt;
    1. Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. &lt;br /&gt;
    2. Using NONE as a default requires that the programmer explicitly scope all variables.&lt;br /&gt;
&lt;br /&gt;
=====Synchronization clauses=====&lt;br /&gt;
* ''critical section'': the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.&lt;br /&gt;
  Format: #pragma omp critical ''[ name ]  newline''&lt;br /&gt;
           ''structured_block''&lt;br /&gt;
&lt;br /&gt;
  CRITICAL SECTION behaves as follows:&lt;br /&gt;
    1. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.&lt;br /&gt;
    2. It is illegal to branch into or out of a CRITICAL block. &lt;br /&gt;
&lt;br /&gt;
* ''atomic'': similar to ''critical section'', but advise the compiler to use special hardware instructions for better performance. Compilers may choose to ignore this suggestion from users and use ''critical section'' instead.&lt;br /&gt;
  Format: #pragma omp atomic  ''newline''&lt;br /&gt;
           ''statement_expression''&lt;br /&gt;
&lt;br /&gt;
  ATOMIC behaves as follows:&lt;br /&gt;
    1. Only to a single, immediately following statement.&lt;br /&gt;
    2. An atomic statement must follow a specific syntax. &lt;br /&gt;
&lt;br /&gt;
* ''ordered'': the structured block is executed in the order in which iterations would be executed in a sequential loop&lt;br /&gt;
  Format: #pragma omp for ordered ''[clauses...]''&lt;br /&gt;
          ''(loop region)''&lt;br /&gt;
          #pragma omp ordered  ''newline''&lt;br /&gt;
          ''structured_block&lt;br /&gt;
          (endo of loop region)''&lt;br /&gt;
&lt;br /&gt;
  ORDERED behaves as follows:&lt;br /&gt;
    1. only appear in the dynamic extent of ''for'' or ''parallel for (C/C++)''.&lt;br /&gt;
    2. Only one thread is allowed in an ordered section at any time.&lt;br /&gt;
    3. It is illegal to branch into or out of an ORDERED block. &lt;br /&gt;
    4. A loop which contains an ORDERED directive, must be a loop with an ORDERED clause. &lt;br /&gt;
&lt;br /&gt;
* ''barrier'': each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.&lt;br /&gt;
   Format: #pragma omp barrier  ''newline''&lt;br /&gt;
&lt;br /&gt;
   BARRIER behaves as follows:&lt;br /&gt;
    1. All threads in a team (or none) must execute the BARRIER region.&lt;br /&gt;
    2. The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team.&lt;br /&gt;
&lt;br /&gt;
*''taskwait'': specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.&lt;br /&gt;
   Format: #pragma omp taskwait  ''newline''&lt;br /&gt;
&lt;br /&gt;
   TASKWAIT behaves as follows:&lt;br /&gt;
    1. Placed only at a point where a base language statement is allowed.&lt;br /&gt;
    2. Not be used in place of the statement following an if, while, do, switch, or label.&lt;br /&gt;
&lt;br /&gt;
*''flush'': The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory. Thread-visible variables are written back to memory at this point. &lt;br /&gt;
   Format: #pragma omp flush ''(list)  newline''&lt;br /&gt;
&lt;br /&gt;
   FLUSH behaves as follows:&lt;br /&gt;
    1. The optional list contains a list of named variables that will be flushed in order to avoid flushing all variables.&lt;br /&gt;
    2. Implementations must ensure any prior modifications to thread-visible variables are visible to all threads after this point.&lt;br /&gt;
&lt;br /&gt;
=====Scheduling clauses=====&lt;br /&gt;
*''schedule(type, chunk)'': This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are:&lt;br /&gt;
#''static'': Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter &amp;quot;chunk&amp;quot; will allocate &amp;quot;chunk&amp;quot; number of contiguous iterations to a particular thread.&lt;br /&gt;
#''dynamic'': Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter &amp;quot;chunk&amp;quot; defines the number of contiguous iterations that are allocated to a thread at a time.&lt;br /&gt;
#''guided'': A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter &amp;quot;chunk&amp;quot;&lt;br /&gt;
=====Initialization=====&lt;br /&gt;
* ''firstprivate'': the data is private to each thread, but initialized using the value of the variable using the same name from the master thread.&lt;br /&gt;
  Format: firstprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  FIRSTPRIVATE variables behave as follows: &lt;br /&gt;
    1. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. &lt;br /&gt;
&lt;br /&gt;
* ''lastprivate'': the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop.  A variable can be both ''firstprivate'' and ''lastprivate''. &lt;br /&gt;
  Format: lastprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
* ''threadprivate'': The data is a global data, but it is private in each parallel region during the runtime. The difference between ''threadprivate'' and ''private'' is the global scope associated with threadprivate and the preserved value across parallel regions.&lt;br /&gt;
  Format: #pragma omp threadprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  THREADPRIVATE variables behave as follows: &lt;br /&gt;
    1. On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined. &lt;br /&gt;
    2. The THREADPRIVATE directive must appear after every declaration of a thread private variable/common block.&lt;br /&gt;
&lt;br /&gt;
=====Reduction=====&lt;br /&gt;
* ''reduction'': the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in &amp;quot;operator&amp;quot; for this particular clause) on a datatype that runs iteratively so that its value at a particular iteration depends on its value at a previous iteration. Basically, the steps that lead up to the operational increment are parallelized, but the threads gather up and wait before updating the datatype, then increments the datatype in order so as to avoid racing condition. &lt;br /&gt;
  Format: reduction ''(operator: list)''&lt;br /&gt;
&lt;br /&gt;
  REDUTION variables behave as follows: &lt;br /&gt;
    1. Variables in the list must be named scalar variables. They can not be array or structure type variables. They must also be declared SHARED in the enclosing context.&lt;br /&gt;
    2. Reduction operations may not be associative for real numbers.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
In code 3.20, first it must include the header file ''omp.h'' which contains OpenMP function declarations. Next, A parallel region is started by  #pragma omp parallel and we enclose this program bu curly brackets. We can use (setenv OMP_NUM_THREADS n) to specify the number of threads. Another way to determine the number of threads is directly calling a function (omp_set_numtheads (n)). &lt;br /&gt;
Code 3.20 only has one loop to execute and we want it to execute in parallel, so we combine the start of the parallel loop and the start of the parallel region with one directive ''#pragma omp parallel for''. &lt;br /&gt;
 &lt;br /&gt;
 '''Code 3.20 A DOALL parallelism example in OpenMP&lt;br /&gt;
 '''#include''' &amp;lt;omp.h&amp;gt;&lt;br /&gt;
 '''...'''&lt;br /&gt;
 '''#pragma''' omp parallel //start of parallel region&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''...'''&lt;br /&gt;
  '''#pragma''' omp parallel for default (shared)&lt;br /&gt;
  '''for''' ( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
    '''A[i]''' = A[i] + A[i] - 3.0;&lt;br /&gt;
 '''}'''//end for parallel region&lt;br /&gt;
&lt;br /&gt;
Apparently, there is no loop-carried dependence in ''i'' loop. With OpenMP, we only need to insert the ''pragma'' directive ''parallel for''. The ''dafault(shared)'' clauses states that all variables within the scope of the loop are shared  unless otherwise specified.&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
We will now introduce how to implement DOACROSS in OpenMP. Here is an example code which has not been paralleled yet.&lt;br /&gt;
 &lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02: for(j=1; j&amp;lt;N; j++){&lt;br /&gt;
 03: a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 04: }&lt;br /&gt;
 05: }&lt;br /&gt;
&lt;br /&gt;
From this sample code, obviously, there is dependence existing here. &lt;br /&gt;
 a[i,j] -&amp;gt; T a[i+1, j+1]&lt;br /&gt;
&lt;br /&gt;
In OpenMP, DOALL parallel can be implemented by insert a “#pragma omp for” before the “for” structure in the source code. But there is not a pragma corresponding to DOACROSS parallel.&lt;br /&gt;
&lt;br /&gt;
When we implement DOACROSS, we use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is gotten by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is gotten by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
*omp_get_num_threads(): Returns the number of threads that are currently in the team executing the parallel region from which it is called.&lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_num_threads(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_NUM_THREADS behaves as following:&lt;br /&gt;
  1. If this call is made from a serial portion of the program, or a nested parallel region that is serialized, it will return 1. &lt;br /&gt;
  2. The default number of threads is implementation dependent. &lt;br /&gt;
&lt;br /&gt;
*omp_get_thread_num(): Returns the thread number of the thread, within the team, making this call. This number will be between 0 and OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0 &lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_thread_num(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_THREAD_NUM behaves as followings:&lt;br /&gt;
  1. If called from a nested parallel region, or a serial region, this function will return 0. &lt;br /&gt;
&lt;br /&gt;
Now, let's see the code which has been paralleled and explanation. &lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 		//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(j_tile = 0; j_tile&amp;lt;N-1; j_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       for(j=j_tile;j&amp;lt;j_tile+M;j++){&lt;br /&gt;
 19:         a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 20:       }&lt;br /&gt;
 21:     }&lt;br /&gt;
 22:     _mylock[myid] += 1;&lt;br /&gt;
 23:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 24:   }&lt;br /&gt;
 25: }&lt;br /&gt;
&lt;br /&gt;
We paralleled the original program in two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other four processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take four interations of the loop i. The same to j loop. Assume the size of each block is 4. Each processor will execute four iterations of loop j. In order to let the total iterations be equal to the original program, j has to be enclosed in loop i. So, the new loop will be looked like ''for (j_tile = 2; j_tile &amp;lt;= 15; j_tile += 4)'', line 18.&lt;br /&gt;
The lower bound of loop j is set to j_tile and the upper bound will be j_tile+3. We will keep the other statement unchanged.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the neighbor threads. After first step, the four processor will finish computing a block 4x4. If we parallel all these four processors, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
We set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
With the four variables, threads are synchronized:&lt;br /&gt;
The first thread continues to run with out waiting (line 9), because its thread ID is 0. Then all other thread can not go down after line 12. If the value in ''_mylocks[_my_id-1]'' is smaller than ''_counter0''.&lt;br /&gt;
&lt;br /&gt;
Otherwise, the block that the current thread is waiting for must have to be completed, and the current thread can go down to line 12, and mark the next block it will wait for by adding 1 to ''_counter0'' (line 14).&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it&lt;br /&gt;
has finish a block by ''mylocks[proc]++''. Once the neighbor thread finds the value has been changed, it will continue running and so on. The below figure presents it to us.&lt;br /&gt;
[[Image:Synchorization.jpg]]&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
Here is another example code and we are going to parallelize it in DOPIPE parallelism. There is a dependence, which is S2 -&amp;gt; T S1, existing in the sample code.&lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02:   S1: a[i]=b[i];&lt;br /&gt;
 03:   S2: c[i]=c[i-1]+a[i];&lt;br /&gt;
 04: &lt;br /&gt;
 05: }&lt;br /&gt;
Now, let's see how to parallel the sample code to DOPIPE parallelism.&lt;br /&gt;
we still use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is got by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is got by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 			//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(i_tile = 0; i_tile&amp;lt;N-1; i_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       a[i]=b[i];&lt;br /&gt;
 19:     }&lt;br /&gt;
 20:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 21:       c[i]=c[i-1]+a[i];&lt;br /&gt;
 22:     }&lt;br /&gt;
 23:     _mylock[myid] += 1;&lt;br /&gt;
 24:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 25:   }&lt;br /&gt;
 26: }&lt;br /&gt;
&lt;br /&gt;
Ideally, We parallelized the original program into two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take interations of the loop i. Now, there are two loop i existing and each loop i contains different statements. Also, we will keep other statements remained.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the threads. After first step, processors will finish computing &lt;br /&gt;
a[i]=b[i]. If we parallel all the processors to do the second loop i, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
Still, we set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it has finish a block by ''mylocks[proc]++''. Once the processors finish their own block, the other processors will be able to get the value to use that value to execute in its statement and process that.&lt;br /&gt;
&lt;br /&gt;
====Functional Parallelism====&lt;br /&gt;
&lt;br /&gt;
In order to introduce function parallelism, we want to execute some code section in parallel with another code section. We use code 3.21 to show two loops execute in parallel with respect to one another, although each loop is sequentially executed.&lt;br /&gt;
&lt;br /&gt;
 '''Code''' 3.21 A function parallelism example in OpenMP&lt;br /&gt;
 '''pragma''' omp parallel shared(A, B)private(i)&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''#pragma''' omp sections nowait&lt;br /&gt;
  '''{'''&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''A[i]''' = A[i]*A[i] - 4.0;&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''B[i]''' = B[i]*B[i] - 9.0;&lt;br /&gt;
  '''}'''//end omp sections&lt;br /&gt;
 '''}'''//end omp parallel&lt;br /&gt;
&lt;br /&gt;
In code 3.21, there are two loops needed to be executed in parallel. We just need to insert two ''pragma omp section'' statements. Once we insert these two statements, those two loops will execute sequentially.&lt;br /&gt;
&lt;br /&gt;
===Intel Thread Building Blocks===&lt;br /&gt;
&lt;br /&gt;
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable &lt;br /&gt;
parallel programming using standard ISO C++ code. It does not require special &lt;br /&gt;
languages or compilers. It is designed to promote scalable data parallel programming. &lt;br /&gt;
The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually. Instead the library abstracts access to the multiple processors by allowing the operations to be treated as &amp;quot;tasks,&amp;quot; which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Also, Intel Threading Building Blocks provides net results, which enables you to specify &lt;br /&gt;
parallelism more conveniently than using raw threads, and at the same time can &lt;br /&gt;
improve performance.&lt;br /&gt;
&lt;br /&gt;
====Variables Scope====&lt;br /&gt;
&lt;br /&gt;
Intel TBB is a collection of components for parallel programming, here is the overview of the library contents:&lt;br /&gt;
&lt;br /&gt;
* Basic algorithms: parallel_for, parallel_reduce, parallel_scan&lt;br /&gt;
* Advanced algorithms: parallel_while, parallel_do,pipeline, parallel_sort&lt;br /&gt;
* Containers: concurrent_queue, concurrent_vector, concurrent_hash_map&lt;br /&gt;
* Scalable memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator&lt;br /&gt;
* Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive mutex&lt;br /&gt;
* Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store&lt;br /&gt;
* Timing: portable fine grained global time stamp&lt;br /&gt;
* Task Scheduler: direct access to control the creation and activation of tasks&lt;br /&gt;
&lt;br /&gt;
Then we will focus on some specific TBB variables.&lt;br /&gt;
&lt;br /&gt;
=====parallel_for=====&lt;br /&gt;
&lt;br /&gt;
Parallel_for is the template function that performs parallel iteration over a range of values. In Intel TBB, a lot of DOALL cases could be implemented by using this function. The syntax is as follows: &lt;br /&gt;
 template&amp;lt;typename Index, typename Function&amp;gt;&lt;br /&gt;
 Function parallel_for(Index first, Index_type last, Index step, Function f);&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_for( const Range&amp;amp; range, const Body&amp;amp; body, [, partitioner] );&lt;br /&gt;
&lt;br /&gt;
A parallel_for(first, last, step, f) represents parallel execution of the loop: &amp;quot;for( auto i=first; i&amp;lt;last; i+=step ) f(i);&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
=====parallel_reduce=====&lt;br /&gt;
&lt;br /&gt;
Function parallel_reduce computes reduction over a range. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Value, typename Func, typename Reduction&amp;gt;&lt;br /&gt;
 Value parallel_reduce( const Range&amp;amp; range, const Value&amp;amp; identity, const Func&amp;amp; func, const Reduction&amp;amp; reduction );&lt;br /&gt;
&lt;br /&gt;
The functional form parallel_reduce(range,identity,func,reduction) performs a&lt;br /&gt;
parallel reduction by applying func to subranges in range and reducing the results&lt;br /&gt;
using binary operator reduction. It returns the result of the reduction. Parameter func&lt;br /&gt;
and reduction can be lambda expressions.&lt;br /&gt;
&lt;br /&gt;
=====parallel_scan=====&lt;br /&gt;
&lt;br /&gt;
This template function computes parallel prefix. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const auto_partitioner&amp;amp; );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const simple_partitioner&amp;amp; );&lt;br /&gt;
&lt;br /&gt;
A parallel_scan(range,body) computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing that is&lt;br /&gt;
sometimes useful in scenarios that appear to have inherently serial dependences. A&lt;br /&gt;
further explanation will be given in the DOACROSS example.&lt;br /&gt;
&lt;br /&gt;
=====pipeline=====&lt;br /&gt;
&lt;br /&gt;
This class performs pipelined execution. Members as follows:&lt;br /&gt;
 namespace tbb {&lt;br /&gt;
     class pipeline {&lt;br /&gt;
     public:&lt;br /&gt;
        pipeline();&lt;br /&gt;
        ~pipeline(); &lt;br /&gt;
        void add_filter( filter&amp;amp; f );&lt;br /&gt;
        void run( size_t max_number_of_live_tokens );&lt;br /&gt;
        void clear();&lt;br /&gt;
   };&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
A pipeline represents pipelined application of a series of filters to a stream of items.&lt;br /&gt;
Each filter operates in a particular mode: parallel, serial in order, or serial out of order. With a parallel filter, &lt;br /&gt;
we could implement DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
====Reduction====&lt;br /&gt;
&lt;br /&gt;
The reduction in Intel TBB is implemented using parallel_reduce function. A parallel_reduce recursively splits the range into subranges and uses the splitting constructor to make one or more copies of the body for each thread. We use an example to illustrate this: &lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 struct Sum {&lt;br /&gt;
     float value;&lt;br /&gt;
     Sum() : value(0) {}&lt;br /&gt;
     Sum( Sum&amp;amp; s, split ) {value = 0;}&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;float*&amp;gt;&amp;amp; r ) {&lt;br /&gt;
         float temp = value;&lt;br /&gt;
         for( float* a=r.begin(); a!=r.end(); ++a ) {&lt;br /&gt;
             temp += *a;&lt;br /&gt;
         }&lt;br /&gt;
         value = temp;&lt;br /&gt;
     }&lt;br /&gt;
     void join( Sum&amp;amp; rhs ) {value += rhs.value;}&lt;br /&gt;
 };&lt;br /&gt;
 float ParallelSum( float array[], size_t n ) {&lt;br /&gt;
     Sum total;&lt;br /&gt;
     parallel_reduce( blocked_range&amp;lt;float*&amp;gt;( array, array+n ), total );&lt;br /&gt;
     return total.value;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The above example sums the values in the array. The parallel_reduce will do the reduction within the range of (array, array+n), to split the working body, and then join them by the return value for each split.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
The implementation of DOALL parallelism in Intel TBB will involve Parallel_for function. &lt;br /&gt;
To better illustrate the usage, here we discuss a simple example. The following is the original code:&lt;br /&gt;
 &lt;br /&gt;
 void SerialApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     for( size_t i=0; i&amp;lt;n; ++i )&lt;br /&gt;
         Foo(a[i]);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
After using Intel TBB, it could be switched to the following:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_for.h&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 class ApplyFoo {&lt;br /&gt;
     float *const my_a;&lt;br /&gt;
 public:&lt;br /&gt;
     void operator( )( const blocked_range&amp;lt;size_t&amp;gt;&amp;amp; r ) const {&lt;br /&gt;
         float *a = my_a;&lt;br /&gt;
         for( size_t i=r.begin(); i!=r.end( ); ++i )&lt;br /&gt;
             Foo(a[i]);&lt;br /&gt;
     }&lt;br /&gt;
     ApplyFoo( float a[] ) :&lt;br /&gt;
         my_a(a)&lt;br /&gt;
     {}&lt;br /&gt;
 };&lt;br /&gt;
 &lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n,The_grain_size_You_Pick), ApplyFoo(a) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example is the simplest DOALL parallelism, similar to the one in the textbook, and execution graph will be very similar as the one in DOALL section above. But with the help of this simple illustration, the TBB code just gives you a flavor of how it would be implemented in Intel Threading Building Blocks.&lt;br /&gt;
&lt;br /&gt;
A little more to say, parallel_for takes an optional third argument to specify a partitioner, which I used &amp;quot;The_grain_size_You_Pick&amp;quot; to represent. If you want to manually divide the grain and assign the work to processors, you could specify that in the function. Or, you could use automatic grain provided TBB. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempts to limit overhead while still providing ample opportunities for load balancing. Then, the last three line of the TBB code above will be:&lt;br /&gt;
&lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n), ApplyFoo(a), auto_partitioner( ) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
We could find a good example in Intel TBB to implement a DOACROSS with the help of parallel_scan. As stated in the parallel_scan section, this function computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing which&lt;br /&gt;
could be helpful in scenarios that appear to have inherently serial dependences, which could be loop-carried dependences. &lt;br /&gt;
&lt;br /&gt;
Let's consider this scenario (which is actually the mathematical definition of parallel prefix):  &lt;br /&gt;
 T temp = id⊕;&lt;br /&gt;
 for( int i=1; i&amp;lt;=n; ++i ) {&lt;br /&gt;
     temp = temp ⊕ x[i];&lt;br /&gt;
     y[i] = temp;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
When we implement this in TBB using parallel_scan, it becomes:&lt;br /&gt;
&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 class Body {&lt;br /&gt;
     T sum;&lt;br /&gt;
     T* const y;&lt;br /&gt;
     const T* const x;&lt;br /&gt;
 public:&lt;br /&gt;
     Body( T y_[], const T x_[] ) : sum(id⊕), x(x_), y(y_) {}&lt;br /&gt;
     T get_sum() const {return sum;}&lt;br /&gt;
     template&amp;lt;typename Tag&amp;gt;&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;int&amp;gt;&amp;amp; r, Tag ) {&lt;br /&gt;
         T temp = sum;&lt;br /&gt;
         for( int i=r.begin(); i&amp;lt;r.end(); ++i ) {&lt;br /&gt;
             temp = temp ⊕ x[i];&lt;br /&gt;
             if( Tag::is_final_scan() )&lt;br /&gt;
                 y[i] = temp;&lt;br /&gt;
         } &lt;br /&gt;
         sum = temp;&lt;br /&gt;
     }&lt;br /&gt;
     Body( Body&amp;amp; b, split ) : x(b.x), y(b.y), sum(id⊕) {}&lt;br /&gt;
     void reverse_join( Body&amp;amp; a ) { sum = a.sum ⊕ sum;}&lt;br /&gt;
     void assign( Body&amp;amp; b ) {sum = b.sum;}&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
 float DoParallelScan( T y[], const T x[], int n ) {&lt;br /&gt;
     Body body(y,x);&lt;br /&gt;
     parallel_scan( blocked_range&amp;lt;int&amp;gt;(0,n), body );&lt;br /&gt;
     return body.get_sum();&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
It is the second part (function DoParallelScan) that we have to focus on. &lt;br /&gt;
&lt;br /&gt;
Actually, this example is just the scenario mentioned above that could take advantages of parallel_scan. The &amp;quot;inherently serial dependences&amp;quot; is taken care of by the functionality of parallel_scan. By computing the prefix, the serial code could be implemented in parallel with just one function.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
&lt;br /&gt;
Pipeline class is the Intel TBB that performs pipelined execution. A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order. So this class can be used to implement a DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
Here is a comparatively complex example about pipeline implementation. Also, if we look carefully, this is an example with both DOPIPE and DOACROSS:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;iostream&amp;gt;&lt;br /&gt;
 #include &amp;quot;tbb/pipeline.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/tbb_thread.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 char InputString[] = &amp;quot;abcdefg\n&amp;quot;;&lt;br /&gt;
 class InputFilter: public filter {&lt;br /&gt;
     char* my_ptr;&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void*) {&lt;br /&gt;
         if (*my_ptr)&lt;br /&gt;
             return my_ptr++;&lt;br /&gt;
         else&lt;br /&gt;
             return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     InputFilter() :&lt;br /&gt;
         filter( serial_in_order ), my_ptr(InputString) {}&lt;br /&gt;
 };&lt;br /&gt;
 class OutputFilter: public thread_bound_filter {&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void* item) {&lt;br /&gt;
         std::cout &amp;lt;&amp;lt; *(char*)item;&lt;br /&gt;
         return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     OutputFilter() : thread_bound_filter(serial_in_order) {}&lt;br /&gt;
 };&lt;br /&gt;
 void RunPipeline(pipeline* p) {&lt;br /&gt;
     p-&amp;gt;run(8);&lt;br /&gt;
 }&lt;br /&gt;
 int main() {&lt;br /&gt;
     // Construct the pipeline&lt;br /&gt;
     InputFilter f;&lt;br /&gt;
     OutputFilter g;&lt;br /&gt;
     pipeline p;&lt;br /&gt;
     p.add_filter(f);&lt;br /&gt;
     p.add_filter(g);&lt;br /&gt;
     // Another thread initiates execution of the pipeline&lt;br /&gt;
     tbb_thread t(RunPipeline,&amp;amp;p);&lt;br /&gt;
     // Process the thread_bound_filter with the current thread.&lt;br /&gt;
     while (g.process_item()!=thread_bound_filter::end_of_stream)&lt;br /&gt;
         continue;&lt;br /&gt;
     // Wait for pipeline to finish on the other thread.&lt;br /&gt;
     t.join();&lt;br /&gt;
     return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example above shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. The main thread does the following after constructing the pipeline:&lt;br /&gt;
1. Start the pipeline on another thread.&lt;br /&gt;
2. Service the thread_bound_filter until it reaches end_of_stream.&lt;br /&gt;
3. Wait for the other thread to finish.&lt;br /&gt;
&lt;br /&gt;
===POSIX Threads===&lt;br /&gt;
&lt;br /&gt;
POSIX Threads, or Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), defines an API for creating and manipulating threads.&lt;br /&gt;
&lt;br /&gt;
====Variable Scopes====&lt;br /&gt;
Pthreads defines a set of C programming language types, functions and constants. It is implemented with a pthread.h header and a thread library.&lt;br /&gt;
&lt;br /&gt;
There are around 100 Pthreads procedures, all prefixed &amp;quot;pthread_&amp;quot;. The subroutines which comprise the Pthreads API can be informally grouped into four major groups:&lt;br /&gt;
&lt;br /&gt;
* '''Thread management:''' Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.) E.g.pthread_create(), pthread_join().&lt;br /&gt;
* '''Mutexes:''' Routines that deal with synchronization, called a &amp;quot;mutex&amp;quot;, which is an abbreviation for &amp;quot;mutual exclusion&amp;quot;. Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes. E.g. pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock().&lt;br /&gt;
* '''Condition variables:''' Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included. E.g. pthread_cond_signal(); pthread_cond_broadcast(); pthread_cond_wait(); pthread_cond_timedwait();pthread_cond_reltimedwait_np().&lt;br /&gt;
* '''Synchronization:''' Routines that manage read/write locks and barriers. E.g. pthread_rwlock_rdlock(); pthread_rwlock_tryrdlock(); pthread_rwlock_wrlock();pthread_rwlock_trywrlock(); pthread_rwlock_unlock();pthread_barrier_init(); pthread_barrier_wait()&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
The following is a simple code example in C, as DOALL parallelism, to print out each threads' ID#.&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS     5&lt;br /&gt;
  &lt;br /&gt;
 void *PrintHello(void *threadid)&lt;br /&gt;
 {&lt;br /&gt;
    long tid;&lt;br /&gt;
  &lt;br /&gt;
    tid = (long)threadid;&lt;br /&gt;
    printf(&amp;quot;Hello World! It's me, thread #%ld!\n&amp;quot;, tid);&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
  &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
    pthread_t threads[NUM_THREADS];&lt;br /&gt;
  &lt;br /&gt;
    int rc;&lt;br /&gt;
    long t;&lt;br /&gt;
    for(t=0; t&amp;lt;NUM_THREADS; t++){&lt;br /&gt;
       printf(&amp;quot;In main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
       rc = pthread_create(&amp;amp;threads[t], NULL, PrintHello, (void *)t);&lt;br /&gt;
  &lt;br /&gt;
       if (rc){&lt;br /&gt;
          printf(&amp;quot;ERROR; return code from pthread_create() is %d\n&amp;quot;, rc);&lt;br /&gt;
          exit(-1);&lt;br /&gt;
       }&lt;br /&gt;
    }&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This loop contains only single statement which doesn't cross the iterations, so each iteration could be considered as a parallel task.&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
When it comes to using Pthreads to implement DOACROSS, it could express functional parallelism easily, but make the parallelism unnecessarily complicated. See an example below: from '''POSIX Threads Programming''' by Blaise Barney&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS	&lt;br /&gt;
 &lt;br /&gt;
 void *BusyWork(void *t)&lt;br /&gt;
 {&lt;br /&gt;
   int i;&lt;br /&gt;
   long tid;&lt;br /&gt;
   double result=0.0;&lt;br /&gt;
   tid = (long)t;&lt;br /&gt;
   printf(&amp;quot;Thread %ld starting...\n&amp;quot;,tid);&lt;br /&gt;
   for (i=0; i&amp;lt;1000000; i++)&lt;br /&gt;
   {&lt;br /&gt;
      result = result + sin(i) * tan(i);&lt;br /&gt;
   }&lt;br /&gt;
   printf(&amp;quot;Thread %ld done. Result = %e\n&amp;quot;,tid, result);&lt;br /&gt;
   pthread_exit((void*) t);&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
   pthread_t thread[NUM_THREADS];&lt;br /&gt;
   pthread_attr_t attr;&lt;br /&gt;
   int rc;&lt;br /&gt;
   long t;&lt;br /&gt;
   void *status;&lt;br /&gt;
 &lt;br /&gt;
   /* Initialize and set thread detached attribute */&lt;br /&gt;
   pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
   pthread_attr_setdetachstate(&amp;amp;attr, PTHREAD_CREATE_JOINABLE);&lt;br /&gt;
 &lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      printf(&amp;quot;Main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
      rc = pthread_create(&amp;amp;thread[t], &amp;amp;attr, BusyWork, (void *)t); &lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_create() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
   /* Free attribute and wait for the other threads */&lt;br /&gt;
   pthread_attr_destroy(&amp;amp;attr);&lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      rc = pthread_join(thread[t], &amp;amp;status);&lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_join() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      printf(&amp;quot;Main: completed join with thread %ld having a status   &lt;br /&gt;
            of %ld\n&amp;quot;,t,(long)status);&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
 printf(&amp;quot;Main: program completed. Exiting.\n&amp;quot;);&lt;br /&gt;
 pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This example demonstrates how to &amp;quot;wait&amp;quot; for thread completions by using the Pthread join routine. Since some implementations of Pthreads may not create threads in a joinable state, the threads in this example are explicitly created in a joinable state so that they can be joined later.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
There is examples of using Posix Threads to implement DOPIPE parallelism, but unnecessarily complex. Due to the long length, we won't provide it here. If the reader is interested, it could be found in &amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/code/pipe.c Pthreads DOPIPE example]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Comparison among the three===&lt;br /&gt;
&lt;br /&gt;
====A unified example====&lt;br /&gt;
&lt;br /&gt;
We use a simple parallel example from [http://sourceforge.net Sourceforge.net] to show how it will be implemented in the three packages, namely, POSIX Threads, Intel TBB, OpenMP, to show some commonalities and differences among them.&lt;br /&gt;
&lt;br /&gt;
Following is the original code:&lt;br /&gt;
&lt;br /&gt;
 Grid1 *g = new Grid1(0, n+1);&lt;br /&gt;
 Grid1IteratorSub it(1, n, g);&lt;br /&gt;
 DistArray x(g), y(g);&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 ForEach(int i, it,&lt;br /&gt;
    x(i) += ( y(i+1) + y(i-1) )*.5;&lt;br /&gt;
    e += sqr( y(i) ); )&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Then we are going to show the implementations in different packages, and also make a brief summary of the three packages.&lt;br /&gt;
&lt;br /&gt;
=====In POSIX Thread=====&lt;br /&gt;
&lt;br /&gt;
POSIX Thread: Symmetric multi processing, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global declaration:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 float *x, *y;&lt;br /&gt;
 float vec[8];&lt;br /&gt;
 int nn, pp;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
&lt;br /&gt;
 void *sub1(void *arg) {&lt;br /&gt;
    int p = (int)arg;&lt;br /&gt;
    float e_local = 0;&lt;br /&gt;
    for (int i=1+(nn*p)/pp; i&amp;lt;1+(nn*(p+1))/pp; ++i) {&lt;br /&gt;
      x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
      e_local += y[i] * y[i];&lt;br /&gt;
    }&lt;br /&gt;
    vec[p] = e_local;&lt;br /&gt;
    return (void*) 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
&lt;br /&gt;
 x = new float[n+1];&lt;br /&gt;
 y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 int p_threads = 8;&lt;br /&gt;
 nn = n-1;&lt;br /&gt;
 pp = p_threads;&lt;br /&gt;
 pthread_t threads[8];&lt;br /&gt;
 pthread_attr_t attr;&lt;br /&gt;
 pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p)&lt;br /&gt;
    pthread_create(&amp;amp;threads[p], &amp;amp;attr,&lt;br /&gt;
      sub1, (void *)p);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p) {&lt;br /&gt;
    pthread_join(threads[p], NULL);&lt;br /&gt;
    e += vec[p];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
=====In Intel Threading Building Blocks=====&lt;br /&gt;
&lt;br /&gt;
Intel TBB: A C++ library for thread programming, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
Translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/cache_aligned_allocator.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
 struct sub1 {&lt;br /&gt;
    float ee;&lt;br /&gt;
    float *x, *y;&lt;br /&gt;
    sub1(float *xx, float *yy) : ee(0), x(xx), y(yy) {}&lt;br /&gt;
    sub1(sub1&amp;amp; s, split) { ee = 0; x = s.x; y = s.y; }&lt;br /&gt;
    void operator() (const blocked_range&amp;lt;int&amp;gt; &amp;amp; r){&lt;br /&gt;
      float e = ee;&lt;br /&gt;
      for (int i = r.begin(); i!= r.end(); ++i) {&lt;br /&gt;
        x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
        e += y[i] * y[i];&lt;br /&gt;
      }&lt;br /&gt;
      ee = e;&lt;br /&gt;
    }&lt;br /&gt;
    void join(sub1&amp;amp; s) { ee += s.ee; }&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 task_scheduler_init init;&lt;br /&gt;
 ...&lt;br /&gt;
 float e;&lt;br /&gt;
 float *x = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 float *y = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 ...&lt;br /&gt;
 sub1 s(x, y);&lt;br /&gt;
 parallel_reduce(blocked_range&amp;lt;int&amp;gt;(1, n, 1000), s);&lt;br /&gt;
 e = s.ee;&lt;br /&gt;
 ...&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(x, n+1);&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(y, n+1);&lt;br /&gt;
&lt;br /&gt;
=====In OpenMP shared memory parallel code annotations=====&lt;br /&gt;
&lt;br /&gt;
OpenMP: Usually automatic paralleization with a run-time system based on a thread library.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 float e;&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 float *x = new float[n+1];&lt;br /&gt;
 float *y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 e = 0;&lt;br /&gt;
 #pragma omp for reduction(+:e)&lt;br /&gt;
 for (int i=1; i&amp;lt;n; ++i) {&lt;br /&gt;
    x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
    e += y[i] * y[i];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
====Summary: Difference among them====&lt;br /&gt;
&lt;br /&gt;
*Pthreads works for all the parallelism and could express functional parallelism easily, but it needs to build specialized synchronization primitives and explicitly privatize variables, means there is more effort needed to switch a serial program in to parallel mode. &lt;br /&gt;
&lt;br /&gt;
*OpenMP can provide many performance enhancing features, such as atomic, barrier and flush synchronization primitives. It is very simple to use OpenMP to exploit DOALL parallelism, but the syntax for expressing functional parallelism is awkward. &lt;br /&gt;
&lt;br /&gt;
*Intel TBB relies on generic programming, it performs better with custom iteration spaces or complex reduction operations. Also, it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes, so it's better in cases that go beyond loop-based parallelism.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the differences [[#References|&amp;lt;sup&amp;gt;[16]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
{| align=&amp;quot;center cellpadding=&amp;quot;4&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!Type of Parallelism&lt;br /&gt;
!Posix Threads&lt;br /&gt;
!Intel&amp;amp;reg; TBB&lt;br /&gt;
!OpenMP 2.0&lt;br /&gt;
!OpenMp 3.0&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOALL&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOACROSS&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOPIPE&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Reduction&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Functional Parallelism&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==Synchronization Mechanisms==&lt;br /&gt;
&lt;br /&gt;
===Overveiw===&lt;br /&gt;
&lt;br /&gt;
In order to accomplish the above parallelizations in a real system, the memory must be carefully orchestrated such that no information gets corrupted.  Every architecture handles synchronizing data from parallel processors slightly differently.  This section is going to look at different architectures and highlight a few of the mechanisms that are used to achieve this memory synchorization.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===IA-64===&lt;br /&gt;
IA-64 is an Intel architecture that is mainly used in Itanium processors.&lt;br /&gt;
====Spinlock====&lt;br /&gt;
the spinlock is used to guard against multiple accesses to the critical section at the same time.  The critical section is a section of code that must be executed in sequential order, it cannot be parallelized.  Therefore, when a parallel process comes across an occupied critical section the process will “spin” until the lock is released. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // available. If it is 1, another process is in the critical section.&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  spin_lock:&lt;br /&gt;
    mov	ar.ccv = 0			// cmpxchg looks for avail (0)&lt;br /&gt;
    mov	r2 = 1				// cmpxchg sets to held (1)&lt;br /&gt;
  &lt;br /&gt;
  spin: &lt;br /&gt;
    ld8	r1 = [lock] ;;			// get lock in shared state&lt;br /&gt;
    cmp.ne	p1, p0 = r1, r2		// is lock held (ie, lock == 1)?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// yes, continue spinning&lt;br /&gt;
    cmpxchg8.acqrl = [lock], r2		// attempt to grab lock&lt;br /&gt;
    cmp.ne p1, p0 = r1, r2		// was lock empty?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// bummer, continue spinning&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
    st8.rel(lock) = r0 ;;		//release the lock&lt;br /&gt;
&lt;br /&gt;
The above code demonstrates how a spin lock is used.  Once the process gets to a spin lock, it will check to see if the lock is available, if it is not, then the process will proceed into the spin loop where it will continuously check to see if the lock is available.  Once it finds out the lock is available, it will attempt to obtain the lock.  If another process obtains the lock first, then the process will branch back into the spin loop and continue to wait.&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
&lt;br /&gt;
A barrier is a common mechanism used to hold up processes until all processes can get to the same point.  The mechanism is useful in kinds of different parallelisms (DOALL, DOACROSS, DOPIPE, reduction, and functional parallelism)  This architecture uses a unique form of the barrier mechanism called the sense-reversing barrier.  The idea behind this barrier is to prevent race conditions.  If a process from the “next” instance of the barrier races ahead while slow processes from the current barrier are leaving, the fast processes could trap the slow processes at the “next” barrier and thus corrupting the memory synchronization. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Dekker’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Dekker’s Algorithm uses variables to indicate which processers are using which resources.  It basically arbitrates for a resource using these variables.  Every processor has a flag that indicates when it is in the critical section.  So when a processor is getting ready to enter the critical section it will set its flag to one, then it will check to make sure that all of the other processor flags are zero, then it will proceed into the section.  This behavior is demonstrated in the code below.  It is a two-way multiprocessor system, so there are two processor flags, flag_me and flag_you. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The flag_me variable is zero if we are not in the synchronization and &lt;br /&gt;
  // critical section code and non-zero otherwise; flag_you is similarly set&lt;br /&gt;
  // for the other processor.  This algorithm does not retry access to the &lt;br /&gt;
  // resource if there is contention.&lt;br /&gt;
  &lt;br /&gt;
  dekker:&lt;br /&gt;
    mov		r1 = 1 ;;		// my_flag = 1 (i want access)&lt;br /&gt;
    st8  	[flag_me] = r1&lt;br /&gt;
    mf ;;				// make st visible first&lt;br /&gt;
    ld8 	r2 = [flag_you] ;;		// is other's flag 0?&lt;br /&gt;
    cmp.eq p1, p0 = 0, r2&lt;br /&gt;
  &lt;br /&gt;
  (p1) &lt;br /&gt;
    br.cond.spnt cs_skip ;;		// if not, resource in use &lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  cs_skip:&lt;br /&gt;
    st8.rel[flag_me] = r0 ;;		// release lock&lt;br /&gt;
&lt;br /&gt;
====Lamport’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Lamport’s Algorithm is similar to a spinlock with the addition of a fairness mechanism that keeps track of the order in which processes request the shared resource and provides access to the shared resource in the same order.  It makes use of two variable x and y and a shared array, b.  The example below shows example code for this algorithm.  [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The proc_id variable holds a unique, non-zero id for the process that &lt;br /&gt;
  // attempts access to the critical section.  x and y are the synchronization&lt;br /&gt;
  // variables that indicate who is in the critical section and who is attempting&lt;br /&gt;
  // entry. ptr_b_1 and ptr_b_id point at the 1'st and id'th element of b[].&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  lamport:&lt;br /&gt;
    	ld8		r1 = [proc_id] ;;	// r1 = unique process id&lt;br /&gt;
  start:&lt;br /&gt;
    	st8	[ptr_b_id] = r1		// b[id] = &amp;quot;true&amp;quot;&lt;br /&gt;
    	st8	[x] = r1			// x = process id&lt;br /&gt;
   	mf					// MUST fence here!&lt;br /&gt;
    	ld8	r2 = [y] ;;&lt;br /&gt;
    	cmp.ne p1, p0 = 0, r2;;		// if (y !=0) then...&lt;br /&gt;
  (p1)	st8	[ptr_b_id] = r0		// ... b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  (p1)	br.cond.sptk	wait_y		// ... wait until y == 0&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r1		// y = process id&lt;br /&gt;
  	mf&lt;br /&gt;
  	ld8 	r3 = [x] ;;		&lt;br /&gt;
  	cmp.eq p1, p0 = r1, r3 ;;	// if (x == id) then..&lt;br /&gt;
  (p1)	br.cond.sptk cs_begin		// ... enter critical section&lt;br /&gt;
  &lt;br /&gt;
  	st8 	[ptr_b_id] = r0		// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  	ld8	r3 = [ptr_b_1]		// r3 = &amp;amp;b[1]&lt;br /&gt;
  	mov	ar.lc = N-1 ;;		// lc = number of processors - 1&lt;br /&gt;
  wait_b:&lt;br /&gt;
  	ld8	r2 = [r3] ;;		&lt;br /&gt;
  	cmp.ne p1, p0 = r1, r2		// if (b[j] != 0) then...&lt;br /&gt;
  (p1)	br.cond.spnt	wait_b ;;	// ... wait until b[j] == 0&lt;br /&gt;
  	add	r3 = 8, r3		// r3 = &amp;amp;b[j+1]&lt;br /&gt;
  	br.cloop.sptk	wait_b ;;	// loop over b[j] for each j&lt;br /&gt;
  &lt;br /&gt;
  	ld8	r2 = [y] ;;		// if (y != id) then...&lt;br /&gt;
  	cmp.ne p1, p2 = 0, r2&lt;br /&gt;
  (p1)  br.cond.spnt 	wait_y&lt;br /&gt;
  	br	start			// back to start to try again&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r0		// release the lock&lt;br /&gt;
  	st8.rel[ptr_b_id] = r0 ;;	// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
===IA-32=== &lt;br /&gt;
&lt;br /&gt;
IA-32 is an Intel architecture that is also known as x86.  This is a very widely used architecture.&lt;br /&gt;
&lt;br /&gt;
====Locked Atomic Operation====&lt;br /&gt;
This is the main mechanism for this architecture to manage shared data structures such as semaphores and system segments.  The process uses the following three interdependent mechanisms to implement the locked atomic operation: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*  Guaranteed atomic operations.&lt;br /&gt;
*  Bus locking, using the LOCK# signal and the LOCK instruction prefix.&lt;br /&gt;
*  Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock). This mechanism is present in the P6 family processors.&lt;br /&gt;
&lt;br /&gt;
=====Guaranteed Atomic Operation=====&lt;br /&gt;
The following are guaranteed to be carried out automatically: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
Reading or writing a byte.&lt;br /&gt;
*  Reading or writing a word aligned on a 16-bit boundary.&lt;br /&gt;
*  Reading or writing a doubleword aligned on a 32-bit boundary.The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:&lt;br /&gt;
*  Reading or writing a quadword aligned on a 64-bit boundary. (This operation is also guaranteed on the Pentium® processor.)&lt;br /&gt;
*  16-bit accesses to uncached memory locations that fit within a 32-bit data bus.&lt;br /&gt;
*  16-, 32-, and 64-bit accesses to cached memory that fit within a 32-Byte cache line.&lt;br /&gt;
&lt;br /&gt;
=====Bus Locking=====&lt;br /&gt;
A LOCK signal is asserted automatically during certain critical sections in order to lock the system bus and grant control to the process executing the critical section.  This signal will disallow control of this bus by any other process while the LOCK is engaged.&lt;br /&gt;
&lt;br /&gt;
===Linux Kernel===&lt;br /&gt;
&lt;br /&gt;
Linux Kernel is referred to as an “architecture”, however it is fairly unconventional in that it is an open source operating system that has full access to the hardware. It uses many common synchronization mechanisms, so it will be considered here. [[#References|&amp;lt;sup&amp;gt;[15]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Busy-waiting lock====&lt;br /&gt;
&lt;br /&gt;
=====Spinlocks=====&lt;br /&gt;
&lt;br /&gt;
This mechanism is very similar to the mechanism described in the IA-64 architecture.  It is a mechanism used to manage access to a critical section of code.  If a process tries to access the critical section and is rejected it will sit and “spin” while it waits for the lock to be released.&lt;br /&gt;
&lt;br /&gt;
=====Rwlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a special kind of spinlock.  It is for protected structures that are frequently read, but rarely written.  This lock allows multiple reads in parallel, which can increase efficiency if process are not having to sit and wait in order to merely carry out a read function.  Like before however, one write is allowed at a time with no reads done in parallel&lt;br /&gt;
&lt;br /&gt;
=====Brlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a super fast read/write lock, but it has a write-side penalty.  The main advantage of this lock is to prevent cache “ping-pong” in a multiple read case.&lt;br /&gt;
&lt;br /&gt;
====Sleeper locks====&lt;br /&gt;
&lt;br /&gt;
=====Semiphores=====&lt;br /&gt;
&lt;br /&gt;
A semaphore is special variable that acts similar to a lock.  If the semaphore can be acquired then the process can proceed into the critical section.  If the semaphore cannon be acquired, then the process is “put to sleep” and the processor is then used for another process.  This means the processes cache is saved off in a place where it can be retrieved when the process is “woken up”.  Once the semaphore is available the “sleeping” process is woken up and obtains the semaphore and proceeds in to the critical section. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===CUDA=== &lt;br /&gt;
&lt;br /&gt;
CUDA, or Compute Unified Device Architecture, is an Nvidia architecture which is the computing engine for their graphics processors.&lt;br /&gt;
&lt;br /&gt;
====_syncthreads====&lt;br /&gt;
&lt;br /&gt;
The _syncthreads operation can be used at the end of a parallel section as a sort of “barrier” mechanicm.  It is necessary to ensure the accuracy of the memory.  In the following example, there are two calls to _syncthreads.  They are both necessary to ensure the expected results are obtained.  Without it, myArray[tid] could end up being either 2 or the original value of myArray[] depending on when the read and write take place.[[#References|&amp;lt;sup&amp;gt;[14]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // myArray is an array of integers located in global or shared&lt;br /&gt;
  // memory&lt;br /&gt;
  __global__ void MyKernel(int* result) {&lt;br /&gt;
     int tid = threadIdx.x;&lt;br /&gt;
    ...    &lt;br /&gt;
     int ref1 = myArray[tid];&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    myArray[tid + 1] = 2;&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    int ref2 = myArray[tid];&lt;br /&gt;
    result[tid] = ref1 * ref2;&lt;br /&gt;
    ...    &lt;br /&gt;
  {&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&amp;lt;ol&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://openmp.org/wp/about-openmp/ OpenMP.org]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://docs.google.com/viewer?a=v&amp;amp;pid=gmail&amp;amp;attid=0.1&amp;amp;thid=126f8a391c11262c&amp;amp;mt=application%2Fpdf&amp;amp;url=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3D2%26ik%3Dd38b56c94f%26view%3Datt%26th%3D126f8a391c11262c%26attid%3D0.1%26disp%3Dattd%26realattid%3Df_g602ojwk0%26zw&amp;amp;sig=AHIEtbTeQDhK98IswmnVSfrPBMfmPLH5Nw An Optimal Abtraction Model for Hardware Multithreading in Modern Processor Architectures]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Reference.pdf Intel Threading Building Blocks 2.2 for Open Source Reference Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.csc.ncsu.edu/faculty/efg/506/s10/ NCSU CSC 506 Parallel Computing Systems]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://parallel-for.sourceforge.net/tbb.html Sourceforge.net]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/openMP/ OpenMP]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.16 Barrier Optimization for OpenMP Program]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://cs.anu.edu.au/~Alistair.Rendell/sc02/module3.pdf Performance Programming: Theory, Practice and Case Studies]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ Intel® Threading Building Blocks, OpenMP, or native threads?]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/pthreads/#Joining POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/source.html Programing with POSIX Threads source code]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA64-softdevman-vol2.pdf IA-64 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA32-softdevman-vol3.pdf IA-32 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf CUDA Programming Guide]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=6&amp;amp;ved=0CEQQFjAF&amp;amp;url=http%3A%2F%2Flinuxindore.com%2Fdownloads%2Fdownload%2Fdata-structures%2Flinux-kernel-arch&amp;amp;ei=jxZWTaGTNI34sAPWm-ScDA&amp;amp;usg=AFQjCNG9UOAz7rHfwUDfayhr50M87uNOYA&amp;amp;sig2=azvo4h85RkoNHcZUtNIkJw Linux Kernel Architecture Overveiw]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/ch_3_jb/Parallel_Programming_Model_Support Spring 2010 NC State ECE/CSC506 Chapter 3 wiki]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;/ol&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43732</id>
		<title>CSC/ECE 506 Spring 2011/ch3 ab</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43732"/>
		<updated>2011-02-14T04:21:54Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Supplement to Chapter 3: Support for parallel-programming models. Discuss how DOACROSS, DOPIPE, DOALL, etc. are implemented in packages such as Posix threads, Intel Thread Building Blocks, OpenMP 2.0 and 3.0.&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this wiki supplement, we will discuss how the three kinds of parallelisms, i.e. DOALL, DOACROSS and DOPIPE implemented in the threads packages - OpenMP, Intel Threading Building Block, POSIX Threads. We discuss each package from the perspective of variable scopes &amp;amp; Reduction/DOALL/DOACROSS/DOPIPE implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementation==&lt;br /&gt;
&lt;br /&gt;
===OpenMP===&lt;br /&gt;
The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.&lt;br /&gt;
&lt;br /&gt;
====Variable Clauses ====&lt;br /&gt;
There are many different types of clauses in OpenMP and each of them has various characteristics. Here we introduce data sharing attribute clauses, Synchronization clauses, Scheduling clauses, Initialization and Reduction. &lt;br /&gt;
=====Data sharing attribute clauses=====&lt;br /&gt;
* ''shared'': the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.&lt;br /&gt;
  Format: shared ''(list)''&lt;br /&gt;
&lt;br /&gt;
  SHARED variables behave as follows:&lt;br /&gt;
  1. Existing in only one memory location and all threads can read or write to that address &lt;br /&gt;
&lt;br /&gt;
* ''private'': the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.&lt;br /&gt;
  Format: private ''(list)''&lt;br /&gt;
&lt;br /&gt;
  PRIVATE variables behave as follows: &lt;br /&gt;
    1. A new object of the same type is declared once for each thread in the team&lt;br /&gt;
    2. All references to the original object are replaced with references to the new object&lt;br /&gt;
    3. Variables declared PRIVATE should be assumed to be uninitialized for each thread &lt;br /&gt;
&lt;br /&gt;
* ''default'': allows the programmer to state that the default data scoping within a parallel region will be either ''shared'', or ''none'' for C/C++, or ''shared'', ''firstprivate'', ''private'', or ''none'' for Fortran.  The ''none'' option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.&lt;br /&gt;
  Format: default (shared | none)&lt;br /&gt;
&lt;br /&gt;
  DEFAULT variables behave as follows: &lt;br /&gt;
    1. Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. &lt;br /&gt;
    2. Using NONE as a default requires that the programmer explicitly scope all variables.&lt;br /&gt;
&lt;br /&gt;
=====Synchronization clauses=====&lt;br /&gt;
* ''critical section'': the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.&lt;br /&gt;
  Format: #pragma omp critical ''[ name ]  newline''&lt;br /&gt;
           ''structured_block''&lt;br /&gt;
&lt;br /&gt;
  CRITICAL SECTION behaves as follows:&lt;br /&gt;
    1. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.&lt;br /&gt;
    2. It is illegal to branch into or out of a CRITICAL block. &lt;br /&gt;
&lt;br /&gt;
* ''atomic'': similar to ''critical section'', but advise the compiler to use special hardware instructions for better performance. Compilers may choose to ignore this suggestion from users and use ''critical section'' instead.&lt;br /&gt;
  Format: #pragma omp atomic  ''newline''&lt;br /&gt;
           ''statement_expression''&lt;br /&gt;
&lt;br /&gt;
  ATOMIC behaves as follows:&lt;br /&gt;
    1. Only to a single, immediately following statement.&lt;br /&gt;
    2. An atomic statement must follow a specific syntax. &lt;br /&gt;
&lt;br /&gt;
* ''ordered'': the structured block is executed in the order in which iterations would be executed in a sequential loop&lt;br /&gt;
  Format: #pragma omp for ordered ''[clauses...]''&lt;br /&gt;
          ''(loop region)''&lt;br /&gt;
          #pragma omp ordered  ''newline''&lt;br /&gt;
          ''structured_block&lt;br /&gt;
          (endo of loop region)''&lt;br /&gt;
&lt;br /&gt;
  ORDERED behaves as follows:&lt;br /&gt;
    1. only appear in the dynamic extent of ''for'' or ''parallel for (C/C++)''.&lt;br /&gt;
    2. Only one thread is allowed in an ordered section at any time.&lt;br /&gt;
    3. It is illegal to branch into or out of an ORDERED block. &lt;br /&gt;
    4. A loop which contains an ORDERED directive, must be a loop with an ORDERED clause. &lt;br /&gt;
&lt;br /&gt;
* ''barrier'': each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.&lt;br /&gt;
   Format: #pragma omp barrier  ''newline''&lt;br /&gt;
&lt;br /&gt;
   BARRIER behaves as follows:&lt;br /&gt;
    1. All threads in a team (or none) must execute the BARRIER region.&lt;br /&gt;
    2. The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team.&lt;br /&gt;
&lt;br /&gt;
*''taskwait'': specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.&lt;br /&gt;
   Format: #pragma omp taskwait  ''newline''&lt;br /&gt;
&lt;br /&gt;
   TASKWAIT behaves as follows:&lt;br /&gt;
    1. Placed only at a point where a base language statement is allowed.&lt;br /&gt;
    2. Not be used in place of the statement following an if, while, do, switch, or label.&lt;br /&gt;
&lt;br /&gt;
*''flush'': The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory. Thread-visible variables are written back to memory at this point. &lt;br /&gt;
   Format: #pragma omp flush ''(list)  newline''&lt;br /&gt;
&lt;br /&gt;
   FLUSH behaves as follows:&lt;br /&gt;
    1. The optional list contains a list of named variables that will be flushed in order to avoid flushing all variables.&lt;br /&gt;
    2. Implementations must ensure any prior modifications to thread-visible variables are visible to all threads after this point.&lt;br /&gt;
&lt;br /&gt;
=====Scheduling clauses=====&lt;br /&gt;
*''schedule(type, chunk)'': This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are:&lt;br /&gt;
#''static'': Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter &amp;quot;chunk&amp;quot; will allocate &amp;quot;chunk&amp;quot; number of contiguous iterations to a particular thread.&lt;br /&gt;
#''dynamic'': Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter &amp;quot;chunk&amp;quot; defines the number of contiguous iterations that are allocated to a thread at a time.&lt;br /&gt;
#''guided'': A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter &amp;quot;chunk&amp;quot;&lt;br /&gt;
=====Initialization=====&lt;br /&gt;
* ''firstprivate'': the data is private to each thread, but initialized using the value of the variable using the same name from the master thread.&lt;br /&gt;
  Format: firstprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  FIRSTPRIVATE variables behave as follows: &lt;br /&gt;
    1. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. &lt;br /&gt;
&lt;br /&gt;
* ''lastprivate'': the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop.  A variable can be both ''firstprivate'' and ''lastprivate''. &lt;br /&gt;
  Format: lastprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
* ''threadprivate'': The data is a global data, but it is private in each parallel region during the runtime. The difference between ''threadprivate'' and ''private'' is the global scope associated with threadprivate and the preserved value across parallel regions.&lt;br /&gt;
  Format: #pragma omp threadprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  THREADPRIVATE variables behave as follows: &lt;br /&gt;
    1. On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined. &lt;br /&gt;
    2. The THREADPRIVATE directive must appear after every declaration of a thread private variable/common block.&lt;br /&gt;
&lt;br /&gt;
=====Reduction=====&lt;br /&gt;
* ''reduction'': the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in &amp;quot;operator&amp;quot; for this particular clause) on a datatype that runs iteratively so that its value at a particular iteration depends on its value at a previous iteration. Basically, the steps that lead up to the operational increment are parallelized, but the threads gather up and wait before updating the datatype, then increments the datatype in order so as to avoid racing condition. &lt;br /&gt;
  Format: reduction ''(operator: list)''&lt;br /&gt;
&lt;br /&gt;
  REDUTION variables behave as follows: &lt;br /&gt;
    1. Variables in the list must be named scalar variables. They can not be array or structure type variables. They must also be declared SHARED in the enclosing context.&lt;br /&gt;
    2. Reduction operations may not be associative for real numbers.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
In code 3.20, first it must include the header file ''omp.h'' which contains OpenMP function declarations. Next, A parallel region is started by  #pragma omp parallel and we enclose this program bu curly brackets. We can use (setenv OMP_NUM_THREADS n) to specify the number of threads. Another way to determine the number of threads is directly calling a function (omp_set_numtheads (n)). &lt;br /&gt;
Code 3.20 only has one loop to execute and we want it to execute in parallel, so we combine the start of the parallel loop and the start of the parallel region with one directive ''#pragma omp parallel for''. &lt;br /&gt;
 &lt;br /&gt;
 '''Code 3.20 A DOALL parallelism example in OpenMP&lt;br /&gt;
 '''#include''' &amp;lt;omp.h&amp;gt;&lt;br /&gt;
 '''...'''&lt;br /&gt;
 '''#pragma''' omp parallel //start of parallel region&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''...'''&lt;br /&gt;
  '''#pragma''' omp parallel for default (shared)&lt;br /&gt;
  '''for''' ( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
    '''A[i]''' = A[i] + A[i] - 3.0;&lt;br /&gt;
 '''}'''//end for parallel region&lt;br /&gt;
&lt;br /&gt;
Apparently, there is no loop-carried dependence in ''i'' loop. With OpenMP, we only need to insert the ''pragma'' directive ''parallel for''. The ''dafault(shared)'' clauses states that all variables within the scope of the loop are shared  unless otherwise specified.&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
We will now introduce how to implement DOACROSS in OpenMP. Here is an example code which has not been paralleled yet.&lt;br /&gt;
 &lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02: for(j=1; j&amp;lt;N; j++){&lt;br /&gt;
 03: a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 04: }&lt;br /&gt;
 05: }&lt;br /&gt;
&lt;br /&gt;
From this sample code, obviously, there is dependence existing here. &lt;br /&gt;
 a[i,j] -&amp;gt; T a[i+1, j+1]&lt;br /&gt;
&lt;br /&gt;
In OpenMP, DOALL parallel can be implemented by insert a “#pragma omp for” before the “for” structure in the source code. But there is not a pragma corresponding to DOACROSS parallel.&lt;br /&gt;
&lt;br /&gt;
When we implement DOACROSS, we use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is gotten by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is gotten by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
*omp_get_num_threads(): Returns the number of threads that are currently in the team executing the parallel region from which it is called.&lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_num_threads(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_NUM_THREADS behaves as following:&lt;br /&gt;
  1. If this call is made from a serial portion of the program, or a nested parallel region that is serialized, it will return 1. &lt;br /&gt;
  2. The default number of threads is implementation dependent. &lt;br /&gt;
&lt;br /&gt;
*omp_get_thread_num(): Returns the thread number of the thread, within the team, making this call. This number will be between 0 and OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0 &lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_thread_num(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_THREAD_NUM behaves as followings:&lt;br /&gt;
  1. If called from a nested parallel region, or a serial region, this function will return 0. &lt;br /&gt;
&lt;br /&gt;
Now, let's see the code which has been paralleled and explanation. &lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 		//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(j_tile = 0; j_tile&amp;lt;N-1; j_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       for(j=j_tile;j&amp;lt;j_tile+M;j++){&lt;br /&gt;
 19:         a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 20:       }&lt;br /&gt;
 21:     }&lt;br /&gt;
 22:     _mylock[myid] += 1;&lt;br /&gt;
 23:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 24:   }&lt;br /&gt;
 25: }&lt;br /&gt;
&lt;br /&gt;
We paralleled the original program in two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other four processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take four interations of the loop i. The same to j loop. Assume the size of each block is 4. Each processor will execute four iterations of loop j. In order to let the total iterations be equal to the original program, j has to be enclosed in loop i. So, the new loop will be looked like ''for (j_tile = 2; j_tile &amp;lt;= 15; j_tile += 4)'', line 18.&lt;br /&gt;
The lower bound of loop j is set to j_tile and the upper bound will be j_tile+3. We will keep the other statement unchanged.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the neighbor threads. After first step, the four processor will finish computing a block 4x4. If we parallel all these four processors, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
We set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
With the four variables, threads are synchronized:&lt;br /&gt;
The first thread continues to run with out waiting (line 9), because its thread ID is 0. Then all other thread can not go down after line 12. If the value in ''_mylocks[_my_id-1]'' is smaller than ''_counter0''.&lt;br /&gt;
&lt;br /&gt;
Otherwise, the block that the current thread is waiting for must have to be completed, and the current thread can go down to line 12, and mark the next block it will wait for by adding 1 to ''_counter0'' (line 14).&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it&lt;br /&gt;
has finish a block by ''mylocks[proc]++''. Once the neighbor thread finds the value has been changed, it will continue running and so on. The below figure presents it to us.&lt;br /&gt;
[[Image:Synchorization.jpg]]&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
Here is another example code and we are going to parallelize it in DOPIPE parallelism. There is a dependence, which is S2 -&amp;gt; T S1, existing in the sample code.&lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02:   S1: a[i]=b[i];&lt;br /&gt;
 03:   S2: c[i]=c[i-1]+a[i];&lt;br /&gt;
 04: &lt;br /&gt;
 05: }&lt;br /&gt;
Now, let's see how to parallel the sample code to DOPIPE parallelism.&lt;br /&gt;
we still use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is got by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is got by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 			//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(i_tile = 0; i_tile&amp;lt;N-1; i_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       a[i]=b[i];&lt;br /&gt;
 19:     }&lt;br /&gt;
 20:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 21:       c[i]=c[i-1]+a[i];&lt;br /&gt;
 22:     }&lt;br /&gt;
 23:     _mylock[myid] += 1;&lt;br /&gt;
 24:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 25:   }&lt;br /&gt;
 26: }&lt;br /&gt;
&lt;br /&gt;
Ideally, We parallelized the original program into two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take interations of the loop i. Now, there are two loop i existing and each loop i contains different statements. Also, we will keep other statements remained.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the threads. After first step, processors will finish computing &lt;br /&gt;
a[i]=b[i]. If we parallel all the processors to do the second loop i, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
Still, we set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it has finish a block by ''mylocks[proc]++''. Once the processors finish their own block, the other processors will be able to get the value to use that value to execute in its statement and process that.&lt;br /&gt;
&lt;br /&gt;
====Functional Parallelism====&lt;br /&gt;
&lt;br /&gt;
In order to introduce function parallelism, we want to execute some code section in parallel with another code section. We use code 3.21 to show two loops execute in parallel with respect to one another, although each loop is sequentially executed.&lt;br /&gt;
&lt;br /&gt;
 '''Code''' 3.21 A function parallelism example in OpenMP&lt;br /&gt;
 '''pragma''' omp parallel shared(A, B)private(i)&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''#pragma''' omp sections nowait&lt;br /&gt;
  '''{'''&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''A[i]''' = A[i]*A[i] - 4.0;&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''B[i]''' = B[i]*B[i] - 9.0;&lt;br /&gt;
  '''}'''//end omp sections&lt;br /&gt;
 '''}'''//end omp parallel&lt;br /&gt;
&lt;br /&gt;
In code 3.21, there are two loops needed to be executed in parallel. We just need to insert two ''pragma omp section'' statements. Once we insert these two statements, those two loops will execute sequentially.&lt;br /&gt;
&lt;br /&gt;
===Intel Thread Building Blocks===&lt;br /&gt;
&lt;br /&gt;
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable &lt;br /&gt;
parallel programming using standard ISO C++ code. It does not require special &lt;br /&gt;
languages or compilers. It is designed to promote scalable data parallel programming. &lt;br /&gt;
The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually. Instead the library abstracts access to the multiple processors by allowing the operations to be treated as &amp;quot;tasks,&amp;quot; which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Also, Intel Threading Building Blocks provides net results, which enables you to specify &lt;br /&gt;
parallelism more conveniently than using raw threads, and at the same time can &lt;br /&gt;
improve performance.&lt;br /&gt;
&lt;br /&gt;
====Variables Scope====&lt;br /&gt;
&lt;br /&gt;
Intel TBB is a collection of components for parallel programming, here is the overview of the library contents:&lt;br /&gt;
&lt;br /&gt;
* Basic algorithms: parallel_for, parallel_reduce, parallel_scan&lt;br /&gt;
* Advanced algorithms: parallel_while, parallel_do,pipeline, parallel_sort&lt;br /&gt;
* Containers: concurrent_queue, concurrent_vector, concurrent_hash_map&lt;br /&gt;
* Scalable memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator&lt;br /&gt;
* Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive mutex&lt;br /&gt;
* Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store&lt;br /&gt;
* Timing: portable fine grained global time stamp&lt;br /&gt;
* Task Scheduler: direct access to control the creation and activation of tasks&lt;br /&gt;
&lt;br /&gt;
Then we will focus on some specific TBB variables.&lt;br /&gt;
&lt;br /&gt;
=====parallel_for=====&lt;br /&gt;
&lt;br /&gt;
Parallel_for is the template function that performs parallel iteration over a range of values. In Intel TBB, a lot of DOALL cases could be implemented by using this function. The syntax is as follows: &lt;br /&gt;
 template&amp;lt;typename Index, typename Function&amp;gt;&lt;br /&gt;
 Function parallel_for(Index first, Index_type last, Index step, Function f);&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_for( const Range&amp;amp; range, const Body&amp;amp; body, [, partitioner] );&lt;br /&gt;
&lt;br /&gt;
A parallel_for(first, last, step, f) represents parallel execution of the loop: &amp;quot;for( auto i=first; i&amp;lt;last; i+=step ) f(i);&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
=====parallel_reduce=====&lt;br /&gt;
&lt;br /&gt;
Function parallel_reduce computes reduction over a range. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Value, typename Func, typename Reduction&amp;gt;&lt;br /&gt;
 Value parallel_reduce( const Range&amp;amp; range, const Value&amp;amp; identity, const Func&amp;amp; func, const Reduction&amp;amp; reduction );&lt;br /&gt;
&lt;br /&gt;
The functional form parallel_reduce(range,identity,func,reduction) performs a&lt;br /&gt;
parallel reduction by applying func to subranges in range and reducing the results&lt;br /&gt;
using binary operator reduction. It returns the result of the reduction. Parameter func&lt;br /&gt;
and reduction can be lambda expressions.&lt;br /&gt;
&lt;br /&gt;
=====parallel_scan=====&lt;br /&gt;
&lt;br /&gt;
This template function computes parallel prefix. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const auto_partitioner&amp;amp; );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const simple_partitioner&amp;amp; );&lt;br /&gt;
&lt;br /&gt;
A parallel_scan(range,body) computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing that is&lt;br /&gt;
sometimes useful in scenarios that appear to have inherently serial dependences. A&lt;br /&gt;
further explanation will be given in the DOACROSS example.&lt;br /&gt;
&lt;br /&gt;
=====pipeline=====&lt;br /&gt;
&lt;br /&gt;
This class performs pipelined execution. Members as follows:&lt;br /&gt;
 namespace tbb {&lt;br /&gt;
     class pipeline {&lt;br /&gt;
     public:&lt;br /&gt;
        pipeline();&lt;br /&gt;
        ~pipeline(); &lt;br /&gt;
        void add_filter( filter&amp;amp; f );&lt;br /&gt;
        void run( size_t max_number_of_live_tokens );&lt;br /&gt;
        void clear();&lt;br /&gt;
   };&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
A pipeline represents pipelined application of a series of filters to a stream of items.&lt;br /&gt;
Each filter operates in a particular mode: parallel, serial in order, or serial out of order. With a parallel filter, &lt;br /&gt;
we could implement DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
====Reduction====&lt;br /&gt;
&lt;br /&gt;
The reduction in Intel TBB is implemented using parallel_reduce function. A parallel_reduce recursively splits the range into subranges and uses the splitting constructor to make one or more copies of the body for each thread. We use an example to illustrate this: &lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 struct Sum {&lt;br /&gt;
     float value;&lt;br /&gt;
     Sum() : value(0) {}&lt;br /&gt;
     Sum( Sum&amp;amp; s, split ) {value = 0;}&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;float*&amp;gt;&amp;amp; r ) {&lt;br /&gt;
         float temp = value;&lt;br /&gt;
         for( float* a=r.begin(); a!=r.end(); ++a ) {&lt;br /&gt;
             temp += *a;&lt;br /&gt;
         }&lt;br /&gt;
         value = temp;&lt;br /&gt;
     }&lt;br /&gt;
     void join( Sum&amp;amp; rhs ) {value += rhs.value;}&lt;br /&gt;
 };&lt;br /&gt;
 float ParallelSum( float array[], size_t n ) {&lt;br /&gt;
     Sum total;&lt;br /&gt;
     parallel_reduce( blocked_range&amp;lt;float*&amp;gt;( array, array+n ), total );&lt;br /&gt;
     return total.value;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The above example sums the values in the array. The parallel_reduce will do the reduction within the range of (array, array+n), to split the working body, and then join them by the return value for each split.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
The implementation of DOALL parallelism in Intel TBB will involve Parallel_for function. &lt;br /&gt;
To better illustrate the usage, here we discuss a simple example. The following is the original code:&lt;br /&gt;
 &lt;br /&gt;
 void SerialApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     for( size_t i=0; i&amp;lt;n; ++i )&lt;br /&gt;
         Foo(a[i]);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
After using Intel TBB, it could be switched to the following:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_for.h&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 class ApplyFoo {&lt;br /&gt;
     float *const my_a;&lt;br /&gt;
 public:&lt;br /&gt;
     void operator( )( const blocked_range&amp;lt;size_t&amp;gt;&amp;amp; r ) const {&lt;br /&gt;
         float *a = my_a;&lt;br /&gt;
         for( size_t i=r.begin(); i!=r.end( ); ++i )&lt;br /&gt;
             Foo(a[i]);&lt;br /&gt;
     }&lt;br /&gt;
     ApplyFoo( float a[] ) :&lt;br /&gt;
         my_a(a)&lt;br /&gt;
     {}&lt;br /&gt;
 };&lt;br /&gt;
 &lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n,The_grain_size_You_Pick), ApplyFoo(a) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example is the simplest DOALL parallelism, similar to the one in the textbook, and execution graph will be very similar as the one in DOALL section above. But with the help of this simple illustration, the TBB code just gives you a flavor of how it would be implemented in Intel Threading Building Blocks.&lt;br /&gt;
&lt;br /&gt;
A little more to say, parallel_for takes an optional third argument to specify a partitioner, which I used &amp;quot;The_grain_size_You_Pick&amp;quot; to represent. If you want to manually divide the grain and assign the work to processors, you could specify that in the function. Or, you could use automatic grain provided TBB. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempts to limit overhead while still providing ample opportunities for load balancing. Then, the last three line of the TBB code above will be:&lt;br /&gt;
&lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n), ApplyFoo(a), auto_partitioner( ) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
We could find a good example in Intel TBB to implement a DOACROSS with the help of parallel_scan. As stated in the parallel_scan section, this function computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing which&lt;br /&gt;
could be helpful in scenarios that appear to have inherently serial dependences, which could be loop-carried dependences. &lt;br /&gt;
&lt;br /&gt;
Let's consider this scenario (which is actually the mathematical definition of parallel prefix):  &lt;br /&gt;
 T temp = id⊕;&lt;br /&gt;
 for( int i=1; i&amp;lt;=n; ++i ) {&lt;br /&gt;
     temp = temp ⊕ x[i];&lt;br /&gt;
     y[i] = temp;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
When we implement this in TBB using parallel_scan, it becomes:&lt;br /&gt;
&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 class Body {&lt;br /&gt;
     T sum;&lt;br /&gt;
     T* const y;&lt;br /&gt;
     const T* const x;&lt;br /&gt;
 public:&lt;br /&gt;
     Body( T y_[], const T x_[] ) : sum(id⊕), x(x_), y(y_) {}&lt;br /&gt;
     T get_sum() const {return sum;}&lt;br /&gt;
     template&amp;lt;typename Tag&amp;gt;&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;int&amp;gt;&amp;amp; r, Tag ) {&lt;br /&gt;
         T temp = sum;&lt;br /&gt;
         for( int i=r.begin(); i&amp;lt;r.end(); ++i ) {&lt;br /&gt;
             temp = temp ⊕ x[i];&lt;br /&gt;
             if( Tag::is_final_scan() )&lt;br /&gt;
                 y[i] = temp;&lt;br /&gt;
         } &lt;br /&gt;
         sum = temp;&lt;br /&gt;
     }&lt;br /&gt;
     Body( Body&amp;amp; b, split ) : x(b.x), y(b.y), sum(id⊕) {}&lt;br /&gt;
     void reverse_join( Body&amp;amp; a ) { sum = a.sum ⊕ sum;}&lt;br /&gt;
     void assign( Body&amp;amp; b ) {sum = b.sum;}&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
 float DoParallelScan( T y[], const T x[], int n ) {&lt;br /&gt;
     Body body(y,x);&lt;br /&gt;
     parallel_scan( blocked_range&amp;lt;int&amp;gt;(0,n), body );&lt;br /&gt;
     return body.get_sum();&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
It is the second part (function DoParallelScan) that we have to focus on. &lt;br /&gt;
&lt;br /&gt;
Actually, this example is just the scenario mentioned above that could take advantages of parallel_scan. The &amp;quot;inherently serial dependences&amp;quot; is taken care of by the functionality of parallel_scan. By computing the prefix, the serial code could be implemented in parallel with just one function.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
&lt;br /&gt;
Pipeline class is the Intel TBB that performs pipelined execution. A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order. So this class can be used to implement a DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
Here is a comparatively complex example about pipeline implementation. Also, if we look carefully, this is an example with both DOPIPE and DOACROSS:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;iostream&amp;gt;&lt;br /&gt;
 #include &amp;quot;tbb/pipeline.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/tbb_thread.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 char InputString[] = &amp;quot;abcdefg\n&amp;quot;;&lt;br /&gt;
 class InputFilter: public filter {&lt;br /&gt;
     char* my_ptr;&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void*) {&lt;br /&gt;
         if (*my_ptr)&lt;br /&gt;
             return my_ptr++;&lt;br /&gt;
         else&lt;br /&gt;
             return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     InputFilter() :&lt;br /&gt;
         filter( serial_in_order ), my_ptr(InputString) {}&lt;br /&gt;
 };&lt;br /&gt;
 class OutputFilter: public thread_bound_filter {&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void* item) {&lt;br /&gt;
         std::cout &amp;lt;&amp;lt; *(char*)item;&lt;br /&gt;
         return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     OutputFilter() : thread_bound_filter(serial_in_order) {}&lt;br /&gt;
 };&lt;br /&gt;
 void RunPipeline(pipeline* p) {&lt;br /&gt;
     p-&amp;gt;run(8);&lt;br /&gt;
 }&lt;br /&gt;
 int main() {&lt;br /&gt;
     // Construct the pipeline&lt;br /&gt;
     InputFilter f;&lt;br /&gt;
     OutputFilter g;&lt;br /&gt;
     pipeline p;&lt;br /&gt;
     p.add_filter(f);&lt;br /&gt;
     p.add_filter(g);&lt;br /&gt;
     // Another thread initiates execution of the pipeline&lt;br /&gt;
     tbb_thread t(RunPipeline,&amp;amp;p);&lt;br /&gt;
     // Process the thread_bound_filter with the current thread.&lt;br /&gt;
     while (g.process_item()!=thread_bound_filter::end_of_stream)&lt;br /&gt;
         continue;&lt;br /&gt;
     // Wait for pipeline to finish on the other thread.&lt;br /&gt;
     t.join();&lt;br /&gt;
     return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example above shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. The main thread does the following after constructing the pipeline:&lt;br /&gt;
1. Start the pipeline on another thread.&lt;br /&gt;
2. Service the thread_bound_filter until it reaches end_of_stream.&lt;br /&gt;
3. Wait for the other thread to finish.&lt;br /&gt;
&lt;br /&gt;
===POSIX Threads===&lt;br /&gt;
&lt;br /&gt;
POSIX Threads, or Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), defines an API for creating and manipulating threads.&lt;br /&gt;
&lt;br /&gt;
====Variable Scopes====&lt;br /&gt;
Pthreads defines a set of C programming language types, functions and constants. It is implemented with a pthread.h header and a thread library.&lt;br /&gt;
&lt;br /&gt;
There are around 100 Pthreads procedures, all prefixed &amp;quot;pthread_&amp;quot;. The subroutines which comprise the Pthreads API can be informally grouped into four major groups:&lt;br /&gt;
&lt;br /&gt;
* '''Thread management:''' Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.) E.g.pthread_create(), pthread_join().&lt;br /&gt;
* '''Mutexes:''' Routines that deal with synchronization, called a &amp;quot;mutex&amp;quot;, which is an abbreviation for &amp;quot;mutual exclusion&amp;quot;. Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes. E.g. pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock().&lt;br /&gt;
* '''Condition variables:''' Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included. E.g. pthread_cond_signal(); pthread_cond_broadcast(); pthread_cond_wait(); pthread_cond_timedwait();pthread_cond_reltimedwait_np().&lt;br /&gt;
* '''Synchronization:''' Routines that manage read/write locks and barriers. E.g. pthread_rwlock_rdlock(); pthread_rwlock_tryrdlock(); pthread_rwlock_wrlock();pthread_rwlock_trywrlock(); pthread_rwlock_unlock();pthread_barrier_init(); pthread_barrier_wait()&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
The following is a simple code example in C, as DOALL parallelism, to print out each threads' ID#.&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS     5&lt;br /&gt;
  &lt;br /&gt;
 void *PrintHello(void *threadid)&lt;br /&gt;
 {&lt;br /&gt;
    long tid;&lt;br /&gt;
  &lt;br /&gt;
    tid = (long)threadid;&lt;br /&gt;
    printf(&amp;quot;Hello World! It's me, thread #%ld!\n&amp;quot;, tid);&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
  &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
    pthread_t threads[NUM_THREADS];&lt;br /&gt;
  &lt;br /&gt;
    int rc;&lt;br /&gt;
    long t;&lt;br /&gt;
    for(t=0; t&amp;lt;NUM_THREADS; t++){&lt;br /&gt;
       printf(&amp;quot;In main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
       rc = pthread_create(&amp;amp;threads[t], NULL, PrintHello, (void *)t);&lt;br /&gt;
  &lt;br /&gt;
       if (rc){&lt;br /&gt;
          printf(&amp;quot;ERROR; return code from pthread_create() is %d\n&amp;quot;, rc);&lt;br /&gt;
          exit(-1);&lt;br /&gt;
       }&lt;br /&gt;
    }&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This loop contains only single statement which doesn't cross the iterations, so each iteration could be considered as a parallel task.&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
When it comes to using Pthreads to implement DOACROSS, it could express functional parallelism easily, but make the parallelism unnecessarily complicated. See an example below: from '''POSIX Threads Programming''' by Blaise Barney&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS	&lt;br /&gt;
 &lt;br /&gt;
 void *BusyWork(void *t)&lt;br /&gt;
 {&lt;br /&gt;
   int i;&lt;br /&gt;
   long tid;&lt;br /&gt;
   double result=0.0;&lt;br /&gt;
   tid = (long)t;&lt;br /&gt;
   printf(&amp;quot;Thread %ld starting...\n&amp;quot;,tid);&lt;br /&gt;
   for (i=0; i&amp;lt;1000000; i++)&lt;br /&gt;
   {&lt;br /&gt;
      result = result + sin(i) * tan(i);&lt;br /&gt;
   }&lt;br /&gt;
   printf(&amp;quot;Thread %ld done. Result = %e\n&amp;quot;,tid, result);&lt;br /&gt;
   pthread_exit((void*) t);&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
   pthread_t thread[NUM_THREADS];&lt;br /&gt;
   pthread_attr_t attr;&lt;br /&gt;
   int rc;&lt;br /&gt;
   long t;&lt;br /&gt;
   void *status;&lt;br /&gt;
 &lt;br /&gt;
   /* Initialize and set thread detached attribute */&lt;br /&gt;
   pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
   pthread_attr_setdetachstate(&amp;amp;attr, PTHREAD_CREATE_JOINABLE);&lt;br /&gt;
 &lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      printf(&amp;quot;Main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
      rc = pthread_create(&amp;amp;thread[t], &amp;amp;attr, BusyWork, (void *)t); &lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_create() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
   /* Free attribute and wait for the other threads */&lt;br /&gt;
   pthread_attr_destroy(&amp;amp;attr);&lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      rc = pthread_join(thread[t], &amp;amp;status);&lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_join() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      printf(&amp;quot;Main: completed join with thread %ld having a status   &lt;br /&gt;
            of %ld\n&amp;quot;,t,(long)status);&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
 printf(&amp;quot;Main: program completed. Exiting.\n&amp;quot;);&lt;br /&gt;
 pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This example demonstrates how to &amp;quot;wait&amp;quot; for thread completions by using the Pthread join routine. Since some implementations of Pthreads may not create threads in a joinable state, the threads in this example are explicitly created in a joinable state so that they can be joined later.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
There is examples of using Posix Threads to implement DOPIPE parallelism, but unnecessarily complex. Due to the long length, we won't provide it here. If the reader is interested, it could be found in &amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/code/pipe.c Pthreads DOPIPE example]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Comparison among the three===&lt;br /&gt;
&lt;br /&gt;
====A unified example====&lt;br /&gt;
&lt;br /&gt;
We use a simple parallel example from [http://sourceforge.net Sourceforge.net] to show how it will be implemented in the three packages, namely, POSIX Threads, Intel TBB, OpenMP, to show some commonalities and differences among them.&lt;br /&gt;
&lt;br /&gt;
Following is the original code:&lt;br /&gt;
&lt;br /&gt;
 Grid1 *g = new Grid1(0, n+1);&lt;br /&gt;
 Grid1IteratorSub it(1, n, g);&lt;br /&gt;
 DistArray x(g), y(g);&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 ForEach(int i, it,&lt;br /&gt;
    x(i) += ( y(i+1) + y(i-1) )*.5;&lt;br /&gt;
    e += sqr( y(i) ); )&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Then we are going to show the implementations in different packages, and also make a brief summary of the three packages.&lt;br /&gt;
&lt;br /&gt;
=====In POSIX Thread=====&lt;br /&gt;
&lt;br /&gt;
POSIX Thread: Symmetric multi processing, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global declaration:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 float *x, *y;&lt;br /&gt;
 float vec[8];&lt;br /&gt;
 int nn, pp;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
&lt;br /&gt;
 void *sub1(void *arg) {&lt;br /&gt;
    int p = (int)arg;&lt;br /&gt;
    float e_local = 0;&lt;br /&gt;
    for (int i=1+(nn*p)/pp; i&amp;lt;1+(nn*(p+1))/pp; ++i) {&lt;br /&gt;
      x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
      e_local += y[i] * y[i];&lt;br /&gt;
    }&lt;br /&gt;
    vec[p] = e_local;&lt;br /&gt;
    return (void*) 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
&lt;br /&gt;
 x = new float[n+1];&lt;br /&gt;
 y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 int p_threads = 8;&lt;br /&gt;
 nn = n-1;&lt;br /&gt;
 pp = p_threads;&lt;br /&gt;
 pthread_t threads[8];&lt;br /&gt;
 pthread_attr_t attr;&lt;br /&gt;
 pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p)&lt;br /&gt;
    pthread_create(&amp;amp;threads[p], &amp;amp;attr,&lt;br /&gt;
      sub1, (void *)p);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p) {&lt;br /&gt;
    pthread_join(threads[p], NULL);&lt;br /&gt;
    e += vec[p];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
=====In Intel Threading Building Blocks=====&lt;br /&gt;
&lt;br /&gt;
Intel TBB: A C++ library for thread programming, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
Translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/cache_aligned_allocator.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
 struct sub1 {&lt;br /&gt;
    float ee;&lt;br /&gt;
    float *x, *y;&lt;br /&gt;
    sub1(float *xx, float *yy) : ee(0), x(xx), y(yy) {}&lt;br /&gt;
    sub1(sub1&amp;amp; s, split) { ee = 0; x = s.x; y = s.y; }&lt;br /&gt;
    void operator() (const blocked_range&amp;lt;int&amp;gt; &amp;amp; r){&lt;br /&gt;
      float e = ee;&lt;br /&gt;
      for (int i = r.begin(); i!= r.end(); ++i) {&lt;br /&gt;
        x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
        e += y[i] * y[i];&lt;br /&gt;
      }&lt;br /&gt;
      ee = e;&lt;br /&gt;
    }&lt;br /&gt;
    void join(sub1&amp;amp; s) { ee += s.ee; }&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 task_scheduler_init init;&lt;br /&gt;
 ...&lt;br /&gt;
 float e;&lt;br /&gt;
 float *x = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 float *y = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 ...&lt;br /&gt;
 sub1 s(x, y);&lt;br /&gt;
 parallel_reduce(blocked_range&amp;lt;int&amp;gt;(1, n, 1000), s);&lt;br /&gt;
 e = s.ee;&lt;br /&gt;
 ...&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(x, n+1);&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(y, n+1);&lt;br /&gt;
&lt;br /&gt;
=====In OpenMP shared memory parallel code annotations=====&lt;br /&gt;
&lt;br /&gt;
OpenMP: Usually automatic paralleization with a run-time system based on a thread library.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 float e;&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 float *x = new float[n+1];&lt;br /&gt;
 float *y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 e = 0;&lt;br /&gt;
 #pragma omp for reduction(+:e)&lt;br /&gt;
 for (int i=1; i&amp;lt;n; ++i) {&lt;br /&gt;
    x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
    e += y[i] * y[i];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
====Summary: Difference among them====&lt;br /&gt;
&lt;br /&gt;
*Pthreads works for all the parallelism and could express functional parallelism easily, but it needs to build specialized synchronization primitives and explicitly privatize variables, means there is more effort needed to switch a serial program in to parallel mode. &lt;br /&gt;
&lt;br /&gt;
*OpenMP can provide many performance enhancing features, such as atomic, barrier and flush synchronization primitives. It is very simple to use OpenMP to exploit DOALL parallelism, but the syntax for expressing functional parallelism is awkward. &lt;br /&gt;
&lt;br /&gt;
*Intel TBB relies on generic programming, it performs better with custom iteration spaces or complex reduction operations. Also, it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes, so it's better in cases that go beyond loop-based parallelism.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illistrates the differences [[#References|&amp;lt;sup&amp;gt;[16]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
{| align=&amp;quot;center cellpadding=&amp;quot;4&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!Type of Parallelism&lt;br /&gt;
!Posix Threads&lt;br /&gt;
!Intel&amp;amp;reg; TBB&lt;br /&gt;
!OpenMP 2.0&lt;br /&gt;
!OpenMp 3.0&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOALL&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOACROSS&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
!DOPIPE&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Reduction&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
! Functional Parallelism&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|No&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
|align=&amp;quot;center&amp;quot;|Yes&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==Synchronization Mechanisms==&lt;br /&gt;
&lt;br /&gt;
===Overveiw===&lt;br /&gt;
&lt;br /&gt;
In order to accomplish the above parallelizations in a real system, the memory must be carefully orchestrated such that no information gets corrupted.  Every architecture handles synchronizing data from parallel processors slightly differently.  This section is going to look at different architectures and highlight a few of the mechanisms that are used to achieve this memory synchorization.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===IA-64===&lt;br /&gt;
IA-64 is an Intel architecture that is mainly used in Itanium processors.&lt;br /&gt;
====Spinlock====&lt;br /&gt;
the spinlock is used to guard against multiple accesses to the critical section at the same time.  The critical section is a section of code that must be executed in sequential order, it cannot be parallelized.  Therefore, when a parallel process comes across an occupied critical section the process will “spin” until the lock is released. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // available. If it is 1, another process is in the critical section.&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  spin_lock:&lt;br /&gt;
    mov	ar.ccv = 0			// cmpxchg looks for avail (0)&lt;br /&gt;
    mov	r2 = 1				// cmpxchg sets to held (1)&lt;br /&gt;
  &lt;br /&gt;
  spin: &lt;br /&gt;
    ld8	r1 = [lock] ;;			// get lock in shared state&lt;br /&gt;
    cmp.ne	p1, p0 = r1, r2		// is lock held (ie, lock == 1)?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// yes, continue spinning&lt;br /&gt;
    cmpxchg8.acqrl = [lock], r2		// attempt to grab lock&lt;br /&gt;
    cmp.ne p1, p0 = r1, r2		// was lock empty?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// bummer, continue spinning&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
    st8.rel(lock) = r0 ;;		//release the lock&lt;br /&gt;
&lt;br /&gt;
The above code demonstrates how a spin lock is used.  Once the process gets to a spin lock, it will check to see if the lock is available, if it is not, then the process will proceed into the spin loop where it will continuously check to see if the lock is available.  Once it finds out the lock is available, it will attempt to obtain the lock.  If another process obtains the lock first, then the process will branch back into the spin loop and continue to wait.&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
&lt;br /&gt;
A barrier is a common mechanism used to hold up processes until all processes can get to the same point.  The mechanism is useful in kinds of different parallelisms (DOALL, DOACROSS, DOPIPE, reduction, and functional parallelism)  This architecture uses a unique form of the barrier mechanism called the sense-reversing barrier.  The idea behind this barrier is to prevent race conditions.  If a process from the “next” instance of the barrier races ahead while slow processes from the current barrier are leaving, the fast processes could trap the slow processes at the “next” barrier and thus corrupting the memory synchronization. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Dekker’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Dekker’s Algorithm uses variables to indicate which processers are using which resources.  It basically arbitrates for a resource using these variables.  Every processor has a flag that indicates when it is in the critical section.  So when a processor is getting ready to enter the critical section it will set its flag to one, then it will check to make sure that all of the other processor flags are zero, then it will proceed into the section.  This behavior is demonstrated in the code below.  It is a two-way multiprocessor system, so there are two processor flags, flag_me and flag_you. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The flag_me variable is zero if we are not in the synchronization and &lt;br /&gt;
  // critical section code and non-zero otherwise; flag_you is similarly set&lt;br /&gt;
  // for the other processor.  This algorithm does not retry access to the &lt;br /&gt;
  // resource if there is contention.&lt;br /&gt;
  &lt;br /&gt;
  dekker:&lt;br /&gt;
    mov		r1 = 1 ;;		// my_flag = 1 (i want access)&lt;br /&gt;
    st8  	[flag_me] = r1&lt;br /&gt;
    mf ;;				// make st visible first&lt;br /&gt;
    ld8 	r2 = [flag_you] ;;		// is other's flag 0?&lt;br /&gt;
    cmp.eq p1, p0 = 0, r2&lt;br /&gt;
  &lt;br /&gt;
  (p1) &lt;br /&gt;
    br.cond.spnt cs_skip ;;		// if not, resource in use &lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  cs_skip:&lt;br /&gt;
    st8.rel[flag_me] = r0 ;;		// release lock&lt;br /&gt;
&lt;br /&gt;
====Lamport’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Lamport’s Algorithm is similar to a spinlock with the addition of a fairness mechanism that keeps track of the order in which processes request the shared resource and provides access to the shared resource in the same order.  It makes use of two variable x and y and a shared array, b.  The example below shows example code for this algorithm.  [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The proc_id variable holds a unique, non-zero id for the process that &lt;br /&gt;
  // attempts access to the critical section.  x and y are the synchronization&lt;br /&gt;
  // variables that indicate who is in the critical section and who is attempting&lt;br /&gt;
  // entry. ptr_b_1 and ptr_b_id point at the 1'st and id'th element of b[].&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  lamport:&lt;br /&gt;
    	ld8		r1 = [proc_id] ;;	// r1 = unique process id&lt;br /&gt;
  start:&lt;br /&gt;
    	st8	[ptr_b_id] = r1		// b[id] = &amp;quot;true&amp;quot;&lt;br /&gt;
    	st8	[x] = r1			// x = process id&lt;br /&gt;
   	mf					// MUST fence here!&lt;br /&gt;
    	ld8	r2 = [y] ;;&lt;br /&gt;
    	cmp.ne p1, p0 = 0, r2;;		// if (y !=0) then...&lt;br /&gt;
  (p1)	st8	[ptr_b_id] = r0		// ... b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  (p1)	br.cond.sptk	wait_y		// ... wait until y == 0&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r1		// y = process id&lt;br /&gt;
  	mf&lt;br /&gt;
  	ld8 	r3 = [x] ;;		&lt;br /&gt;
  	cmp.eq p1, p0 = r1, r3 ;;	// if (x == id) then..&lt;br /&gt;
  (p1)	br.cond.sptk cs_begin		// ... enter critical section&lt;br /&gt;
  &lt;br /&gt;
  	st8 	[ptr_b_id] = r0		// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  	ld8	r3 = [ptr_b_1]		// r3 = &amp;amp;b[1]&lt;br /&gt;
  	mov	ar.lc = N-1 ;;		// lc = number of processors - 1&lt;br /&gt;
  wait_b:&lt;br /&gt;
  	ld8	r2 = [r3] ;;		&lt;br /&gt;
  	cmp.ne p1, p0 = r1, r2		// if (b[j] != 0) then...&lt;br /&gt;
  (p1)	br.cond.spnt	wait_b ;;	// ... wait until b[j] == 0&lt;br /&gt;
  	add	r3 = 8, r3		// r3 = &amp;amp;b[j+1]&lt;br /&gt;
  	br.cloop.sptk	wait_b ;;	// loop over b[j] for each j&lt;br /&gt;
  &lt;br /&gt;
  	ld8	r2 = [y] ;;		// if (y != id) then...&lt;br /&gt;
  	cmp.ne p1, p2 = 0, r2&lt;br /&gt;
  (p1)  br.cond.spnt 	wait_y&lt;br /&gt;
  	br	start			// back to start to try again&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r0		// release the lock&lt;br /&gt;
  	st8.rel[ptr_b_id] = r0 ;;	// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
===IA-32=== &lt;br /&gt;
&lt;br /&gt;
IA-32 is an Intel architecture that is also known as x86.  This is a very widely used architecture.&lt;br /&gt;
&lt;br /&gt;
====Locked Atomic Operation====&lt;br /&gt;
This is the main mechanism for this architecture to manage shared data structures such as semaphores and system segments.  The process uses the following three interdependent mechanisms to implement the locked atomic operation: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*  Guaranteed atomic operations.&lt;br /&gt;
*  Bus locking, using the LOCK# signal and the LOCK instruction prefix.&lt;br /&gt;
*  Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock). This mechanism is present in the P6 family processors.&lt;br /&gt;
&lt;br /&gt;
=====Guaranteed Atomic Operation=====&lt;br /&gt;
The following are guaranteed to be carried out automatically: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
Reading or writing a byte.&lt;br /&gt;
*  Reading or writing a word aligned on a 16-bit boundary.&lt;br /&gt;
*  Reading or writing a doubleword aligned on a 32-bit boundary.The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:&lt;br /&gt;
*  Reading or writing a quadword aligned on a 64-bit boundary. (This operation is also guaranteed on the Pentium® processor.)&lt;br /&gt;
*  16-bit accesses to uncached memory locations that fit within a 32-bit data bus.&lt;br /&gt;
*  16-, 32-, and 64-bit accesses to cached memory that fit within a 32-Byte cache line.&lt;br /&gt;
&lt;br /&gt;
=====Bus Locking=====&lt;br /&gt;
A LOCK signal is asserted automatically during certain critical sections in order to lock the system bus and grant control to the process executing the critical section.  This signal will disallow control of this bus by any other process while the LOCK is engaged.&lt;br /&gt;
&lt;br /&gt;
===Linux Kernel===&lt;br /&gt;
&lt;br /&gt;
Linux Kernel is referred to as an “architecture”, however it is fairly unconventional in that it is an open source operating system that has full access to the hardware. It uses many common synchronization mechanisms, so it will be considered here. [[#References|&amp;lt;sup&amp;gt;[15]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Busy-waiting lock====&lt;br /&gt;
&lt;br /&gt;
=====Spinlocks=====&lt;br /&gt;
&lt;br /&gt;
This mechanism is very similar to the mechanism described in the IA-64 architecture.  It is a mechanism used to manage access to a critical section of code.  If a process tries to access the critical section and is rejected it will sit and “spin” while it waits for the lock to be released.&lt;br /&gt;
&lt;br /&gt;
=====Rwlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a special kind of spinlock.  It is for protected structures that are frequently read, but rarely written.  This lock allows multiple reads in parallel, which can increase efficiency if process are not having to sit and wait in order to merely carry out a read function.  Like before however, one write is allowed at a time with no reads done in parallel&lt;br /&gt;
&lt;br /&gt;
=====Brlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a super fast read/write lock, but it has a write-side penalty.  The main advantage of this lock is to prevent cache “ping-pong” in a multiple read case.&lt;br /&gt;
&lt;br /&gt;
====Sleeper locks====&lt;br /&gt;
&lt;br /&gt;
=====Semiphores=====&lt;br /&gt;
&lt;br /&gt;
A semaphore is special variable that acts similar to a lock.  If the semaphore can be acquired then the process can proceed into the critical section.  If the semaphore cannon be acquired, then the process is “put to sleep” and the processor is then used for another process.  This means the processes cache is saved off in a place where it can be retrieved when the process is “woken up”.  Once the semaphore is available the “sleeping” process is woken up and obtains the semaphore and proceeds in to the critical section. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===CUDA=== &lt;br /&gt;
&lt;br /&gt;
CUDA, or Compute Unified Device Architecture, is an Nvidia architecture which is the computing engine for their graphics processors.&lt;br /&gt;
&lt;br /&gt;
====_syncthreads====&lt;br /&gt;
&lt;br /&gt;
The _syncthreads operation can be used at the end of a parallel section as a sort of “barrier” mechanicm.  It is necessary to ensure the accuracy of the memory.  In the following example, there are two calls to _syncthreads.  They are both necessary to ensure the expected results are obtained.  Without it, myArray[tid] could end up being either 2 or the original value of myArray[] depending on when the read and write take place.[[#References|&amp;lt;sup&amp;gt;[14]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // myArray is an array of integers located in global or shared&lt;br /&gt;
  // memory&lt;br /&gt;
  __global__ void MyKernel(int* result) {&lt;br /&gt;
     int tid = threadIdx.x;&lt;br /&gt;
    ...    &lt;br /&gt;
     int ref1 = myArray[tid];&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    myArray[tid + 1] = 2;&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    int ref2 = myArray[tid];&lt;br /&gt;
    result[tid] = ref1 * ref2;&lt;br /&gt;
    ...    &lt;br /&gt;
  {&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&amp;lt;ol&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://openmp.org/wp/about-openmp/ OpenMP.org]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://docs.google.com/viewer?a=v&amp;amp;pid=gmail&amp;amp;attid=0.1&amp;amp;thid=126f8a391c11262c&amp;amp;mt=application%2Fpdf&amp;amp;url=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3D2%26ik%3Dd38b56c94f%26view%3Datt%26th%3D126f8a391c11262c%26attid%3D0.1%26disp%3Dattd%26realattid%3Df_g602ojwk0%26zw&amp;amp;sig=AHIEtbTeQDhK98IswmnVSfrPBMfmPLH5Nw An Optimal Abtraction Model for Hardware Multithreading in Modern Processor Architectures]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Reference.pdf Intel Threading Building Blocks 2.2 for Open Source Reference Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.csc.ncsu.edu/faculty/efg/506/s10/ NCSU CSC 506 Parallel Computing Systems]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://parallel-for.sourceforge.net/tbb.html Sourceforge.net]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/openMP/ OpenMP]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.16 Barrier Optimization for OpenMP Program]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://cs.anu.edu.au/~Alistair.Rendell/sc02/module3.pdf Performance Programming: Theory, Practice and Case Studies]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ Intel® Threading Building Blocks, OpenMP, or native threads?]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/pthreads/#Joining POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/source.html Programing with POSIX Threads source code]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA64-softdevman-vol2.pdf IA-64 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA32-softdevman-vol3.pdf IA-32 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf CUDA Programming Guide]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=6&amp;amp;ved=0CEQQFjAF&amp;amp;url=http%3A%2F%2Flinuxindore.com%2Fdownloads%2Fdownload%2Fdata-structures%2Flinux-kernel-arch&amp;amp;ei=jxZWTaGTNI34sAPWm-ScDA&amp;amp;usg=AFQjCNG9UOAz7rHfwUDfayhr50M87uNOYA&amp;amp;sig2=azvo4h85RkoNHcZUtNIkJw Linux Kernel Architecture Overveiw]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/ch_3_jb/Parallel_Programming_Model_Support Spring 2010 NC State ECE/CSC506 Chapter 3 wiki]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;/ol&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43727</id>
		<title>CSC/ECE 506 Spring 2011/ch3 ab</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43727"/>
		<updated>2011-02-14T04:10:53Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Supplement to Chapter 3: Support for parallel-programming models. Discuss how DOACROSS, DOPIPE, DOALL, etc. are implemented in packages such as Posix threads, Intel Thread Building Blocks, OpenMP 2.0 and 3.0.&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this wiki supplement, we will discuss how the three kinds of parallelisms, i.e. DOALL, DOACROSS and DOPIPE implemented in the threads packages - OpenMP, Intel Threading Building Block, POSIX Threads. We discuss each package from the perspective of variable scopes &amp;amp; Reduction/DOALL/DOACROSS/DOPIPE implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementation==&lt;br /&gt;
&lt;br /&gt;
===OpenMP===&lt;br /&gt;
The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.&lt;br /&gt;
&lt;br /&gt;
====Variable Clauses ====&lt;br /&gt;
There are many different types of clauses in OpenMP and each of them has various characteristics. Here we introduce data sharing attribute clauses, Synchronization clauses, Scheduling clauses, Initialization and Reduction. &lt;br /&gt;
=====Data sharing attribute clauses=====&lt;br /&gt;
* ''shared'': the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.&lt;br /&gt;
  Format: shared ''(list)''&lt;br /&gt;
&lt;br /&gt;
  SHARED variables behave as follows:&lt;br /&gt;
  1. Existing in only one memory location and all threads can read or write to that address &lt;br /&gt;
&lt;br /&gt;
* ''private'': the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.&lt;br /&gt;
  Format: private ''(list)''&lt;br /&gt;
&lt;br /&gt;
  PRIVATE variables behave as follows: &lt;br /&gt;
    1. A new object of the same type is declared once for each thread in the team&lt;br /&gt;
    2. All references to the original object are replaced with references to the new object&lt;br /&gt;
    3. Variables declared PRIVATE should be assumed to be uninitialized for each thread &lt;br /&gt;
&lt;br /&gt;
* ''default'': allows the programmer to state that the default data scoping within a parallel region will be either ''shared'', or ''none'' for C/C++, or ''shared'', ''firstprivate'', ''private'', or ''none'' for Fortran.  The ''none'' option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.&lt;br /&gt;
  Format: default (shared | none)&lt;br /&gt;
&lt;br /&gt;
  DEFAULT variables behave as follows: &lt;br /&gt;
    1. Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. &lt;br /&gt;
    2. Using NONE as a default requires that the programmer explicitly scope all variables.&lt;br /&gt;
&lt;br /&gt;
=====Synchronization clauses=====&lt;br /&gt;
* ''critical section'': the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.&lt;br /&gt;
  Format: #pragma omp critical ''[ name ]  newline''&lt;br /&gt;
           ''structured_block''&lt;br /&gt;
&lt;br /&gt;
  CRITICAL SECTION behaves as follows:&lt;br /&gt;
    1. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.&lt;br /&gt;
    2. It is illegal to branch into or out of a CRITICAL block. &lt;br /&gt;
&lt;br /&gt;
* ''atomic'': similar to ''critical section'', but advise the compiler to use special hardware instructions for better performance. Compilers may choose to ignore this suggestion from users and use ''critical section'' instead.&lt;br /&gt;
  Format: #pragma omp atomic  ''newline''&lt;br /&gt;
           ''statement_expression''&lt;br /&gt;
&lt;br /&gt;
  ATOMIC behaves as follows:&lt;br /&gt;
    1. Only to a single, immediately following statement.&lt;br /&gt;
    2. An atomic statement must follow a specific syntax. &lt;br /&gt;
&lt;br /&gt;
* ''ordered'': the structured block is executed in the order in which iterations would be executed in a sequential loop&lt;br /&gt;
  Format: #pragma omp for ordered ''[clauses...]''&lt;br /&gt;
          ''(loop region)''&lt;br /&gt;
          #pragma omp ordered  ''newline''&lt;br /&gt;
          ''structured_block&lt;br /&gt;
          (endo of loop region)''&lt;br /&gt;
&lt;br /&gt;
  ORDERED behaves as follows:&lt;br /&gt;
    1. only appear in the dynamic extent of ''for'' or ''parallel for (C/C++)''.&lt;br /&gt;
    2. Only one thread is allowed in an ordered section at any time.&lt;br /&gt;
    3. It is illegal to branch into or out of an ORDERED block. &lt;br /&gt;
    4. A loop which contains an ORDERED directive, must be a loop with an ORDERED clause. &lt;br /&gt;
&lt;br /&gt;
* ''barrier'': each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.&lt;br /&gt;
   Format: #pragma omp barrier  ''newline''&lt;br /&gt;
&lt;br /&gt;
   BARRIER behaves as follows:&lt;br /&gt;
    1. All threads in a team (or none) must execute the BARRIER region.&lt;br /&gt;
    2. The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team.&lt;br /&gt;
&lt;br /&gt;
*''taskwait'': specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.&lt;br /&gt;
   Format: #pragma omp taskwait  ''newline''&lt;br /&gt;
&lt;br /&gt;
   TASKWAIT behaves as follows:&lt;br /&gt;
    1. Placed only at a point where a base language statement is allowed.&lt;br /&gt;
    2. Not be used in place of the statement following an if, while, do, switch, or label.&lt;br /&gt;
&lt;br /&gt;
*''flush'': The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory. Thread-visible variables are written back to memory at this point. &lt;br /&gt;
   Format: #pragma omp flush ''(list)  newline''&lt;br /&gt;
&lt;br /&gt;
   FLUSH behaves as follows:&lt;br /&gt;
    1. The optional list contains a list of named variables that will be flushed in order to avoid flushing all variables.&lt;br /&gt;
    2. Implementations must ensure any prior modifications to thread-visible variables are visible to all threads after this point.&lt;br /&gt;
&lt;br /&gt;
=====Scheduling clauses=====&lt;br /&gt;
*''schedule(type, chunk)'': This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are:&lt;br /&gt;
#''static'': Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter &amp;quot;chunk&amp;quot; will allocate &amp;quot;chunk&amp;quot; number of contiguous iterations to a particular thread.&lt;br /&gt;
#''dynamic'': Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter &amp;quot;chunk&amp;quot; defines the number of contiguous iterations that are allocated to a thread at a time.&lt;br /&gt;
#''guided'': A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter &amp;quot;chunk&amp;quot;&lt;br /&gt;
=====Initialization=====&lt;br /&gt;
* ''firstprivate'': the data is private to each thread, but initialized using the value of the variable using the same name from the master thread.&lt;br /&gt;
  Format: firstprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  FIRSTPRIVATE variables behave as follows: &lt;br /&gt;
    1. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. &lt;br /&gt;
&lt;br /&gt;
* ''lastprivate'': the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop.  A variable can be both ''firstprivate'' and ''lastprivate''. &lt;br /&gt;
  Format: lastprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
* ''threadprivate'': The data is a global data, but it is private in each parallel region during the runtime. The difference between ''threadprivate'' and ''private'' is the global scope associated with threadprivate and the preserved value across parallel regions.&lt;br /&gt;
  Format: #pragma omp threadprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  THREADPRIVATE variables behave as follows: &lt;br /&gt;
    1. On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined. &lt;br /&gt;
    2. The THREADPRIVATE directive must appear after every declaration of a thread private variable/common block.&lt;br /&gt;
&lt;br /&gt;
=====Reduction=====&lt;br /&gt;
* ''reduction'': the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in &amp;quot;operator&amp;quot; for this particular clause) on a datatype that runs iteratively so that its value at a particular iteration depends on its value at a previous iteration. Basically, the steps that lead up to the operational increment are parallelized, but the threads gather up and wait before updating the datatype, then increments the datatype in order so as to avoid racing condition. &lt;br /&gt;
  Format: reduction ''(operator: list)''&lt;br /&gt;
&lt;br /&gt;
  REDUTION variables behave as follows: &lt;br /&gt;
    1. Variables in the list must be named scalar variables. They can not be array or structure type variables. They must also be declared SHARED in the enclosing context.&lt;br /&gt;
    2. Reduction operations may not be associative for real numbers.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
In code 3.20, first it must include the header file ''omp.h'' which contains OpenMP function declarations. Next, A parallel region is started by  #pragma omp parallel and we enclose this program bu curly brackets. We can use (setenv OMP_NUM_THREADS n) to specify the number of threads. Another way to determine the number of threads is directly calling a function (omp_set_numtheads (n)). &lt;br /&gt;
Code 3.20 only has one loop to execute and we want it to execute in parallel, so we combine the start of the parallel loop and the start of the parallel region with one directive ''#pragma omp parallel for''. &lt;br /&gt;
 &lt;br /&gt;
 '''Code 3.20 A DOALL parallelism example in OpenMP&lt;br /&gt;
 '''#include''' &amp;lt;omp.h&amp;gt;&lt;br /&gt;
 '''...'''&lt;br /&gt;
 '''#pragma''' omp parallel //start of parallel region&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''...'''&lt;br /&gt;
  '''#pragma''' omp parallel for default (shared)&lt;br /&gt;
  '''for''' ( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
    '''A[i]''' = A[i] + A[i] - 3.0;&lt;br /&gt;
 '''}'''//end for parallel region&lt;br /&gt;
&lt;br /&gt;
Apparently, there is no loop-carried dependence in ''i'' loop. With OpenMP, we only need to insert the ''pragma'' directive ''parallel for''. The ''dafault(shared)'' clauses states that all variables within the scope of the loop are shared  unless otherwise specified.&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
We will now introduce how to implement DOACROSS in OpenMP. Here is an example code which has not been paralleled yet.&lt;br /&gt;
 &lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02: for(j=1; j&amp;lt;N; j++){&lt;br /&gt;
 03: a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 04: }&lt;br /&gt;
 05: }&lt;br /&gt;
&lt;br /&gt;
From this sample code, obviously, there is dependence existing here. &lt;br /&gt;
 a[i,j] -&amp;gt; T a[i+1, j+1]&lt;br /&gt;
&lt;br /&gt;
In OpenMP, DOALL parallel can be implemented by insert a “#pragma omp for” before the “for” structure in the source code. But there is not a pragma corresponding to DOACROSS parallel.&lt;br /&gt;
&lt;br /&gt;
When we implement DOACROSS, we use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is gotten by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is gotten by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
*omp_get_num_threads(): Returns the number of threads that are currently in the team executing the parallel region from which it is called.&lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_num_threads(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_NUM_THREADS behaves as following:&lt;br /&gt;
  1. If this call is made from a serial portion of the program, or a nested parallel region that is serialized, it will return 1. &lt;br /&gt;
  2. The default number of threads is implementation dependent. &lt;br /&gt;
&lt;br /&gt;
*omp_get_thread_num(): Returns the thread number of the thread, within the team, making this call. This number will be between 0 and OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0 &lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_thread_num(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_THREAD_NUM behaves as followings:&lt;br /&gt;
  1. If called from a nested parallel region, or a serial region, this function will return 0. &lt;br /&gt;
&lt;br /&gt;
Now, let's see the code which has been paralleled and explanation. &lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 		//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(j_tile = 0; j_tile&amp;lt;N-1; j_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       for(j=j_tile;j&amp;lt;j_tile+M;j++){&lt;br /&gt;
 19:         a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 20:       }&lt;br /&gt;
 21:     }&lt;br /&gt;
 22:     _mylock[myid] += 1;&lt;br /&gt;
 23:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 24:   }&lt;br /&gt;
 25: }&lt;br /&gt;
&lt;br /&gt;
We paralleled the original program in two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other four processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take four interations of the loop i. The same to j loop. Assume the size of each block is 4. Each processor will execute four iterations of loop j. In order to let the total iterations be equal to the original program, j has to be enclosed in loop i. So, the new loop will be looked like ''for (j_tile = 2; j_tile &amp;lt;= 15; j_tile += 4)'', line 18.&lt;br /&gt;
The lower bound of loop j is set to j_tile and the upper bound will be j_tile+3. We will keep the other statement unchanged.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the neighbor threads. After first step, the four processor will finish computing a block 4x4. If we parallel all these four processors, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
We set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
With the four variables, threads are synchronized:&lt;br /&gt;
The first thread continues to run with out waiting (line 9), because its thread ID is 0. Then all other thread can not go down after line 12. If the value in ''_mylocks[_my_id-1]'' is smaller than ''_counter0''.&lt;br /&gt;
&lt;br /&gt;
Otherwise, the block that the current thread is waiting for must have to be completed, and the current thread can go down to line 12, and mark the next block it will wait for by adding 1 to ''_counter0'' (line 14).&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it&lt;br /&gt;
has finish a block by ''mylocks[proc]++''. Once the neighbor thread finds the value has been changed, it will continue running and so on. The below figure presents it to us.&lt;br /&gt;
[[Image:Synchorization.jpg]]&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
Here is another example code and we are going to parallelize it in DOPIPE parallelism. There is a dependence, which is S2 -&amp;gt; T S1, existing in the sample code.&lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02:   S1: a[i]=b[i];&lt;br /&gt;
 03:   S2: c[i]=c[i-1]+a[i];&lt;br /&gt;
 04: &lt;br /&gt;
 05: }&lt;br /&gt;
Now, let's see how to parallel the sample code to DOPIPE parallelism.&lt;br /&gt;
we still use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is got by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is got by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; 			//thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04:   int _counter0 = 1;&lt;br /&gt;
 05:   int _my_id = omp_get_thread_num();&lt;br /&gt;
 06:   int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07:   _mylocks[my_id] = 0;&lt;br /&gt;
 08:   for(i_tile = 0; i_tile&amp;lt;N-1; i_tile+=M){&lt;br /&gt;
 09:     if(_my_id&amp;gt;0) {&lt;br /&gt;
 10:       do{&lt;br /&gt;
 11:         #pragma omp flush(_mylock)&lt;br /&gt;
 12:       } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13:       #pragma omp flush(a, _mylock)&lt;br /&gt;
 14:       _counter0 += 1;&lt;br /&gt;
 15:     }&lt;br /&gt;
 16:     #pragma omp for nowait&lt;br /&gt;
 17:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18:       a[i]=b[i];&lt;br /&gt;
 19:     }&lt;br /&gt;
 20:     for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 21:       c[i]=c[i-1]+a[i];&lt;br /&gt;
 22:     }&lt;br /&gt;
 23:     _mylock[myid] += 1;&lt;br /&gt;
 24:     #pragma omp flush(a, _mylock)&lt;br /&gt;
 25:   }&lt;br /&gt;
 26: }&lt;br /&gt;
&lt;br /&gt;
Ideally, We parallelized the original program into two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take interations of the loop i. Now, there are two loop i existing and each loop i contains different statements. Also, we will keep other statements remained.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the threads. After first step, processors will finish computing &lt;br /&gt;
a[i]=b[i]. If we parallel all the processors to do the second loop i, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
Still, we set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it has finish a block by ''mylocks[proc]++''. Once the processors finish their own block, the other processors will be able to get the value to use that value to execute in its statement and process that.&lt;br /&gt;
&lt;br /&gt;
====Functional Parallelism====&lt;br /&gt;
&lt;br /&gt;
In order to introduce function parallelism, we want to execute some code section in parallel with another code section. We use code 3.21 to show two loops execute in parallel with respect to one another, although each loop is sequentially executed.&lt;br /&gt;
&lt;br /&gt;
 '''Code''' 3.21 A function parallelism example in OpenMP&lt;br /&gt;
 '''pragma''' omp parallel shared(A, B)private(i)&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''#pragma''' omp sections nowait&lt;br /&gt;
  '''{'''&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''A[i]''' = A[i]*A[i] - 4.0;&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''B[i]''' = B[i]*B[i] - 9.0;&lt;br /&gt;
  '''}'''//end omp sections&lt;br /&gt;
 '''}'''//end omp parallel&lt;br /&gt;
&lt;br /&gt;
In code 3.21, there are two loops needed to be executed in parallel. We just need to insert two ''pragma omp section'' statements. Once we insert these two statements, those two loops will execute sequentially.&lt;br /&gt;
&lt;br /&gt;
===Intel Thread Building Blocks===&lt;br /&gt;
&lt;br /&gt;
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable &lt;br /&gt;
parallel programming using standard ISO C++ code. It does not require special &lt;br /&gt;
languages or compilers. It is designed to promote scalable data parallel programming. &lt;br /&gt;
The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually. Instead the library abstracts access to the multiple processors by allowing the operations to be treated as &amp;quot;tasks,&amp;quot; which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Also, Intel Threading Building Blocks provides net results, which enables you to specify &lt;br /&gt;
parallelism more conveniently than using raw threads, and at the same time can &lt;br /&gt;
improve performance.&lt;br /&gt;
&lt;br /&gt;
====Variables Scope====&lt;br /&gt;
&lt;br /&gt;
Intel TBB is a collection of components for parallel programming, here is the overview of the library contents:&lt;br /&gt;
&lt;br /&gt;
* Basic algorithms: parallel_for, parallel_reduce, parallel_scan&lt;br /&gt;
* Advanced algorithms: parallel_while, parallel_do,pipeline, parallel_sort&lt;br /&gt;
* Containers: concurrent_queue, concurrent_vector, concurrent_hash_map&lt;br /&gt;
* Scalable memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator&lt;br /&gt;
* Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive mutex&lt;br /&gt;
* Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store&lt;br /&gt;
* Timing: portable fine grained global time stamp&lt;br /&gt;
* Task Scheduler: direct access to control the creation and activation of tasks&lt;br /&gt;
&lt;br /&gt;
Then we will focus on some specific TBB variables.&lt;br /&gt;
&lt;br /&gt;
=====parallel_for=====&lt;br /&gt;
&lt;br /&gt;
Parallel_for is the template function that performs parallel iteration over a range of values. In Intel TBB, a lot of DOALL cases could be implemented by using this function. The syntax is as follows: &lt;br /&gt;
 template&amp;lt;typename Index, typename Function&amp;gt;&lt;br /&gt;
 Function parallel_for(Index first, Index_type last, Index step, Function f);&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_for( const Range&amp;amp; range, const Body&amp;amp; body, [, partitioner] );&lt;br /&gt;
&lt;br /&gt;
A parallel_for(first, last, step, f) represents parallel execution of the loop: &amp;quot;for( auto i=first; i&amp;lt;last; i+=step ) f(i);&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
=====parallel_reduce=====&lt;br /&gt;
&lt;br /&gt;
Function parallel_reduce computes reduction over a range. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Value, typename Func, typename Reduction&amp;gt;&lt;br /&gt;
 Value parallel_reduce( const Range&amp;amp; range, const Value&amp;amp; identity, const Func&amp;amp; func, const Reduction&amp;amp; reduction );&lt;br /&gt;
&lt;br /&gt;
The functional form parallel_reduce(range,identity,func,reduction) performs a&lt;br /&gt;
parallel reduction by applying func to subranges in range and reducing the results&lt;br /&gt;
using binary operator reduction. It returns the result of the reduction. Parameter func&lt;br /&gt;
and reduction can be lambda expressions.&lt;br /&gt;
&lt;br /&gt;
=====parallel_scan=====&lt;br /&gt;
&lt;br /&gt;
This template function computes parallel prefix. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const auto_partitioner&amp;amp; );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const simple_partitioner&amp;amp; );&lt;br /&gt;
&lt;br /&gt;
A parallel_scan(range,body) computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing that is&lt;br /&gt;
sometimes useful in scenarios that appear to have inherently serial dependences. A&lt;br /&gt;
further explanation will be given in the DOACROSS example.&lt;br /&gt;
&lt;br /&gt;
=====pipeline=====&lt;br /&gt;
&lt;br /&gt;
This class performs pipelined execution. Members as follows:&lt;br /&gt;
 namespace tbb {&lt;br /&gt;
     class pipeline {&lt;br /&gt;
     public:&lt;br /&gt;
        pipeline();&lt;br /&gt;
        ~pipeline(); &lt;br /&gt;
        void add_filter( filter&amp;amp; f );&lt;br /&gt;
        void run( size_t max_number_of_live_tokens );&lt;br /&gt;
        void clear();&lt;br /&gt;
   };&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
A pipeline represents pipelined application of a series of filters to a stream of items.&lt;br /&gt;
Each filter operates in a particular mode: parallel, serial in order, or serial out of order. With a parallel filter, &lt;br /&gt;
we could implement DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
====Reduction====&lt;br /&gt;
&lt;br /&gt;
The reduction in Intel TBB is implemented using parallel_reduce function. A parallel_reduce recursively splits the range into subranges and uses the splitting constructor to make one or more copies of the body for each thread. We use an example to illustrate this: &lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 struct Sum {&lt;br /&gt;
     float value;&lt;br /&gt;
     Sum() : value(0) {}&lt;br /&gt;
     Sum( Sum&amp;amp; s, split ) {value = 0;}&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;float*&amp;gt;&amp;amp; r ) {&lt;br /&gt;
         float temp = value;&lt;br /&gt;
         for( float* a=r.begin(); a!=r.end(); ++a ) {&lt;br /&gt;
             temp += *a;&lt;br /&gt;
         }&lt;br /&gt;
         value = temp;&lt;br /&gt;
     }&lt;br /&gt;
     void join( Sum&amp;amp; rhs ) {value += rhs.value;}&lt;br /&gt;
 };&lt;br /&gt;
 float ParallelSum( float array[], size_t n ) {&lt;br /&gt;
     Sum total;&lt;br /&gt;
     parallel_reduce( blocked_range&amp;lt;float*&amp;gt;( array, array+n ), total );&lt;br /&gt;
     return total.value;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The above example sums the values in the array. The parallel_reduce will do the reduction within the range of (array, array+n), to split the working body, and then join them by the return value for each split.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
The implementation of DOALL parallelism in Intel TBB will involve Parallel_for function. &lt;br /&gt;
To better illustrate the usage, here we discuss a simple example. The following is the original code:&lt;br /&gt;
 &lt;br /&gt;
 void SerialApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     for( size_t i=0; i&amp;lt;n; ++i )&lt;br /&gt;
         Foo(a[i]);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
After using Intel TBB, it could be switched to the following:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_for.h&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 class ApplyFoo {&lt;br /&gt;
     float *const my_a;&lt;br /&gt;
 public:&lt;br /&gt;
     void operator( )( const blocked_range&amp;lt;size_t&amp;gt;&amp;amp; r ) const {&lt;br /&gt;
         float *a = my_a;&lt;br /&gt;
         for( size_t i=r.begin(); i!=r.end( ); ++i )&lt;br /&gt;
             Foo(a[i]);&lt;br /&gt;
     }&lt;br /&gt;
     ApplyFoo( float a[] ) :&lt;br /&gt;
         my_a(a)&lt;br /&gt;
     {}&lt;br /&gt;
 };&lt;br /&gt;
 &lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n,The_grain_size_You_Pick), ApplyFoo(a) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example is the simplest DOALL parallelism, similar to the one in the textbook, and execution graph will be very similar as the one in DOALL section above. But with the help of this simple illustration, the TBB code just gives you a flavor of how it would be implemented in Intel Threading Building Blocks.&lt;br /&gt;
&lt;br /&gt;
A little more to say, parallel_for takes an optional third argument to specify a partitioner, which I used &amp;quot;The_grain_size_You_Pick&amp;quot; to represent. If you want to manually divide the grain and assign the work to processors, you could specify that in the function. Or, you could use automatic grain provided TBB. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempts to limit overhead while still providing ample opportunities for load balancing. Then, the last three line of the TBB code above will be:&lt;br /&gt;
&lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n), ApplyFoo(a), auto_partitioner( ) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
We could find a good example in Intel TBB to implement a DOACROSS with the help of parallel_scan. As stated in the parallel_scan section, this function computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing which&lt;br /&gt;
could be helpful in scenarios that appear to have inherently serial dependences, which could be loop-carried dependences. &lt;br /&gt;
&lt;br /&gt;
Let's consider this scenario (which is actually the mathematical definition of parallel prefix):  &lt;br /&gt;
 T temp = id⊕;&lt;br /&gt;
 for( int i=1; i&amp;lt;=n; ++i ) {&lt;br /&gt;
     temp = temp ⊕ x[i];&lt;br /&gt;
     y[i] = temp;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
When we implement this in TBB using parallel_scan, it becomes:&lt;br /&gt;
&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 class Body {&lt;br /&gt;
     T sum;&lt;br /&gt;
     T* const y;&lt;br /&gt;
     const T* const x;&lt;br /&gt;
 public:&lt;br /&gt;
     Body( T y_[], const T x_[] ) : sum(id⊕), x(x_), y(y_) {}&lt;br /&gt;
     T get_sum() const {return sum;}&lt;br /&gt;
     template&amp;lt;typename Tag&amp;gt;&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;int&amp;gt;&amp;amp; r, Tag ) {&lt;br /&gt;
         T temp = sum;&lt;br /&gt;
         for( int i=r.begin(); i&amp;lt;r.end(); ++i ) {&lt;br /&gt;
             temp = temp ⊕ x[i];&lt;br /&gt;
             if( Tag::is_final_scan() )&lt;br /&gt;
                 y[i] = temp;&lt;br /&gt;
         } &lt;br /&gt;
         sum = temp;&lt;br /&gt;
     }&lt;br /&gt;
     Body( Body&amp;amp; b, split ) : x(b.x), y(b.y), sum(id⊕) {}&lt;br /&gt;
     void reverse_join( Body&amp;amp; a ) { sum = a.sum ⊕ sum;}&lt;br /&gt;
     void assign( Body&amp;amp; b ) {sum = b.sum;}&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
 float DoParallelScan( T y[], const T x[], int n ) {&lt;br /&gt;
     Body body(y,x);&lt;br /&gt;
     parallel_scan( blocked_range&amp;lt;int&amp;gt;(0,n), body );&lt;br /&gt;
     return body.get_sum();&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
It is the second part (function DoParallelScan) that we have to focus on. &lt;br /&gt;
&lt;br /&gt;
Actually, this example is just the scenario mentioned above that could take advantages of parallel_scan. The &amp;quot;inherently serial dependences&amp;quot; is taken care of by the functionality of parallel_scan. By computing the prefix, the serial code could be implemented in parallel with just one function.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
&lt;br /&gt;
Pipeline class is the Intel TBB that performs pipelined execution. A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order. So this class can be used to implement a DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
Here is a comparatively complex example about pipeline implementation. Also, if we look carefully, this is an example with both DOPIPE and DOACROSS:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;iostream&amp;gt;&lt;br /&gt;
 #include &amp;quot;tbb/pipeline.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/tbb_thread.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 char InputString[] = &amp;quot;abcdefg\n&amp;quot;;&lt;br /&gt;
 class InputFilter: public filter {&lt;br /&gt;
     char* my_ptr;&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void*) {&lt;br /&gt;
         if (*my_ptr)&lt;br /&gt;
             return my_ptr++;&lt;br /&gt;
         else&lt;br /&gt;
             return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     InputFilter() :&lt;br /&gt;
         filter( serial_in_order ), my_ptr(InputString) {}&lt;br /&gt;
 };&lt;br /&gt;
 class OutputFilter: public thread_bound_filter {&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void* item) {&lt;br /&gt;
         std::cout &amp;lt;&amp;lt; *(char*)item;&lt;br /&gt;
         return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     OutputFilter() : thread_bound_filter(serial_in_order) {}&lt;br /&gt;
 };&lt;br /&gt;
 void RunPipeline(pipeline* p) {&lt;br /&gt;
     p-&amp;gt;run(8);&lt;br /&gt;
 }&lt;br /&gt;
 int main() {&lt;br /&gt;
     // Construct the pipeline&lt;br /&gt;
     InputFilter f;&lt;br /&gt;
     OutputFilter g;&lt;br /&gt;
     pipeline p;&lt;br /&gt;
     p.add_filter(f);&lt;br /&gt;
     p.add_filter(g);&lt;br /&gt;
     // Another thread initiates execution of the pipeline&lt;br /&gt;
     tbb_thread t(RunPipeline,&amp;amp;p);&lt;br /&gt;
     // Process the thread_bound_filter with the current thread.&lt;br /&gt;
     while (g.process_item()!=thread_bound_filter::end_of_stream)&lt;br /&gt;
         continue;&lt;br /&gt;
     // Wait for pipeline to finish on the other thread.&lt;br /&gt;
     t.join();&lt;br /&gt;
     return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example above shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. The main thread does the following after constructing the pipeline:&lt;br /&gt;
1. Start the pipeline on another thread.&lt;br /&gt;
2. Service the thread_bound_filter until it reaches end_of_stream.&lt;br /&gt;
3. Wait for the other thread to finish.&lt;br /&gt;
&lt;br /&gt;
===POSIX Threads===&lt;br /&gt;
&lt;br /&gt;
POSIX Threads, or Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), defines an API for creating and manipulating threads.&lt;br /&gt;
&lt;br /&gt;
====Variable Scopes====&lt;br /&gt;
Pthreads defines a set of C programming language types, functions and constants. It is implemented with a pthread.h header and a thread library.&lt;br /&gt;
&lt;br /&gt;
There are around 100 Pthreads procedures, all prefixed &amp;quot;pthread_&amp;quot;. The subroutines which comprise the Pthreads API can be informally grouped into four major groups:&lt;br /&gt;
&lt;br /&gt;
* '''Thread management:''' Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.) E.g.pthread_create(), pthread_join().&lt;br /&gt;
* '''Mutexes:''' Routines that deal with synchronization, called a &amp;quot;mutex&amp;quot;, which is an abbreviation for &amp;quot;mutual exclusion&amp;quot;. Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes. E.g. pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock().&lt;br /&gt;
* '''Condition variables:''' Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included. E.g. pthread_cond_signal(); pthread_cond_broadcast(); pthread_cond_wait(); pthread_cond_timedwait();pthread_cond_reltimedwait_np().&lt;br /&gt;
* '''Synchronization:''' Routines that manage read/write locks and barriers. E.g. pthread_rwlock_rdlock(); pthread_rwlock_tryrdlock(); pthread_rwlock_wrlock();pthread_rwlock_trywrlock(); pthread_rwlock_unlock();pthread_barrier_init(); pthread_barrier_wait()&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
The following is a simple code example in C, as DOALL parallelism, to print out each threads' ID#.&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS     5&lt;br /&gt;
  &lt;br /&gt;
 void *PrintHello(void *threadid)&lt;br /&gt;
 {&lt;br /&gt;
    long tid;&lt;br /&gt;
  &lt;br /&gt;
    tid = (long)threadid;&lt;br /&gt;
    printf(&amp;quot;Hello World! It's me, thread #%ld!\n&amp;quot;, tid);&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
  &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
    pthread_t threads[NUM_THREADS];&lt;br /&gt;
  &lt;br /&gt;
    int rc;&lt;br /&gt;
    long t;&lt;br /&gt;
    for(t=0; t&amp;lt;NUM_THREADS; t++){&lt;br /&gt;
       printf(&amp;quot;In main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
       rc = pthread_create(&amp;amp;threads[t], NULL, PrintHello, (void *)t);&lt;br /&gt;
  &lt;br /&gt;
       if (rc){&lt;br /&gt;
          printf(&amp;quot;ERROR; return code from pthread_create() is %d\n&amp;quot;, rc);&lt;br /&gt;
          exit(-1);&lt;br /&gt;
       }&lt;br /&gt;
    }&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This loop contains only single statement which doesn't cross the iterations, so each iteration could be considered as a parallel task.&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
When it comes to using Pthreads to implement DOACROSS, it could express functional parallelism easily, but make the parallelism unnecessarily complicated. See an example below: from '''POSIX Threads Programming''' by Blaise Barney&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS	&lt;br /&gt;
 &lt;br /&gt;
 void *BusyWork(void *t)&lt;br /&gt;
 {&lt;br /&gt;
   int i;&lt;br /&gt;
   long tid;&lt;br /&gt;
   double result=0.0;&lt;br /&gt;
   tid = (long)t;&lt;br /&gt;
   printf(&amp;quot;Thread %ld starting...\n&amp;quot;,tid);&lt;br /&gt;
   for (i=0; i&amp;lt;1000000; i++)&lt;br /&gt;
   {&lt;br /&gt;
      result = result + sin(i) * tan(i);&lt;br /&gt;
   }&lt;br /&gt;
   printf(&amp;quot;Thread %ld done. Result = %e\n&amp;quot;,tid, result);&lt;br /&gt;
   pthread_exit((void*) t);&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
   pthread_t thread[NUM_THREADS];&lt;br /&gt;
   pthread_attr_t attr;&lt;br /&gt;
   int rc;&lt;br /&gt;
   long t;&lt;br /&gt;
   void *status;&lt;br /&gt;
 &lt;br /&gt;
   /* Initialize and set thread detached attribute */&lt;br /&gt;
   pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
   pthread_attr_setdetachstate(&amp;amp;attr, PTHREAD_CREATE_JOINABLE);&lt;br /&gt;
 &lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      printf(&amp;quot;Main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
      rc = pthread_create(&amp;amp;thread[t], &amp;amp;attr, BusyWork, (void *)t); &lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_create() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
   /* Free attribute and wait for the other threads */&lt;br /&gt;
   pthread_attr_destroy(&amp;amp;attr);&lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      rc = pthread_join(thread[t], &amp;amp;status);&lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_join() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      printf(&amp;quot;Main: completed join with thread %ld having a status   &lt;br /&gt;
            of %ld\n&amp;quot;,t,(long)status);&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
 printf(&amp;quot;Main: program completed. Exiting.\n&amp;quot;);&lt;br /&gt;
 pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This example demonstrates how to &amp;quot;wait&amp;quot; for thread completions by using the Pthread join routine. Since some implementations of Pthreads may not create threads in a joinable state, the threads in this example are explicitly created in a joinable state so that they can be joined later.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
There is examples of using Posix Threads to implement DOPIPE parallelism, but unnecessarily complex. Due to the long length, we won't provide it here. If the reader is interested, it could be found in &amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/code/pipe.c Pthreads DOPIPE example]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Comparison among the three===&lt;br /&gt;
&lt;br /&gt;
====A unified example====&lt;br /&gt;
&lt;br /&gt;
We use a simple parallel example from [http://sourceforge.net Sourceforge.net] to show how it will be implemented in the three packages, namely, POSIX Threads, Intel TBB, OpenMP, to show some commonalities and differences among them.&lt;br /&gt;
&lt;br /&gt;
Following is the original code:&lt;br /&gt;
&lt;br /&gt;
 Grid1 *g = new Grid1(0, n+1);&lt;br /&gt;
 Grid1IteratorSub it(1, n, g);&lt;br /&gt;
 DistArray x(g), y(g);&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 ForEach(int i, it,&lt;br /&gt;
    x(i) += ( y(i+1) + y(i-1) )*.5;&lt;br /&gt;
    e += sqr( y(i) ); )&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Then we are going to show the implementations in different packages, and also make a brief summary of the three packages.&lt;br /&gt;
&lt;br /&gt;
=====In POSIX Thread=====&lt;br /&gt;
&lt;br /&gt;
POSIX Thread: Symmetric multi processing, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global declaration:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 float *x, *y;&lt;br /&gt;
 float vec[8];&lt;br /&gt;
 int nn, pp;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
&lt;br /&gt;
 void *sub1(void *arg) {&lt;br /&gt;
    int p = (int)arg;&lt;br /&gt;
    float e_local = 0;&lt;br /&gt;
    for (int i=1+(nn*p)/pp; i&amp;lt;1+(nn*(p+1))/pp; ++i) {&lt;br /&gt;
      x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
      e_local += y[i] * y[i];&lt;br /&gt;
    }&lt;br /&gt;
    vec[p] = e_local;&lt;br /&gt;
    return (void*) 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
&lt;br /&gt;
 x = new float[n+1];&lt;br /&gt;
 y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 int p_threads = 8;&lt;br /&gt;
 nn = n-1;&lt;br /&gt;
 pp = p_threads;&lt;br /&gt;
 pthread_t threads[8];&lt;br /&gt;
 pthread_attr_t attr;&lt;br /&gt;
 pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p)&lt;br /&gt;
    pthread_create(&amp;amp;threads[p], &amp;amp;attr,&lt;br /&gt;
      sub1, (void *)p);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p) {&lt;br /&gt;
    pthread_join(threads[p], NULL);&lt;br /&gt;
    e += vec[p];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
=====In Intel Threading Building Blocks=====&lt;br /&gt;
&lt;br /&gt;
Intel TBB: A C++ library for thread programming, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
Translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/cache_aligned_allocator.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
 struct sub1 {&lt;br /&gt;
    float ee;&lt;br /&gt;
    float *x, *y;&lt;br /&gt;
    sub1(float *xx, float *yy) : ee(0), x(xx), y(yy) {}&lt;br /&gt;
    sub1(sub1&amp;amp; s, split) { ee = 0; x = s.x; y = s.y; }&lt;br /&gt;
    void operator() (const blocked_range&amp;lt;int&amp;gt; &amp;amp; r){&lt;br /&gt;
      float e = ee;&lt;br /&gt;
      for (int i = r.begin(); i!= r.end(); ++i) {&lt;br /&gt;
        x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
        e += y[i] * y[i];&lt;br /&gt;
      }&lt;br /&gt;
      ee = e;&lt;br /&gt;
    }&lt;br /&gt;
    void join(sub1&amp;amp; s) { ee += s.ee; }&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 task_scheduler_init init;&lt;br /&gt;
 ...&lt;br /&gt;
 float e;&lt;br /&gt;
 float *x = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 float *y = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 ...&lt;br /&gt;
 sub1 s(x, y);&lt;br /&gt;
 parallel_reduce(blocked_range&amp;lt;int&amp;gt;(1, n, 1000), s);&lt;br /&gt;
 e = s.ee;&lt;br /&gt;
 ...&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(x, n+1);&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(y, n+1);&lt;br /&gt;
&lt;br /&gt;
=====In OpenMP shared memory parallel code annotations=====&lt;br /&gt;
&lt;br /&gt;
OpenMP: Usually automatic paralleization with a run-time system based on a thread library.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 float e;&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 float *x = new float[n+1];&lt;br /&gt;
 float *y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 e = 0;&lt;br /&gt;
 #pragma omp for reduction(+:e)&lt;br /&gt;
 for (int i=1; i&amp;lt;n; ++i) {&lt;br /&gt;
    x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
    e += y[i] * y[i];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
====Summary: Difference among them====&lt;br /&gt;
&lt;br /&gt;
*Pthreads works for all the parallelism and could express functional parallelism easily, but it needs to build specialized synchronization primitives and explicitly privatize variables, means there is more effort needed to switch a serial program in to parallel mode. &lt;br /&gt;
&lt;br /&gt;
*OpenMP can provide many performance enhancing features, such as atomic, barrier and flush synchronization primitives. It is very simple to use OpenMP to exploit DOALL parallelism, but the syntax for expressing functional parallelism is awkward. &lt;br /&gt;
&lt;br /&gt;
*Intel TBB relies on generic programming, it performs better with custom iteration spaces or complex reduction operations. Also, it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes, so it's better in cases that go beyond loop-based parallelism.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Synchronization Mechanisms==&lt;br /&gt;
&lt;br /&gt;
===Overveiw===&lt;br /&gt;
&lt;br /&gt;
In order to accomplish the above parallelizations in a real system, the memory must be carefully orchestrated such that no information gets corrupted.  Every architecture handles synchronizing data from parallel processors slightly differently.  This section is going to look at different architectures and highlight a few of the mechanisms that are used to achieve this memory synchorization.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===IA-64===&lt;br /&gt;
IA-64 is an Intel architecture that is mainly used in Itanium processors.&lt;br /&gt;
====Spinlock====&lt;br /&gt;
the spinlock is used to guard against multiple accesses to the critical section at the same time.  The critical section is a section of code that must be executed in sequential order, it cannot be parallelized.  Therefore, when a parallel process comes across an occupied critical section the process will “spin” until the lock is released. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // available. If it is 1, another process is in the critical section.&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  spin_lock:&lt;br /&gt;
    mov	ar.ccv = 0			// cmpxchg looks for avail (0)&lt;br /&gt;
    mov	r2 = 1				// cmpxchg sets to held (1)&lt;br /&gt;
  &lt;br /&gt;
  spin: &lt;br /&gt;
    ld8	r1 = [lock] ;;			// get lock in shared state&lt;br /&gt;
    cmp.ne	p1, p0 = r1, r2		// is lock held (ie, lock == 1)?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// yes, continue spinning&lt;br /&gt;
    cmpxchg8.acqrl = [lock], r2		// attempt to grab lock&lt;br /&gt;
    cmp.ne p1, p0 = r1, r2		// was lock empty?&lt;br /&gt;
  &lt;br /&gt;
  (p1)	br.cond.spnt	spin ;;		// bummer, continue spinning&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
    st8.rel(lock) = r0 ;;		//release the lock&lt;br /&gt;
&lt;br /&gt;
The above code demonstrates how a spin lock is used.  Once the process gets to a spin lock, it will check to see if the lock is available, if it is not, then the process will proceed into the spin loop where it will continuously check to see if the lock is available.  Once it finds out the lock is available, it will attempt to obtain the lock.  If another process obtains the lock first, then the process will branch back into the spin loop and continue to wait.&lt;br /&gt;
&lt;br /&gt;
====Barrier====&lt;br /&gt;
&lt;br /&gt;
A barrier is a common mechanism used to hold up processes until all processes can get to the same point.  The mechanism is useful in kinds of different parallelisms (DOALL, DOACROSS, DOPIPE, reduction, and functional parallelism)  This architecture uses a unique form of the barrier mechanism called the sense-reversing barrier.  The idea behind this barrier is to prevent race conditions.  If a process from the “next” instance of the barrier races ahead while slow processes from the current barrier are leaving, the fast processes could trap the slow processes at the “next” barrier and thus corrupting the memory synchronization. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Dekker’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Dekker’s Algorithm uses variables to indicate which processers are using which resources.  It basically arbitrates for a resource using these variables.  Every processor has a flag that indicates when it is in the critical section.  So when a processor is getting ready to enter the critical section it will set its flag to one, then it will check to make sure that all of the other processor flags are zero, then it will proceed into the section.  This behavior is demonstrated in the code below.  It is a two-way multiprocessor system, so there are two processor flags, flag_me and flag_you. [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The flag_me variable is zero if we are not in the synchronization and &lt;br /&gt;
  // critical section code and non-zero otherwise; flag_you is similarly set&lt;br /&gt;
  // for the other processor.  This algorithm does not retry access to the &lt;br /&gt;
  // resource if there is contention.&lt;br /&gt;
  &lt;br /&gt;
  dekker:&lt;br /&gt;
    mov		r1 = 1 ;;		// my_flag = 1 (i want access)&lt;br /&gt;
    st8  	[flag_me] = r1&lt;br /&gt;
    mf ;;				// make st visible first&lt;br /&gt;
    ld8 	r2 = [flag_you] ;;		// is other's flag 0?&lt;br /&gt;
    cmp.eq p1, p0 = 0, r2&lt;br /&gt;
  &lt;br /&gt;
  (p1) &lt;br /&gt;
    br.cond.spnt cs_skip ;;		// if not, resource in use &lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  cs_skip:&lt;br /&gt;
    st8.rel[flag_me] = r0 ;;		// release lock&lt;br /&gt;
&lt;br /&gt;
====Lamport’s Algorithm====&lt;br /&gt;
&lt;br /&gt;
Lamport’s Algorithm is similar to a spinlock with the addition of a fairness mechanism that keeps track of the order in which processes request the shared resource and provides access to the shared resource in the same order.  It makes use of two variable x and y and a shared array, b.  The example below shows example code for this algorithm.  [[#References|&amp;lt;sup&amp;gt;[12]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // The proc_id variable holds a unique, non-zero id for the process that &lt;br /&gt;
  // attempts access to the critical section.  x and y are the synchronization&lt;br /&gt;
  // variables that indicate who is in the critical section and who is attempting&lt;br /&gt;
  // entry. ptr_b_1 and ptr_b_id point at the 1'st and id'th element of b[].&lt;br /&gt;
  //&lt;br /&gt;
  &lt;br /&gt;
  lamport:&lt;br /&gt;
    	ld8		r1 = [proc_id] ;;	// r1 = unique process id&lt;br /&gt;
  start:&lt;br /&gt;
    	st8	[ptr_b_id] = r1		// b[id] = &amp;quot;true&amp;quot;&lt;br /&gt;
    	st8	[x] = r1			// x = process id&lt;br /&gt;
   	mf					// MUST fence here!&lt;br /&gt;
    	ld8	r2 = [y] ;;&lt;br /&gt;
    	cmp.ne p1, p0 = 0, r2;;		// if (y !=0) then...&lt;br /&gt;
  (p1)	st8	[ptr_b_id] = r0		// ... b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  (p1)	br.cond.sptk	wait_y		// ... wait until y == 0&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r1		// y = process id&lt;br /&gt;
  	mf&lt;br /&gt;
  	ld8 	r3 = [x] ;;		&lt;br /&gt;
  	cmp.eq p1, p0 = r1, r3 ;;	// if (x == id) then..&lt;br /&gt;
  (p1)	br.cond.sptk cs_begin		// ... enter critical section&lt;br /&gt;
  &lt;br /&gt;
  	st8 	[ptr_b_id] = r0		// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
  	ld8	r3 = [ptr_b_1]		// r3 = &amp;amp;b[1]&lt;br /&gt;
  	mov	ar.lc = N-1 ;;		// lc = number of processors - 1&lt;br /&gt;
  wait_b:&lt;br /&gt;
  	ld8	r2 = [r3] ;;		&lt;br /&gt;
  	cmp.ne p1, p0 = r1, r2		// if (b[j] != 0) then...&lt;br /&gt;
  (p1)	br.cond.spnt	wait_b ;;	// ... wait until b[j] == 0&lt;br /&gt;
  	add	r3 = 8, r3		// r3 = &amp;amp;b[j+1]&lt;br /&gt;
  	br.cloop.sptk	wait_b ;;	// loop over b[j] for each j&lt;br /&gt;
  &lt;br /&gt;
  	ld8	r2 = [y] ;;		// if (y != id) then...&lt;br /&gt;
  	cmp.ne p1, p2 = 0, r2&lt;br /&gt;
  (p1)  br.cond.spnt 	wait_y&lt;br /&gt;
  	br	start			// back to start to try again&lt;br /&gt;
  &lt;br /&gt;
  cs_begin:&lt;br /&gt;
    // critical section code goes here...&lt;br /&gt;
  cs_end:&lt;br /&gt;
  &lt;br /&gt;
  	st8	[y] = r0		// release the lock&lt;br /&gt;
  	st8.rel[ptr_b_id] = r0 ;;	// b[id] = &amp;quot;false&amp;quot;&lt;br /&gt;
===IA-32=== &lt;br /&gt;
&lt;br /&gt;
IA-32 is an Intel architecture that is also known as x86.  This is a very widely used architecture.&lt;br /&gt;
&lt;br /&gt;
====Locked Atomic Operation====&lt;br /&gt;
This is the main mechanism for this architecture to manage shared data structures such as semaphores and system segments.  The process uses the following three interdependent mechanisms to implement the locked atomic operation: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*  Guaranteed atomic operations.&lt;br /&gt;
*  Bus locking, using the LOCK# signal and the LOCK instruction prefix.&lt;br /&gt;
*  Cache coherency protocols that insure that atomic operations can be carried out on cached data structures (cache lock). This mechanism is present in the P6 family processors.&lt;br /&gt;
&lt;br /&gt;
=====Guaranteed Atomic Operation=====&lt;br /&gt;
The following are guaranteed to be carried out automatically: [[#References|&amp;lt;sup&amp;gt;[13]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
Reading or writing a byte.&lt;br /&gt;
*  Reading or writing a word aligned on a 16-bit boundary.&lt;br /&gt;
*  Reading or writing a doubleword aligned on a 32-bit boundary.The P6 family processors guarantee that the following additional memory operations will always be carried out atomically:&lt;br /&gt;
*  Reading or writing a quadword aligned on a 64-bit boundary. (This operation is also guaranteed on the Pentium® processor.)&lt;br /&gt;
*  16-bit accesses to uncached memory locations that fit within a 32-bit data bus.&lt;br /&gt;
*  16-, 32-, and 64-bit accesses to cached memory that fit within a 32-Byte cache line.&lt;br /&gt;
&lt;br /&gt;
=====Bus Locking=====&lt;br /&gt;
A LOCK signal is asserted automatically during certain critical sections in order to lock the system bus and grant control to the process executing the critical section.  This signal will disallow control of this bus by any other process while the LOCK is engaged.&lt;br /&gt;
&lt;br /&gt;
===Linux Kernel===&lt;br /&gt;
&lt;br /&gt;
Linux Kernel is referred to as an “architecture”, however it is fairly unconventional in that it is an open source operating system that has full access to the hardware. It uses many common synchronization mechanisms, so it will be considered here. [[#References|&amp;lt;sup&amp;gt;[15]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====Busy-waiting lock====&lt;br /&gt;
&lt;br /&gt;
=====Spinlocks=====&lt;br /&gt;
&lt;br /&gt;
This mechanism is very similar to the mechanism described in the IA-64 architecture.  It is a mechanism used to manage access to a critical section of code.  If a process tries to access the critical section and is rejected it will sit and “spin” while it waits for the lock to be released.&lt;br /&gt;
&lt;br /&gt;
=====Rwlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a special kind of spinlock.  It is for protected structures that are frequently read, but rarely written.  This lock allows multiple reads in parallel, which can increase efficiency if process are not having to sit and wait in order to merely carry out a read function.  Like before however, one write is allowed at a time with no reads done in parallel&lt;br /&gt;
&lt;br /&gt;
=====Brlocks=====&lt;br /&gt;
&lt;br /&gt;
This is a super fast read/write lock, but it has a write-side penalty.  The main advantage of this lock is to prevent cache “ping-pong” in a multiple read case.&lt;br /&gt;
&lt;br /&gt;
====Sleeper locks====&lt;br /&gt;
&lt;br /&gt;
=====Semiphores=====&lt;br /&gt;
&lt;br /&gt;
A semaphore is special variable that acts similar to a lock.  If the semaphore can be acquired then the process can proceed into the critical section.  If the semaphore cannon be acquired, then the process is “put to sleep” and the processor is then used for another process.  This means the processes cache is saved off in a place where it can be retrieved when the process is “woken up”.  Once the semaphore is available the “sleeping” process is woken up and obtains the semaphore and proceeds in to the critical section. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===CUDA=== &lt;br /&gt;
&lt;br /&gt;
CUDA, or Compute Unified Device Architecture, is an Nvidia architecture which is the computing engine for their graphics processors.&lt;br /&gt;
&lt;br /&gt;
====_syncthreads====&lt;br /&gt;
&lt;br /&gt;
The _syncthreads operation can be used at the end of a parallel section as a sort of “barrier” mechanicm.  It is necessary to ensure the accuracy of the memory.  In the following example, there are two calls to _syncthreads.  They are both necessary to ensure the expected results are obtained.  Without it, myArray[tid] could end up being either 2 or the original value of myArray[] depending on when the read and write take place.[[#References|&amp;lt;sup&amp;gt;[14]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
  // myArray is an array of integers located in global or shared&lt;br /&gt;
  // memory&lt;br /&gt;
  __global__ void MyKernel(int* result) {&lt;br /&gt;
     int tid = threadIdx.x;&lt;br /&gt;
    ...    &lt;br /&gt;
     int ref1 = myArray[tid];&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    myArray[tid + 1] = 2;&lt;br /&gt;
      __syncthreads();&lt;br /&gt;
    int ref2 = myArray[tid];&lt;br /&gt;
    result[tid] = ref1 * ref2;&lt;br /&gt;
    ...    &lt;br /&gt;
  {&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&amp;lt;ol&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://openmp.org/wp/about-openmp/ OpenMP.org]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://docs.google.com/viewer?a=v&amp;amp;pid=gmail&amp;amp;attid=0.1&amp;amp;thid=126f8a391c11262c&amp;amp;mt=application%2Fpdf&amp;amp;url=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3D2%26ik%3Dd38b56c94f%26view%3Datt%26th%3D126f8a391c11262c%26attid%3D0.1%26disp%3Dattd%26realattid%3Df_g602ojwk0%26zw&amp;amp;sig=AHIEtbTeQDhK98IswmnVSfrPBMfmPLH5Nw An Optimal Abtraction Model for Hardware Multithreading in Modern Processor Architectures]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Reference.pdf Intel Threading Building Blocks 2.2 for Open Source Reference Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.csc.ncsu.edu/faculty/efg/506/s10/ NCSU CSC 506 Parallel Computing Systems]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://parallel-for.sourceforge.net/tbb.html Sourceforge.net]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/openMP/ OpenMP]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.16 Barrier Optimization for OpenMP Program]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://cs.anu.edu.au/~Alistair.Rendell/sc02/module3.pdf Performance Programming: Theory, Practice and Case Studies]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ Intel® Threading Building Blocks, OpenMP, or native threads?]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/pthreads/#Joining POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/source.html Programing with POSIX Threads source code]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA64-softdevman-vol2.pdf IA-64 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://refspecs.freestandards.org/IA32-softdevman-vol3.pdf IA-32 Software Development Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf CUDA Programming Guide]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=6&amp;amp;ved=0CEQQFjAF&amp;amp;url=http%3A%2F%2Flinuxindore.com%2Fdownloads%2Fdownload%2Fdata-structures%2Flinux-kernel-arch&amp;amp;ei=jxZWTaGTNI34sAPWm-ScDA&amp;amp;usg=AFQjCNG9UOAz7rHfwUDfayhr50M87uNOYA&amp;amp;sig2=azvo4h85RkoNHcZUtNIkJw Linux Kernel Architecture Overveiw]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;/ol&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43699</id>
		<title>CSC/ECE 506 Spring 2011/ch3 ab</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch3_ab&amp;diff=43699"/>
		<updated>2011-02-12T03:53:57Z</updated>

		<summary type="html">&lt;p&gt;Akrepask: testing&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Supplement to Chapter 3: Support for parallel-programming models. Discuss how DOACROSS, DOPIPE, DOALL, etc. are implemented in packages such as Posix threads, Intel Thread Building Blocks, OpenMP 2.0 and 3.0.&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this wiki supplement, we will discuss how the three kinds of parallelisms, i.e. DOALL, DOACROSS and DOPIPE implemented in the threads packages - OpenMP, Intel Threading Building Block, POSIX Threads. We discuss the each packages from the respects of variable scopes &amp;amp; Reduction/DOALL/DOACROSS/DOPIPE implementations.&lt;br /&gt;
&lt;br /&gt;
==Implementation==&lt;br /&gt;
&lt;br /&gt;
===OpenMP===&lt;br /&gt;
The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.&lt;br /&gt;
&lt;br /&gt;
====Variable Clauses ====&lt;br /&gt;
There are many different types of clauses in OpenMP and each of them has various characteristics. Here we introduce data sharing attribute clauses, Synchronization clauses, Scheduling clauses, Initialization and Reduction. &lt;br /&gt;
=====Data sharing attribute clauses=====&lt;br /&gt;
* ''shared'': the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.&lt;br /&gt;
  Format: shared ''(list)''&lt;br /&gt;
&lt;br /&gt;
  SHARED variables behave as follows:&lt;br /&gt;
  1. Existing in only one memory location and all threads can read or write to that address &lt;br /&gt;
&lt;br /&gt;
* ''private'': the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.&lt;br /&gt;
  Format: private ''(list)''&lt;br /&gt;
&lt;br /&gt;
  PRIVATE variables behave as follows: &lt;br /&gt;
    1. A new object of the same type is declared once for each thread in the team&lt;br /&gt;
    2. All references to the original object are replaced with references to the new object&lt;br /&gt;
    3. Variables declared PRIVATE should be assumed to be uninitialized for each thread &lt;br /&gt;
&lt;br /&gt;
* ''default'': allows the programmer to state that the default data scoping within a parallel region will be either ''shared'', or ''none'' for C/C++, or ''shared'', ''firstprivate'', ''private'', or ''none'' for Fortran.  The ''none'' option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.&lt;br /&gt;
  Format: default (shared | none)&lt;br /&gt;
&lt;br /&gt;
  DEFAULT variables behave as follows: &lt;br /&gt;
    1. Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. &lt;br /&gt;
    2. Using NONE as a default requires that the programmer explicitly scope all variables.&lt;br /&gt;
&lt;br /&gt;
=====Synchronization clauses=====&lt;br /&gt;
* ''critical section'': the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.&lt;br /&gt;
  Format: #pragma omp critical ''[ name ]  newline''&lt;br /&gt;
           ''structured_block''&lt;br /&gt;
&lt;br /&gt;
  CRITICAL SECTION behaves as follows:&lt;br /&gt;
    1.f a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.&lt;br /&gt;
    2. It is illegal to branch into or out of a CRITICAL block. &lt;br /&gt;
&lt;br /&gt;
* ''atomic'': similar to ''critical section'', but advise the compiler to use special hardware instructions for better performance. Compilers may choose to ignore this suggestion from users and use ''critical section'' instead.&lt;br /&gt;
  Format: #pragma omp atomic  ''newline''&lt;br /&gt;
           ''statement_expression''&lt;br /&gt;
&lt;br /&gt;
  ATOMIC behaves as follows:&lt;br /&gt;
    1. Only to a single, immediately following statement.&lt;br /&gt;
    2. An atomic statement must follow a specific syntax. &lt;br /&gt;
&lt;br /&gt;
* ''ordered'': the structured block is executed in the order in which iterations would be executed in a sequential loop&lt;br /&gt;
  Format: #pragma omp for ordered ''[clauses...]''&lt;br /&gt;
          ''(loop region)''&lt;br /&gt;
          #pragma omp ordered  ''newline''&lt;br /&gt;
          ''structured_block&lt;br /&gt;
          (endo of loop region)''&lt;br /&gt;
&lt;br /&gt;
  ORDERED behaves as follows:&lt;br /&gt;
    1. only appear in the dynamic extent of ''for'' or ''parallel for (C/C++)''.&lt;br /&gt;
    2. Only one thread is allowed in an ordered section at any time.&lt;br /&gt;
    3. It is illegal to branch into or out of an ORDERED block. &lt;br /&gt;
    4. A loop which contains an ORDERED directive, must be a loop with an ORDERED clause. &lt;br /&gt;
&lt;br /&gt;
* ''barrier'': each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has an implicit barrier synchronization at the end.&lt;br /&gt;
   Format: #pragma omp barrier  ''newline''&lt;br /&gt;
&lt;br /&gt;
   BARRIER behaves as follows:&lt;br /&gt;
    1. All threads in a team (or none) must execute the BARRIER region.&lt;br /&gt;
    2. The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team.&lt;br /&gt;
&lt;br /&gt;
*''taskwait'': specifies that threads completing assigned work can proceed without waiting for all threads in the team to finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.&lt;br /&gt;
   Format: #pragma omp taskwait  ''newline''&lt;br /&gt;
&lt;br /&gt;
   TASKWAIT behaves as follows:&lt;br /&gt;
    1. Placed only at a point where a base language statement is allowed.&lt;br /&gt;
    2. Not be used in place of the statement following an if, while, do, switch, or label.&lt;br /&gt;
&lt;br /&gt;
*''flush'': The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory. Thread-visible variables are written back to memory at this point. &lt;br /&gt;
   Format: #pragma omp flush ''(list)  newline''&lt;br /&gt;
&lt;br /&gt;
   FLUSH behaves as follows:&lt;br /&gt;
    1. The optional list contains a list of named variables that will be flushed in order to avoid flushing all variables.&lt;br /&gt;
    2. Implementations must ensure any prior modifications to thread-visible variables are visible to all threads after this point.&lt;br /&gt;
&lt;br /&gt;
=====Scheduling clauses=====&lt;br /&gt;
*''schedule(type, chunk)'': This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of scheduling are:&lt;br /&gt;
#''static'': Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided among threads equally by default. However, specifying an integer for the parameter &amp;quot;chunk&amp;quot; will allocate &amp;quot;chunk&amp;quot; number of contiguous iterations to a particular thread.&lt;br /&gt;
#''dynamic'': Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its allocated iteration, it returns to get another one from the iterations that are left. The parameter &amp;quot;chunk&amp;quot; defines the number of contiguous iterations that are allocated to a thread at a time.&lt;br /&gt;
#''guided'': A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size decreases exponentially with each successive allocation to a minimum size specified in the parameter &amp;quot;chunk&amp;quot;&lt;br /&gt;
=====Initialization=====&lt;br /&gt;
* ''firstprivate'': the data is private to each thread, but initialized using the value of the variable using the same name from the master thread.&lt;br /&gt;
  Format: firstprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  FIRSTPRIVATE variables behave as follows: &lt;br /&gt;
    1. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. &lt;br /&gt;
&lt;br /&gt;
* ''lastprivate'': the data is private to each thread. The value of this private data will be copied to a global variable using the same name outside the parallel region if current iteration is the last iteration in the parallelized loop.  A variable can be both ''firstprivate'' and ''lastprivate''. &lt;br /&gt;
  Format: lastprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
* ''threadprivate'': The data is a global data, but it is private in each parallel region during the runtime. The difference between ''threadprivate'' and ''private'' is the global scope associated with threadprivate and the preserved value across parallel regions.&lt;br /&gt;
  Format: #pragma omp threadprivate ''(list)''&lt;br /&gt;
&lt;br /&gt;
  THREADPRIVATE variables behave as follows: &lt;br /&gt;
    1. On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined. &lt;br /&gt;
    2. The THREADPRIVATE directive must appear after every declaration of a thread private variable/common block.&lt;br /&gt;
&lt;br /&gt;
=====Reduction=====&lt;br /&gt;
* ''reduction'': the variable has a local copy in each thread, but the values of the local copies will be summarized (reduced) into a global shared variable. This is very useful if a particular operation (specified in &amp;quot;operator&amp;quot; for this particular clause) on a datatype that runs iteratively so that its value at a particular iteration depends on its value at a previous iteration. Basically, the steps that lead up to the operational increment are parallelized, but the threads gather up and wait before updating the datatype, then increments the datatype in order so as to avoid racing condition. &lt;br /&gt;
  Format: reduction ''(operator: list)''&lt;br /&gt;
&lt;br /&gt;
  REDUTION variables behave as follows: &lt;br /&gt;
    1. Variables in the list must be named scalar variables. They can not be array or structure type variables. They must also be declared SHARED in the enclosing context.&lt;br /&gt;
    2. Reduction operations may not be associative for real numbers.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
In code 3.20, first it must include the header file ''omp.h'' which contains OpenMP function clarations. Next, A paralel region is started by  #pragma omp parallel and we enclose this program bu curly brackets. We can use (setenv OMP_NUM_THREADS n) to specify the number of threads. Another way to determine the number of threads is directly calling a function (omp_set_numtheads (n)). &lt;br /&gt;
Code 3.20 only has one loop to execute and we want to execute in parallel, so we combine the start of the parallel loop and the start of the parallel region with one directive ''#pragma omp parallel for''. &lt;br /&gt;
 &lt;br /&gt;
 '''Code 3.20 A DOALL parallelism example in OpenMP&lt;br /&gt;
 '''#include''' &amp;lt;omp.h&amp;gt;&lt;br /&gt;
 '''...'''&lt;br /&gt;
 '''#pragma''' omp parallel //start of parallel region&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''...'''&lt;br /&gt;
  '''#pragma''' omp parallel for default (shared)&lt;br /&gt;
  '''for''' ( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
    '''A[i]''' = A[i] + A[i] - 3.0;&lt;br /&gt;
 '''}'''//end for parallel region&lt;br /&gt;
&lt;br /&gt;
Apparently, there is no loop-carried dependence in ''i'' loop. With OpenMP, we only need to insert the ''pragma'' directive ''parallel for''. The ''dafault(shared)'' clauses states that all variables within the scope of the loop are shared  unless otherwise specified.&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
We will introduce how to implement DOACROSS in OpenMP. Here is an example code which has not been paralleled yet.&lt;br /&gt;
 &lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02: for(j=1; j&amp;lt;N; j++){&lt;br /&gt;
 03: a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 04: }&lt;br /&gt;
 05: }&lt;br /&gt;
&lt;br /&gt;
From this sample code, obviously, there is dependence existing here. &lt;br /&gt;
 a[i,j] -&amp;gt; T a[i+1, j+1]&lt;br /&gt;
&lt;br /&gt;
In OpenMP, DOALL parallel can be implemented by insert a “#pragma omp for” before the “for” structure in the source code. But there is not a pragma corresponding to DOACROSS parallel.&lt;br /&gt;
&lt;br /&gt;
When we implement DOACROSS, we use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is got by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is got by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
*omp_get_num_threads(): Returns the number of threads that are currently in the team executing the parallel region from which it is called.&lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_num_threads(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_NUM_THREADS behaves as following:&lt;br /&gt;
  1. If this call is made from a serial portion of the program, or a nested parallel region that is serialized, it will return 1. &lt;br /&gt;
  2. The default number of threads is implementation dependent. &lt;br /&gt;
&lt;br /&gt;
*omp_get_thread_num(): Returns the thread number of the thread, within the team, making this call. This number will be between 0 and OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0 &lt;br /&gt;
 Format: #include &amp;lt;omp.h&amp;gt;&lt;br /&gt;
         int omp_get_thread_num(void)&lt;br /&gt;
&lt;br /&gt;
 OMP_GET_THREAD_NUM behaves as followings:&lt;br /&gt;
  1. If called from a nested parallel region, or a serial region, this function will return 0. &lt;br /&gt;
&lt;br /&gt;
Now, let's see the code which has been paralleled and explanation. &lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; //thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04: int _counter0 = 1;&lt;br /&gt;
 05: int _my_id = omp_get_thread_num();&lt;br /&gt;
 06: int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07: _mylocks[my_id] = 0;&lt;br /&gt;
 08: for(j_tile = 0; j_tile&amp;lt;N-1; j_tile+=M){&lt;br /&gt;
 09: if(_my_id&amp;gt;0) {&lt;br /&gt;
 10: do{&lt;br /&gt;
 11: #pragma omp flush(_mylock)&lt;br /&gt;
 12: } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13: #pragma omp flush(a, _mylock)&lt;br /&gt;
 14: _counter0 += 1;&lt;br /&gt;
 15: }&lt;br /&gt;
 16: #pragma omp for nowait&lt;br /&gt;
 17: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18: for(j=j_tile;j&amp;lt;j_tile+M;j++){&lt;br /&gt;
 19: a[i][j]=a[i-1][j]+a[i][j-1];&lt;br /&gt;
 20: }&lt;br /&gt;
 21: }&lt;br /&gt;
 22: _mylock[myid] += 1;&lt;br /&gt;
 23: #pragma omp flush(a, _mylock)&lt;br /&gt;
 24: }&lt;br /&gt;
 25: }&lt;br /&gt;
&lt;br /&gt;
We paralleled the original program in two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other four processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take four interations of the loop i. The same to j loop. Assume the size of each block is 4. Each processor will execute four iterations of loop j. In order to let the total iterations be equal to the original program, j has to be enclosed in loop i. So, the new loop will be looked like ''for (j_tile = 2; j_tile &amp;lt;= 15; j_tile += 4)'', line 18.&lt;br /&gt;
The lower bound of loop j is set to j_tile and the upper bound will be j)tile+3. We will keep other statements remained.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the neighbor threads. After first step, the four processor will finish computing a block 4x4. If we parallel all these four processors, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
We set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
With the four variable, threads are synchronized:&lt;br /&gt;
The first thread continue to run with out waiting (line 9), because its thread ID is 0. Then all other thread can not go down after line 12. If the value in ''_mylocks[_my_id-1]'' is smaller than ''_counter0''.&lt;br /&gt;
&lt;br /&gt;
Otherwise, the block that current thread is waiting for must have be completed, and current thread can go down line 12, and mark the next block it will wait for by add 1 to ''_counter0'' (line 14).&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it&lt;br /&gt;
has finish a block by ''mylocks[proc]++''. Once the neighbor thread finds the value has been changed, it will continue running and so on. The below figure presents it to us.&lt;br /&gt;
[[Image:Synchorization.jpg]]&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
Here is another example code and we are going to parallel it in DOPIPE parallelism. There is a dependence, which is S2 -&amp;gt; T S1, existing in the sample code.&lt;br /&gt;
 '''Sample Code'''&lt;br /&gt;
 01: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 02: S1: a[i]=b[i];&lt;br /&gt;
 03: S2: c[i]=c[i-1]+a[i];&lt;br /&gt;
 04: &lt;br /&gt;
 05: }&lt;br /&gt;
Now, let's see how to parallel the sample code to DOPIPE parallelism.&lt;br /&gt;
we still use a shared array &amp;quot;_mylocks[threadid]&amp;quot; which is defined to store events of each thread. Besides, a private variable _counter0 is defined to indicate the event which current thread is waiting for. &amp;quot;mylock&amp;quot; indicates the total number of threads. &lt;br /&gt;
The number of threads is got by function &amp;quot;omp_get_num_threads()&amp;quot; and current thread's id is got by function &amp;quot;omp_get_thread_num()&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
 01: int _mylocks[256]; //thread’s synchronized array&lt;br /&gt;
 02: #pragma omp parallel&lt;br /&gt;
 03: {&lt;br /&gt;
 04: int _counter0 = 1;&lt;br /&gt;
 05: int _my_id = omp_get_thread_num();&lt;br /&gt;
 06: int _my_nprocs= omp_get_num_threads();&lt;br /&gt;
 07: _mylocks[my_id] = 0;&lt;br /&gt;
 08: for(i_tile = 0; i_tile&amp;lt;N-1; i_tile+=M){&lt;br /&gt;
 09: if(_my_id&amp;gt;0) {&lt;br /&gt;
 10: do{&lt;br /&gt;
 11: #pragma omp flush(_mylock)&lt;br /&gt;
 12: } while(_mylock[myid-1]&amp;lt;_counter0);&lt;br /&gt;
 13: #pragma omp flush(a, _mylock)&lt;br /&gt;
 14: _counter0 += 1;&lt;br /&gt;
 15: }&lt;br /&gt;
 16: #pragma omp for nowait&lt;br /&gt;
 17: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 18: a[i]=b[i];&lt;br /&gt;
 19: }&lt;br /&gt;
 20: for(i=1; i&amp;lt; N; i++) {&lt;br /&gt;
 21: c[i]=c[i-1]+a[i];&lt;br /&gt;
 22: }&lt;br /&gt;
 23: _mylock[myid] += 1;&lt;br /&gt;
 24: #pragma omp flush(a, _mylock)&lt;br /&gt;
 25: }&lt;br /&gt;
 26: }&lt;br /&gt;
&lt;br /&gt;
Ideally, We parallelized the original program into two steps.&lt;br /&gt;
&lt;br /&gt;
*First step: We divide i loop among the other processors by inserting an OpenMP to construct “#programa omp for nowait” (line 16). Afterwards, each processor will take interations of the loop i. Now, there are two loop i existing and each loop i contains different statements. Also, we will keep other statements remained.&lt;br /&gt;
&lt;br /&gt;
*Second step: We are going to Synchronize the threads. After first step, processors will finish computing &lt;br /&gt;
a[i]=b[i]. If we parallel all the processors to do the second loop i, the dependence will be violated. So, we have to synchronized them by neighbors.&lt;br /&gt;
Still, we set 4 variables as followings: &lt;br /&gt;
1. A private variable: _my_nprocs = omp_get_num_threads(), which indicates the total number of threads that run corresponding parallel region.&lt;br /&gt;
2. A private variable ： _my_id = omp_get_thread_num()，which indicates  the unique ID for current thread.&lt;br /&gt;
3. A shared array：_mylocks[proc], is initialize by 0 for each element, which is used to indicate whether the thread of proc-1 has finish computing the current block.&lt;br /&gt;
4. A private variable ：_counter0, is initialize by 1, which indicate the block that current thread is waiting for.&lt;br /&gt;
&lt;br /&gt;
When current thread finish its block, it will set that it has finish a block by ''mylocks[proc]++''. Once the processors finish their own block, the other processors will be able to get the value to use that value to execute in its statement and process that.&lt;br /&gt;
&lt;br /&gt;
====Functional Parallelism====&lt;br /&gt;
&lt;br /&gt;
In order to introduce function parallelism, we want to execute some code section in parallel with another code section. We use code 3.21 to show two loops execute in parallel with respect to one another, although each loop is sequentially executed.&lt;br /&gt;
&lt;br /&gt;
 '''Code''' 3.21 A function parallelism example in OpenMP&lt;br /&gt;
 '''pragma''' omp parallel shared(A, B)private(i)&lt;br /&gt;
 '''{'''&lt;br /&gt;
  '''#pragma''' omp sections nowait&lt;br /&gt;
  '''{'''&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''A[i]''' = A[i]*A[i] - 4.0;&lt;br /&gt;
      '''pragma''' omp section&lt;br /&gt;
      '''for'''( i = 0; i &amp;lt; n ; i++)&lt;br /&gt;
         '''B[i]''' = B[i]*B[i] - 9.0;&lt;br /&gt;
  '''}'''//end omp sections&lt;br /&gt;
 '''}'''//end omp parallel&lt;br /&gt;
&lt;br /&gt;
In code 3.21, there are two loops needed to be executed in parallel. We just need to insert two ''pragma omp section'' statements. Since we insert these two statements, those two loops will execute sequentially.&lt;br /&gt;
&lt;br /&gt;
===Intel Thread Building Blocks===&lt;br /&gt;
&lt;br /&gt;
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable &lt;br /&gt;
parallel programming using standard ISO C++ code. It does not require special &lt;br /&gt;
languages or compilers. It is designed to promote scalable data parallel programming. &lt;br /&gt;
The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX threads, Windows threads, or the portable Boost Threads in which individual threads of execution are created, synchronized, and terminated manually. Instead the library abstracts access to the multiple processors by allowing the operations to be treated as &amp;quot;tasks,&amp;quot; which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the cache. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Also, Intel Threading Building Blocks provides net results, which enables you to specify &lt;br /&gt;
parallelism more conveniently than using raw threads, and at the same time can &lt;br /&gt;
improve performance.&lt;br /&gt;
&lt;br /&gt;
====Variables Scope====&lt;br /&gt;
&lt;br /&gt;
Intel TBB is a collection of components for parallel programming, here is the overview of the library contents:&lt;br /&gt;
&lt;br /&gt;
* Basic algorithms: parallel_for, parallel_reduce, parallel_scan&lt;br /&gt;
* Advanced algorithms: parallel_while, parallel_do,pipeline, parallel_sort&lt;br /&gt;
* Containers: concurrent_queue, concurrent_vector, concurrent_hash_map&lt;br /&gt;
* Scalable memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator&lt;br /&gt;
* Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive mutex&lt;br /&gt;
* Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store&lt;br /&gt;
* Timing: portable fine grained global time stamp&lt;br /&gt;
* Task Scheduler: direct access to control the creation and activation of tasks&lt;br /&gt;
&lt;br /&gt;
Then we will focus on some specific TBB variables.&lt;br /&gt;
&lt;br /&gt;
=====parallel_for=====&lt;br /&gt;
&lt;br /&gt;
Parallel_for is the template function that performs parallel iteration over a range of values. In Intel TBB, a lot of DOALL cases could be implemented by using this function. The syntax is as follows: &lt;br /&gt;
 template&amp;lt;typename Index, typename Function&amp;gt;&lt;br /&gt;
 Function parallel_for(Index first, Index_type last, Index step, Function f);&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_for( const Range&amp;amp; range, const Body&amp;amp; body, [, partitioner] );&lt;br /&gt;
&lt;br /&gt;
A parallel_for(first, last, step, f) represents parallel execution of the loop: &amp;quot;for( auto i=first; i&amp;lt;last; i+=step ) f(i);&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
=====parallel_reduce=====&lt;br /&gt;
&lt;br /&gt;
Function parallel_reduce computes reduction over a range. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Value, typename Func, typename Reduction&amp;gt;&lt;br /&gt;
 Value parallel_reduce( const Range&amp;amp; range, const Value&amp;amp; identity, const Func&amp;amp; func, const Reduction&amp;amp; reduction );&lt;br /&gt;
&lt;br /&gt;
The functional form parallel_reduce(range,identity,func,reduction) performs a&lt;br /&gt;
parallel reduction by applying func to subranges in range and reducing the results&lt;br /&gt;
using binary operator reduction. It returns the result of the reduction. Parameter func&lt;br /&gt;
and reduction can be lambda expressions.&lt;br /&gt;
&lt;br /&gt;
=====parallel_scan=====&lt;br /&gt;
&lt;br /&gt;
This template function computes parallel prefix. Syntax is as follows:&lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const auto_partitioner&amp;amp; );&lt;br /&gt;
 &lt;br /&gt;
 template&amp;lt;typename Range, typename Body&amp;gt;&lt;br /&gt;
 void parallel_scan( const Range&amp;amp; range, Body&amp;amp; body, const simple_partitioner&amp;amp; );&lt;br /&gt;
&lt;br /&gt;
A parallel_scan(range,body) computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing that is&lt;br /&gt;
sometimes useful in scenarios that appear to have inherently serial dependences. A&lt;br /&gt;
further explanation will be given in the DOACROSS example.&lt;br /&gt;
&lt;br /&gt;
=====pipeline=====&lt;br /&gt;
&lt;br /&gt;
This class performs pipelined execution. Members as follows:&lt;br /&gt;
 namespace tbb {&lt;br /&gt;
     class pipeline {&lt;br /&gt;
     public:&lt;br /&gt;
        pipeline();&lt;br /&gt;
        ~pipeline(); &lt;br /&gt;
        void add_filter( filter&amp;amp; f );&lt;br /&gt;
        void run( size_t max_number_of_live_tokens );&lt;br /&gt;
        void clear();&lt;br /&gt;
   };&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
A pipeline represents pipelined application of a series of filters to a stream of items.&lt;br /&gt;
Each filter operates in a particular mode: parallel, serial in order, or serial out of order. With a parallel filter, &lt;br /&gt;
we could implement DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
====Reduction====&lt;br /&gt;
&lt;br /&gt;
The reduction in Intel TBB is implemented using parallel_reduce function. A parallel_reduce recursively splits the range into subranges and uses the splitting constructor to make one or more copies of the body for each thread. We use an example to illustrate this: &lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 struct Sum {&lt;br /&gt;
     float value;&lt;br /&gt;
     Sum() : value(0) {}&lt;br /&gt;
     Sum( Sum&amp;amp; s, split ) {value = 0;}&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;float*&amp;gt;&amp;amp; r ) {&lt;br /&gt;
         float temp = value;&lt;br /&gt;
         for( float* a=r.begin(); a!=r.end(); ++a ) {&lt;br /&gt;
             temp += *a;&lt;br /&gt;
         }&lt;br /&gt;
         value = temp;&lt;br /&gt;
     }&lt;br /&gt;
     void join( Sum&amp;amp; rhs ) {value += rhs.value;}&lt;br /&gt;
 };&lt;br /&gt;
 float ParallelSum( float array[], size_t n ) {&lt;br /&gt;
     Sum total;&lt;br /&gt;
     parallel_reduce( blocked_range&amp;lt;float*&amp;gt;( array, array+n ), total );&lt;br /&gt;
     return total.value;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The above example sums the values in the array. The parallel_reduce will do the reduction within the range of (array, array+n), to split the working body, and then join them by the return value for each split.&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
The implementation of DOALL parallelism in Intel TBB will involve Parallel_for function. &lt;br /&gt;
To better illustrate the usage, here we discuss a simple example. The following is the original code:&lt;br /&gt;
 &lt;br /&gt;
 void SerialApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     for( size_t i=0; i&amp;lt;n; ++i )&lt;br /&gt;
         Foo(a[i]);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
After using Intel TBB, it could be switched to the following:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_for.h&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 class ApplyFoo {&lt;br /&gt;
     float *const my_a;&lt;br /&gt;
 public:&lt;br /&gt;
     void operator( )( const blocked_range&amp;lt;size_t&amp;gt;&amp;amp; r ) const {&lt;br /&gt;
         float *a = my_a;&lt;br /&gt;
         for( size_t i=r.begin(); i!=r.end( ); ++i )&lt;br /&gt;
             Foo(a[i]);&lt;br /&gt;
     }&lt;br /&gt;
     ApplyFoo( float a[] ) :&lt;br /&gt;
         my_a(a)&lt;br /&gt;
     {}&lt;br /&gt;
 };&lt;br /&gt;
 &lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n,The_grain_size_You_Pick), ApplyFoo(a) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example is the simplest DOALL parallelism, similar as the one in textbook, and execution graph will be just similar as the one in DOALL section above. But with the help with this simple illustration, the TBB code just gives you a flavor of how it would be implemented in Intel Threading Building Blocks.&lt;br /&gt;
&lt;br /&gt;
A little more to say, parallel_for takes an optional third argument to specify a partitioner, which I used &amp;quot;The_grain_size_You_Pick&amp;quot; to represent. If you want to manually divide the grain and assign the work to processors, you could specify that in the function. Or, you could use automatic grain provided TBB. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempts to limit overhead while still providing ample opportunities for load balancing. Then, the last three line of the TBB code above will be:&lt;br /&gt;
&lt;br /&gt;
 void ParallelApplyFoo( float a[], size_t n ) {&lt;br /&gt;
     parallel_for(blocked_range&amp;lt;size_t&amp;gt;(0,n), ApplyFoo(a), auto_partitioner( ) );&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
We could find a good example in Intel TBB to implement a DOACROSS with the help of parallel_scan. As stated in the parallel_scan section, this function computes a parallel prefix, also known as parallel&lt;br /&gt;
scan. This computation is an advanced concept in parallel computing which&lt;br /&gt;
could be helpful in scenarios that appear to have inherently serial dependences, which could be loop-carried dependences. &lt;br /&gt;
&lt;br /&gt;
Let's consider this scenario (which is actually the mathematical definition of parallel prefix):  &lt;br /&gt;
 T temp = id⊕;&lt;br /&gt;
 for( int i=1; i&amp;lt;=n; ++i ) {&lt;br /&gt;
     temp = temp ⊕ x[i];&lt;br /&gt;
     y[i] = temp;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
When we implement this in TBB using parallel_scan, it becomes:&lt;br /&gt;
&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 class Body {&lt;br /&gt;
     T sum;&lt;br /&gt;
     T* const y;&lt;br /&gt;
     const T* const x;&lt;br /&gt;
 public:&lt;br /&gt;
     Body( T y_[], const T x_[] ) : sum(id⊕), x(x_), y(y_) {}&lt;br /&gt;
     T get_sum() const {return sum;}&lt;br /&gt;
     template&amp;lt;typename Tag&amp;gt;&lt;br /&gt;
     void operator()( const blocked_range&amp;lt;int&amp;gt;&amp;amp; r, Tag ) {&lt;br /&gt;
         T temp = sum;&lt;br /&gt;
         for( int i=r.begin(); i&amp;lt;r.end(); ++i ) {&lt;br /&gt;
             temp = temp ⊕ x[i];&lt;br /&gt;
             if( Tag::is_final_scan() )&lt;br /&gt;
                 y[i] = temp;&lt;br /&gt;
         } &lt;br /&gt;
         sum = temp;&lt;br /&gt;
     }&lt;br /&gt;
     Body( Body&amp;amp; b, split ) : x(b.x), y(b.y), sum(id⊕) {}&lt;br /&gt;
     void reverse_join( Body&amp;amp; a ) { sum = a.sum ⊕ sum;}&lt;br /&gt;
     void assign( Body&amp;amp; b ) {sum = b.sum;}&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
 float DoParallelScan( T y[], const T x[], int n ) {&lt;br /&gt;
     Body body(y,x);&lt;br /&gt;
     parallel_scan( blocked_range&amp;lt;int&amp;gt;(0,n), body );&lt;br /&gt;
     return body.get_sum();&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
It is the second part (function DoParallelScan) that we have to focus on. &lt;br /&gt;
&lt;br /&gt;
Actually, this example is just the scenario mentioned above that could take advantages of parallel_scan. The &amp;quot;inherently serial dependences&amp;quot; is taken care of by the functionality of parallel_scan. By computing the prefix, the serial code could be implemented in parallel with just one function.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
&lt;br /&gt;
Pipeline class is the Intel TBB that performs pipelined execution. A pipeline represents pipelined application of a series of filters to a stream of items. Each filter operates in a particular mode: parallel, serial in order, or serial out of order. So this class can be used to implement a DOPIPE parallelism.&lt;br /&gt;
&lt;br /&gt;
Here is a comparatively complex example about pipeline implementation. Also, if we look carefully, this is an example with both DOPIPE and DOACROSS:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;iostream&amp;gt;&lt;br /&gt;
 #include &amp;quot;tbb/pipeline.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/tbb_thread.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
 char InputString[] = &amp;quot;abcdefg\n&amp;quot;;&lt;br /&gt;
 class InputFilter: public filter {&lt;br /&gt;
     char* my_ptr;&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void*) {&lt;br /&gt;
         if (*my_ptr)&lt;br /&gt;
             return my_ptr++;&lt;br /&gt;
         else&lt;br /&gt;
             return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     InputFilter() :&lt;br /&gt;
         filter( serial_in_order ), my_ptr(InputString) {}&lt;br /&gt;
 };&lt;br /&gt;
 class OutputFilter: public thread_bound_filter {&lt;br /&gt;
 public:&lt;br /&gt;
     void* operator()(void* item) {&lt;br /&gt;
         std::cout &amp;lt;&amp;lt; *(char*)item;&lt;br /&gt;
         return NULL;&lt;br /&gt;
     }&lt;br /&gt;
     OutputFilter() : thread_bound_filter(serial_in_order) {}&lt;br /&gt;
 };&lt;br /&gt;
 void RunPipeline(pipeline* p) {&lt;br /&gt;
     p-&amp;gt;run(8);&lt;br /&gt;
 }&lt;br /&gt;
 int main() {&lt;br /&gt;
     // Construct the pipeline&lt;br /&gt;
     InputFilter f;&lt;br /&gt;
     OutputFilter g;&lt;br /&gt;
     pipeline p;&lt;br /&gt;
     p.add_filter(f);&lt;br /&gt;
     p.add_filter(g);&lt;br /&gt;
     // Another thread initiates execution of the pipeline&lt;br /&gt;
     tbb_thread t(RunPipeline,&amp;amp;p);&lt;br /&gt;
     // Process the thread_bound_filter with the current thread.&lt;br /&gt;
     while (g.process_item()!=thread_bound_filter::end_of_stream)&lt;br /&gt;
         continue;&lt;br /&gt;
     // Wait for pipeline to finish on the other thread.&lt;br /&gt;
     t.join();&lt;br /&gt;
     return 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
The example above shows a pipeline with two filters where the second filter is a thread_bound_filter serviced by the main thread. The main thread does the following after constructing the pipeline:&lt;br /&gt;
1. Start the pipeline on another thread.&lt;br /&gt;
2. Service the thread_bound_filter until it reaches end_of_stream.&lt;br /&gt;
3. Wait for the other thread to finish.&lt;br /&gt;
&lt;br /&gt;
===POSIX Threads===&lt;br /&gt;
&lt;br /&gt;
POSIX Threads, or Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions (IEEE Std 1003.1c-1995), defines an API for creating and manipulating threads.&lt;br /&gt;
&lt;br /&gt;
====Variable Scopes====&lt;br /&gt;
Pthreads defines a set of C programming language types, functions and constants. It is implemented with a pthread.h header and a thread library.&lt;br /&gt;
&lt;br /&gt;
There are around 100 Pthreads procedures, all prefixed &amp;quot;pthread_&amp;quot;. The subroutines which comprise the Pthreads API can be informally grouped into four major groups:&lt;br /&gt;
&lt;br /&gt;
* '''Thread management:''' Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to set/query thread attributes (joinable, scheduling etc.) E.g.pthread_create(), pthread_join().&lt;br /&gt;
* '''Mutexes:''' Routines that deal with synchronization, called a &amp;quot;mutex&amp;quot;, which is an abbreviation for &amp;quot;mutual exclusion&amp;quot;. Mutex functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions that set or modify attributes associated with mutexes. E.g. pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock().&lt;br /&gt;
* '''Condition variables:''' Routines that address communications between threads that share a mutex. Based upon programmer specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included. E.g. pthread_cond_signal(); pthread_cond_broadcast(); pthread_cond_wait(); pthread_cond_timedwait();pthread_cond_reltimedwait_np().&lt;br /&gt;
* '''Synchronization:''' Routines that manage read/write locks and barriers. E.g. pthread_rwlock_rdlock(); pthread_rwlock_tryrdlock(); pthread_rwlock_wrlock();pthread_rwlock_trywrlock(); pthread_rwlock_unlock();pthread_barrier_init(); pthread_barrier_wait()&lt;br /&gt;
&lt;br /&gt;
====DOALL====&lt;br /&gt;
&lt;br /&gt;
The following is a simple code example in C, as DOALL parallelism, to print out each threads' ID#.&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS     5&lt;br /&gt;
  &lt;br /&gt;
 void *PrintHello(void *threadid)&lt;br /&gt;
 {&lt;br /&gt;
    long tid;&lt;br /&gt;
  &lt;br /&gt;
    tid = (long)threadid;&lt;br /&gt;
    printf(&amp;quot;Hello World! It's me, thread #%ld!\n&amp;quot;, tid);&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
  &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
    pthread_t threads[NUM_THREADS];&lt;br /&gt;
  &lt;br /&gt;
    int rc;&lt;br /&gt;
    long t;&lt;br /&gt;
    for(t=0; t&amp;lt;NUM_THREADS; t++){&lt;br /&gt;
       printf(&amp;quot;In main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
       rc = pthread_create(&amp;amp;threads[t], NULL, PrintHello, (void *)t);&lt;br /&gt;
  &lt;br /&gt;
       if (rc){&lt;br /&gt;
          printf(&amp;quot;ERROR; return code from pthread_create() is %d\n&amp;quot;, rc);&lt;br /&gt;
          exit(-1);&lt;br /&gt;
       }&lt;br /&gt;
    }&lt;br /&gt;
    pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This loop contains only single statement which doesn't cross the iterations, so each iteration could be considered as a parallel task.&lt;br /&gt;
====DOACROSS====&lt;br /&gt;
&lt;br /&gt;
When it comes to using Pthreads to implement DOACROSS, it could express functional parallelism easily, but make the parallelism unnecessarily complicated. See an example below: from '''POSIX Threads Programming''' by Blaise Barney&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
 #include &amp;lt;stdlib.h&amp;gt;&lt;br /&gt;
 #define NUM_THREADS	&lt;br /&gt;
 &lt;br /&gt;
 void *BusyWork(void *t)&lt;br /&gt;
 {&lt;br /&gt;
   int i;&lt;br /&gt;
   long tid;&lt;br /&gt;
   double result=0.0;&lt;br /&gt;
   tid = (long)t;&lt;br /&gt;
   printf(&amp;quot;Thread %ld starting...\n&amp;quot;,tid);&lt;br /&gt;
   for (i=0; i&amp;lt;1000000; i++)&lt;br /&gt;
   {&lt;br /&gt;
      result = result + sin(i) * tan(i);&lt;br /&gt;
   }&lt;br /&gt;
   printf(&amp;quot;Thread %ld done. Result = %e\n&amp;quot;,tid, result);&lt;br /&gt;
   pthread_exit((void*) t);&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 int main (int argc, char *argv[])&lt;br /&gt;
 {&lt;br /&gt;
   pthread_t thread[NUM_THREADS];&lt;br /&gt;
   pthread_attr_t attr;&lt;br /&gt;
   int rc;&lt;br /&gt;
   long t;&lt;br /&gt;
   void *status;&lt;br /&gt;
 &lt;br /&gt;
   /* Initialize and set thread detached attribute */&lt;br /&gt;
   pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
   pthread_attr_setdetachstate(&amp;amp;attr, PTHREAD_CREATE_JOINABLE);&lt;br /&gt;
 &lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      printf(&amp;quot;Main: creating thread %ld\n&amp;quot;, t);&lt;br /&gt;
      rc = pthread_create(&amp;amp;thread[t], &amp;amp;attr, BusyWork, (void *)t); &lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_create() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
   /* Free attribute and wait for the other threads */&lt;br /&gt;
   pthread_attr_destroy(&amp;amp;attr);&lt;br /&gt;
   for(t=0; t&amp;lt;NUM_THREADS; t++) {&lt;br /&gt;
      rc = pthread_join(thread[t], &amp;amp;status);&lt;br /&gt;
      if (rc) {&lt;br /&gt;
         printf(&amp;quot;ERROR; return code from pthread_join() &lt;br /&gt;
                is %d\n&amp;quot;, rc);&lt;br /&gt;
         exit(-1);&lt;br /&gt;
         }&lt;br /&gt;
      printf(&amp;quot;Main: completed join with thread %ld having a status   &lt;br /&gt;
            of %ld\n&amp;quot;,t,(long)status);&lt;br /&gt;
      }&lt;br /&gt;
 &lt;br /&gt;
 printf(&amp;quot;Main: program completed. Exiting.\n&amp;quot;);&lt;br /&gt;
 pthread_exit(NULL);&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
This example demonstrates how to &amp;quot;wait&amp;quot; for thread completions by using the Pthread join routine. Since some implementations of Pthreads may not create threads in a joinable state, the threads in this example are explicitly created in a joinable state so that they can be joined later.&lt;br /&gt;
&lt;br /&gt;
====DOPIPE====&lt;br /&gt;
There is examples of using Posix Threads to implement DOPIPE parallelism, but unnecessarily complex. Due to the long length, we won't provide it here. If the reader is interested, it could be found in &amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/code/pipe.c Pthreads DOPIPE example]&amp;lt;/li&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Comparison among the three===&lt;br /&gt;
&lt;br /&gt;
====A unified example====&lt;br /&gt;
&lt;br /&gt;
We use a simple parallel example from [http://sourceforge.net Sourceforge.net] to show how it will be implemented in the three packages, namely, POSIX Threads, Intel TBB, OpenMP, to show some common and differences among them.&lt;br /&gt;
&lt;br /&gt;
Following is the original code:&lt;br /&gt;
&lt;br /&gt;
 Grid1 *g = new Grid1(0, n+1);&lt;br /&gt;
 Grid1IteratorSub it(1, n, g);&lt;br /&gt;
 DistArray x(g), y(g);&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 ForEach(int i, it,&lt;br /&gt;
    x(i) += ( y(i+1) + y(i-1) )*.5;&lt;br /&gt;
    e += sqr( y(i) ); )&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Then we are going to show the implementations in different packages, and also make a brief summary of the three packages.&lt;br /&gt;
&lt;br /&gt;
=====In POSIX Thread=====&lt;br /&gt;
&lt;br /&gt;
POSIX Thread: Symmetric multi processing, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global declaration:&lt;br /&gt;
&lt;br /&gt;
 #include &amp;lt;pthread.h&amp;gt;&lt;br /&gt;
 float *x, *y;&lt;br /&gt;
 float vec[8];&lt;br /&gt;
 int nn, pp;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
&lt;br /&gt;
 void *sub1(void *arg) {&lt;br /&gt;
    int p = (int)arg;&lt;br /&gt;
    float e_local = 0;&lt;br /&gt;
    for (int i=1+(nn*p)/pp; i&amp;lt;1+(nn*(p+1))/pp; ++i) {&lt;br /&gt;
      x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
      e_local += y[i] * y[i];&lt;br /&gt;
    }&lt;br /&gt;
    vec[p] = e_local;&lt;br /&gt;
    return (void*) 0;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
&lt;br /&gt;
 x = new float[n+1];&lt;br /&gt;
 y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 float e = 0;&lt;br /&gt;
 int p_threads = 8;&lt;br /&gt;
 nn = n-1;&lt;br /&gt;
 pp = p_threads;&lt;br /&gt;
 pthread_t threads[8];&lt;br /&gt;
 pthread_attr_t attr;&lt;br /&gt;
 pthread_attr_init(&amp;amp;attr);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p)&lt;br /&gt;
    pthread_create(&amp;amp;threads[p], &amp;amp;attr,&lt;br /&gt;
      sub1, (void *)p);&lt;br /&gt;
 for (int p=0; p&amp;lt;p_threads; ++p) {&lt;br /&gt;
    pthread_join(threads[p], NULL);&lt;br /&gt;
    e += vec[p];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
=====In Intel Threading Building Blocks=====&lt;br /&gt;
&lt;br /&gt;
Intel TBB: A C++ library for thread programming, e.g. SMP multi-processor computers, multi-core processors, virtual shared memory computer.&lt;br /&gt;
&lt;br /&gt;
Data layout: A single global memory. Each thread reads global shared data and writes to a private fraction of global data.&lt;br /&gt;
&lt;br /&gt;
Translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 #include &amp;quot;tbb/task_scheduler_init.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/blocked_range.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/parallel_reduce.h&amp;quot;&lt;br /&gt;
 #include &amp;quot;tbb/cache_aligned_allocator.h&amp;quot;&lt;br /&gt;
 using namespace tbb;&lt;br /&gt;
&lt;br /&gt;
thread code:&lt;br /&gt;
 struct sub1 {&lt;br /&gt;
    float ee;&lt;br /&gt;
    float *x, *y;&lt;br /&gt;
    sub1(float *xx, float *yy) : ee(0), x(xx), y(yy) {}&lt;br /&gt;
    sub1(sub1&amp;amp; s, split) { ee = 0; x = s.x; y = s.y; }&lt;br /&gt;
    void operator() (const blocked_range&amp;lt;int&amp;gt; &amp;amp; r){&lt;br /&gt;
      float e = ee;&lt;br /&gt;
      for (int i = r.begin(); i!= r.end(); ++i) {&lt;br /&gt;
        x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
        e += y[i] * y[i];&lt;br /&gt;
      }&lt;br /&gt;
      ee = e;&lt;br /&gt;
    }&lt;br /&gt;
    void join(sub1&amp;amp; s) { ee += s.ee; }&lt;br /&gt;
 };&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 task_scheduler_init init;&lt;br /&gt;
 ...&lt;br /&gt;
 float e;&lt;br /&gt;
 float *x = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 float *y = cache_aligned_allocator&amp;lt;float&amp;gt;().allocate(n+1);&lt;br /&gt;
 ...&lt;br /&gt;
 sub1 s(x, y);&lt;br /&gt;
 parallel_reduce(blocked_range&amp;lt;int&amp;gt;(1, n, 1000), s);&lt;br /&gt;
 e = s.ee;&lt;br /&gt;
 ...&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(x, n+1);&lt;br /&gt;
 cache_aligned_allocator&amp;lt;float&amp;gt;().deallocate(y, n+1);&lt;br /&gt;
&lt;br /&gt;
=====In OpenMP shared memory parallel code annotations=====&lt;br /&gt;
&lt;br /&gt;
OpenMP: Usually automatic paralleization with a run-time system based on a thread library.&lt;br /&gt;
&lt;br /&gt;
A simplified translation of the example parallel-for loop is given below.&lt;br /&gt;
&lt;br /&gt;
Global:&lt;br /&gt;
 float e;&lt;br /&gt;
&lt;br /&gt;
main code:&lt;br /&gt;
 float *x = new float[n+1];&lt;br /&gt;
 float *y = new float[n+1];&lt;br /&gt;
 ...&lt;br /&gt;
 e = 0;&lt;br /&gt;
 #pragma omp for reduction(+:e)&lt;br /&gt;
 for (int i=1; i&amp;lt;n; ++i) {&lt;br /&gt;
    x[i] += ( y[i+1] + y[i-1] )*.5;&lt;br /&gt;
    e += y[i] * y[i];&lt;br /&gt;
 }&lt;br /&gt;
 ...&lt;br /&gt;
 delete[] x, y;&lt;br /&gt;
&lt;br /&gt;
====Summary: Difference among them====&lt;br /&gt;
&lt;br /&gt;
*Pthreads works for all the parallelism and could express functional parallelism easily, but it needs to build specialized synchronization primitives and explicitly privatize variables, makes it more effort needed to switch a serial program in to parallel mode. &lt;br /&gt;
&lt;br /&gt;
*OpenMP can provide many performance enhancing features, such as atomic, barrier and flush synchronization primitives. It is very simple to use OpenMP to exploit DOALL parallelism, but the syntax for expressing functional parallelism is awkward. &lt;br /&gt;
&lt;br /&gt;
*Intel TBB relies on generic programming, it performs better with custom iteration spaces or complex reduction operations. Also, it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes, so it's better in cases go beyond loop-based parallelism.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&amp;lt;ol&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://openmp.org/wp/about-openmp/ OpenMP.org]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://docs.google.com/viewer?a=v&amp;amp;pid=gmail&amp;amp;attid=0.1&amp;amp;thid=126f8a391c11262c&amp;amp;mt=application%2Fpdf&amp;amp;url=https%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3D2%26ik%3Dd38b56c94f%26view%3Datt%26th%3D126f8a391c11262c%26attid%3D0.1%26disp%3Dattd%26realattid%3Df_g602ojwk0%26zw&amp;amp;sig=AHIEtbTeQDhK98IswmnVSfrPBMfmPLH5Nw An Optimal Abtraction Model for Hardware Multithreading in Modern Processor Architectures]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Reference.pdf Intel Threading Building Blocks 2.2 for Open Source Reference Manual]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.csc.ncsu.edu/faculty/efg/506/s10/ NCSU CSC 506 Parallel Computing Systems]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://parallel-for.sourceforge.net/tbb.html Sourceforge.net]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/openMP/ OpenMP]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.16 Barrier Optimization for OpenMP Program]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://cs.anu.edu.au/~Alistair.Rendell/sc02/module3.pdf Performance Programming: Theory, Practice and Case Studies]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ Intel® Threading Building Blocks, OpenMP, or native threads?]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[https://computing.llnl.gov/tutorials/pthreads/#Joining POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;li&amp;gt;[http://homepage.mac.com/dbutenhof/Threads/source.html Programing with POSIX Threads source code]&amp;lt;/li&amp;gt;&lt;br /&gt;
&amp;lt;/ol&amp;gt;&lt;/div&gt;</summary>
		<author><name>Akrepask</name></author>
	</entry>
</feed>