CSC/ECE 506 Spring 2012/preface
This preface intends to browse through the wiki pages done over the Spring 11 and Spring 12 CSC/ECE 506 courses and, taking the Solihin's FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE<ref>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE by Solihin</ref> as a base and perusing the book Table of Content <ref>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE TOC</ref>, travel through the study topics and summing them up, picking up the best approaches and comments from the pages.
Chapter 1: Perspectives <ref>Spring_2012/1a_hv</ref><ref>Spring_2012/1a_mw</ref> <ref>Spring_2012/1a_ry</ref> <ref>Spring_2012/1b_ap</ref> <ref>Spring_2012/1b_as </ref> <ref> Spring_2012/1b_ps</ref> <ref>Spring_2012/1c_12</ref> <ref> Spring_2012/1c_cl </ref> <ref>Spring_2012/1c_dm </ref>
Supercomputers Evolution
This chapter includes pages on three different subjects: supercomputer comparisons, Moore's Law and MISD architectures.On the first topic, Supercomputers Evolution, out of the 3 pages both [5] and [6] are very similar, probably because they come from the same source ([6] actually specify what their additions were to the Spring 11 original), and are both remarkable. They start analyzing the evolution of the supercomputer (Eniac, Cray, Japanese computers, IBM...) to pass to their different designs and architectures ([5] adds here a section of the reasons for the decay of the vector-based machines). Chapter 7 on [6] is an extensive excursion over the operative systems used in these platforms, ways of interconnecting them ([5] also introduces Infiniband as an alternative to Ethernet) and includes some comparisons between brands and different architectures. Next more current fast supercomputers implementations are presented, together with programming models and tools, like Fortran, C, PVM or OpenMP. Finally Consumption and cooling techniques are presented, followed by a discussion on trends and the future of these machines.
Does Moore's Law still hold?
The second set of wikis try to briefly answer the Does Moore's Law still hold? question. [7] and [8] are also very similar in their approach and both pages introduce the Law and its proposer, a historical of its evolution through the years; and also a list of computers with their number of transistors as a proof of the Law and its particularities, including that it refers to the number of transistors, not performance. Then they approach the future of the Law, and [7] includes some future alternatives to the transistor, like memristors, quantum computing and ballistic deflection transistors.
MISD architectures
The third set is about MISD architectures and the wiki pages go into describing the Flynn's Taxonomy in general and the MISD machines in particular. [9], after introducing the MISD architecture and giving examples (or explaining the lack of them, as theoretically they don't exist), makes a good description of systolic arrays and its applications, describing, for instance, discrete and fast Fourier transforms implementations. It also discusses whether the systolic arrays are really MISD machines or not, as the concept is a matter of polemic.
Chapter 2: Parallel Programming Models <ref>Spring_2012/2a_bm </ref> <ref> Spring_2012/2a_va </ref><ref> Spring_2012/ch2b_cm </ref><ref>Spring_2011/ch2_cl </ref><ref> Spring_2011/ch2_dm</ref><ref>Spring_2011/ch2_JR </ref><ref> Spring_2011/ch2a_mc </ref>
Data Parallel models
This chapter is also divided into three different topics. The first one describes Data Parallel models in general and the wiki pages have been done in the Spring of 2011. Out of the three pages on the subject, [16] does a good job describing the model and its parts, decomposition and orchestration; but [15] treats the Task Parallel model better, comparing them even graphically; and offers also a history of the evolution of these techniques.
Coming into the Spring 2012 wiki pages, we have two topics. The first one is Shared Address Space (SAS) programming on distributed-memory machines, with [12] going first through the origins and background and introducing the two main issues on shared memory: coherence and synchronization. It passes then to talk about Distributed Shared Memory (DSM) and cache coherence, page management and memory mapping in MOME, and also node communications and programming environments, with descriptions of how the locks and barriers are introduced to ensure synchronization. It follows with some DSM implementation example of implementations and algorithms, introducing consistency models. Finally, it describes some evaluation models to measure performance (SPLASH) and some study cases, to then close the page commenting of the perspective for the evolution of the topic.
Data Parallelism in GPUs
The last topic in this chapter is Data Parallelism in GPUs. The pages [14] and [18] treat the subject in a very similar way, getting into the diatribes of the massive parallel computer that take places in graphic processors and how are they more focused on data processing, versus control, as other kind of processor. [18] opens openCL and programming models, talking about the platforms, execution and memory models and programming approaches. Then it discusses two implementation from AMD, the Radeon HD5000 and HD6000 series and NVidia GeForce 4000s. Compute Unified Device Architecture (CUDA) is then introduced and described, focusing in threads and programming. [14] is more focused in the different NVidea processor implementations and dedicates more space to stream processors and CUDA architecture, programming and examples.
Types of Parallelism
In this chapter we have also three topics. In the first one, Types of Parallelism, the authors of [23] define the concepts of DOALL, DOACCROSS, DOPIPE, Functional Parallelism and Reduction. Following they introduce synchronization and its mechanisms and describe how they are applied in several architectural examples, that is, IA-64, IA-32, Linux Kernel, CUDA, Power-PC and Cell Broadband Engines (CEB); talking about what techniques every of these examples uses to maintain that synchronization (spin-locks, barriers...).
Patterns of Parallel Programming
Next topic in this chapter is Patterns of Parallel Programming where we have a couple of different approaches: [20] tries first to make classifications and presents three different ones, Massingeli's, Keutzer and Mattson's and Siu et al's. Then it divides the list of patterns at four different levels, that is, Decomposition, Algorithm, Implementation and Parallel Executions; and describes common routines and algorithms that are used at every level, with some examples. It finishes with a glossary of all of the patterns that could be useful for a fast search. [21] on the other hand, focuses on the Algorithm level and, after introducing the need to have these patterns as building blocks to get the best performance and repeatability out of our systems, it divides the patterns into two groups, Functional Parallelism Patterns, like the Embarrassingly Parallel Pattern or Divide and conquer; Inseparable Patterns and Data Parallelism Patterns, which include Replicable Patterns, Repository Patterns, Recursive Data Pattern, Geometric and Irregular Mesh Pattern and Pipeline Pattern. To finish, it talks about the limitations of the Parallel Patterns, and does a comparison amongst them.
Map Reduce
The last topic in this chapter is Map Reduce, of which we only have one wiki page. This page introduces the software framework as a Google development and proceeds to describe the programming model, after which it gives some examples and example code. Then it shows three different implementations of the technique: the very own Google's MapReduce, with its Master Data Structures and Fault Tolerance, as the libraries can have enormous amount of data and have to fall gracefully. Following the Phoenix implementation for Shared Memory Systems and a description of its API and buffer management. Finally it describes an implementation in graphics processors with the Mars API and implementation and optimization details.
Data-Parallel Algorithm Implementations
The first three wiki pages in these chapter, [25], [26] and [27], are from Spring 11 and have three different topics, all related to data-parallel algorithm implementations, hence being a weak fit for the subjects discussed in this section. The first one, [25], takes on the three models of the Gaussian Elimination implementation, developed in High Performance Fortran, OpenMP and MPI. The second, [26], digs into the Serial Simplex Method (Neader-Mead function minimization), proposing pseudo code and then ways of performing task and a data parallel implementations of the algorithm. It also provides a shared memory and a messaging-pass implementation. Finally, [27]is the most ethereal of all three, describing an N-body problem as a result of observations in situations found in Astrophysics and Mathematics. For solving the problem in parallel, the authors introduce several approaches, like the Barnes-Hut Tree and the Orthogonal Recursive Bisection. Then they nicely describe data-parallel, shared-memory and message-passing implementations.
Automatic Parallelism And Its Limitations
This who writes could not find any wiki pages on the topic of Automatic parallelism and its limitations. As interesting as it could be, nobody seems to have been attracted by the subject of finding ways of automatizing the act of parallelization of a sequential process and the description of the compilers that could do that task to make the programmer's one easier.
The Limits To Speedup
The final subdivision of this chapters talks about The Limits To Speedup and there is just one wiki from Spring 2012 taking the topic [24]. The page starts mentioning the Amdah's Law and how it leads to the Gustafson's Law, which is explained, and its derivation; and, after a brief mention to Scaled Speedup, it goes into its evolutions and some examples. It then talks about the Gordon Bell Prize on parallel computing research, of which Gustafson is a recipient; and that leads to the definition of Superlinear Speed, its controversy and the reasons behind it and the reasons why we may not be able to reach it.
Chapter 5:Parallel Programming for Linked Data Structures <ref>Spring_2012/ch5a_ja</ref><ref>Spring_2011/ch5_LL</ref>
Parallel Programming For Link Data Structures
In this chapter we have two wiki pages. The first one, [29]from Spring 11, talks about Parallel Programming For Link Data Structures. The page moves to describe parallel algorithms for working with them, focusing on thread-safe accesses based on locks with better parallel performance, like fine grained or read-write locks. It starts categorizing non-blocking algorithms, their motivations and their drawbacks, going into the building blocks that make them and the atomic instructions, like CAS, that can be used. Then it describes a lock-free singly linked list implementation, with examples of insert and remove functions, hazard pointers and other solutions. In a twist of topic, the page goes to describe memory models and barrier instructions for shared-memory systems; talking about the X86 memory model, barrier instructions and skip lists.
Other Linked Data Structures
This year 2012 wiki page concentrates in Other Linked Data Structures. It first introduces Linked-list Parallel Programming, talking about general techniques for parallelization, like copy-scan or pointer doubling that could be used. Next it describes some of the other data structures, like trees, hash tables and graphs; looking for opportunities for parallelization, like trees traversals, shared hash-maps and graphs traversals; showing code examples to compare the serial and parallel solutions.
Chapter 6: Introduction to Memory Hierarchy Organization <ref>Spring_2011/6a_ms</ref><ref>Spring_2012/6b_am</ref><ref>Spring_2012/6b_pa</ref><ref>Spring_2011/ch6a_ep</ref><ref>Spring_2011/ch6a_jp</ref><ref>Spring_2011/ch6b_ab</ref><ref>Spring_2011/ch6b_df</ref>
This section has also pages with three main topics. The first one, Write-Miss Policies and Prefetching is from Spring 11 and has three very similar in quality pages {30], [33] and [34]. [33] does a throughout description of recent processor architectures and their cache characteristics, going after that to describe cache write miss policies and their combinations, like write-validate, write-invalidate or write-around. [34] also includes the write-hit policies and a good section on prefetching, describing the advantages and disadvantages and how effective the solution is. It goes then to describe an example, Stream Buffer Prefetching, and then how prefetching is used in parallel computing, specifically in Shared Memory Systems.
The second topic covers Cache Addressing and Translation Lookaside Buffer (TLB)Coherence. We also have two pages, [35] does a good description on cache addressing, going through the combinations of virtual and physical index and tags caches can use for management, and its advantages and disadvantages. [36] focuses particularly on the TLB architecture, defining it and going through the models of cache addressing and how they affect its performance. Finally there is a short reference to cache coherence, a topic of another chapter.
The last topic in this chapter is Multiprocessor issues with write buffers, getting into the Spring 2012 wiki pages. Both [31] and {32] do a very extensive discussion on the topic. The two of them start by talking about the concerns that write buffers bring up in both uniprocessors and multiprocessors systems. [32] goes then to discuss the topic of software coherence through write policies, describing approaches and implemtations of the buffer and making a comparison on parameters like hit radios, memory latency or execution time. It passes afterwards to the topic of sequential coherence and atomic sequences to end with describing write buffers implementations, like the Universal Write Buffer, the Scalable Store Buffer and the Partial Block Invalidation coherence mechanisms. {31] on the other side presents sections on maintaining sequential consistency and overcoming buffer stalls, mentioning again Atomic Sequence Ordering and Scalable Store Buffers.
In the first section of this chapter, [37] and [38], from 2011, bring the topic of Cache Coherence and Memory Consistency in Share Memory Systems. [37] starts by talking about the advantages and disadvantages of the share memory architectures and the hardware support is needed to achieve them. Then, it describes in depth both the cache coherence and memory consistency problems, passing to talk about the Peterson's Algorithm and Page Tables.
In the second section TLB Coherence in Multiprocessing, we get into the Spring 2012 pages, [39] and [40]. [39] starts giving us a background on the subject, introducing the concepts of virtual memory, paging and defining TLBs. Following it describes the coherence problems that we find when we work in with TLBs in a multiprocessor environment, like swap-outs, protection changes or mapping changes. Next, it discusses approaches to the TLB coherence and its elements and presents some strategies to face the problem, that is, using Virtually Indexes Cache, which would eliminate TLBs altogether; TLB Shootdowns, a mixture of synchronization and messaging, with descriptions of methods and implementations; Hierarchical TLBs, where they are organized in levels, like cache; Instruction-based Validation; an approach based on an Address Space Identifier (ASID) and a Validation Based solution, where invalidations are postponed until memory is accessed. Finally, as an example, the page describes the Unified Cache and TLB coherence solution, the Unified Instruction/Translation/Data (UNITD) Coherence protocol by Romanescu et al; giving some background of the reasons that took to its development, and then thoroughly discussing its implementation and performance.
Chapter 8: Bus-Based Coherent Multiprocessors <ref>Spring_2011/ch8_cl</ref><ref>Spring_2011/ch8_mc</ref> <ref>Spring_2012/8a_cj</ref><ref>Spring_2012/8a_fu</ref> <ref>Spring_2012/8b_va</ref>
The first section in this chapter is from Spring 2011 and is titled Introduction to bus-based cache coherence in real machines. While [41] describes well all the coherence protocols, [42] makes a point to associate them to the architectures and vendors they were developed for. It first presents the common bus architecture in SMP systems and summarizing some of the more commons Cache Coherence Protocols, like MSI, MESI, MESIF or MOESI, their states and transitions; and discussing their performance. Following, it goes to describe several architectures, starting with MSI & SGI IRIS 4D Processors, then Synapse protocol and Synapse multiprocessors, MESI and Intel Processors, with a brief explanation of the Intel processor architecture and evolution; and the MOESI & AMD Processors, with the Opteron and Phenom example architectures, and other particularities, performance enhancements and optimizations of the AMD64 family of processors. It ends with Dragon Protocol & Xerox Dragon Processors to give pass to discussions on Cache Coherence Power Utilization and discussing possible power savings that can be obtained using Snoop Filtering, eliminating unnecessary snoops, with techniques like Serial Snooping and Jetty.
Now in Spring 2012, [43] and [44] develop a similar topic as [42], MSI, MESI, MESIF, and MOESI protocols on real architectures, and the wiki pages are very close to those of the previous year. [44] adds a chapter on MESI and ARM11 and Cortex-A9 MPCore, which are in fashion these days. It also presents a chapter on Instruction Prefetching and another in Word Invalidation.
Last section takes on 8b. Update and adaptive coherence protocols on real architectures, and power considerations. [45] introduces the coherence protocols, which take care of making sure that changes in a cache are propagated to the others. Main issue, they state, is that all this transient of information brings extra traffic to the bus and, hence, congestion. They define then the concept of coherence and the two possible approaches, write-update and write-invalidate protocols. The paper focused then in write-update only protocols, starting with the Xerox-Dragon protocol, which is explained; and then with the DEC Firefly protocol. After this it passes to describe the Adaptive Coherence Protocols, which, given the disadvantages of both write-update and write-invalidate, try to mix them in a hybrid scheme, which is here described, together with the hardware architecture needed ti support it. The Subblock Protocol, a snoopy-based adaptation of the Illinois MESI protocol, is also presented with a description of its mechanisms and advantages of this protocol versus more coarse ones. Finally an enhancement of this technique, the Read-snarfing protocol, is presented as an example, with coding proposals and studies over some particular situations; and some simulations results are also shown.
Chapter 9: Hardware Support for Synchronization <ref>Spring_2011/ch9_ms</ref><ref>Spring_2012/9a_mm</ref><ref>Spring_2012/9a_ms</ref><ref>Spring_2012/9a_cm</ref>
In this chapter we have two sections. The first one Synchronization, from Spring 2011, has just one wiki page, [49]. The page starts by introducing the concept and its parts, acquire, wait and release, and which of them are better implemented in hardware or in software. It then passes to talk about Lock Implementation, how to evaluate their performance -latency, traffic, fairness and storage- and compares several implementations -T&S, TT&S-, discussing ways of improving performance and discussing their drawbacks, like deadlocks or excessive use of resources. After locks, the page covers barrier implementations and describes several implementations and improvements, as Sense-Reversal barriers, Combining Tree barriers, Tournament Barriers or Disseminating Barriers.
The second section is very well cover, with three different wiki pages. The topic is Reducing locking overhead and it is about ways of reducing the load locks impose in the multi-process systems. {48] starts by talking about Synchronization in Java, discussing its memory models, sharing and locks and Monitors, also discussed thoroughly in [47]. After that, also in [47], problems with locks are discussed, one of them being the excessive use of system resources. The three pages, [47], [48] and [49] bring up Thin locks in a very similar way, describing the method and how they are implemented in Java objects, peeling lock unnecessary layers to make them more efficient. [48] nicely shows sections for the loop characteristics, the algorithm that they follow and examples, while [47] and [49] show code examples. Then sections about Biased Locks are shown, a evolution of the thin locks that favors a particular thread over others. It is treated in a similar way as Thin locks by the three pages. Finally [47] dedicates a section to the Rogers' lock, an implementation of the biased lock in Java.
Chapter 10: Memory Consistency Models<ref>Spring_2012/10a_dr</ref><ref>Spring_2012/10a_jp</ref><ref>Spring_2012/10a_vm</ref><ref>Spring_2012/10b_sr</ref><ref>Spring_2012/ch10_sj</ref><ref>Spring_2011/ch10_MC</ref><ref>Spring_2011/ch10_sb</ref><ref>Spring_2011/ch10_sj</ref><ref>Spring_2011/ch10a_dc</ref>
Chapter 10 is one of the most popular ones. The first of the Spring 2011 subjects is taken care of by [55] and [58], that get to talk about Java Memory Model (JMM) main aspects, motivation, and the use of concurrency-related building blocks. [55] describes then the JMM specifications and rules and concepts like the 'volatile' variable, 'piggybacking' on synchronization and the importance of doing a correct initialization; to then pass to talk about Double-Checked Locking and ending with a mention to the JDK and how to write safe Java code.
The second Spring 2011 topic Memory Consistency Models. [56] starts by introducing the Sequential Consistency Models and discusses their performance in multiprocessors and some hardware and software optimization techniques, like prefetching, speculative reads or the Shasha and Snir's algorithm. Then it passes to talk about Relaxed Consistency Models, and how we can make these models nimbler, stripping them of some of their parts and relaxing sequential consistency. After the concept is introduced, several models are listed, and real implementations, like like the IBM-370, IBM power PC or the DEC-Alpha ones are discussed. Then it moves into the Release Consistency Models, like the eager release or lazy release and compares and analyze their performance. Finally, a discussion about the comparison of all models, experiment results and some shortcomings of their use; end the page. Then it brings sequential consistence into the equation,
The Spring 2012 pages approach this former chapter a bit differently and the section 10a, Prefetching and Consistency Models, focuses more in Prefetching. [52] starts by introducing the concept and listing the types of prefetching techniques and how they work, guiding us with examples. Then it introduces sequential consistency into the equation and compares the results on working with combinations of both methods, explaining the reasons why these combinations have not been widely adopted. Following it repeats the exercise with using Release Consistency first and then Speculative Execution, lingering into the explanation of the former, ending with an practical example. It is worthy to note that [50] and [51] have both small sections about prefetching in multiprocessors.
The last section in this chapter is Use of consistency models in current multiprocessors and has just one page, [53]. The page opens introducing the topic and giving a background of current multiprocessor architectures, their trends and the challenges that they face, like scalability or latency. Then it goes into describing the an extensive list of consistency models that are being used by current machines, some of them already mentioned before in other pages. The list includes Address translation aware memory consistency, Causal consistency, Delta consistency, Entry consistency, Eventual consistency, Linearizability, PRAM consistency, Weak consistency and Strong consistency; with some of the methods coming with example code. Following, the page focuses in the performance impact, comparing several of those methods among them, reaching to the conclusion that weaker consistencies have a definite improvement in performance at the cost of complexity. Release consistency, Sequential consistency
The Spring 2011 wiki pages on this chapter treat the subject of Scalable Coherent Interface (SCI). [61] and [63] are very similar and go through the topic almost parallely. They start by setting up a background of what SCI means in multiprocessors and the reasons behind it. Then they pass to describe the states that the protocol uses and state diagrams with the protocol transitions, with [63] giving some examples and explaining the way the protocol work. Then both pages move into the topic of Race Conditions that this protocol shows due to early invalidations and how they can be prevented with various mechanisms; like atomic transactions, using the head node to carry on most of the coherence, or changing the directory structure. There are other possible race conditions, like concurrent deletion or simultaneous deletion and invalidation, that are presented with possible work-arounds.
The second topic in the chapter is 11a, Performance of DSM systems and both [59] and [60] are good pages. [59] introduces the concept of distributed shared memory (DSM), how it works and how it is used. Then it discusses why a bus cannot be used, due to the desired scalability and message-passing is preferred to maintain coherence. Then it goes into describing the hardware architecture, that is, the systems where this can be deployed and used, being those CC-NUMA, the most interesting for us; COMA and Reflective Memory systems. Then they move into the software architecture, with the Fine-Grained approach description, the Shasta protocol and their optimizations and the Cashmere protocol and the challenged the last two face. Following the page analyzes the performance of these protocols, the system overhead, load balancing, scheduling and the communication overhead. [60] also has a section on Parallel File Input/Output in Cohesion systems where the advantages of having the file management controlled by more than one node are discussed. A design, scheme and implementation are presented. A real example, the Cray X1, closes the page. .
Last part of this chapter, Improvements to directory-based cache-coherence animations are not wiki pages, but a compendium of Power Point animations on Full-Vector and SCCI.
Chapter 12: Interconnection Network Architecture <ref>Spring_2011/ch12</ref><ref>Spring_2011/ch12_aj</ref><ref>Spring_2011/ch12_ar</ref><ref>Spring_2011/ch12_ob</ref><ref>Spring_2012/12b_jh</ref><ref>Spring_2012/12b_sb</ref>
There are two groups of wiki pages in this chapter. The first one is New Interconnection Topologies and all but one of the pages can be included in this group.The pages are really all very similar, using even the same pictures, and it results very difficult to lean towards a particular one. [66] looks like the most complete, starting by describing the concept and the parameters that we would be using to evaluate, like latency or bandwidth; and characterize, like diameter, bisection bandwidth , number of links and degree. Then introduces known topologies, like the Linear Array, the Ring, the Cube, Trees and Butterfly; passing afterwards to more complex structures that have evolved from them, like the D-Mesh, the Hypercube, the D-Torus and the Fat Tree and, then, makes a comparison among them. Next it moves into real implementations and its abilities and limitation, including again the Fat Tree, the Butterfly, Messes and Tori and the Hypercube. The subject is changed to Routing algorithms, deterministic, adaptive or a combination of them; passing into the problems that routing faces, like Deadlock, and ways or getting over them, as circular routing patterns or turn restriction models. Afterwards, it defines the router architecture and its evolution over time. The page ends with a reference to Fault Tolerant routing, models and mechanisms.
The next and last section is On-chip Interconnects. Though [68] makes a mention to it, [69] is the one that really develops the topic. It starts setting a background by saying how the grow in performance in the most recent architectures have shifted from faster single core processors to the many cores per die, the chip multiprocessor (CMP), that are popular nowadays. It gets into the topologies that are used in those architectures and the parameters that characterize them; passing to explain in more depth some of the most common, like 2-D Mesh, Concentration Mesh, Flattened Butterfly and Multidrop Express Channels (MECS), and comparing each other performance. Next it points out some commercial implementations by Intel, Tilera, ST Microelectronics and IBM. Last, it moves into Routing in a slightly different way that [66], defining first the General Routing Schemes, like store and forward, cut-through and deterministic or adaptive routing and then the dangers, like deadlocks and livelocks. Specific routing protocols implementations for SoC and then listed, that is, Source or Distributed Routing, Logic Based Distribute Routing (LBDR), Bufferless Deflection Routing (BLESS protocol), CHIPPER (Cheap-Interconnect Partially Permuting Router) and Dimension-order Routing. The page ends with current lines of research, like optical on-chip interconnects and Reconfigurable or Bio Network on Chip (NoC)
References
<references/>
Notes To This Preface
The commented wiki pages have been taken from the Expertiza repository for the course CSC506, classes of Spring 2011 and Spring 2012. Some of the links have been obviated because they were empty, repeated or invalid.