CSC/ECE 506 Spring 2012/preface: Difference between revisions
Line 53: | Line 53: | ||
=Chapter 8: Bus-Based Coherent Multiprocessors <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch8_cl Spring_2011/ch8_cl]</ref><ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch8_mc Spring_2011/ch8_mc]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8a_cj Spring_2012/8a_cj]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8a_cm Spring_2012/8a_cm]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8a_fu Spring_2012/8a_fu]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8b_va Spring_2012/8b_va]</ref> = | =Chapter 8: Bus-Based Coherent Multiprocessors <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch8_cl Spring_2011/ch8_cl]</ref><ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch8_mc Spring_2011/ch8_mc]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8a_cj Spring_2012/8a_cj]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8a_cm Spring_2012/8a_cm]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8a_fu Spring_2012/8a_fu]</ref> <ref>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/8b_va Spring_2012/8b_va]</ref> = | ||
The first section in this chapter is from Spring 2011 and is titled '''Introduction to bus-based cache coherence in real machines'''. While [41] describes well all the coherence protocols, [42] makes a point to associate them to the architectures and vendors they were developed for. | |||
== Chapter 8a: ''MSI, MESI, MESIF, and MOESI protocols on real architectures'' == | == Chapter 8a: ''MSI, MESI, MESIF, and MOESI protocols on real architectures'' == |
Revision as of 21:05, 18 April 2012
This preface intends to browse through the wiki pages done over the Spring 11 and Spring 12 CSC/ECE 506 courses and, taking the Solihin's FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE<ref>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE by Solihin</ref> as a base and perusing the book Table of Content <ref>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE TOC</ref> travel though the study topics and summing them up, picking up the best approaches.
Chapter 1: Perspectives <ref>Spring_2012/1a_hv</ref><ref>Spring_2012/1a_mw</ref> <ref>Spring_2012/1a_ry</ref> <ref>Spring_2012/1b_ap</ref> <ref>Spring_2012/1b_as </ref> <ref> Spring_2012/1b_ps</ref> <ref>Spring_2012/1c_12</ref> <ref> Spring_2012/1c_cl </ref> <ref>Spring_2012/1c_dm </ref>
This chapter includes pages on three different subjects: supercomputer comparisons, Moore's Law and MISD architectures.On the first topic, supercomputers evolution, out of the 3 pages both [5] and [6] are very similar, probably because they come from the same source ([6] actually specify what their additions were to the Spring 11 original), and are both remarkable. They start analyzing the evolution of the supercomputer (Eniac, Cray, Japanese computers, IBM...) to pass to their different designs and architectures ([5] adds here a section of the reasons for the decay of the vector-based machines). Chapter 7 on [6] is an extensive excursion over the operative systems used in these platforms, ways of interconnecting them ([5] also introduces Infiniband as an alternative to Ethernet)and includes some comparisons between brands and different architectures. Next more current fast supercomputers implementations are presented, together with programming models and tools, like Fortran, C, PVM or OpenMP. Finally Consumption and cooling techniques are presented, followed by a discussion on trends and the future of these machines.
The second set of wikis try to briefly answer the Does Moore's Law still hold? question. [7] and [8] are also very similar in their approach and both pages introduce the Law and its proposer, a historical of its evolution through the years; and also a list of computers with their number of transistors as a proof of the Law and its particularities, including that it refers to the number of transistors, not performance. Then they approach the future of the Law, and [7] includes some future alternatives to the transistor, like memristors, quantum computing and ballistic deflection transistors.
The third set is about MISD architectures and the wiki pages go into describing the Flynn's Taxonomy in general and the MISD machines in particular. [9], after introducing the MISD architecture and giving examples (or explaining the lack of them, as theoretically they don't exist), makes a good description of systolic arrays and its applications, describing, for instance, discrete and fast Fourier transforms implementations. It also discusses whether the systolic arrays are really MISD machines or not, as the concept is a matter of polemic.
Chapter 2: Parallel Programming Models <ref>Spring_2012/2a_bm </ref> <ref> Spring_2012/2a_va </ref><ref> Spring_2012/ch2b_cm </ref><ref>Spring_2011/ch2_cl </ref><ref> Spring_2011/ch2_dm</ref><ref>Spring_2011/ch2_JR </ref><ref> Spring_2011/ch2a_mc </ref>
This chapter is also divided into three different topics. The first one describes Data Parallel models in general and the wiki pages have been done in the Spring of 2011. Out of the three pages on the subject, [16] does a good job describing the model and its parts, decomposition and orchestration; but [15] treats the Task Parallel model better, comparing them even graphically; and offers also a history of the evolution of these techniques.
Coming into the Spring 2012 wiki pages, we have two topics. The first one is Shared Address Space (SAS) programming on distributed-memory machines, with [12] going first through the origins and background and introducing the two main issues on shared memory: coherence and synchronization. It passes then to talk about Distributed Shared Memory (DSM) and cache coherence, page management and memory mapping in MOME, and also node communications and programming environments, with descriptions of how the locks and barriers are introduced to ensure synchronization. It follows with some DSM implementation example of implementations and algorithms, introducing consistency models. Finally, it describes some evaluation models to measure performance (SPLASH) and some study cases, to then close the page commenting of the perspective for the evolution of the topic.
The last topic in this chapter is Data parallelism in GPUs. The pages [14] and [18] treat the subject in a very similar way, getting into the diatribes of the massive parallel computer that take places in graphic processors and how are they more focused on data processing, versus control, as other kind of processor. [18] opens openCL and programming models, talking about the platforms, execution and memory models and programming approaches. Then it discusses two implementation from AMD, the Radeon HD5000 and HD6000 series and NVidia GeForce 4000s. Compute Unified Device Architecture (CUDA) is then introduced and described, focusing in threads and programming. [14] is more focused in the different NVidea processor implementations and dedicates more space to stream processors and CUDA architecture, programming and examples.
In this chapter we have also three topics. In the first one, Types of Parallelism, the authors of [23] define the concepts of DOALL, DOACCROSS, DOPIPE, Functional Parallelism and Reduction. Following they introduce synchronization and its mechanisms and describe how they are applied in several architectural examples, that is, IA-64, IA-32, Linux Kernel, CUDA, Power-PC and Cell Broadband Engines (CEB); talking about what techniques every of these examples uses to maintain that synchronization (spin-locks, barriers...).
Next topic in this chapter is Patterns of parallel programming where we have a couple of different approaches: [20] tries first to make classifications and presents three different ones, Massingeli's, Keutzer and Mattson's and Siu et al's. Then it divides the list of patterns at four different levels, that is, Decomposition, Algorithm, Implementation and Parallel Executions; and describes common routines and algorithms that are used at every level, with some examples. It finishes with a glossary of all of the patterns that could be useful for a fast search. [21] on the other hand, focuses on the Algorithm level and, after introducing the need to have these patterns as building blocks to get the best performance and repeatability out of our systems, it divides the patterns into two groups, Functional Parallelism Patterns, like the Embarrassingly Parallel Pattern or Divide and conquer; Inseparable Patterns and Data Parallelism Patterns, which include Replicable Patterns, Repository Patterns, Recursive Data Pattern, Geometric and Irregular Mesh Pattern and Pipeline Pattern. To finish, it talks about the limitations of the Parallel Patterns, and does a comparison amongst them.
The last topic in this chapter is Map Reduce, of which we only have one wiki page. This page introduces the software framework as a Google development and proceeds to describe the programming model, after which it gives some examples and example code. Then it shows three different implementations of the technique: the very own Google's MapReduce, with its Master Data Structures and Fault Tolerance, as the libraries can have enormous amount of data and have to fall gracefully. Following the Phoenix implementation for Shared Memory Systems and a description of its API and buffer management. Finally it describes an implementation in graphics processors with the Mars API and implementation and optimization details.
The first three wiki pages in these chapter, [25], [26] and [27], are from Spring 11 and have three different topics, all related to data-parallel algorithm implementations, hence being a weak fit for the subjects discussed in this section. The first one, [25], takes on the three models of the Gaussian Elimination implementation, developed in High Performance Fortran, OpenMP and MPI. The second, [26], digs into the Serial Simplex Method (Neader-Mead function minimization), proposing pseudo code and then ways of performing task and a data parallel implementations of the algorithm. It also provides a shared memory and a messaging-pass implementation. Finally, [27]is the most ethereal of all three, describing an N-body problem as a result of observations in situations found in Astrophysics and Mathematics. For solving the problem in parallel, the authors introduce several approaches, like the Barnes-Hut Tree and the Orthogonal Recursive Bisection. Then they nicely describe data-parallel, shared-memory and message-passing implementations.
This who writes could not find any wiki pages on the topic of Automatic parallelism and its limitations. As interesting as it could be, nobody seems to have been attracted by the subject of finding ways of automatizing the act of parallelization of a sequential process and the description of the compilers that could do that task to make the programmer's one easier.
The final subdivision of this chapters talks about The limits to speedup and there is just one wiki from Spring 2012 taking the topic [24]. The page starts mentioning the Amdah's Law and how it leads to the Gustafson's Law, which is explained, and its derivation; and, after a brief mention to Scaled Speedup, it goes into its evolutions and some examples. It then talks about the Gordon Bell Prize on parallel computing research, of which Gustafson is a recipient; and that leads to the definition of Superlinear Speed, its controversy and the reasons behind it and the reasons why we may not be able to reach it.
Chapter 5:Parallel Programming for Linked Data Structures <ref>Spring_2012/ch5a_ja</ref><ref>Spring_2011/ch5_LL</ref>
In this chapter we have two wiki pages. The first one, [29]from Spring 11, talks about Parallel Programming For Link Data Structures. The page moves to describe parallel algorithms for working with them, focusing on thread-safe accesses based on locks with better parallel performance, like fine grained or read-write locks. It starts categorizing non-blocking algorithms, their motivations and their drawbacks, going into the building blocks that make them and the atomic instructions, like CAS, that can be used. Then it describes a lock-free singly linked list implementation, with examples of insert and remove functions, hazard pointers and other solutions. In a twist of topic, the page goes to describe memory models and barrier instructions for shared-memory systems; talking about the X86 memory model, barrier instructions and skip lists.
This year 2012 wiki page concentrates in Other Linked Data Structures. It first introduces Linked-list Parallel Programming, talking about general techniques for parallelization, like copy-scan or pointer doubling that could be used. Next it describes some of the other data structures, like trees, hash tables and graphs; looking for opportunities for parallelization, like trees traversals, shared hashmaps and graphs traversals; showing code examples to compare the serial and parallel solutions.
Chapter 6: Introduction to Memory Hierarchy Organization <ref>Spring_2011/6a_ms</ref><ref>Spring_2012/6b_am</ref><ref>Spring_2012/6b_pa</ref><ref>Spring_2011/ch6a_ep</ref><ref>Spring_2011/ch6a_jp</ref><ref>Spring_2011/ch6b_ab</ref><ref>Spring_2011/ch6b_df</ref>
This section has also pages with three main topics. The first one, Write-Miss Policies and Prefetching is from Spring 11 and has three very similar in quality pages {30], [33] and [34]. [33] does a throughout description of recent processor architectures and their cache characteristics, going after that to describe cache write miss policies and their combinations, like write-validate, write-invalidate or write-around. [34] also includes the write-hit policies and a good section on prefetching, describing the advantages and disadvantages and how effective the solution is. It goes then to describe an example, Stream Buffer Prefetching, and then how prefetching is used in parallel computing, specifically in Shared Memory Systems.
The second topic covers Cache Addressing and Translation Lookaside Buffer (TLB)Coherence. We also have two pages, [35] does a good description on cache addressing, going through the combinations of virtual and physical index and tags caches can use for management, and its advantages and disadvantages. [36] focuses particularly on the TLB architecture, defining it and going through the models of cache addressing and how they affect its performance. Finally there is a short reference to cache coherence, a topic of another chapter.
The last topic in this chapter is Multiprocessor issues with write buffers, getting into the Spring 2012 wiki pages. Both [31] and {32] do a very extensive discussion on the topic. The two of them start by talking about the concerns that write buffers bring up in both uniprocessors and multiprocessors systems. [32] goes then to discuss the topic of software coherence through write policies, describing approaches and implemtations of the buffer and making a comparison on parameters like hit radios, memory latency or execution time. It passes afterwards to the topic of sequential coherence and atomic sequences to end with describing write buffers implementations, like the Universal Write Buffer, the Scalable Store Buffer and the Partial Block Invalidation coherence mechanisms. {31] on the other side presents sections on maintaining sequential consistency and overcoming buffer stalls, mentioning again Atomic Sequence Ordering and Scalable Store Buffers.
In the first section of this chapter, [37] and [38], from 2011, bring the topic of Cache Coherence and Memory Consistency in Share Memory Systems. [37] starts by talking about the advantages and disadvantages of the share memory architectures and the hardware support is needed to achieve them. Then, it describes in depth both the cache coherence and memory consistency problems, passing to talk about the Peterson's Algorithm and Page Tables.
In the second section TLB coherence, with Spring 2012 pages [39] and [40],
Chapter 8: Bus-Based Coherent Multiprocessors <ref>Spring_2011/ch8_cl</ref><ref>Spring_2011/ch8_mc</ref> <ref>Spring_2012/8a_cj</ref> <ref>Spring_2012/8a_cm</ref> <ref>Spring_2012/8a_fu</ref> <ref>Spring_2012/8b_va</ref>
The first section in this chapter is from Spring 2011 and is titled Introduction to bus-based cache coherence in real machines. While [41] describes well all the coherence protocols, [42] makes a point to associate them to the architectures and vendors they were developed for.
Chapter 8a: MSI, MESI, MESIF, and MOESI protocols on real architectures
Chapter 8b: 8b. Update and adaptive coherence protocols on real architectures, and power considerations
Chapter 9: Hardware Support for Synchronization <ref>Spring_2011/ch9_ms</ref><ref>Spring_2012/9a_mm</ref><ref>Spring_2012/9a_mm</ref><ref>Spring_2012/9a_mm</ref>
Chapter 9a: Reducing locking overhead
Chapter 10: Memory Consistency Models<ref>Spring_2012/10a_dr</ref><ref>Spring_2012/10a_jp</ref><ref>Spring_2012/10a_vm</ref><ref>Spring_2012/10b_sr</ref><ref>Spring_2012/ch10_sj</ref><ref>Spring_2011/ch10_MC</ref><ref>Spring_2011/ch10_sb</ref><ref>Spring_2011/ch10_sj</ref><ref>Spring_2011/ch10a_dc</ref>
Chapter 10a: Prefetching and consistency models
Chapter 10b: Use of consistency models in current multiprocessors
Chapter 11a: Performance of DSM systems
Chapter 11b: Improvements to directory-based cache-coherence animations
Chapter 12: Interconnection Network Architecture <ref>Spring_2011/ch12 </ref><ref>Spring_2011/ch12_aj</ref><ref>Spring_2011/ch12_ar</ref><ref>Spring_2011/ch12_ob</ref><ref>Spring_2012/12b_jh</ref><ref>Spring_2012/12b_sb</ref>
Chapter 12a: New interconnection topologies
Chapter 12b: On-chip interconnects
References
<references/>