CSC/ECE 506 Spring 2012/preface

From Expertiza_Wiki
Jump to navigation Jump to search

This preface intends to browse through the wiki pages done over the Spring 11 and Spring 12 CSC/ECE 506 courses and, taking the Solihin's FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE<ref>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE by Solihin</ref> as a base and perusing the book Table of Content <ref>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE TOC</ref> travel though the study topics and summing them up, picking up the best approaches.

Chapter 1: Perspectives <ref>Spring_2012/1a_hv</ref><ref>Spring_2012/1a_mw</ref> <ref>Spring_2012/1a_ry</ref> <ref>Spring_2012/1b_ap</ref> <ref>Spring_2012/1b_as </ref> <ref> Spring_2012/1b_ps</ref> <ref>Spring_2012/1c_12</ref> <ref> Spring_2012/1c_cl </ref> <ref>Spring_2012/1c_dm </ref>

This chapter includes pages on three different subjects: supercomputer comparisons, Moore's Law and MISD architectures.On the first topic, supercomputers evolution, out of the 3 pages both [5] and [6] are very similar, probably because they come from the same source ([6] actually specify what their additions were to the Spring 11 original), and are both remarkable. They start analyzing the evolution of the supercomputer (Eniac, Cray, Japanese computers, IBM...) to pass to their different designs and architectures ([5] adds here a section of the reasons for the decay of the vector-based machines). Chapter 7 on [6] is an extensive excursion over the operative systems used in these platforms, ways of interconnecting them ([5] also introduces Infiniband as an alternative to Ethernet)and includes some comparisons between brands and different architectures. Next more current fast supercomputers implementations are presented, together with programming models and tools, like Fortran, C, PVM or OpenMP. Finally Consumption and cooling techniques are presented, followed by a discussion on trends and the future of these machines.

The second set of wikis try to briefly answer the Does Moore's Law still hold? question. [7] and [8] are also very similar in their approach and both pages introduce the Law and its proposer, a historical of its evolution through the years; and also a list of computers with their number of transistors as a proof of the Law and its particularities, including that it refers to the number of transistors, not performance. Then they approach the future of the Law, and [7] includes some future alternatives to the transistor, like memristors, quantum computing and ballistic deflection transistors.

The third set is about MISD architectures and the wiki pages go into describing the Flynn's Taxonomy in general and the MISD machines in particular. [9], after introducing the MISD architecture and giving examples (or explaining the lack of them, as theoretically they don't exist), makes a good description of systolic arrays and its applications, describing, for instance, discrete and fast Fourier transforms implementations. It also discusses whether the systolic arrays are really MISD machines or not, as the concept is a matter of polemic.

Chapter 2: Parallel Programming Models <ref>Spring_2012/2a_bm </ref> <ref> Spring_2012/2a_va </ref><ref> Spring_2012/ch2b_cm </ref><ref>Spring_2011/ch2_cl </ref><ref> Spring_2011/ch2_dm</ref><ref>Spring_2011/ch2_JR </ref><ref> Spring_2011/ch2a_mc </ref>

This chapter is also divided into three different topics. The first one describes Data Parallel models in general and the wiki pages have been done in the Spring of 2011. Out of the three pages on the subject, [16] does a good job describing the model and its parts, decomposition and orchestration; but [15] treats the Task Parallel model better, comparing them even graphically; and offers also a history of the evolution of these techniques.

Coming into the Spring 2012 wiki pages, we have two topics. The first one is Shared Address Space (SAS) programming on distributed-memory machines, with [12] going first through the origins and background and introducing the two main issues on shared memory: coherence and synchronization. It passes then to talk about Distributed Shared Memory (DSM) and cache coherence, page management and memory mapping in MOME, and also node communications and programming environments, with descriptions of how the locks and barriers are introduced to ensure synchronization. It follows with some DSM implementation example of implementations and algorithms, introducing consistency models. Finally, it describes some evaluation models to measure performance (SPLASH) and some study cases, to then close the page commenting of the perspective for the evolution of the topic.

The last topic in this chapter is Data parallelism in GPUs. The pages [14] and [18] treat the subject in a very similar way, getting into the diatribes of the massive parallel computer that take places in graphic processors and how are they more focused on data processing, versus control, as other kind of processor. [18] opens openCL and programming models, talking about the platforms, execution and memory models and programming approaches. Then it discusses two implementation from AMD, the Radeon HD5000 and HD6000 series and NVidia GeForce 4000s. Compute Unified Device Architecture (CUDA) is then introduced and described, focusing in threads and programming. [14] is more focused in the different NVidea processor implementations and dedicates more space to stream processors and CUDA architecture, programming and examples.

Chapter 3: Shared Memory Parallel Programming <ref>Spring_2012/3a_df</ref><ref>Spring_2012/3a_kp</ref><ref>Spring_2012/3a_yw]</ref><ref>Spring_2012/3b_sk</ref><ref>Spring_2011/ch3_ab</ref>

In this chapter we have also three topics. In the first one, Types of Parallelism, the authors of [23] define the concepts of DOALL, DOACCROSS, DOPIPE, Functional Parallelism and Reduction. Following they introduce synchronization and its mechanisms and describe how they are applied in several architectural examples, that is, IA-64, IA-32, Linux Kernel, CUDA, Power-PC and Cell Broadband Engines (CEB); talking about what techniques every of these examples uses to maintain that synchronization (spin-locks, barriers...).

Next topic in this chapter is Patterns of parallel programming where we have a couple of different approaches: [20] tries first to make classifications and presents three different ones, Massingeli's, Keutzer and Mattson's and Siu et al's. Then it divides the list of patterns at four different levels, that is, Decomposition, Algorithm, Implementation and Parallel Executions; and describes common routines and algorithms that are used at every level, with some examples. It finishes with a glossary of all of the patterns that could be useful for a fast search. [21] on the other hand, focuses on the Algorithm level and, after introducing the need to have these patterns as building blocks to get the best performance and repeatability out of our systems, it divides the patterns into two groups, Functional Parallelism Patterns, like the Embarrassingly Parallel Pattern or Divide and conquer; Inseparable Patterns and Data Parallelism Patterns, which include Replicable Patterns, Repository Patterns, Recursive Data Pattern, Geometric and Irregular Mesh Pattern and Pipeline Pattern. To finish, it talks about the limitations of the Parallel Patterns, and does a comparison amongst them.

The last topic in this chapter is Map Reduce, of which we only have one wiki page. This page introduces the software framework as a Google development and proceeds to describe the programming model, after which it gives some examples and example code. Then it shows three different implementations of the technique: the very own Google's MapReduce, with its Master Data Structures and Fault Tolerance, as the libraries can have enormous amount of data and have to fall gracefully. Following the Phoenix implementation for Shared Memory Systems and a description of its API and buffer management. Finally it describes an implementation in graphics processors with the Mars API and implementation and optimization details.

Chapter 4: Issues in Shared Memory Programming <ref>Spring_2012/4b_rs</ref><ref>Spring_2011/ch4a_bm</ref><ref>Spring_2011/ch4a_ob</ref><ref>Spring_2011/ch4a_zz</ref>

The first three wiki pages in these chapter, [25], [26] and [27], are from Spring 11 and have three different topics, all related to data-parallel algorithm implementations, hence being a weak fit for the subjects discussed in this section. The first one, [25], takes on the three models of the Gaussian Elimination implementation, developed in High Performance Fortran, OpenMP and MPI. The second, [26], digs into the Serial Simplex Method (Neader-Mead function minimization), proposing pseudo code and then ways of performing task and a data parallel implementations of the algorithm. It also provides a shared memory and a messaging-pass implementation. Finally, [27]is the most ethereal of all three, describing an N-body problem as a result of observations in situations found in Astrophysics and Mathematics. For solving the problem in parallel, the authors introduce several approaches, like the Barnes-Hut Tree and the Orthogonal Recursive Bisection. Then they nicely describe data-parallel, shared-memory and message-passing implementations.

This who writes could not find any wiki pages on the topic of Automatic parallelism and its limitations. As interesting as it could be, nobody seems to have been attracted by the subject of finding ways of automatizing the act of parallelization of a sequential process and the description of the compilers that could do that task to make the programmer's one easier.

The final subdivision of this chapters talks about The limits to speedup and there is just one wiki from Spring 2012 taking the topic [24]. The page starts mentioning the Amdah's Law and how it leads to the Gustafson's Law, which is explained, and its derivation; and, after a brief mention to Scaled Speedup, it goes into its evolutions and some examples. It then talks about the Gordon Bell Prize on parallel computing research, of which Gustafson is a recipient; and that leads to the definition of Superlinear Speed, its controversy and the reasons behind it and the reasons why we may not be able to reach it.

Chapter 5:Parallel Programming for Linked Data Structures <ref>Spring_2012/ch5a_ja</ref><ref>Spring_2011/ch5_LL</ref>

In this chapter we have two wiki pages. The first one, [29]from Spring 11, talks about Parallel Programming For Link Data Structures. The page moves to describe parallel algorithms for working with them, focusing on thread-safe accesses based on locks with better parallel performance, like fine grained or read-write locks. It starts categorizing non-blocking algorithms, their motivations and their drawbacks, going into the building blocks that make them and the atomic instructions, like CAS, that can be used. Then it describes a lock-free singly linked list implementation, with examples of insert and remove functions, hazard pointers and other solutions. In a twist of topic, the page goes to describe memory models and barrier instructions for shared-memory systems; talking about the X86 memory model, barrier instructions and skip lists.

This year 2012 wiki page concentrates in Other Linked Data Structures

Chapter 6: Introduction to Memory Hierarchy Organization <ref>Spring_2011/6a_ms</ref><ref>Spring_2012/6b_am</ref><ref>Spring_2012/6b_pa</ref><ref>Spring_2011/ch6a_ep</ref><ref>Spring_2011/ch6a_jp</ref><ref>Spring_2011/ch6b_ab</ref><ref>Spring_2011/ch6b_df</ref>

Chapter 6b: Multiprocessor issues with write buffers

Chapter 7:Introduction to Shared Memory Multiprocessors <ref>Spring_2011/ch7_jp</ref><ref>Spring_2011/ch7_ss</ref><ref>Spring_2012/7b_pk</ref><ref>Spring_2012/7b_yw</ref>

Chapter 7b: TLB coherence

Chapter 8: Bus-Based Coherent Multiprocessors <ref>Spring_2011/ch8_cl</ref><ref>Spring_2011/ch8_mc</ref> <ref>Spring_2012/8a_cj</ref> <ref>Spring_2012/8a_cm</ref> <ref>Spring_2012/8a_fu</ref> <ref>Spring_2012/8b_va</ref>

Chapter 8a: MSI, MESI, MESIF, and MOESI protocols on real architectures

Chapter 8b: 8b. Update and adaptive coherence protocols on real architectures, and power considerations

Chapter 9: Hardware Support for Synchronization <ref>Spring_2011/ch9_ms</ref><ref>Spring_2012/9a_mm</ref><ref>Spring_2012/9a_mm</ref><ref>Spring_2012/9a_mm</ref>

Chapter 9a: Reducing locking overhead

Chapter 10: Memory Consistency Models<ref>Spring_2012/10a_dr</ref><ref>Spring_2012/10a_jp</ref><ref>Spring_2012/10a_vm</ref><ref>Spring_2012/10b_sr</ref><ref>Spring_2012/ch10_sj</ref><ref>Spring_2011/ch10_MC</ref><ref>Spring_2011/ch10_sb</ref><ref>Spring_2011/ch10_sj</ref><ref>Spring_2011/ch10a_dc</ref>

Chapter 10a: Prefetching and consistency models

Chapter 10b: Use of consistency models in current multiprocessors

Chapter 11: Distributed Shared Memory Multiprocessors <ref>Spring_2012/11a_hn</ref><ref>Spring_2012/11a_ht</ref><ref>Spring_2012/11a_sc</ref><ref>Spring_2011/ch11_BB_EP</ref><ref>Spring_2011/ch11_DJ</ref><ref>Spring_2011/ch11_sw</ref>

Chapter 11a: Performance of DSM systems

Chapter 11b: Improvements to directory-based cache-coherence animations

Chapter 12: Interconnection Network Architecture <ref>Spring_2011/ch12 </ref><ref>Spring_2011/ch12_aj</ref><ref>Spring_2011/ch12_ar</ref><ref>Spring_2011/ch12_ob</ref><ref>Spring_2012/12b_jh</ref><ref>Spring_2012/12b_sb</ref>

Chapter 12a: New interconnection topologies

Chapter 12b: On-chip interconnects

References

<references/>