CSC/ECE 506 Spring 2012/preface: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 120: Line 120:
=Chapter 11: Distributed Shared Memory Multiprocessors=
=Chapter 11: Distributed Shared Memory Multiprocessors=
Pages in this chapter: <ref name=A59>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/11a_hn Spring_2012/11a_hn]</ref><ref name=A60>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/11a_ht Spring_2012/11a_ht]</ref><ref name=A61>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch11_BB_EP Spring_2011/ch11_BB_EP]</ref><ref name=A62>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch11_DJ Spring_2011/ch11_DJ]</ref><ref name=A63>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch11_sw Spring_2011/ch11_sw]</ref>
Pages in this chapter: <ref name=A59>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/11a_hn Spring_2012/11a_hn]</ref><ref name=A60>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/11a_ht Spring_2012/11a_ht]</ref><ref name=A61>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch11_BB_EP Spring_2011/ch11_BB_EP]</ref><ref name=A62>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch11_DJ Spring_2011/ch11_DJ]</ref><ref name=A63>[http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch11_sw Spring_2011/ch11_sw]</ref>
Chapter 11 in the Solihin<ref name=A1 /> text explains what are the best ways to scale a shared memory system. A directory approach is one of the most scalable and hence DSM systems are developed in this section. Full-bit vector and SCCI protocols are fully described. The wiki pages in this section talk about the SCI and the performance of the DSM protocols, complementing the text. Also, the power point slides where the functioning of this last protocol is described, needed some enhancement and some teams have taken care of that.
==Scalable Coherent Interface (SCI)==
==Scalable Coherent Interface (SCI)==
The Spring 2011 wiki pages on this chapter treat the subject of '''Scalable Coherent Interface (SCI)'''. <ref name=A61 /> and <ref name=A63 /> are very similar and go through the topic almost parallely. They start by setting up a background of what SCI means in multiprocessors and the reasons behind it. Then they pass to describe the states that the protocol uses and state diagrams with the protocol transitions, with <ref name=A63 /> giving some examples and explaining the way the protocol work. Then both pages move into the topic of Race Conditions that this protocol shows due to early invalidations and how they can be prevented with various mechanisms; like atomic transactions, using the head node to carry on most of the coherence, or changing the directory structure. There are other possible race conditions, like concurrent deletion or simultaneous deletion and invalidation, that are presented with possible work-arounds.  
The Spring 2011 wiki pages on this chapter treat the subject of '''Scalable Coherent Interface (SCI)'''. <ref name=A61 /> and <ref name=A63 /> are very similar and go through the topic almost parallely. They start by setting up a background of what SCI means in multiprocessors and the reasons behind it. Then they pass to describe the states that the protocol uses and state diagrams with the protocol transitions, with <ref name=A63 /> giving some examples and explaining the way the protocol work. Then both pages move into the topic of Race Conditions that this protocol shows due to early invalidations and how they can be prevented with various mechanisms; like atomic transactions, using the head node to carry on most of the coherence, or changing the directory structure. There are other possible race conditions, like concurrent deletion or simultaneous deletion and invalidation, that are presented with possible work-arounds.  

Revision as of 02:29, 4 May 2012

This preface intends to browse through the wiki pages done over the Spring 11 and Spring 12 CSC/ECE 506 courses and, taking the Solihin's FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE<ref name=A1>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE by Solihin</ref> as a base and perusing the book Table of Content <ref name= A2>FUNDAMENTALS OF PARALLEL COMPUTER ARCHITECTURE TOC</ref>, travel through the study topics and summing them up, picking up the best approaches and comments from the pages.

Chapter 1: Perspectives

Pages in this chapter:<ref name= A3>Spring_2012/1a_hv</ref><ref name=A4>Spring_2012/1a_mw</ref> <ref name=A5>Spring_2012/1a_ry</ref> <ref name=A6>Spring_2012/1b_ap</ref> <ref name=A7>Spring_2012/1b_as </ref> <ref name=A8> Spring_2012/1b_ps</ref> <ref name=A9>Spring_2012/1c_12</ref> <ref name=A10> Spring_2012/1c_cl </ref> <ref name=A11>Spring_2012/1c_dm </ref>

The Solihin<ref name=A1 /> text's chapter one explores the evolution of the supercomputer and their performance, with references to the Moore's Law as the ruler of the estimation of the speed of integration. Also in this chapter we have techniques used to improve performance, like ILP or caches. The Flynn taxonomy is mentioned and the architectures discussed. For this chapter we have wiki pages on three different subjects: supercomputer comparisons, where we would like to see how this evolution is keeping since the book was written. Also we wonder if the Moore's Law is still valid and where it stands. The last topic we have is about MISD architectures and its particularities: is it really possible?

Supercomputers Evolution

On the first topic, Supercomputers Evolution, out of the 3 pages both <ref name=A5 /> and <ref name=A6 /> are very similar, probably because they come from the same source (<ref name=A6 /> actually specify what their additions were to the Spring 11 original), and are both remarkable. They start analyzing the evolution of the supercomputer (Eniac, Cray, Japanese computers, IBM...) to pass to their different designs and architectures (<ref name=A5 /> adds here a section of the reasons for the decay of the vector-based machines). Chapter 7 on <ref name=A6 /> is an extensive excursion over the operative systems used in these platforms, ways of interconnecting them (<ref name=A5 /> also introduces Infiniband as an alternative to Ethernet) and includes some comparisons between brands and different architectures. Next more current fast supercomputers implementations are presented, together with programming models and tools, like Fortran, C, PVM or OpenMP. Finally Consumption and cooling techniques are presented, followed by a discussion on trends and the future of these machines.

Does Moore's Law still hold?

The second set of wikis try to briefly answer the Does Moore's Law still hold? question. <ref name=A7 /> and <ref name=A8 /> are also very similar in their approach and both pages introduce the Law and its proposer, a historical of its evolution through the years; and also a list of computers with their number of transistors as a proof of the Law and its particularities, including that it refers to the number of transistors, not performance. Then they approach the future of the Law, and <ref name=A7 /> includes some future alternatives to the transistor, like memristors, quantum computing and ballistic deflection transistors.

MISD architectures

The third set is about MISD architectures and the wiki pages go into describing the Flynn's Taxonomy in general and the MISD machines in particular. <ref name=A9 />, is probably the best one and, after introducing the MISD architecture and giving examples (or explaining the lack of them, as theoretically they don't exist), makes a good description of systolic arrays and its applications, describing, for instance, discrete and fast Fourier transforms implementations. It also discusses whether the systolic arrays are really MISD machines or not, as the concept is a matter of polemic.

Chapter 2: Parallel Programming Models

Pages in this chapter: <ref name= A12>Spring_2012/2a_bm </ref> <ref name= A13> Spring_2012/2a_va </ref> <ref name= A14> Spring_2012/ch2b_cm </ref> <ref name= A15>Spring_2011/ch2_cl </ref> <ref name= A16> Spring_2011/ch2_dm</ref> <ref name= A17>Spring_2011/ch2_JR </ref> <ref name= A18> Spring_2011/ch2a_mc </ref>

Solihin's <ref name= A1 /> chapter two treats the subject of Parallel Programming Models. How uniprocessor and multiprocessor programing is different and the ways of organizing memory communications, like Shared Memory or Message Passing. In this section the wiki pages are discussing evolution in three different topics.

Data Parallel models

The first one describes Data Parallel models in general. All the wiki pages have been done in the Spring of 2011. Out of the three pages on the subject, <ref name= A16 /> does a good job describing the model and its parts, decomposition and orchestration; but <ref name= A15 /> treats the Task Parallel model better, comparing them even graphically; and offers also a history of the evolution of these techniques. In any case, none of the pages seem to be particularly updated.

Shared Address Space (SAS) programming on distributed-memory machines

Coming into the Spring 2012 wiki pages, we have two topics. The first one is Shared Address Space (SAS) programming on distributed-memory machines, with <ref name= A12 /> going first through the origins and background and introducing the two main issues on shared memory: coherence and synchronization. It passes then to talk about Distributed Shared Memory (DSM) and cache coherence, page management and memory mapping in MOME, and also node communications and programming environments, with descriptions of how the locks and barriers are introduced to ensure synchronization. It follows with some DSM implementation example of implementations and algorithms, introducing consistency models. Finally, it describes some evaluation models to measure performance (SPLASH) and some study cases, to then close the page commenting of the perspective for the evolution of the topic.

Data Parallelism in GPUs

The last topic in this chapter is Data Parallelism in GPUs, which has become very popular in recent years. The pages <ref name= A14 /> and <ref name= A18 /> treat the subject in a very similar way, getting into the diatribes of the massive parallel computer that take places in graphic processors and how are they more focused on data processing, versus control, as other kind of processor. <ref name= A18 /> opens openCL and programming models, talking about the platforms, execution and memory models and programming approaches. Then it discusses two implementation from AMD, the Radeon HD5000 and HD6000 series and NVidia GeForce 4000s. Compute Unified Device Architecture (CUDA) is then introduced and described, focusing in threads and programming. [14] is more focused in the different NVidia processor implementations and dedicates more space to stream processors and CUDA architecture, programming and examples.

Chapter 3: Shared Memory Parallel Programming

Pages in this chapter: <ref name=A19>Spring_2012/3a_df</ref><ref name=A20>Spring_2012/3a_kp</ref><ref name=A21>Spring_2012/3a_yw]</ref><ref name=A22>Spring_2012/3b_sk</ref><ref name=A23>Spring_2011/ch3_ab</ref>

Chapter 3 in the Solihin <ref name=A1 /> book refers to programming in shared memories environments and how to analyze our code to find out the best ways of taking advantage of opportunities for finding and exploiting parallelization. The wiki pages have the chance to study three topics, like how these parallelisms are used in current computer architectures; exploring ways of studying and classifying patterns in these parallelisms and getting deeper in one of the algorithms of fashion, MapReduce and the way it works.

Types of Parallelism

In the first topic, Types of Parallelism, the authors of <ref name=A23 /> define the concepts of DOALL, DOACCROSS, DOPIPE, Functional Parallelism and Reduction, though they have been already brought up in the book. Following they introduce synchronization and its mechanisms and describe how they are applied in several architectural examples, that is, IA-64, IA-32, Linux Kernel, CUDA, Power-PC and Cell Broadband Engines (CEB); talking about what techniques every of these examples uses to maintain that synchronization (spin-locks, barriers...).

Patterns of Parallel Programming

Next topic in this chapter is Patterns of Parallel Programming where we have a couple of different approaches: <ref name=A20 /> tries first to make classifications and presents three different ones, Massingeli's, Keutzer and Mattson's and Siu et al's. Then it divides the list of patterns at four different levels, that is, Decomposition, Algorithm, Implementation and Parallel Executions; and describes common routines and algorithms that are used at every level, with some examples. It finishes with a glossary of all of the patterns that could be useful for a fast search. <ref name=A21 />, on the other hand, focuses on the Algorithm level and, after introducing the need to have these patterns as building blocks to get the best performance and repeatability out of our systems, it divides the patterns into two groups, Functional Parallelism Patterns, like the Embarrassingly Parallel Pattern or Divide and conquer; Inseparable Patterns and Data Parallelism Patterns, which include Replicable Patterns, Repository Patterns, Recursive Data Pattern, Geometric and Irregular Mesh Pattern and Pipeline Pattern. To finish, it talks about the limitations of the Parallel Patterns, and does a comparison amongst them.

Map Reduce

The last topic in this chapter is Map Reduce, of which we only have one wiki page <ref name=A22 /> . This page introduces the software framework as a Google development and proceeds to describe the programming model, after which it gives some examples and example code. Then it shows three different implementations of the technique: the very own Google's MapReduce, with its Master Data Structures and Fault Tolerance, as the libraries can have enormous amount of data and have to fall gracefully. Following the Phoenix implementation for Shared Memory Systems and a description of its API and buffer management. Finally it describes an implementation in graphics processors with the Mars API and implementation and optimization details.

Chapter 4: Issues in Shared Memory Programming

Pages in this chapter: <ref name= A24>Spring_2012/4b_rs</ref><ref name= A25>Spring_2011/ch4a_bm</ref><ref name= A26>Spring_2011/ch4a_ob</ref><ref name= A27>Spring_2011/ch4a_zz</ref>

The Solihin <ref name= A1 /> book dedicates this chapter to the problems we can find in a shared memory programming environment, either in correctness, like preservation of the result, synchronization and variable partitioning; or in compiler limitations or performance issues; and it talks about those different aspects and concepts can affect the results and efficiency of our systems. The wiki pages in this chapter try to complement these subjects talking about current implementations of data parallel algorithms, limitations of automatic parallelism and how much we can stress the speed up equations.

Data-Parallel Algorithm Implementations

The first three wiki pages in these chapter, <ref name= A25 />, <ref name= A26 /> and <ref name= A27 />, are from Spring 11 and have three different topics, all related to data-parallel algorithm implementations, being a somewhat the weaker fit for the subjects discussed in this section. The first one, <ref name= A25 />, takes on the three models of the Gaussian Elimination implementation, developed in High Performance Fortran, OpenMP and MPI. The second, <ref name= A26 />, digs into the Serial Simplex Method (Neader-Mead function minimization), proposing pseudo code and then ways of performing task and a data parallel implementations of the algorithm. It also provides a shared memory and a messaging-pass implementation. Finally, <ref name= A27 />is the most ethereal of all three, describing an N-body problem as a result of observations in situations found in Astrophysics and Mathematics. For solving the problem in parallel, the authors introduce several approaches, like the Barnes-Hut Tree and the Orthogonal Recursive Bisection. Then they nicely describe data-parallel, shared-memory and message-passing implementations, which are topics that belong to other chapters.

Automatic Parallelism And Its Limitations

This who writes could not find any wiki pages on the topic of Automatic parallelism and its limitations. As interesting as it could be, nobody seems to have been attracted by the subject of finding ways of automatizing the act of parallelization of a sequential process and the description of the compilers that could do that task to make the programmer's one easier.

The Limits To Speedup

The final subdivision of this chapters talks about The Limits To Speedup and there is just one wiki from Spring 2012 taking the topic <ref name= A24 />. The page starts mentioning the Amdahl's Law and how it leads to the Gustafson's Law, which is explained, and its derivation; and, after a brief mention to Scaled Speedup, it goes into its evolutions and some examples. It then talks about the Gordon Bell Prize on parallel computing research, of which Gustafson is a recipient; and that leads to the definition of Superlinear Speed, its controversy and the reasons behind it and the reasons why we may not be able to reach it.

Chapter 5:Parallel Programming for Linked Data Structures

Pages in this chapter: <ref name= A28>Spring_2012/ch5a_ja</ref><ref name=A29>Spring_2011/ch5_LL</ref>

This chapter of the Solihin<ref name=A1 /> text brings the topic of how to use parallel programming with linked data structures, its serializability and parallelization strategies. The chapter seems to focus especially in linked lists and it would be interesting to be able to extend it to other structures. One wiki page in this section do so talking about other linked data structures.

Parallel Programming For Link Data Structures

In this chapter we have just two wiki pages. The first one, <ref name=A29 />, from Spring 11, is named like the Solihin's chapter Parallel Programming For Link Data Structures and it is intended to supplement it. The page describes how to work with parallel algorithms, focusing on thread-safe accesses based on locks with better parallel performance, like fine grained or read-write locks. It starts categorizing non-blocking algorithms, their motivations and their drawbacks, going into the building blocks that make them and the atomic instructions, like CAS, that can be used. Then it describes a lock-free singly linked list implementation, with examples of insert and remove functions, hazard pointers and other solutions. In a twist of topic, the page goes to describe memory models for shared-memory systems; talking about the X86 memory model, barrier instructions and skip lists.

Other Linked Data Structures

This year 2012's wiki page, <ref name=A28 />, concentrates in Other Linked Data Structures. It first introduces Linked-list Parallel Programming, talking about general techniques for parallelization, like copy-scan or pointer doubling that could be used. Next it describes some of the other data structures, like trees, hash tables and graphs; looking for opportunities for parallelization, like trees traversals, shared hash-maps and graphs traversals; showing code examples to compare the serial and parallel solutions.

Chapter 6: Introduction to Memory Hierarchy Organization

Pages in this chapter: <ref name=A30>Spring_2011/6a_ms</ref><ref name=A31>Spring_2012/6b_am</ref><ref name=A32>Spring_2012/6b_pa</ref><ref name=A33>Spring_2011/ch6a_ep</ref><ref name=A34>Spring_2011/ch6a_jp</ref><ref name=A35>Spring_2011/ch6b_ab</ref><ref name=A36>Spring_2011/ch6b_df</ref>

Chapter 6 in the Solihin <ref name=A1 /> talks about ways the memory can be organized in a chip of a set of them. How the memory can be subdivided and how those divisions, like the cache, work and are structured. Then it describes the memory hierarchy and the ways of managing, that is, writing and reading policies or replacing mechanisms. The wikis in this chapter try to go beyond in some of those policies and prefetching, which does not have a big attention in the book. Also other wikis complement the chapter talking about TLB coherence and problems with write buffers.

Write-Miss Policies and Prefetching

The first topic, Write-Miss Policies and Prefetching, is from Spring 11 and has three very similar in quality pages <ref name=A30 />, <ref name=A33 /> and <ref name=A34 />. <ref name=A33 /> does a thorough description of recent processor architectures and their cache characteristics, going after that to describe cache write miss policies and their combinations, like write-validate, write-invalidate or write-around. <ref name=A34 /> also includes the write-hit policies and a good section on prefetching, describing the advantages and disadvantages and how effective the solution is. It goes then to describe an example, Stream Buffer Prefetching, and then how prefetching is used in parallel computing, specifically in Shared Memory Systems.

Cache Addressing and Translation Lookaside Buffer (TLB) Coherence

The second topic covers Cache Addressing and Translation Lookaside Buffer (TLB)Coherence. We also have two pages, <ref name=A35 /> does a good description on cache addressing, going through the combinations of virtual and physical index and tags caches can use for management, and its advantages and disadvantages. <ref name=A36 /> focuses particularly on the TLB architecture, defining it and going through the models of cache addressing and how they affect its performance. Finally there is a short reference to cache coherence, a topic also brought up in another chapter.

Multiprocessor issues with write buffers

The last topic in this chapter is Multiprocessor Issues With Write Buffers, getting into the Spring 2012 wiki pages. Both <ref name=A31 /> and <ref name=A32 /> do a very extensive discussion on the topic. The two of them start by talking about the concerns that write buffers bring up in both uni-processors and multiprocessors systems. <ref name=A32 /> goes then to discuss the topic of software coherence through write policies, describing approaches and implementations of the buffer and making a comparison on parameters like hit radios, memory latency or execution time. It passes afterwards to the topic of sequential coherence and atomic sequences to end with describing write buffers implementations, like the Universal Write Buffer, the Scalable Store Buffer and the Partial Block Invalidation coherence mechanisms. <ref name=A31 /> on the other side presents sections on maintaining sequential consistency and overcoming buffer stalls, mentioning again Atomic Sequence Ordering and Scalable Store Buffers.

Chapter 7:Introduction to Shared Memory Multiprocessors

Pages in this chapter: <ref name=A37>Spring_2011/ch7_jp</ref><ref name=A38>Spring_2011/ch7_ss</ref><ref name=A39>Spring_2012/7b_pk</ref><ref name=A40>Spring_2012/7b_yw</ref>

Chapter 7 of the Solihin <ref name=A1 /> text treats the topic of cache coherence. It discuses the necessity of having hardware support in shared memory systems to ensure ordering and describes it. It also browses through the concepts of cache coherence in systems with caches, memory consistency models and synchronization support. The first wiki pages in this section complement the chapter up to some state, following the structure of the book. The second set introduces the concept of TLB coherence, which is not treated deeply in the text.

Cache Coherence and Memory Consistency in Share Memory Systems

In the first section of this chapter, <ref name=A37 /> and <ref name=A38 />, from 2011, bring the topic of Cache Coherence and Memory Consistency in Share Memory Systems. <ref name=A37 /> starts by talking about the advantages and disadvantages of the share memory architectures and the hardware support is needed to achieve them. Then, it describes in depth both the cache coherence and memory consistency problems, passing to talk about the Peterson's Algorithm and Page Tables.

TLB Coherence in Multiprocessing

In the second section TLB Coherence in Multiprocessing, we get into the Spring 2012 pages, <ref name=A39 /> and <ref name=A40 />. <ref name=A39 /> starts giving us a background on the subject, introducing the concepts of virtual memory, paging and defining TLBs. Following it describes the coherence problems that we find when we work in with TLBs in a multiprocessor environment, like swap-outs, protection changes or mapping changes. Next, it discusses approaches to the TLB coherence and its elements and presents some strategies to face the problem, that is, using Virtually Indexes Cache, which would eliminate TLBs altogether; TLB Shootdowns, a mixture of synchronization and messaging, with descriptions of methods and implementations; Hierarchical TLBs, where they are organized in levels, like cache; Instruction-based Validation; an approach based on an Address Space Identifier (ASID) and a Validation Based solution, where invalidations are postponed until memory is accessed. Finally, as an example, the page describes the Unified Cache and TLB coherence solution, the Unified Instruction/Translation/Data (UNITD) Coherence protocol by Romanescu et al; giving some background of the reasons that took to its development, and then thoroughly discussing its implementation and performance.

Chapter 8: Bus-Based Coherent Multiprocessors

Pages in this chapter: <ref name=A41>Spring_2011/ch8_cl</ref><ref name=A42>Spring_2011/ch8_mc</ref> <ref name=A43>Spring_2012/8a_cj</ref><ref name=A44>Spring_2012/8a_fu</ref> <ref name=A45>Spring_2012/8b_va</ref> This dense chapter of the Solihin<ref name=A1 /> talks about bus-based architecture and how to assure coherence in a multiprocessor system that uses it. It defines the bus, its characteristics and transactions and the design choices that we can have. Then it goes into bus arbitration and snooping-based coherence and describes the protocols that use it, like MSI, MESI, MOESI, Dragon or Firefly. While the first set of wikis connect those protocols with the real machine implementations they were develop for, the lasts go deeper into implementation issues and, being an topic of great relevance lately, power schemes.

Introduction To Bus-Based Cache Coherence In Real Machines

The first section in this chapter is from Spring 2011 and is titled Introduction To Bus-Based Cache Coherence In Real Machines. While <ref name=A41 /> describes well all the coherence protocols, <ref name=A42 /> makes a point to associate them to the architectures and vendors they were developed for. It first presents the common bus architecture in SMP systems and summarizing some of the more commons Cache Coherence Protocols, like MSI, MESI, MESIF or MOESI, their states and transitions; and discussing their performance. Following, it goes to describe several architectures, starting with MSI & SGI IRIS 4D Processors, then Synapse protocol and Synapse multiprocessors, MESI and Intel Processors, with a brief explanation of the Intel processor architecture and evolution; and the MOESI & AMD Processors, with the Opteron and Phenom example architectures, and other particularities, performance enhancements and optimizations of the AMD64 family of processors. It ends with Dragon Protocol & Xerox Dragon Processors to give pass to discussions on Cache Coherence Power Utilization and discussing possible power savings that can be obtained using Snoop Filtering, eliminating unnecessary snoops, with techniques like Serial Snooping and Jetty.

MSI, MESI, MESIF, And MOESI Protocols on Real Architectures

Now in Spring 2012, <ref name=A43 /> and <ref name=A44 /> develop a similar topic as <ref name=A42 />, MSI, MESI, MESIF, And MOESI Protocols on Real Architectures, and the wiki pages are very close to those of the previous year. <ref name=A44 /> adds a chapter on MESI and ARM11 and Cortex-A9 MPCore, which are in fashion these days. It also presents a chapter on Instruction Prefetching and another in Word Invalidation.

Update And Adaptive Coherence Protocols On Real Architectures, And Power Considerations

Last section takes on Update And Adaptive Coherence Protocols On Real Architectures, And Power Considerations. <ref name=A45 /> introduces the coherence protocols, which take care of making sure that changes in a cache are propagated to the others. Main issue, they state, is that all this transient of information brings extra traffic to the bus and, hence, congestion. They define then the concept of coherence and the two possible approaches, write-update and write-invalidate protocols. The paper focused then in write-update only protocols, starting with the Xerox-Dragon protocol, which is explained; and then with the DEC Firefly protocol. After this it passes to describe the Adaptive Coherence Protocols, which, given the disadvantages of both write-update and write-invalidate, try to mix them in a hybrid scheme, which is here described, together with the hardware architecture needed ti support it. The Sublock Protocol, a snoopy-based adaptation of the Illinois MESI protocol, is also presented with a description of its mechanisms and advantages of this protocol versus more coarse ones. Finally an enhancement of this technique, the Read-snarfing protocol, is presented as an example, with coding proposals and studies over some particular situations; and some simulations results are also shown.

Chapter 9: Hardware Support for Synchronization

Pages in this chapter: <ref name=A46>Spring_2011/ch9_ms</ref><ref name=A47>Spring_2012/9a_mm</ref><ref name=A48>Spring_2012/9a_ms</ref><ref name=A49>Spring_2012/9a_cm</ref>

This chapter takes care of explaining the means needed to exercise synchronization in a multi-process system. It introduces hardware (LL/SC, atomic instructions) and software (locks, barriers, semaphores or monitors) tools to obtain that synchronization and describe them in depth, talking about their architecture, implementation, advantages and drawbacks. The wiki pages associated at this chapter have two subjects. The first one loosely follows the book chapter and it's based on it, almost like a summary. The second section brings up an interesting topic that the chapter does not cover in depth, which is the overhead related to the lock process, introducing a couple of techniques, thin and bias locks, that try to minimize those overheads basing themselves in particularities of the programming applications, mostly in the Java environment.

Synchronization

The first section, Synchronization, from Spring 2011, has just one wiki page, <ref name=A49 />. The page starts by introducing the concept and its parts, acquire, wait and release, and which of them are better implemented in hardware or in software. It then passes to talk about Lock Implementation, how to evaluate their performance -latency, traffic, fairness and storage- and compares several implementations -T&S, TT&S-, discussing ways of improving performance and discussing their drawbacks, like deadlocks or excessive use of resources. After locks, the page covers barrier implementations and describes several implementations and improvements, as Sense-Reversal barriers, Combining Tree barriers, Tournament Barriers or Disseminating Barriers.

Reducing Locking Overhead

The second section is very well cover with three different wiki pages. The topic is Reducing Locking Overhead and it is about ways of reducing the load locks impose in the multi-process systems. <ref name=A48 /> starts by talking about Synchronization in Java, discussing its memory models, sharing and locks and Monitors, also discussed thoroughly in <ref name=A47 />. After that, also in <ref name=A47 />, problems with locks are discussed, one of them being the excessive use of system resources. The three pages, <ref name=A47 />, <ref name=A48 /> and <ref name=A49 /> bring up Thin Locks in a very similar way, describing the method and how they are implemented in Java objects, peeling the lock unnecessary layers to make them more efficient. <ref name=A48 /> nicely shows sections for the loop characteristics, the algorithm that they follow and examples, while <ref name=A47 /> and <ref name=A49 /> show code examples. Then, sections about Biased Locks are shown, a evolution of the thin locks that favors a particular thread over others. It is treated in a similar way as Thin Locks by the three pages. Finally <ref name=A47 /> dedicates a section to the Rogers' lock, an implementation of the biased lock in Java.

Chapter 10: Memory Consistency Models

Pages in this chapter: <ref name=A50>Spring_2012/10a_dr</ref><ref name=A51>Spring_2012/10a_jp</ref><ref name=A52>Spring_2012/10a_vm</ref><ref name=A53>Spring_2012/10b_sr</ref><ref name=A54>Spring_2012/ch10_sj</ref><ref name=A55>Spring_2011/ch10_MC</ref><ref name=A56>Spring_2011/ch10_sb</ref><ref name=A57>Spring_2011/ch10_sj</ref><ref name=A58>Spring_2011/ch10a_dc</ref>

Chapter 10 is one of the most popular ones. The Solihin<ref name=A1 /> chapter 10 explains the different models of memory consistency (all memory) in opposition to cache coherence (sinle memory) and explores the expectation of atomicity that the programmers expect from the system, explaining the consistency and relaxed consistency models. The wiki pages in this chapter complement the text with topics like the use of concurrency in JMM, or the use of prefetching to accelerate the performance of consistency models.

Java Memory Model (JMM)

The first of the Spring 2011 subjects is taken care of by <ref name=A55 /> and <ref name=A58 />, that get to talk about Java Memory Model (JMM) main aspects, motivation, and the use of concurrency-related building blocks. <ref name=A55 /> describes then the JMM specifications and rules and concepts like the 'volatile' variable, 'piggybacking' on synchronization and the importance of doing a correct initialization; to then pass to talk about Double-Checked Locking and ending with a mention to the JDK and how to write safe Java code.

Memory Consistency Models

The second Spring 2011 topic Memory Consistency Models. <ref name=A56 /> starts by introducing the Sequential Consistency Models and discusses their performance in multiprocessors and some hardware and software optimization techniques, like prefetching, speculative reads or the Shasha and Snir's algorithm. Then it passes to talk about Relaxed Consistency Models, and how we can make these models nimbler, stripping them of some of their parts and relaxing sequential consistency. After the concept is introduced, several models are listed, and real implementations, like like the IBM-370, IBM power PC or the DEC-Alpha ones are discussed. Then it moves into the Release Consistency Models, like the eager release or lazy release and compares and analyze their performance. Finally, a discussion about the comparison of all models, experiment results and some shortcomings of their use; end the page. Then it brings sequential consistence into the equation,

Prefetching and Consistency Models

The Spring 2012 pages approach this former chapter a bit differently and the section 10a, Prefetching and Consistency Models, focuses more in Prefetching. <ref name=A52 /> starts by introducing the concept and listing the types of prefetching techniques and how they work, guiding us with examples. Then it introduces sequential consistency into the equation and compares the results on working with combinations of both methods, explaining the reasons why these combinations have not been widely adopted. Following it repeats the exercise with using Release Consistency first and then Speculative Execution, lingering into the explanation of the former, ending with an practical example. It is worthy to note that <ref name=A50 /> and <ref name=A51 /> have both small sections about prefetching in multiprocessors.

Use of Consistency Models In Current Multiprocessors

The last section in this chapter is Use of Consistency Models In Current Multiprocessors and has just one page, <ref name=A53 />. The page opens introducing the topic and giving a background of current multiprocessor architectures, their trends and the challenges that they face, like scalability or latency. Then it goes into describing the an extensive list of consistency models that are being used by current machines, some of them already mentioned before in other pages. The list includes Address translation aware memory consistency, Causal consistency, Delta consistency, Entry consistency, Eventual consistency, Linearizability, PRAM consistency, Weak consistency and Strong consistency; with some of the methods coming with example code. Following, the page focuses in the performance impact, comparing several of those methods among them, reaching to the conclusion that weaker consistencies have a definite improvement in performance at the cost of complexity. Release consistency, Sequential consistency

Chapter 11: Distributed Shared Memory Multiprocessors

Pages in this chapter: <ref name=A59>Spring_2012/11a_hn</ref><ref name=A60>Spring_2012/11a_ht</ref><ref name=A61>Spring_2011/ch11_BB_EP</ref><ref name=A62>Spring_2011/ch11_DJ</ref><ref name=A63>Spring_2011/ch11_sw</ref>

Chapter 11 in the Solihin<ref name=A1 /> text explains what are the best ways to scale a shared memory system. A directory approach is one of the most scalable and hence DSM systems are developed in this section. Full-bit vector and SCCI protocols are fully described. The wiki pages in this section talk about the SCI and the performance of the DSM protocols, complementing the text. Also, the power point slides where the functioning of this last protocol is described, needed some enhancement and some teams have taken care of that.

Scalable Coherent Interface (SCI)

The Spring 2011 wiki pages on this chapter treat the subject of Scalable Coherent Interface (SCI). <ref name=A61 /> and <ref name=A63 /> are very similar and go through the topic almost parallely. They start by setting up a background of what SCI means in multiprocessors and the reasons behind it. Then they pass to describe the states that the protocol uses and state diagrams with the protocol transitions, with <ref name=A63 /> giving some examples and explaining the way the protocol work. Then both pages move into the topic of Race Conditions that this protocol shows due to early invalidations and how they can be prevented with various mechanisms; like atomic transactions, using the head node to carry on most of the coherence, or changing the directory structure. There are other possible race conditions, like concurrent deletion or simultaneous deletion and invalidation, that are presented with possible work-arounds.

Performance of DSM systems

The second topic in the chapter is 11a, Performance of DSM systems and both <ref name=A59 /> and <ref name=A60 /> are good pages. <ref name=A59 /> introduces the concept of distributed shared memory (DSM), how it works and how it is used. Then it discusses why a bus cannot be used, due to the desired scalability and message-passing is preferred to maintain coherence. Then it goes into describing the hardware architecture, that is, the systems where this can be deployed and used, being those CC-NUMA, the most interesting for us; COMA and Reflective Memory systems. Then they move into the software architecture, with the Fine-Grained approach description, the Shasta protocol and their optimizations and the Cashmere protocol and the challenged the last two face. Following the page analyzes the performance of these protocols, the system overhead, load balancing, scheduling and the communication overhead. <ref name=A60 /> also has a section on Parallel File Input/Output in Cohesion systems where the advantages of having the file management controlled by more than one node are discussed. A design, scheme and implementation are presented. A real example, the Cray X1, closes the page.

Improvements to directory-based cache-coherence animations

Last part of this chapter, Improvements to directory-based cache-coherence animations have no wiki pages, but a compendium of Power Point animations on Full-Vector and SCCI.

Chapter 12: Interconnection Network Architecture

Pages in this chapter: <ref name=A64>Spring_2011/ch12</ref><ref name=A65>Spring_2011/ch12_aj</ref><ref name=A66>Spring_2011/ch12_ar</ref><ref name=A67>Spring_2011/ch12_ob</ref><ref name=A68>Spring_2012/12b_jh</ref><ref name=A69>Spring_2012/12b_sb</ref>

New Interconnection Topologies

There are two groups of wiki pages in this chapter. The first one is New Interconnection Topologies and all but one of the pages can be included in this group. The pages are really all very similar, using even the same pictures, and it results very difficult to lean towards a particular one. <ref name=A66 /> looks like the most complete, starting by describing the concept and the parameters that we would be using to evaluate, like latency or bandwidth; and characterize, like diameter, bisection bandwidth , number of links and degree. Then introduces known topologies, like the Linear Array, the Ring, the Cube, Trees and Butterfly; passing afterwards to more complex structures that have evolved from them, like the D-Mesh, the Hypercube, the D-Torus and the Fat Tree and, then, makes a comparison among them. Next it moves into real implementations and its abilities and limitation, including again the Fat Tree, the Butterfly, Messes and Tori and the Hypercube. The subject is changed to Routing algorithms, deterministic, adaptive or a combination of them; passing into the problems that routing faces, like Deadlock, and ways or getting over them, as circular routing patterns or turn restriction models. Afterwards, it defines the router architecture and its evolution over time. The page ends with a reference to Fault Tolerant routing, models and mechanisms.

On-chip Interconnects

The next and last section is On-chip Interconnects. Though <ref name=A68 /> makes a mention to it, <ref name=A69 /> is the one that really develops the topic. It starts setting a background by saying how the grow in performance in the most recent architectures have shifted from faster single core processors to the many cores per die, the chip multiprocessor (CMP), that are popular nowadays. It gets into the topologies that are used in those architectures and the parameters that characterize them; passing to explain in more depth some of the most common, like 2-D Mesh, Concentration Mesh, Flattened Butterfly and Multidrop Express Channels (MECS), and comparing each other performance. Next it points out some commercial implementations by Intel, Tilera, ST Microelectronics and IBM. Last, it moves into Routing in a slightly different way that [66], defining first the General Routing Schemes, like store and forward, cut-through and deterministic or adaptive routing and then the dangers, like deadlocks and livelocks. Specific routing protocols implementations for SoC and then listed, that is, Source or Distributed Routing, Logic Based Distribute Routing (LBDR), Bufferless Deflection Routing (BLESS protocol), CHIPPER (Cheap-Interconnect Partially Permuting Router) and Dimension-order Routing. The page ends with current lines of research, like optical on-chip interconnects and Reconfigurable or Bio Network on Chip (NoC)

References

<references/>

Notes To This Preface

The commented wiki pages have been taken from the Expertiza repository for the course CSC506, classes of Spring 2011 and Spring 2012. Some of the links have been obviated because they were empty, repeated or invalid.