CSC/ECE 506 Spring 2010/summary

ECE633 Independent Study: Architecture of Parallel Computers Karishma Navalakha

Abstract:

There has been tremendous research and development in the field of multi-core Architecture in the last decade. In such a dynamic environment it is very difficult to have text books covering latest developments in the field. Wiki written text books comes as an extremely handy tool for students to get acquainted and interested in ongoing research. In this independent study we explored an academic learning technique where students could learn the fundamental concepts of the subject through the text book available to students and lectures delivered by Prof. Gehringer in class. They can now build on this foundation and gather latest information from the varied online resources and technical papers and summarize their findings in the form of wiki pages. Software is also being currently developed to assist the students and was adopted in this course. We tried to enhance the quality of student submitted wiki pages through peer reviewing. Professor Gehringer and I constantly provided inputs to students to improve both their quality of wiki pages as well as quality of reviewing. The software being developed under the able guidance of professor Gehringer has been vital in overcoming administrative hurdles involved in assigning topics to students, maintaining the updates and tracking progress of their writings, getting feedbacks through peer reviewing and handling the re-submitted work. All this has been managed via the software in an organized fashion.

Experience with Wiki written text book:

The software was first deployed in CSC/ECE 506, Architecture of Parallel Computers. This is a beginning masters-level course that is taken by all Computer Engineering masters students. It is optional for Computer Science students, but as it is one way to fulfill a core requirement, it is popular with them too. The recently adopted textbook for this course is the locally written Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems [Solihin 2009]. It did not make sense to have the students rewrite this excellent text, but the book concentrates on theory and design fundamentals, without detailed application to current parallel machines. We felt that students would benefit from learning how the principles were applied in current architectures. Furthermore, they would learn about the newest machines in this fast-changing field.

After every chapter covered in class, two individuals, or pairs of students were required to sign up for writing the wiki supplement for that particular chapter. (That is, we solicited two supplements for each chapter, each of which could be authored by one or two students.) They were asked to add specific types of information which was not included in the chapter.

Initially, students were not clear about the purpose of their wiki pages. The first pages they wrote had substantial duplication of topics covered in the textbook. Students were attempting to give a complete coverage of issues discussed in the chapter. We wanted them to concentrate instead on recent developments. Upon seeing this, we established the practice of having the first two authors of this paper (Gehringer and Navalakha) review the student work, along with three peer reviews from fellow students. A lot of review time was spent providing guidance on how to revise.

At the beginning we gave the students complete freedom to explore resources for the topic they had chosen to write on. This was not very successful, as the students seemingly chose to read the first few search hits, which tended to provide an overview of the topic, rather than in-depth information on particular implementations. Sometimes students were not aware that the information they found was already covered in the next chapter, which they have not read yet. The first review which we gave students was mainly just making them aware of topics covered in later chapters. A lot of effort in writing the initial draft was thus wasted. After the first two sets of topics, we began to provide links for students to material that we wanted the students to pay attention to. Gehringer and Navalakha met weekly to discuss what to provide to students. We regularly consulted other textbooks, technology news, and Web sites of major processor manufacturers, such as Intel and AMD. As the semester progressed, the quality of the initial submissions improved, and the students realized better returns for their effort.

The quality of work seemed to improve as the semester progressed. A comparison of the grades for the wiki pages revealed that the average score for the first chapter written by each student was 82.8% while the average for the second submission was 82.7%. The quality of wiki pages had improved, but at the same time, the peer reviewers became more demanding. Students were given more inputs to improve their work via peer reviewing. Thus the improvement was seen in the final wiki page produced as against the grades received by students. The initial wiki pages provided randomly collected data and was cluttered by diagrams and graphs. This information reinstated facts given in the textbook. The later wiki pages focused on a comparative study of present-day supercomputers produced by Intel, AMD and IBM.

For example while writing the wiki for cache-coherence protocols, the students examined which protocol was favored by which company and why. They also discussed protocols which have been introduced in recent two years e.g., Intel's MESIF protocol. Such in depth analysis made the wiki more appealing to readers. Gehringer and Navalakha provided additional reviews which helped in constantly improving the quality of wiki pages. These reviews gave the students insight into what was expected expected of them. This led to an increasing focus on current developments while peer reviewing. It was observed that later versions of reviews included guidance similar to that received from Gehringer and Navalakha. The organization of the wiki pages and the volume of relevant data collected by students improved as the semester progressed.

Electronic peer-review systems have been widely used to review student work, but never before, to our knowledge, have they been applied to assignments consisting of multiple interrelated parts with precedence constraints. The growing interest in large collaborative projects, such as wiki textbooks, has led to a need for electronic support for the process, lest the administrative burden on instructor and TA grow too large.

Chapter wise learning from this independent study:

Chapter 1: Perspectives

The supplement covers an interesting topic of supercomputer evolution. Wiki pages written for this topic includes a lot data from literature. It has interesting topics which are not covered in the text book such as Timeline of supercomputers, First Supercomputer (ENIAC), Cray History, Supercomputer Hierarchical Architecture, Supercomputer Operating System, Cooling Supercomputer and Processor Family. From the supplement we can see the increase in dominance of Intel’s processors in the consumer market. It also concludes that Unix has been the platform for most of these super computers. Massive Parallel Processing (MPP) and Symmetric Multiprocessing (SMP) are the earliest style of widely used multiprocessor machine architectures which are replaced by constellation computing in the 2000 and currently is dominated by cluster computing.

References:

http://www.top500.org/

The future of supercomputing: an interim report By National Research Council (U.S.). Committee on the Future of Supercomputing

Chapter 2: Parallel Programming Model

The wiki supplement provides comparisons between data parallelism and task parallelism. Haveraaen (2000) notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. It can be noted that the data parallel model may be used with the shared memory or the message passing model without conflict. In the comparisons it concludes that combining the data parallel and message passing models results in reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM).

References:

1. W. Daniel Hillis and Guy L. Steele, Jr., "Data parallel algorithms," Communications of the ACM, 29(12):1170-1183, December 1986.

2. Alexander C. Klaiber and Henry M. Levy, "A comparison of message passing and shared memory architectures for data parallel programs," in Proceedings of the 21st Annual International Symposium on Computer Architecture, April 1994, pp. 94-105.

Chapter3: Shared Memory Parallel Programming

In this wiki supplement, the three kinds of parallelisms, i.e. DOALL, DOACROSS and DOPIPE are discussed. These three parallelism techniques are discussed with examples in the form of Open MP code as discussed in the text book. Besides it also provides additional depth in this topic by discussing parallel_for, parallel_reduce, parallel_scan, pipeline, Reduction, DOALL, DOACROSS, DOPIPE with respect to Intel Thread Building Blocks. It also compares DOPIPE, DOACROSS, DOALL in POSIX Threads. Finally it concludes : Pthreads works for all the parallelism and could express functional parallelism easily, but it needs to build specialized synchronization primitives and explicitly privatize variables, makes it more effort needed to switch a serial program in to parallel mode.

OpenMP can provide many performance enhancing features, such as atomic, barrier and flush synchronization primitives. It is very simple to use OpenMP to exploit DOALL parallelism, but the syntax for expressing functional parallelism is awkward.

Intel TBB relies on generic programming, it performs better with custom iteration spaces or complex reduction operations. Also, it provides generic parallel patterns for parallel while-loops, data-flow pipeline models, parallel sorts and prefixes, so it's better in cases go beyond loop-based parallelism.

References:

1. An Optimal Abtraction Model for Hardware Multithreading in Modern Processor Architectures

2. Intel Threading Building Blocks 2.2 for Open Source Reference Manual

3. POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory

Chapter 6: Introduction to Memory Hierarchy Organization

The wiki supplement adds additional insight on this topic by discussing Shared Memory Multiprocessors, write policies and replacement policies. Greedy Dual Size (GDS) and Priority Cache(PC) replacement policy is an additional subtopic students threw light on. It also gives definitions about Trace Cache and Smart Cache techniques by Intel. The most important take away from this topic is WRITE POLICIES used in recent multi core architectures. For example, Intel IA 32 IA64 architecture implements Write Combining, Write Collapsing, Weakly Ordered, Uncacheable & Write No Allocate and Non-temporal techniques in its cache. AMD uses cache exclusion unlike Intel’s cache inclusion. Sun's Niagara and SPARC use L1 caches as WT, with allocate on load and noallocate on stores.

References:

1. http://download.intel.com/technology/architecture/sma.pdf

2. http://www.intel.com/Assets/PDF/manual/248966.pdf

3. http://www.intel.com/design/intarch/papers/cache6.pdf

Chapter 7: Introduction to Shared Memory Multiprocessors

Shared-memory multiprocessors run into several problems that are more pronounced than their uniprocessor counterparts. The Solihin text used in this course goes into detail on three of these issues, that is cache coherence, memory consistency and synchronization. The goal of this wiki supplement is to discuss these three issues and also what can be done to ensure that instructions are handled in both a timely and efficient manner and in a manner that is consistent with what the programmer might desire. Memory consistency is discussed by comparing ordering on a uniprocessor vs ordering on a multiprocessor. It concludes that in a multiprocessor much more care must be taken to ensure that all of the loads and stores are committed to memory in a valid order. Synchronization is discussed as applicable to Open MP and fence insertion. Other methods such as test and set method and direct interrupt to another core are also briefly discussed. The programmer (or complier) is responsible for knowing which synchronization directives are available on a given architecture and implementing them in an efficient manner. It also discusses commonly used instructions for synchronization in popular processor architectures. For example SPARC V8 uses store barrier, Alpha uses memory barrier and write memory barrier whereas Intel x86 uses lfence (load) sfence (store).

References:

1. https://wiki.ittc.ku.edu/ittc/images/0/0f/Loghi.pdf

2. http://portal.acm.org/citation.cfm?id=782854&dl=GUIDE&coll=GUIDE&CFID=84866326&CFTOKEN=84791790

Chapter 8: Bus-Based Coherent Multiprocessors

The wiki page discusses the existing bus-based cache coherence in real machines. It goes ahead and classifies the cache coherence protocols based on the year they were introduced and the processors which uses them. MSI protocol was first used in SGI IRIS 4D series. In Synapse protocol M state is called D (Dirty) but works the same as MSI protocol works. MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. The Pentium Pro microprocessor, introduced in 1992 was the first Intel architecture microprocessor to support SMP and MESI. The MESIF protocol, used in the latest Intel multi-core processors was introduced to accommodate the point-to-point links used in the QuickPath Interconnect. MESI came with the drawback of using much time and bandwidth. MOESI was the AMD’s answer to this problem . MOESI' has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol. The Dragon Protocol is an update based coherence protocol which does not invalidate other cached copies. The Dragon Protocol , was developed by Xerox Palo Alto Research Center(Xerox PARC), a subsidiary of Xerox Corporation. This protocol was used in the Xerox PARC Dragon multiprocessor workstation.

References:

1. Cache consistency with MESI on Intel processor

2. AMD dual core Architecture

3. Silicon Graphics Computer Systems

4. Synapse tightly coupled multiprocessors: a new approach to solve old problems

5. Dragon Protocol

Chapter 9: Hardware Support for Synchronization

It classifies synchronization techniques based on implementation. Hardware synchronization uses locks, barriers and mutual exclusion. Software synchronization examples include ticket locks and queue-based MCS locks. Mutex implementation uses execution of atomic statements.

Some common examples include Test-and-Set, Fetch-and-Increment, Exchange, Compare-and-Swap. Another type of lock that was not discussed in the text is known as the "Hand-off" lock was discussed in detail by the students. It also discusses reasons why a programmer should attempt to write programs in such a way as to avoid locks. There are API's that exist for parallel architectures that provide specific types of synchronization. If the API are used they way they were design, performance can be maximized while minimizing overhead.Load Locked(LL) and Store Conditional(SC) are a pair of instructions are improved hardware primitives that are used for lock-free read-modify-write operation.

Detailed description of Combining Tree Barrier, Tournament Barrier and Disseminating Barrier is included. One of the interesting topics discussed in this wiki supplement was the performance evaluation of different barrier implementations. It shows that barrier/centralized blocking barrier does not scale with number of threads and the contention increases with increase in number of threads.

References:

1. http://www2.cs.uh.edu/~hpctools/pub/iwomp-barrier.pdf

2. http://www.statemaster.com/encyclopedia/Deadlock

3. http://www.ukhec.ac.uk/publications/reports/synch_java.pdf

Chapter 10: Memory Consistency Models

The wiki supplement discusses the existing bus-based cache coherence in real machines. It goes ahead and classifies the cache coherence protocols based on the year they were introduced and the processors which uses them. MSI protocol was first used in SGI IRIS 4D series. In Synapse protocol M state is called D (Dirty) but works the same as MSI protocol works. MSI has a major drawback in that each read-write sequence incurs 2 bus transactions irrespective of whether the cache line is stored in only one cache or not. The Pentium Pro microprocessor, introduced in 1992 was the first Intel architecture microprocessor to support SMP and MESI. The MESIF protocol, used in the latest Intel multi-core processors was introduced to accommodate the point-to-point links used in the QuickPath Interconnect. MESI came with the drawback of using much time and bandwidth. MOESI was the AMD’s answer to this problem . MOESI' has become one of the most popular snoop-based protocols supported in the AMD64 architecture. The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol. The Dragon Protocol is an update based coherence protocol which does not invalidate other cached copies. The Dragon Protocol , was developed by Xerox Palo Alto Research Center(Xerox PARC), a subsidiary of Xerox Corporation. This protocol was used in the Xerox PARC Dragon multiprocessor workstation. References:

1. Shared Memory Consistency Models

2. Designing Memory Consistency Models For Shared-Memory Multiprocessors

3. Consistency Models

Chapter 11: Distributed Shared Memory Multiprocessors

The cache coherence protocol presented in Chapter 11 of Solihin 2008 is simpler than most real directory-based protocols. This textbook supplement presents the directory-based protocols used by the DASH multiprocessor and the Alewife multiprocessor. It concludes with an argument of why complexity might be undesirable in cache coherence protocols. The DASH multiprocessor uses a two-level coherence protocol, relying on a snoopy bus to ensure cache coherence within cluster and a directory-based protocol to ensure coherence across clusters. The protocol uses a Remote Access Cache (RAC) at each cluster, which essentially consolidates memory blocks from remote clusters into a single cache on the local snoopy bus. When a request is issued for a block from a remote cluster that is not in the RAC, the request is denied but the request is also forwarded to the owner. The owner supplies the block to the RAC. Eventually, when the requestor retries, the block will be waiting in the RAC. Read and readx operations on a Dash processor were discussed in detail. They also discuss two race conditions which mainly arises on a Dash processor.The first occurs when a Read from requester R is forwarded from home H to owner O, but O sends a Writeback to H before the forwarded Read arrives. Another possible race occurs when the home node H replies with data (ReplyD) to a Read from requester R but an invalidation (Inv) arrives first. LimitLESS is the cache coherence protocol used by the Alewife multiprocessor. Unlike the DASH multiprocessor, the Alewife multiprocessor is not organized into clusters of nodes with local buses, and therefore cache coherence through the system is maintain through the directory.

References:

1. Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy (1990). "The directory-based cache coherence protocol for the DASH multiprocessor." In Proceedings of the 17th Annual International Symposium on Computer Architecture.

2. David Chaiken, John Kubiatowicz, and Anant Agarwal (1991). "LimitLESS directories: A scalable cache coherence scheme." ACM SIGPLAN Notices.

Chapter 12: Interconnection networks

Advances in multiprocessors, parallel computing & networking and parallel computer architectures demand very high performance from interconnection networks. Due to this, interconnection network structure has changed over time, trying to meet higher bandwidths and performance. Students discussed criterion to be considered for choosing the best Network. It included Performance Requirements, Scalability, Incremental expandability, Partitionability, Simplicity, Distance Span, Physical Constraints, Reliability and Reparability, Expected Workloads and Cost Constraints. It provides in depth discussion on Classification of Interconnection networks. Shared-Medium Networks include Token Ring, Token Bus, Backplane Bus. Direct Networks include Mesh, Torus, Hypercube, Tree, Cube-Connected Cycles and de Bruijn and Star Graph Networks. Indirect Networks include Regular Topologies like Crossbar Network and Multistage Interconnection Network and Hybrid Networks such as Multiple Backplane Buses, Hierarchical Networks, Cluster-Based Networks and Hypergraph Topologies. It also discusses routing algorithms and deadlock, starvation and livelock associated with it. These topics are covered in in an extremely detailed way. It includes a diagrammatic representation for every topology.

References:

1. http://www.top500.org/2007_overview_recent_supercomputers/sci

2. http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/topology.html

CSC/ECE 506 Spring 2010/summary

Navigation menu

Search