CSC/ECE 506 Spring 2010/6 DD

MULTICORE Architecture

The term multicore means two or more independent cores. The term architecture means the design/layout of a particular idea. Hence the combined term multicore architecture simply means the integration of cores and its layout. For e.g. dual core means each processor contains 2 independent cores.

NUMA architecture

The term NUMA is an acronym for Non Uniform Memory Access. It is one way of organizing the MIMD (Multiple Instruction Multiple Data: independent processors are connected together to form a multiprocessor system.) architecture. NUMA is also known as distributed shared memory. Each processor has an independent cache, memory and they are connected to a common interconnection network like a crossbar, mesh, hypercube, etc.

A crossbar is a switch which can connect processors to memory in any order. However, it becomes expensive to use, hence as a compromise between crossbar and shared bus sytems, multistage interconnection networks are used (e.g. network on chip interconnects)

CACHE

'Cache is a storage device having small memory. Why were caches introduced? ^[1]
The CPU works at clock frequencies of 3 GHZ and most common RAM speeds are about 400 MHz. Hence, there needs to be a transfer of data to and fro from RAM and CPU. This will increase the latency and CPU will remain idle most times. The main reason of introducing caches was to reduce this latency to access the data from main memory. These small buffers (caches) operate at higher speeds than normal RAM and hence data can be accessed quickly from the cache.

Memory Hierarchy^[2]

The cache which is closest to CPU is called Level 1 Cache or L1 cache. It is also known as primary cache. The size is between 8 KB – 128 KB. The next level is L2 cache (secondary cache) whose size is between 256 KB –1024 KB. If the data that CPU is looking for, is not found in L1 cache, it'll look in L2 cache,then in the main memory, then finally in an external storage device.

This leads us to the introduction of two new terms: cache hit and cache miss. The data searched for by processor, if found in cache, results in cache hit, and if not found in cache, results in cache miss. With the introduction of multiple cache levels, it can be individually stated as L1 cache hit/miss, L2 cache hit/miss and so on.

Most modern computers use the following memory hierarchy. The figure below shows an uni processor memory hierarchy & a recent dual core memory hierarchy

Uniprocessor memory hierarchy (L) ^[3] & Dual core memory hierarchy (R) ^[4]

Previously the processors had an unified L1 cache i.e. data and instruction cache were unified in one. If the Instruction fetch unit wants to fetch data from the cache and the memory unit wants to load an address simultaneously, then the memory operation is given priority and the fetch unit is stalled. Hence, L1 cache was split into L1 data cache (L1 – D) and L1 instruction cache (L1 -I) so that it helps in concurrency. One more advantage is fewer number of ports will be required. L2 cache is still unified. As the latency difference between main memory and the fastest cache has become larger, some processors have begun to utilize L3 cache, some don’t. Hence the above figure shows 2 boxes, one which includes L1 and L2, and the other which combines L3 with them. The L3 cache is roughly about 4 MB – 32 MB in size.

In addition to these 3 levels of cache, some multicore processors use: Inclusion property, Victim cache, Translation LookAside Buffer, Trace Cache, Smart Cache (Intel)

Inclusion Property

Here, the larger lower level cache (L2) includes all the data of upper level (L1) cache. The advantage of this property is that if a value is not found in the lower level cache, it is guaranteed it is not present in the upper level cache, which reduces the amount of checking. The disadvantage is if L2 cache is smaller in size, most of it contains only L1 cache data and hardly any new data, defeating the purpose of another level of cache altogether.

Victim Cache

A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. When main cache evicts a block, the victim cache will take the evicted block. This block is called the victim block. When the main cache misses, it searches the victim cache for recently evicted blocks. So, a hit in victim cache means the main cache doesn’t have to go to the next level of memory.

Translation LookAside Buffer

The TLB is a small cache holding recently used virtual-to-physical address translations. TLB is a hardware cache that sits alongside the L1 instruction and data caches.

Trace Cache

The trace cache is located between the decoder unit and the execution unit. Thus the fetch unit grabs data directly from L2 memory cache.

Smart Cache

Intel Advanced Smart Cache^[5] is a multi-core optimized cache that improves performance by increasing the probability that each execution core of a dual-core processor can access data from a higher-performance, more-efficient cache subsystem. To accomplish this, Intel Core micro architecture shares the Level 2 (L2) cache between the cores. If one of the cores is inactive, the other core will have access to the full cache. It provides a peak transfer rate of 96 GB/sec (at 3 GHz frequency).

Parameters of a cache:

1. Size: cache data storage.
2. Associativity: Number of blocks in a set
3. Block Size: Total number of bytes in a single block.
A set is a fancy term used for the row of a cache.

PLACEMENT POLICY

The placement policy is the first step in managing a cache. This decides where a block in memory can be placed in the cache. The placement policies are:

Direct mapped

If each block has only one place where it can appear in a cache, it is called direct- mapping. It is also known as one way set associative.^[6]
The main memory address (32 bits) is broken down as: (Assumed 8 blocks and 8 lines in cache.)

Number of offset bits = log 2 (Block Size)
Number of index bits = log 2 (Number of sets)
Number of Tag bits = 32- offset bits- index bits

Each line has its own tag associated with it. The index tells which row of cache to look at. To search for a word in the cache, compare the tag bits of the address with the tag of the line. If it matches, the block is in the cache. It is a cache hit. Then select the corresponding word from that line.

Fully Associative

If any line can store the contents of any memory location, it is known as fully associative cache. Hence, no index bits are required.
The main memory address (32 bits) is broken down as:

It is also known as Content Addressable Memory. To search for a word in the cache, compare the tag bits of the address with the tag of all lines in the cache simultaneously. If any one matches, the word is in the cache. Then select the corresponding word from that line.

Set Associative

Hence, a compromise between the fully associative cache and the direct mapping technique is made, known as set associative mapping. Almost 90% of the current processors use this mapping. Here the cache is divided into ‘s’ sets, where s is a power of 2. The associativity can be found by using the following relation: Associativity = Cache size/ (Block Size * Number of sets). For an associativity of ‘n’, the cache is ‘n- way’ set associative.

2 way set associative mapping ^[3]

Here the cache size is the same, but the index field is 1 bit shorter. Based on the index bits, select a particular set. To search for a word in the cache, compare the tag bits of the address with the tag bits of the line in each set (here 2). If it matches with any one of the tags, it is a cache hit. Else, it is a cache miss and the word has to be fetched from main memory.

Conceptually, the direct mapped and fully associative caches are just "special cases"^[6] of the N-way set associative cache. If you make n=1, it becomes a "1-way" set associative cache. i.e. there is only one line per set, which is the same as a direct mapped cache. On the other hand, if you make ‘n’ to be equal to the number of lines in the cache, then you only have one set, containing all of the cache lines, and each and every memory location points to that set. This gives us a fully associative cache.

REPLACEMENT POLICY

This decides which block needs to be evicted to make space available for a new block entry. The replacement policies used are: LRU, Psuedo LRU, MRU, LFU,FIFO, OPT, random, etc.
The most commonly used is LRU (Least recently used replacement policy). It works as follows: ^[3]

1. Make small counters per block in a set and the number of bits in each counter = log2 (Associativity)
2. If there is a cache hit, make the current block’s counter value as 0, which indicates that it is the most recently used one. Increment the counters of the other blocks whose counter values are less than the referenced block’s old counter value
3.If there is a cache miss, make the newly allocated block’s counter value as 0 and then increment the counters of all other blocks.

One eg: trace ABCEF. (0-2 indicate counter numbers)

0	1	2
2	0 B	1 A
0 C	1 B	2 A
1 C	2 B	0 E
2 C	0 F	1 E

To start with , the first three steps only load A, B and C. When E came in, the block with the highest count here 2, block A was evicted since it was least recently used and the value of other counters were incremented accordingly. Similarly, when F came in, block B had the highest counter value and it was evicted to be replaced by block F.

For caches with greater associativity, the implementation cost of LRU increases. So, a variation called Pseudo-LRU is used.

Exactly in contrast to LRU, a MRU (most recently used) replacement policy also can be used. This is used when the recent data is not likely to be accessed in the near future. The number of hits are more in MRU than LRU.

LFU (Least Frequently Used) policy counts the number of times an item is used and replaces the one with the lowest count.

Adaptive replacement^[15] can be used as a tradeoff between LRU and LFU.

FIFO policy simply removes the cache block which was brought in earliest in the cache, irrespective of it being least, most recently used.

WRITE POLICY

Write Through

Whenever a value is written in the cache, it is immediately written in the next level of memory hierarchy. The main advantage is that both levels have the recent updated value at all times. No dirty bit is required. The disadvantage is that since it writes every time, it consumes a lot of bandwidth. A bit stored in the cache can flip its value when struck by alpha particles resulting in soft errors^[4]. The effect of this is that the value in the cache is lost. Since, the lower level of cache contains the same updated value, it is safe to discard the block, as it can be easily re fetched. So, only error detection is enough.

Write Back

The next level of memory hierarchy is not updated as soon as a value is written in the cache, but is updated only when a block is evicted from the cache. To cope up with the issue of having the most recent value, a dirty bit is associated with the cache block. When there is a write to the cache, the dirty bit is set 1. So, if this block were to be evicted, indication of dirty bit being 1, will make the block to be written back to the main memory. When a new block is written in cache, the dirty bit is made 0. Since, lower level of cache does not contain the same updated value, error correction is necessary.

Thus in most processors, L1 cache uses a WT policy and L2 cache uses a WB policy.

Write Allocate policy

On a write miss, the block is brought back into the cache before it is written. A write back policy generally uses write allocate.

Write No Allocate Policy:

On a write miss, the write is propagated to the lower level of memory hierarchy but it is not brought back into the cache. A write through policy generally uses write no allocate.

RECENT MULTI-CORE ARCHITECTURES

Cores in multi-core systems may implement architectures such as superscalar, VLIW, vector processing, SIMD, or multithreading.

Super scalar

Superscalar architecture implements Instruction level parallelism within a single processor. A superscalar processor executes more than one instruction during a clock cycle.

Very long instruction word (VLIW)

VLIW^[7] refers to a CPU architecture designed to take advantage of instruction level parallelism (ILP). It includes out of order execution. It is a type of MIMD. VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs. One VLIW instruction encodes multiple operations; specifically, one instruction encodes at least one operation for each execution unit of the device.

Simultaneous Multithreading

SMT is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits many independent threads of execution to make better utilization of the resources provided by modern processor architectures.

The following table shows the type of cache structure used by recent processors.^[8]

The above tables are not directly cited from anywhere, information has been found and framed.

RECENT NUMA ARCHITECTURES

1. 4 Opteron 875 dual core processors: 1 L1- D cache per core (2 per socket), size: 65 KB ,1 L1- I cache per core, size: 65 KB ,1 L2 cache per core (2 per socket), size: 1MB

2.AMD Hammer: L1: 128 KB, 2 way set assoc & L2: 1 MB. It also uses a large TLB and a victim cache.

3. Altix: The Altix UV supercomputer architecture combines a development of the NUMAlink interconnect used in the Altix 4000 (NUMAlink 5) with 4/6/8-core "Nehalem-EX" Intel Xeon processors. Altix UV systems run either SuSE Linux Enterprise Server or Red Hat Enterprise Linux, and scale from 32 to 2,048 cores with support for up to 16 TB of shared memory in a single system. It came in the end of 2009.

Other examples are: IBM p690, SGI Origin. Here are two different structures.

Tilera TILE 64 consists of a mesh network of 64 tiles. It is based on VLIW instruction set. Each of the cores (tile) has its own L1 and L2 cache plus an overall virtual L3 cache, having mutual inclusion property. It was developed in 2007.
Tilera, TILE- Gx is a future multi core processor consisting of a mesh network of 100 cores. It is a 64 bit core (3 issue) L1- I: 32 KB/ core, L1- D: 32 KB/ core, L2: 256 KB/ core, L3: 26 MB/ chip.

WRITE POLICIES used in recent multi core architectures

1. Intel Pentium Processors: ^[10] Mainly 2 bits control the cache. CD (cache disable) and NW (not write through)

CD= 1: cache disabled, CD=0: enables the cache NW= 0: Write Through, NW=1: Write Back

A WB ^[11] line is always fetched into the cache if a cache miss occurs.
A WT ^[11]line is not fetched into the cache on a write miss. For the Pentium Pro processor, a WT hit to the L1 cache updates the L1 cache. A WT hit to L2 cache invalidates the L2 cache.
A WP ^[11] (Write Protected) line is cacheable, but a write to it cannot modify the cache line. A WP line is not fetched into the cache on a write miss. For the Pentium Pro processor, a WP hit to the L2 cache invalidates the line in the L2 cache.
An UC ^[11] (Uncacheable) line is not put into the cache. A UC hit to the L1 or L2 cache invalidates the entry. Cache consistency: MESI (Modified, Exclusive, Shared, Invalid) policy is used.

2.Intel IA 32 IA64 architecture: ^[12] In IA 32, the number of processor sharing the cache is one plus value represented by CPUID.4.EAX [25:14]

Write Combining: Successive writes to the same cache line are combined. Write Collapsing: Successive writes to the same bytes result in only the last write being visible. Weakly Ordered: No ordering is preserved between WC stores or between other loads or stores. Uncacheable & Write No Allocate: Stored data is written around the cache and it uses Write No Allocate. Non-temporal: Data which is referenced once and not reused in the immediate future.

If the programmer specifies a non-temporal store to UC or WP memory types, then the store behaves like an uncacheable store. There is no conflict if a non temporal store to WC is specified. If the programmer specifies a non-temporal store to WB or WT memory types, and if the data is in the cache, will ensure consistency. However, if it is a miss, the transaction is subject to all WC memory semantics, which will not write allocate.

3.Intel's Celeron and Xeon based on Netburst microarchitecture (i.e. based on Pentium 4) will have the same specs as Pentium 4 but with different L2 cache size, while Celeron and Xeon based on Core microarchitecture (i.e. based on Core 2 Duo) will have the same specs as Core 2 Duo but with different L2 cache size.

Intel Xeon Clowertown: L1: 8 way associative & L2: 16 way associative.
Intel Celeron using Intel Core 2 architecture: L2: 16 way associative & L1- D: 8 way associative & L1- I: 8 way associative. It also has 16 entry DTLB.

4. Intel RAID: These use Write Back policy.

5. Intel Core microarchitecture uses FIFO replacement policy.

6. AMD uses cache exclusion unlike Intel’s cache inclusion

7. Sun's Niagara and SPARC use L1 caches as WT, with allocate on load and noallocate on stores.

8. In general, WT invalidate protocols gives better performance than WB- MESI protocols. But most modern processors use WB policy.

REFERENCES

[1]: http://www.karbosguide.com/books/pcarchitecture/chapter10.htm
[2]: http://www.real-knowledge.com/memory.htm
[3]: Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg
[4]: Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin
[5] : http://download.intel.com/technology/architecture/sma.pdf
[6]: http://www.pcguide.com/ref/mbsys/cache/funcMapping-c.html
[7]:http://en.wikipedia.org/wiki/Very_Long_Instruction_Word
[8]: http://en.wikipedia.org/wiki/Multi-core_processor
[9]: http://en.wikipedia.org/wiki/Power_Architecture
[10]: http://www.intel.com/design/intarch/papers/cache6.pdf
[11]: http://download.intel.com/design/archives/processors/pro/docs/24269001.pdf
[12]: http://www.intel.com/Assets/PDF/manual/248966.pdf
[13]: http://www.intel.com/pressroom/kits/quickreffam.htm
[14]: http://www.hardwaresecrets.com
[15]: http://en.wikipedia.org/wiki/Cache_algorithms#Pseudo-LRU

CSC/ECE 506 Spring 2010/6 DD

Contents

MULTICORE Architecture

NUMA architecture