2012-03-04T15:20:49Z

Sbasu3: /* Local Bus Controller */

==Introduction ==

With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write.

===Write Buffers in Uni-processors===

A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer.

[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]

===Write Buffer Issues in Multiprocessors===

In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory.

[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]]

As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.

===The Coherence Problem===

Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches.

==Sequential Consistency==

When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor.
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize.

===Strong Ordering===
The requirements for strong ordering are as follows:

1) All memory operations appear to execute one at a time.

2) All memory operations from a single CPU appear to execute in-order.

3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)

===Total Store Ordering===
Requirements are as follows:

1) Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place

2) Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.

===Partial Store Ordering===

Requirement is as follows:

Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.

===Weak Ordering===
Requirement is as follows:

Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.

===Example===
Examples for Sequential Consistency:
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables

a = b = flag1 = flag2 = 0; // initial value
CPU1 CPU 2
Flag 1 = 1; flag2 = 1;
a = 1; a = 2;
r1 = a; r3 = a;
r2 = flag2; r4 = flag1;

SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.

===Effects on Write Buffer Operation===

// Explain here what ordering has to do with the write buffer coherency

==Cache Coherence Models==

The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.

=== Write-Update ===

In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.

===Write-Invalidate===
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.

== Coherence in Write Buffers ==

===Software-Based Coherence===
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory.

===Hardware-Based Coherence===
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-

====Unique Buffer per Processor====
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]]

In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when

# the request is serviced in the buffer and the cache is updated on a local hit and
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.

This can be achieved in two ways depending on the topology of the connection between the caches -

# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches.
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.

====Separate Buffers for Local and Remote Accesses====
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]]

Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).

====Universal read/write Buffer====

[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]

In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy. Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.

# INTERVENE is asserted when any cache wants to provide a valid data to other caches.
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.

=====''' Data Buffer '''=====

The data buffer consists of three FIFOs

# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles.
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU.
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.

=====''' Address Buffer '''=====

The address buffer also consists of three FIFOs

# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.

Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.

For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.

In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.

=====''' Local Bus Controller '''=====

Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.

Algorithm to generate controller signals-

For the MSI and MOSI protocols:

If there is a read miss on the local bus
INTERVENE is asserted.
If there is a write miss with M status on the local bus
Write back cycle will be performed
Generate WB signal.
If there is a read miss or write miss from any of the protocol
LF cycle will be performed
Generate LF signal

For the MOESI protocol:

If there is a read miss on the local bus
INTERVENE is asserted

The above outputs can be written in Boolean equations as shown in the example<ref>http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629</ref> below:

WB = RD * MOESI * INTV + WR * [STATUS=M]
LF = INTV
03 = RD * MOESI * INTV
MOESI means the cycle is initiated by MOESI protocol.
INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.

=====''' Algorithms '''=====

The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently.

''MOESI'' The algorithm followed for the MOESI protocol is as follows:

# M state provides data to other cache if read hit initiated by other cache and changes its state from M to O
# If same data is hit again, cache with O state is responsible of providing data to the requesting cache.
# Write cycle needs a line fill if there is a cache miss and no INTERVENE

''MSI'' and ''MESI'' The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)

# M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.
# Local bus controller is responsible to change the state from S to O for MOESI protocol
# M of MSI or MESI still changes to S.
# If write cycle initiated by other CPU (MSI or MESI),
:: If MOESI cache has a valid line with M or O status
::: It will send out the data line on to local bus and change the state to I.</blockquote>
::: [Because the line will be written by other cache and outdated]</blockquote>

:5. If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.

:6. Force the status bit to be O instead of S

=='''References'''==
<references/>
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs

[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman

[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja

CSC/ECE 506 Spring 2012/6b am

2012-03-04T15:16:26Z

Sbasu3: /* Address Buffer */

==Introduction ==

With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write.

===Write Buffers in Uni-processors===

A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer.

[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]

===Write Buffer Issues in Multiprocessors===

In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory.

[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]]

As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.

===The Coherence Problem===

Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches.

==Sequential Consistency==

When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor.
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize.

===Strong Ordering===
The requirements for strong ordering are as follows:

1) All memory operations appear to execute one at a time.

2) All memory operations from a single CPU appear to execute in-order.

3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)

===Total Store Ordering===
Requirements are as follows:

1) Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place

2) Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.

===Partial Store Ordering===

Requirement is as follows:

Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.

===Weak Ordering===
Requirement is as follows:

Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.

===Example===
Examples for Sequential Consistency:
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables

a = b = flag1 = flag2 = 0; // initial value
CPU1 CPU 2
Flag 1 = 1; flag2 = 1;
a = 1; a = 2;
r1 = a; r3 = a;
r2 = flag2; r4 = flag1;

SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.

===Effects on Write Buffer Operation===

// Explain here what ordering has to do with the write buffer coherency

==Cache Coherence Models==

The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.

=== Write-Update ===

In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.

===Write-Invalidate===
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.

== Coherence in Write Buffers ==

===Software-Based Coherence===
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory.

===Hardware-Based Coherence===
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-

====Unique Buffer per Processor====
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]]

In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when

# the request is serviced in the buffer and the cache is updated on a local hit and
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.

This can be achieved in two ways depending on the topology of the connection between the caches -

# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches.
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.

====Separate Buffers for Local and Remote Accesses====
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]]

Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).

====Universal read/write Buffer====

[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]

In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy. Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.

# INTERVENE is asserted when any cache wants to provide a valid data to other caches.
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.

=====''' Data Buffer '''=====

The data buffer consists of three FIFOs

# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles.
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU.
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.

=====''' Address Buffer '''=====

The address buffer also consists of three FIFOs

# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.

Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.

For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.

In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.

=====''' Local Bus Controller '''=====

Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.
The above outputs can be written in Boolean equations as shown in the example<ref>http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629</ref> below:

WB = RD * MOESI * INTV + WR * [STATUS=M]
LF = INTV
03 = RD * MOESI * INTV
MOESI means the cycle is initiated by MOESI protocol.
INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.

=====''' Algorithms '''=====

The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently.

''MOESI'' The algorithm followed for the MOESI protocol is as follows:

# M state provides data to other cache if read hit initiated by other cache and changes its state from M to O
# If same data is hit again, cache with O state is responsible of providing data to the requesting cache.
# Write cycle needs a line fill if there is a cache miss and no INTERVENE

''MSI'' and ''MESI'' The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)

# M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.
# Local bus controller is responsible to change the state from S to O for MOESI protocol
# M of MSI or MESI still changes to S.
# If write cycle initiated by other CPU (MSI or MESI),
:: If MOESI cache has a valid line with M or O status
::: It will send out the data line on to local bus and change the state to I.</blockquote>
::: [Because the line will be written by other cache and outdated]</blockquote>

:5. If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.

:6. Force the status bit to be O instead of S

=='''References'''==
<references/>
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs

[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman

[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja