Write Combining Memory Implementation Guidelines
5
processor are most often pixel writes and as such tend to be 8-bit, 16-bit or 32-bit quantities
rather than full cache lines, a processor would normally be unable to run burst cycles for
graphics operations. In previous Intel Architecture processors, like the Pentium processor,
graphics-like data writes have been sent to the system bus and have been grouped or combined
into burst writes in the chip set. The uncached transaction performance characteristics of the P6
family processor implementation have made that architectural model less effective. Because of
this, the P6 family processor was designed with a new caching method, or memory type, that
allows internal buffers of the processor to be used to combine smaller (or partial) writes
automatically into larger burstable cache line writes. The table above shows typical bandwidth
results for Pentium Pro processor system with differing configurations varying from 8MB/s
with all tuning options disabled through to sequential WC memory writes which can saturate a
33MHz PCI subsystem. These performance numbers could be contrasted with Pentium
processor, 430 PCIset uncached PCI performance of ~70MB/s. Cache line like bursts to
graphics space then have the potential for being combined into even longer bursts by the chip
set. To further enhance graphics performance, the 82450/82440 PCIsets were designed for the
Pentium Pro processor with features such as Outbound Posting (OBP), Burst Write Assembly
and the ability to run Memory Write Invalidate PCI bus commands.
For applications to harness the maximum performance of the P6 family processor it is essential
that operating system and driver software allow the system bus to be utilized, as the initial
architecture design intended.
WRITE COMBINING
Once a memory region has been defined as having the WC memory type, accesses into the
memory region will be subject to the architectural definition of WC:
WC is a weakly ordered memory type. System memory locations are not
cached and coherency is not enforced by the processor’s bus coherency
protocol. Speculative reads are allowed. Writes may be delayed and
combined in the write combining buffer to reduce memory accesses..
What does this really mean? Writes to WC memory are not cached in the typical sense of the
word cached. They are delayed in an internal buffer that is separate from the internal L1 and
L2 caches. The buffer is not snooped and thus does not provide data coherency. The write
buffering is done to allow software a small window of time to supply more modified data to the
buffer while remaining as non-intrusive to software as possible. The size of the buffer is not
defined in the architectural statement above. The Pentium Pro processor and Pentium II
processor implement a 32 byte buffer. The size of this buffer was chosen by implementation
convenience rather than by performance optimization. The buffer size optimization process
may occur in a future generation of the P6 family processor and so software should not rely
upon the current 32 byte WC buffer size or the existence of just a single concurrent buffer. The
WC buffering of writes has another facet, data is also collapsed e.g. multiple writes to the same
location will leave the last data written in the location and the other writes may be lost.
So if the data is delayed inside the processor how do pixels get to the screen where I want
them? On the current P6 family processors, once software writes to a region of memory that is
addressed outside of the range of the current 32byte buffer the data in the existing buffers will
automatically get forwarded to the system bus and written to memory. Therefore software that
writes more than one 32byte buffers worth of data will ensure that the data from the first
buffers address range is forwarded to memory. The last buffer written in the sequence may be
delayed by the processor a little longer unless deliberately propagated to memory by software.
The main message here is that despite the fact that data is delayed in the processor the natural
software operations of moving data will cause the data in the WC buffers to be written to
memory. A caution at this stage: Software developers should not rely on the fact that there is
only one active WC buffer at a time. If software cares about data being delayed developers
must deliberately empty the WC buffers and not assume the hardware will. The WC buffer is