Performance Monitoring
3.4 Data Cache
3.4.2 Cacheability
caching is specified for that area of memory. If the cache does not contain the
requested data, the access ‘misses’ the cache, and the sequence of events that follows depends on the configuration of the cache, the configuration of the MMU and the page attributes, which are described in “Cacheability” on page 63.
The data/mini-data cache is still accessed even though it is disabled. If a load hits the cache it will return the requested data to the destination register. If a store hits the cache, the data is written into the cache. Any access that misses the cache will not allocate a line in the cache when it’s disabled, even if the MMU is enabled and the memory region’s cacheability attribute is set.
If there is no outstanding fill request for that line, the current store request is placed in the fill buffer and a 32-byte external memory read request is made. If the pending buffer or fill buffer is full, the Intel XScale processor will stall until an entry is available.
2. The 32-bytes of data can be returned back to the Intel XScale processor in any word order, i.e, the eight words in the line can be returned in any order. Note that it does not matter, for performance reasons, which order the data is returned to the Intel XScale processor since the store operation has to wait until the entire line is written into the cache before it can complete.
3. When the entire 32-byte line has returned from external memory, a line is allocated in the cache, selected by the round-robin pointer. The line to be written into the cache may replace a valid line previously allocated in the cache. In this case both dirty bits are examined and if any are set, the four words associated with a dirty bit that’s asserted will be written back to external memory as a 4 word burst
operation. This write operation will be placed in the write buffer.
4. The line is written into the cache along with the data associated with the store operation.
If the above condition for requesting a 32-byte cache line is not met, a write miss will cause a write request to external memory for the exact data size specified by the store operation, assuming the write request doesn’t coalesce with another write operation in the write buffer.
The Intel XScale processor supports write-back caching or write-through caching, controlled through the MMU page attributes. When write-through caching is specified, all store operations are written to external memory even if the access hits the cache.
This feature keeps the external memory coherent with the cache, i.e., no dirty bits are set for this region of memory in the data/mini-data cache. However write through does not guarantee that the data/mini-data cache is coherent with external memory, which is dependent on the system level configuration, specifically if the external memory is shared by another master.
When write-back caching is specified, a store operation that hits the cache will not generate a write to external memory, thus reducing external memory traffic.
The line replacement algorithm for the data cache is round-robin. Each set in the data cache has a round-robin pointer that keeps track of the next line (in that set) to replace. The next line to replace in a set is the next sequential line after the last one that was just filled. For example, if the line for the last fill was written into way 5-set 2, the next line to replace for that set would be way 6. None of the other round-robin pointers for the other sets are affected in this case.
After reset, way 31 is pointed to by the round-robin pointer for all the sets. Once a line is written into way 31, the round-robin pointer points to the first available way of a set, beginning with way 0 if no lines have been re-configured as data RAM in that particular set. Re-configuring lines as data RAM effectively reduces the available lines for cache updating. For example, if the first three lines of a set were re-configured, the round- robin pointer would point to the line at way 3 after it rolled over from way 31. Refer to
“Reconfiguring the Data Cache as Data RAM” on page 68 for more details on data RAM.
The mini-data cache follows the same round-robin replacement algorithm as the data cache except that there are only two lines the round-robin pointer can point to such that the round-robin pointer always points to the least recently filled line. A least recently used replacement algorithm is not supported because the purpose of the mini- data cache is to cache data that exhibits low temporal locality, i.e.,data that is placed into the mini-data cache is typically modified once and then written back out to external memory.
The data cache and mini-data cache are protected by parity to ensure data integrity;
there is one parity bit per byte of data. (The tags are NOT parity protected.) When a parity error is detected on a data/mini-data cache access, a data abort exception occurs. Before servicing the exception, hardware will set bit 10 of the Fault Status Register register.
A data/mini-data cache parity error is an imprecise data abort, meaning R14_ABORT (+8) may not point to the instruction that caused the parity error. If the parity error occurred during a load, the targeted register may be updated with incorrect data.
A data abort due to a data/mini-data cache parity error may not be recoverable if the data address that caused the abort occurred on a line in the cache that has a write- back caching policy. Prior updates to this line may be lost; in this case the software exception handler should perform a “clean and clear” operation on the data cache, ignoring subsequent parity errors, and restart the offending process.
The SWP and SWPB instructions generate an atomic load and store operation allowing a memory semaphore to be loaded and altered without interruption. These accesses may hit or miss the data/mini-data cache depending on configuration of the cache, configuration of the MMU, and the page attributes.
After processor reset, both the data cache and mini-data cache are disabled, all valid bits are set to zero (invalid), and the round-robin bit points to way 31. Any lines in the data cache that were configured as data RAM before reset are changed back to cacheable lines after reset, i.e., there are 32 Kbytes of data cache and zero bytes of data RAM.
The data cache and mini-data cache are enabled by setting bit 2 in coprocessor 15, register 1 (Control Register). See “Configuration” on page 73, for a description of this register and others.
Equation 8 shows code that enables the data and mini-data caches. Note that the MMU must be enabled to use the data cache.
Individual entries can be invalidated and cleaned in the data cache and mini-data cache via coprocessor 15, register 7. Note that a line locked into the data cache remains locked even after it has been subjected to an invalidate-entry operation. This will leave an unusable line in the cache until a global unlock has occurred. For this reason, do not use these commands on locked lines.
This same register also provides the command to invalidate the entire data cache and mini-data cache. Refer to Table 18, “Cache Functions” on page 81 for a listing of the commands. These global invalidate commands have no effect on lines locked in the data cache. Locked lines must be unlocked before they can be invalidated. This is accomplished by the Unlock Data Cache command found in Table 20, “Cache Lock- Down Functions” on page 83.
Example 8. Enabling the Data Cache enableDCache:
MCR p15, 0, r0, c7, c10, 4; Drain pending data operations...
; (see Chapter 7.2.8, Register 7: Cache functions) MRC p15, 0, r0, c1, c0, 0; Get current control register
ORR r0, r0, #4 ; Enable DCache by setting ‘C’ (bit 2) MCR p15, 0, r0, c1, c0, 0; And update the Control register
A simple software routine is used to globally clean the data cache. It takes advantage of the line-allocate data cache operation, which allocates a line into the data cache.
This allocation evicts any cache dirty data back to external memory. Example 9 shows how data cache can be cleaned.
Example 9. Global Clean Operation
; Global Clean/Invalidate THE DATA CACHE
; R1 contains the virtual address of a region of cacheable memory reserved for
; this clean operation
; R0 is the loop count; Iterate 1024 times which is the number of lines in the
; data cache
;; Macro ALLOCATE performs the line-allocation cache operation on the
;; address specified in register Rx.
;;
MACRO ALLOCATE Rx
MCR P15, 0, Rx, C7, C2, 5 ENDM
MOV R0, #1024 LOOP1:
ALLOCATE R1 ; Allocate a line at the virtual address
; specified by R1.
ADD R1, R1, #32 ; Increment the address in R1 to the next cache line SUBS R0, R0, #1 ; Decrement loop count
BNE LOOP1
;
;Clean the Mini-data Cache
; Can’t use line-allocate command, so cycle 2KB of unused data through.
; R2 contains the virtual address of a region of cacheable memory reserved for
; cleaning the Mini-data Cache
; R0 is the loop count; Iterate 64 times which is the number of lines in the
; Mini-data Cache.
MOV R0, #64 LOOP2:
LDR R3,[R2],#32 ; Load and increment to next cache line SUBS R0, R0, #1 ; Decrement loop count
BNE LOOP2
;
; Invalidate the data cache and mini-data cache MCR P15, 0, R0, C7, C6, 0
;
The line-allocate operation does not require physical memory to exist at the virtual address specified by the instruction, since it does not generate a load/fill request to external memory. Also, the line-allocate operation does not set the 32 bytes of data associated with the line to any known value. Reading this data will produce
unpredictable results.
The line-allocate command will not operate on the mini Data Cache, so system software must clean this cache by reading 2 Kbytes of contiguous unused data into it. This data must be unused and reserved for this purpose so that it will not already be in the cache. It must reside in a page that is marked as mini Data Cache cacheable (see “New Page Attributes” on page 152).
The time it takes to execute a global clean operation depends on the number of dirty lines in cache.