Network Processors and IXC1100 Control Plane Processor

Intel Corporation may have patents or pending patent applications, trademarks, copyrights or other intellectual property rights related to the presented subject matter. The provision of documents and other materials and information does not grant any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights or other intellectual property rights.

Revision History

Introduction
About This Document

How to Read This Document

Other Relevant Documents
Terminology and Conventions

Number Representation
Acronyms and Terminology

Overview of Product Line
Intel XScale ® Microarchitecture Processor

Intel XScale ® Processor Overview

Each chapter in this document focuses on a specific architectural feature of the Intel® IXP42X product line and IXC1100 control plane processors. Unless otherwise specified, the functional descriptions apply to all IXP42X product line and IXC1100 control plane processors.

Write Buffer

The Intel XScale processor implements the ARM V5 integer instruction set architecture, but does not provide hardware support of floating-point instructions. Operating systems may require modifications to fit specific hardware features of the IXP42X product line and IXC1100 control plane processors and to obtain.

Fill Buffer

These audio coding improvements focus on multiply and accumulate operations that speed up many audio filter operations. Note: The power management control feature has not been implemented in the IXP42X product line and the IXC1100 control plane processors.

Instruction Cache

The IXP42X product line and IXC1100 control plane processors are equipped to efficiently handle audio processing by supporting 16-bit data types and 16-bit operations. Several architectural improvements were made to the MAC to support audio coding algorithms, including a 40-bit accumulator and support for 16-bit packed data.

IMMU

The MAC unit supports multiply/accumulate early termination in two cycles and can sustain a throughput of a MAC operation each cycle.

DMMU

JTAGDebug

Branch Target Buffer

Data Cache

Data RAM

Power Mgnt

Mini- Data

Performance Monitoring

Memory Management

The Intel XScale processor implements the Memory Management Unit (MMU) architecture specified in the ARM Architecture Reference Manual. The MMU architecture also specifies the caching policy for the instruction cache and data cache.

Instruction Cache

Enable the write buffer to aggregate storage to external memory. See Section 3.1, “Memory management unit” on page 44 for more details.

Branch Target Buffer

Data Cache

Intel XScale ® Processor Performance Monitoring

Network Processor Engines (NPE)

Internal Bus

MII Interfaces

AHB Queue Manager

The two interrupts, one for queues 0-31 and one for queues 32-63, provide status interrupts to the Intel XScale processor. For more information about the AHB Queue Manager, see Section 21.0, "AHB Queue Manager (AQM)" on page 556.

UTOPIA 2

USB v1.1

Memory Controller

If an x16 memory chip is used, at least two memory chips will be required to facilitate the 32-bit interface required by the IXP42X product line and IXC1100 control plane processors. The memory controller interfaces internally with the North AHB and South AHB with independent peripherals.

Expansion Bus

This burst size allows the best efficiency/fairness performance between accesses from the North and South AHB.

High-Speed Serial Interfaces

Universal Asynchronous Receiver Transceiver

GPIO

Interrupt Controller

Timers

JTAG

Intel XScale ® Processor

Memory Management Unit

Memory Attributes .1 Page (P) Attribute Bit.1Page (P) Attribute Bit

Cacheable (C), Bufferable (B), and eXtension (X) Bits .1 Instruction Cache.1Instruction Cache

Interaction of the MMU, Instruction Cache, and Data Cache
MMU Control

Invalidate (Flush) Operation
Enabling/Disabling
Locking Entries
Round-Robin Replacement Algorithm

When the MMU is disabled, all data accesses are non-cacheable and non-bufferable. Therefore, only three of the four combinations of the MMU and data/mini-data cache enablement are valid.

Table 3. Data Cache and Buffer Behavior When X = 0

Instruction Cache

Operation When Instruction Cache is Enabled

Instruction-Cache ‘Miss’
Instruction-Cache Line-Replacement Algorithm
Instruction-Cache Coherence

Each set in the instruction cache has a round-robin pointer that keeps track of the next line (in that set) to be replaced. By writing to coprocessor 15, register 9 unlocks all the locked lines in the instruction cache and leaves them valid.

Figure 8. Instruction Cache Organization

Branch Target Buffer

Branch Target Buffer (BTB) Operation

Reset

The BTB uses bits [8:2] of the current address to read the tag and then compares this tag to bits [31:9,1] of the current instruction address. Before enabling or disabling the BTB, the software must invalidate the BTB (described in the next section).

Data Cache

Data Cache Overview
Cacheability
Reconfiguring the Data Cache as Data RAM

Each set in the data cache has a round-robin pointer that keeps track of which next line (in that set) to replace. Individual entries can be invalidated and purged through coprocessor 15, register 7 in the data cache and mini data cache.

Configuration

CP15 Registers

Register 0: ID and Cache Type Registers
Register 1: Control and Auxiliary Control Registers
Register 3: Domain Access Control Register
Register 4: Reserved
Register 5: Fault Status Register
Register 6: Fault Address Register
Register 7: Cache Functions
Register 8: TLB Operations
Register 9: Cache Lock Down
Register 10: TLB Lock Down
Register 11-12: Reserved
Register 13: Process ID
The PID Register Affect On Addresses
Register 14: Breakpoint Registers
Register 15: Coprocessor Access Register

CP14 Registers

Performance Monitoring Registers
Clock and Power Management Registers
Software Debug Registers

The cache type register is selected when opcode_2=1 and describes the cache configuration of the Intel XScale processor. The mini-data cache configuration must be configured before any data can be accessed that can be stored in the mini-data cache. Bits [31:5] of Rd are used to specify the virtual address of the line to be allocated to the data cache.

An application can request the use of a shared resource (eg the accumulator in CP0) by issuing an access to the resource, which will result in an undefined exception.

Software Debug

Definitions
Debug Registers
Debug Modes

Halt Mode
Monitor Mode

Debug Control and Status Register (DCSR)

Global Enable Bit (GE)
Halt Mode Bit (H)
Vector Trap Bits (TF,TI,TD,TA,TS,TU,TR)
Sticky Abort Bit (SA)
Method of Entry Bits (MOE)
Trace Buffer Mode Bit (M)
Trace Buffer Enable Bit (E)

Debug Exceptions

Halt Mode
Monitor Mode

HW Breakpoint Resources

Instruction Breakpoints
Data Breakpoints

Software Breakpoints
Transmit/Receive Control Register

Fill once mode: The trace buffer automatically generates a debug exception (trace buffer full break) when it becomes full. The processor automatically clears this bit to disable the trace buffer when a debug exception occurs. When the stop mode is active, the processor uses the reset vector as a debug vector.

At a data breakpoint, the processor generates a debug exception and redirects execution to the debug handler before executing the next instruction.

Table 33. Debug Control and Status Register (DCSR) (Sheet 1 of 2)

TXRXCTRL)

RX Register Ready Bit (RR)
Overflow Flag (OV)
Download Flag (D)
TX Register Ready Bit (TR)
Conditional Execution Using TXRXCTRL
Transmit Register
Receive Register
Debug JTAG Access

SELDCSR JTAG Command
SELDCSR JTAG Register

When the RR bit is clear, indicating that the debug handler is ready, the debugger starts downloading. After completing the download, the debugger clears the D bit so that the debug handler can exit the download loop. The debugger and debug handler use the TR bit to synchronize accesses to the TX register.

The debugger and debug handler must poll the TR bit before accessing the TX register.

TDI TDO

DBGTX JTAG Command
DBGTX JTAG Register
DBGRX JTAG Command
DBGRX JTAG Register
Debug JTAG Data Register Reset Values
Trace Buffer

Trace Buffer CP Registers

Trace Buffer Entries

Message Byte
Trace Buffer Usage

Downloading Code in ICache

LDIC JTAG Command
LDIC JTAG Data Register
LDIC Cache Functions
Loading IC During Reset
Dynamically Loading IC After Reset

The debugger does not know the starting address of the oldest entry read from the trace buffer. When any exception occurs, the exception message is placed in the trace buffer. The address placed in the trace buffer will be the address of the target application.

The LDIC JTAG instruction selects the JTAG data register for loading code into the instruction cache.

Debugger Actions

Dynamic Code Download Synchronization” on page 124 describes the details for implementing the handshake in the debug handler. Execution of the debug handler starts when the application running on the IXP42X product line and IXC1100 control plane processors generates a debug exception or when the host generates an external debug interrupt. While the DBGTX JTAG instruction is in the JTAG IR (see “DBGTX JTAG Command” . on page 105), the host requests DBG_SR[0], and waits for the debug handler to set it.

When the debugger gets to the point where it's OK to start transferring code, it writes to TX, which automatically sets DBG_SR[0].

Debug Handler Actions

Mini-Instruction Cache Overview
Halt Mode Software Protocol

Starting a Debug Session
Implementing a Debug Handler
Ending a Debug Session

Software Debug Notes and Errata
Performance Monitoring

Overview
Register Description .1 Clock Counter (CCNT).1Clock Counter (CCNT)

The host waits for the debug handler to signal that it is ready. Execution is redirected to the debug handler so that the debugger can perform any necessary initialization. The debug handling code does not need to be specially mapped to avoid this problem.

For all three methods, the downloaded code is executed in the context of the debug handler.

Table 51. Debug-Handler Code to Implement Synchronization During Dynamic Code Download

PMN0 - PMN3)

Performance Monitor Control Register

PMNC)

Interrupt Enable Register

INTEN)

Overflow Flag Status Register

FLAG)

Event Select Register

EVTSEL)

Managing the Performance Monitor

An interrupt request will be generated when a counter's overflow flag is set and its associated interrupt enable bit is set INTEN. The interrupt request will remain asserted until the software clears the overflow flag by writing a one to the set flag. Note that the product-specific interrupt device and CPSR must have enabled the interrupt for software to receive it.).

This can be done in the interrupt service routine (ISR) where an increment in a memory location each time the interrupt occurs will enable longer performance monitoring times.

Performance Monitoring Events

Instruction Cache Efficiency Mode
Data Cache Efficiency Mode
Instruction Fetch Latency Mode
Data/Bus Request Buffer Full Mode
Stall/Write-Back Statistics
Instruction TLB Efficiency Mode
Data TLB Efficiency Mode

PMN0 collects the number of cycles when the instruction cache is unable to deliver an instruction to the IXP42X product line and IXC1100 control plane processors due to an instruction cache miss or instruction TLB miss. The average number of cycles the processor is stuck on a data cache access that may overflow the data cache buffers. PMN1 counts the number of instruction TLB table walks, which occur when there is a TLB miss.

PMN1 counts the number of data TLB table-walks that occur when there is a TLB miss.

Multiple Performance Monitoring Run Statistics

The total number of requests to return data to external memory can only be retrieved with PMN1. PMN0 is the total number of instructions that were executed, which does not include instructions that were translated from the instruction TLB and were never executed. PMN0 is the total number of data memory accesses, which includes cache and non-cache accesses, data mini-cache accesses, and accesses made to locations configured as data RAM.

Note that STM and LDM each count as different accesses to the data TLB depending on the number of registers specified in the register list.

Examples

This can happen if a branch instruction changes the program flow; the instruction TLB can translate the next consecutive instructions to the branch, before it receives the target address of the branch. The average number of cycles it took to execute an instruction or commonly referred to as cycles-per-instruction (CPI). Assume that performance count interrupts are the only IRQ in the system MRC P14,0,R1,C0,c1,0; read the PMNC register.

In the seasoned example above, the instruction cache had a miss rate of 5% and CPI was 2.4.

Programming Model

ARM * Architecture Compatibility
ARM * Architecture Implementation Options

Big Endian versus Little Endian
Thumb
ARM * DSP-Enhanced Instruction Set
Base Register Update

Extensions to ARM * Architecture

DSP Coprocessor 0 (CP0)
New Page Attributes
Additions to CP15 Functionality
Event Architecture .1 Exception Summary.1Exception Summary

The Intel XScale processor maintains the ARM definitions for C and B encoding when X = 0, which differs from ARM products. The value set in R14_ABORT (Abort Mode Link Register) is the address of the aborted instruction + 4. A latch interrupt is a precise data interrupt; the extended status field of the error status register is set to 0xb10100.

The error address register is not defined and R14_ABORT is the address of the aborted instruction + 8.

Table 62. Multiply with Internal Accumulate Format

Performance Considerations

Interrupt Latency
Branch Prediction
Addressing Modes
Instruction Latencies

Performance Terms
Branch Instruction Timings
Multiply Instruction Timings
Saturated Arithmetic Instructions
Status Register Access Instructions
Load/Store Instructions
Semaphore Instructions
Coprocessor Instructions
Miscellaneous Instruction Timing
Thumb Instructions

The cycle distance from the first issued clock of the current command to the issued clock of the next command. The cycle distance from the first issue time of the current command to the issue time of the first command that can use the result without causing a deadlock due to a resource dependency. Note: If the next instruction needs to use the result of processing the shift data immediately or as Rn.

Note: If the next instruction must use the result of the data processing for a shift immediately or as Rn in a QDAD or QDSUB, one extra cycle of result delay is added to the listed number.

Optimization Guide

Introduction

About This Section

Processors’ Pipeline

General Pipeline Characteristics
Instruction Flow Through the Pipeline
Main Execution Pipeline
Memory Pipeline
Multiply/Multiply Accumulate (MAC) Pipeline

Basic Optimizations

Conditional Instructions
Bit Field Manipulation
Optimizing the Use of Immediate Values
Optimizing Integer Multiply and Divide
Effective Use of Addressing Modes

Cache and Prefetch Optimizations

Instruction Cache
Data and Mini Cache
Cache Considerations
Prefetch Considerations

Instruction Scheduling

Scheduling Loads
Scheduling Data Processing Instructions
Scheduling Multiply Instructions
Scheduling SWP and SWPB Instructions
Scheduling the MRA and MAR Instructions (MRRC/MCRR)
Scheduling the MIA and MIAPH Instructions
Scheduling MRS and MSR Instructions
Scheduling CP15 Coprocessor Instructions

Optimizing C Libraries
Optimizations for Size

Space/Performance Trade Off

One of the biggest differences between the IXP42X product line and the IXC1100 control plane processors and ARM processors is the pipeline. This section briefly describes the structure and behavior of the IXP42X product line and the pipeline of IXC1100 control plane processors. The instructions of the IXP42X product line and the IXC1100 control plane processors can selectively change the state of the condition codes.

In the case of the IXP42X product line and the IXC1100 control plane processors, a branch misprediction incurs a penalty of four cycles.