1 branch acceleration, 2 operand execution pipeline (oep), 1 illegal opcode handling – Freescale Semiconductor MCF5480 User Manual

Page 108: 2 enhanced multiply/accumulate (emac) unit, Operand execution pipeline (oep) -4

Advertising
background image

MCF548x Reference Manual, Rev. 3

3-4

Freescale Semiconductor

3.2.1.1.1

Branch Acceleration

To maximize the performance of conditional branch instructions, the IFP implements a sophisticated

two-level acceleration mechanism. The first level is an 8-entry, direct-mapped branch cache with 2 bits for

indicating four prediction states (strongly or weakly; taken or not-taken) for each entry. The branch cache

also provides the association between instruction addresses and the corresponding target address. In the

event of a branch cache hit, if the branch is predicted as taken, the branch cache sources the target address

from the IC1 stage back into the IAG to redirect the prefetch stream to the new location.
The branch cache implements instruction folding, so conditional branch instructions correctly predicted as

taken can execute in zero cycles. For conditional branches with no information in the branch cache, a

second-level, direct-mapped prediction table is accessed. Each of its 128 entries uses the same 2-bit

prediction mechanism as the branch cache.
If a branch is predicted as taken, branch acceleration logic in the IED stage generates the target address.

Other change-of-flow instructions, including unconditional branches, jumps, and subroutine calls, use a

similar mechanism where the IFP calculates the target address. The performance of subroutine return

instruction (RTS) is improved through the use of a four-entry, LIFO hardware return stack. In all cases,

these mechanisms allow the IFP to redirect the fetch stream down the predicted path well ahead of

instruction execution.

3.2.1.2

Operand Execution Pipeline (OEP)

The two instruction registers in the decode stage (DS) of the OEP are loaded from the FIFO instruction

buffer or are bypassed directly from the instruction early decode (IED). The OEP consists of two

traditional, two-stage RISC compute engines with a dual-ported register file access feeding an arithmetic

logic unit (ALU).
The compute engine at the top of the OEP (the address ALU) is used typically for operand address

calculations; the execution ALU at the bottom is used for instruction execution. The resulting structure

provides 4 Gbytes/S operand bandwidth (at 162 MHz) to the two compute engines and supports

single-cycle execution speeds for most instructions, including all load and store operations and most

embedded-load operations. The V4 OEP supports the ColdFire Revision B instruction set, which adds a

few new instructions to improve performance and code density.
The OEP also implements the following advanced performance features:

Stalls are minimized by dynamically basing the choice between the address ALU or execution

ALU for instruction execution on the pipeline state.

The address ALU and register renaming resources together can execute heavily used opcodes and

forward results to subsequent instructions with no pipeline stalls.

Instruction folding involving MOVE instructions allows two instructions to be issued in one cycle.

The resulting microarchitecture approaches full superscalar performance at a much lower silicon

cost.

3.2.1.2.1

Illegal Opcode Handling

To aid in conversion from M68000 code, every 16-bit operation word is decoded to ensure that each

instruction is valid. If the processor attempts execution of an illegal or unsupported instruction, an illegal

instruction exception (vector 4) is taken.

3.2.1.2.2

Enhanced Multiply/Accumulate (EMAC) Unit

The EMAC unit in the Version 4e provides hardware support for a limited set of digital signal processing

(DSP) operations used in embedded code, while supporting the integer multiply instructions in the

Advertising
This manual is related to the following products: