Vlc architecture, Dedicated cache coherency interface (cci), Crossbar-less configuration – NEC INTEL 5800/1000 User Manual

Page 5: Tag based cache coherency, Very large cache (vlc) architecture, Split bus architecture, Chipset

Advertising
background image

5

VLC Architecture

High-speed / low latency Intra-Cell cache-to-cache data transfer

The Express5800/1000 series server
implements the VLC architecture, which
allows for low latency cache-to-cache
data transfer between multiple CPUs
within a cell.

In a split BUS architecture, for a cache-
to-cache data transfer to take place, the
data must be passed through a chipset.
However, in the VLC architecture,
data within the cache memory can
be accessed directly by one another,
bypassing the chipset. This allows
for lower latency between the cache
memory, which results in faster data
transfers.

Dedicated Cache Coherency Interface (CCI)

High-speed / low latency Inter-Cell cache-to-cache data transfer

Another technology implemented in the Express5800/1000 series
server to improve cache-to-cache data transfer is the Cache
Coherency Interface (CCI). CCI, the inter-Cell counterpart of the
VLC architecture, allows for a lower latency cache-to-cache data
transfer between Cells.

Information containing the location and state of cached data is
required for the CPU to access the specific data stored in cache
memory. By accessing the cache memory according to this
information, the CPU is able to retrieve the desired data.

Two main mechanisms exist for cache-to-cache data transfer
between Cells, directory based and TAG based cache coherency.
The cache information, described above, is stored in external
memory (DIR memory) for the directory based, and within the
chipset for the TAG based mechanisms.

In a directory based system, the requestor CPU will first access the
external memory to confirm the location of the cached data, and
then will access the appropriate cache memory. On the other hand,
in a TAG based system, the requestor CPU broadcasts a request to
all other cache simultaneously via TAG.

Crossbar-less configuration

Improved data transfer latency through direct attached Cell configuration

Within the Express5800/1000 series server lineup, the 1080Rf
has been able to lower the data transfer latency by removing the
crossbar and directly connecting Cell to Cell, and Cell to PCI box.

Even with the crossbar-less configuration, virtualization of the Cell
card and I/O box has been retained as not to diminish computing
and I/O resources.

CPU

L3

Cache

Memory

Cache

Memory

Cache

Memory

Cache

Memory

Cache

Memory

Cache

Memory

CPU

L3

L3 of other CPU

CPU

L3

L3 of other CPU

L3 of other

CPU on

different FSB

L3 of other CPU

on same FSB

L3 of other CPU on

different FSB

CPU

L3

Increased enterprise

applications

performance through

reduced cache memory

access latency

Very Large Cache (VLC) Architecture

Intel

®

Itanium

®

2 processor

(Madison : L3 9MB)

Latency

Dual-Core Intel

®

Itanium

®

processor

(Montvale : L3 24MB)

Latency

CPU

CPU

CPU

Cache

Memory

Cache

Memory

CPU

Cache

Memory

Cache

Memory

Intel

®

Itanium

®

2 processor

(Madison : L3 9MB)

Latency

High-speed

cache-to-cache

transfers

Direct CPU-to-CPU transfers

FSB

Data Size

Data Size

Memory

Dual-Core Intel

®

Itanium

®

processor

(Montvale : L3 24MB)

Latency

Split BUS Architecture

Data Size

CPU

CPU

CPU

Cache

Memory

Cache

Memory

CPU

Cache

Memory

Cache

Memory

chipset

Data transfer controller

Latency
degradation
(approx 3x)

This area increases

due to the increase in

cache size and

higher latency

Overhead from transferring

data through the chipset.

FSB

FSB

chipset

Higher cache memory

access latency.

Non-uniform

cache-to-cache data

transfer.

Inconsistent

performance.

Data Size

Higher
latency
(approx 3x)

This image does not depict actual numbers

Memory

chipset

Cache

Memory

Cache

Memory

Cache

Memory

Cache

Memory

Cache

Memory

Cache

Memory

L3 of

other CPU on

same FSB

The benefit of the TAG based mechanism, thus implemented in
the Express5800/1000 series server, is that by accessing the
TAG, unnecessary inquiries to the cache memory are filtered for a
smoother transfer of data. Furthermore, the Express5800/1000
series server includes a dedicated high-speed cache coherency
interface (CCI) which is used to connect the Cells directly to
one another without using a crossbar. This interface is used for
broadcasting and other cache coherency transactions to allow for
even faster cache-to-cache data transfer.

CPU requesting the information

CPU storing the newest information

Memory that is storing location regarding

the memory

TAG memory (Manages cache line

information for all of the CPUs loaded on a

CELL card)

DIR Memory (Manages cache line

information for all of the memory loaded on

a CELL card)

Tag Based Cache Coherency

Directory Based Cache Coherency

Request is broadcasted to all CPU
simultaneously

The Express5800/1000 Series server

implements a dedicated connection (CCI)

for snooping

Access Directory to confirm the location of
the data first, then access the appropriate
cache memory

Memory

CPU

CPU

DIR

TAG

Memory

CPU

CPU CPU

CPU CPU CPU CPU

Memory

CPU CPU CPU CPU

Memory

Memory

CPU

CPU CPU

DIR

CPU CPU CPU CPU

Memory

CPU CPU CPU CPU

DIR

CPU

DIR

Memory

Directory Based Cache Coherency

A

3

Chipset

chip

set

chip

set

chip

set

chip

set

chip

set

chip

set

CPU

CPU

chip

set

chip

set

chip

set

chip

set

chip

set

chip

set

chip

set

chip

set

CPU

chip

set

chip

set

CPU

TAG

CPU

CPU

Memory

DIR

Performance
increase with

the A

3

chipset

TAG

TAG

TAG

CPU

Advertising