cpu L1 cache bus

zchieply

Newcomer
I have noticed that almost all cpu cache designs have a smaller L1 bus (128bit) compared to the L2 and L3 buses (256bit). this seems like a bottle neck to me.
 
It's the access latency that's more important in this case. And not all designs are similar -- any particular CPU architecture in mind?
 
Further clarification will be needed on what is meant by the term "L1 bus".
There isn't a bus per se. There are frequently 2 128-bit ports on current x86 L1 caches, working at full clock speed.
Often, the bus between the L1 and L2 does not transfer data on every clock, but this also depends on the implementation.
 
I have noticed that almost all cpu cache designs have a smaller L1 bus (128bit) compared to the L2 and L3 buses (256bit). this seems like a bottle neck to me.

What exactly do you mean by L1 cache bus? It's worth noting that no modern desktop cpu actually has "a 128bit L1 bus". The load-store units operate directly on the L1, which is banked, and assuming no bank conflicts can usually support 2 128 bit reads and one 128 bit write per clock. So depending how you count, that's 384 bit.

A very important point is that the L1<->L2 bus, and everything after that, almost exclusively move full cache lines, or 64 Bytes at a time, where as the L1 is accessed often by words or even single bytes. So more independent smaller buses are better than a single large one that can only be used when you go full tilt on vector data.
 
smaller faster bus internally, finer granularity, = greater accuracy on what should and shouldn't be in the cache

registers: fine granularity, small transfers, faster access
main memory: coarse granularity, larger transfers, slower access

(by fast/slow i mean latency more than throughput )

various levels of caching represent a sliding scale between those extremes
granularity may be some combination of actual bus width and burst reads

cell was interesting with it's 128bit registers like level zero... e.g. read a data structure of 32bit scalars into regs then further decompose/calcualte/recompose with reg-reg instructions instead of bus operations. again all part of the sliding scale between granularity and throughput further away from the execution units
 
Last edited by a moderator:
Back
Top