smaller faster bus internally, finer granularity, = greater accuracy on what should and shouldn't be in the cache
registers: fine granularity, small transfers, faster access
main memory: coarse granularity, larger transfers, slower access
(by fast/slow i mean latency more than throughput )
various levels of caching represent a sliding scale between those extremes
granularity may be some combination of actual bus width and burst reads
cell was interesting with it's 128bit registers like level zero... e.g. read a data structure of 32bit scalars into regs then further decompose/calcualte/recompose with reg-reg instructions instead of bus operations. again all part of the sliding scale between granularity and throughput further away from the execution units