Caching is one of the things that can boost performance by exploiting spatial and temporal locality of an upredictable workload. So one has to wonder why Sony, again, has chosen an architecture without caches. Of course the scratchpad is going function as an explicit caches, but it's still no substitute.
First... I do not understand this... when people say that Sony is helping IBM with CELL together with Toshiba everybody answers this is all an IBM project which Sony begged for...
Then suddenly it's all Sony's fault for every questionable decision...
sigh
This is not directed to you Gubbi...
however let's see again what you said...
Caching is one of the things that can boost performance by exploiting spatial and temporal locality of an upredictable workload.
First, we have to remember, as far as PS3 is concerned, that its main purpose will be related to 3D graphics and vector calculations... this involves constant streaming in and out of massive quantities of data with this data not spending a huge amount of time in the processor being processed over and over ( of course more advanced vertex programs will do more and more job per vertex reducing the potential speed the data streams in and out )...
We might have good spatial locality as we can organize triangle data in memory in such a way, but temporal locality would not be the major factor we want to take advantage of...
Large buffers/memory pools might be even MORE useful than caches: they can take advantage of spatial locality by prefetching a bit aggressively and software caching can be done to take advantage of the temporal locality that is offered by the code we are running ( the efficiency of software caching will not be astonding maybe compared to havin g an extra dedicated cache, but it can help ).
See Flipper big Texture Cache vs the Graphics Synthesizer's bigger VRAM ( e-DRAM )... think about heavvy render-to-texture operations ( or wanting to save temporary results in a local buffer )... Flipper will have to access the external main-RAM while the Graphics Synthesizer will have the VRAM to write to...
I do believe that if EE's RISC core SPRAM would have been 32-64 KB ( preferrably 64 KB ) and VU0's micro-memories would have been 16 KB each and VU1's micro-memories would have been 32 KB each that the need for a fat L2 would have been much less as a lot of problems could have been avoided by better use of these local buffers.
Managing of the buffers as caches is not easy and can lead to poor caching performance, but it can be done ( PS2 developers did so with VUs and RISC core's SPRAM )...
Let's look at the memory hierarchy of a single PE ( and let's start from an APU )...
We have 128 KB of Local Storage ( SRAM memory ) per APU and each APU has 128x128bits registers, then if what we want is not in the local storage ( 128 KB is 4x the total Instruction+Data micro-memory VU1 had and
8x the SPRAM the RISC core of the EE had ) and not in the resgisters we could always look in the LS of other APUs ( I am not 100% sure of this because I have not foudn yet a decisive enough declaration in the patent regarding this, I will look more for it tough ) in the same PE, then we might look at the PU's cache ( the PU should have small but existing L1 caches )...
Then we have the 64 MB of DRAM to look for the data we need and that is quite a big space ( Xbox's main RAM is 64 MB
)... still the way I see part of the DRAM would be used as a prefetch buffer for data so that we have a much higher chance not to have to wait for the external memory to provide us slowly with the wanted piece of data... ( the external memory won't be slow, but it will be surely not as fast as the embedded DRAM )...
Making some quick calculations...
8 APUs/PE * 128 KB/APU * 4 PEs/BE = 4 MB of SRAM used only as local storage ( LS )
8 APUs/PE * 128 registers/APU * 16 bytes/register * 4 PEs/BE = 64 KB of space with registers alone
And then we have 64 MB of fast DRAM whose bus is 1,024 bits wide too...
And then we have external RAM with like 12-20 GB/s of bandwidth we keep and keep on streaming from ( we might have a bit more by 2005 )...
The resources are there... and let's not forget the benefits that local RAM has over caches: you can write and read... bus to main RAM is busy ? not to worry you have the RAM right there..