Nonamer, the VUs on the EE have only Micro-memories which are not HW Caches and in total they make only 40 KB of SRAM ( for instruction and data ).
My point still stands and in fact a the end of your latest post you basically seem to understand what I was trying to highlight.
Perhaps I was being to vague and oversimplified the situation. I was really refering to the cache/eDRAM. Main RAM could never be enough, only the on-die data storage could be made fast enough. Registers are mostly irrelevant (what's the point of have so many FPU if registers could feed them?)
What is the point of having so many FPUs if registers could feed them ? I am sorry, but I do not quite see what is the point you are trying to make wioth a statement like that.
The point of having so many FPUs is to be able to practically speaking "chrunch a lot of math" each cycle.
You say "duh" to me and then say "well in some cases ( Note: namely multimedia applications which is one of the major areas CELL is targeted to ) this power can be used, but not with just any kind of code".
A "duh" would be a good prize for that too
Now don't get me wrong. I'm sure if there's anything like a 1TFLOP of power in the PS3 it can all be used. Just not for general applications.
So you are telling me that for non multi-meia opr vector friendly applications we would not reach 1 TFLOPS ?
That is some horrible news, I do not know if Word 2005 and Excel 2005 will be able to run then... oh noo if I start Mozilla too I will surely get the performance to slow down to a crawl...
Meanwhile, for graphics/vector processing in general/multi-media applications CELL will be able to flex its muscles much better.
It does look like it was meant for and not a "whoopsie" on CELL designers's part.
Registers are not irrelevant, just because you got a little too accustomed to a 8 GPRs + FP Stack architecture ( cough.. IA-32... cough ) doesn't mean that bigger register files can help.
You know it already, a good memory hierachy is one that take good care of the design of each step of a relatively long chain: the purpose of each step of the memory hierarchy is not naive to the point of thinking it will negate the need for a lower level as the ratio of cost and density for each level is well known.
The more realistic purpose is to relieve the pressure off successive steps of the hierarchy in the best way possible considering the application the processor will be targeted at.
To make a long story short, which has been debated before since this point was already brought up by Deadmeat, the idea is to provide anough registers that LOAD/STORE operations from/to LS ( Local Storage, 128 KB of SRAM in each APU ) are kept to the lowest number possible.
I do not see as many main memory LOAD/STORE instructions in register heavvy architectures such as IPF as you can see in common x86 code, guess why...
LS will not be, maybe, exactly as fast as the registers, but thanks to the good amount of registers ( basically 32x128 bits Registers of the 4 FP/FX Units groups in each APU ) the pressure on it will be reduced.
Oh, btw the APU's Functional Unit cannot directly process operands from the e-DRAM, all it processes has to be contained in the LS ( programs have to be subdivided in 128 KB chunks basically to achieve optimal efficiency ).
Pressure is taken away from the e-DRAM as we have that many GPRs per APU and a good amount of LS per APU TB/s of bandwidth for the e-DRAM is not crippling CELL.
Nobody ever said that the VUs were extremely inefficient because they were so unbalanced...
Well, compared to a VU1, each APU has 4x the amount of Micro-memory and 4x the amount of registers and it is trying to feed less execution units ( VU1 can, for example, feed up to 5 parallel FMACs ) and a MUCH higher bandwidth that connects it with the next memory hierachy step.