See, the on-die stuff isn't fast enough.
Well, that explains how you reached your bandwith requirement conclusion...nonamer said:Registers are mostly irrelevant
Even if the CPU only reaches 500 GFLOPs peak and about 100 GFLOPs sustained, that would still be tremendous.
Fafalada said:Well, that explains how you reached your bandwith requirement conclusion...
nonamer said:I don't know why you guys are focusing on registers. It should be obvious that they can provide the necessary bandwidth. I see that by having a lot of them, you can relieve pressure on the Local storage SRAM. But eventually, you'll have to access local storage, which is where the major slowdown can occur. The bandwidth of that is not enough, so aren't you still bandwidth limited?
Panajev2001a said:nonamer said:I don't know why you guys are focusing on registers. It should be obvious that they can provide the necessary bandwidth. I see that by having a lot of them, you can relieve pressure on the Local storage SRAM. But eventually, you'll have to access local storage, which is where the major slowdown can occur. The bandwidth of that is not enough, so aren't you still bandwidth limited?
Consider these two scenarios:
1) the APU's execution units ( each unit is a mixed FX/FP Unit and we have 4 of them in the APU ) have to always access the LS.
2) the APU's execution units have 128 registers to store temp data and access LS a fraction ( even 50% is a fraction of 100%, I am not giving numbers here ) of the time.
Would you say that case 1) or case 2) would put LS's "bandwidth limitation" more to the test ?
Would you say that IPF or x86 ( take a hypotetical Pentium MMX at the same speed of your low end Itanium 2 ) puts more stress on the memory ( higher percentage of memory-to-CPU and CPU-to-memory instructions [percentage of LOAD/STORE instructions that touch memory] in the instruction stream ), hence on the cache ?
One of those two processors has many more registers than the other
To make a long story short: registers do matter.
Unless you build a case scenario in which every cycle, all APUs' execution units are writing back a result to the LS then the LS does not need to have the same bandwidth as the registers do provide.
Let's look at the single APU's case ( valid as 1 TFLOPS would be the aggregate performance of 32 APUs ): it can do a peak of 32 GFLOPS using FP or Integer Vector MADD instructions.
32 GFLOPS * 4 bytes = 128 GB/s
LS bandwidth does not need to be that high ( and still I suspect to eb able to reach >100 GB/s ) as it does not have to provide every cycle 12 operands to the execution units ( a Vector FP MADD requires 3x128 bits Vectors and each could be a 32 bits operand 3 * ( 128 / 32 ) = 12 ) as those are loaded from the Registers.
A Vector FP MADD uses up to 4 registers: Destination = Source_1 * Source_2 + Source_3.
We have 128 Registers in each APU ( 128x128 bits Register file ).
A Vector MADD normally has a latency of 4 and a throughput of 1 ( each cycle we can execute a 1 scalar FP MADD or the 4 required for the Vector MADD instruction ) which means that in pipelined terms we do execute 1 Vector FP MADD per cycle of course ( nothing new there ).
Your point would be that each cycle the LS has to load into the Register file 256 bits of data ( for 8 FP ops ) which does not happen: compared to a more Register starved architecture a lot more operations use registers to hold and load the operands from and to store the results ( temporary or not ) and memory bandwidth is less stressed.
If each time we had to load all operands from LS ( and store them there as well ) then you would be right doing "32 GFLOPS * 4 bytes = 128 GB/s".
Still, it would not be far off from the bandwidth LS should achieve.
Pentium 4's L2 cache is now well beyond 90 GB/s: 256 bits * 3 GHz ( Pentium 4 clocked at 3 GHz ) = 96 GB/s.
Why would be 2005's CELL Local Storage ( which uses SRAM ) have that much trouble to provide 100+ GB/s ?
Your bandwith calculation claimed every FP operand access to be also a memory access.nonamer said:I don't know why you guys are focusing on registers.
If APUs are anything like VUs in respect to their memory pools, those accesses will be basically free.While you do not have to access the LS every time, you will have to eventually. That's when you have your slowdown. So my question is this: How much do you slow down?
nonamer said:Panajev2001a said:nonamer said:I don't know why you guys are focusing on registers. It should be obvious that they can provide the necessary bandwidth. I see that by having a lot of them, you can relieve pressure on the Local storage SRAM. But eventually, you'll have to access local storage, which is where the major slowdown can occur. The bandwidth of that is not enough, so aren't you still bandwidth limited?
Consider these two scenarios:
1) the APU's execution units ( each unit is a mixed FX/FP Unit and we have 4 of them in the APU ) have to always access the LS.
2) the APU's execution units have 128 registers to store temp data and access LS a fraction ( even 50% is a fraction of 100%, I am not giving numbers here ) of the time.
Would you say that case 1) or case 2) would put LS's "bandwidth limitation" more to the test ?
Obviously #2, but that's besides the point.
Would you say that IPF or x86 ( take a hypotetical Pentium MMX at the same speed of your low end Itanium 2 ) puts more stress on the memory ( higher percentage of memory-to-CPU and CPU-to-memory instructions [percentage of LOAD/STORE instructions that touch memory] in the instruction stream ), hence on the cache ?
One of those two processors has many more registers than the other
To make a long story short: registers do matter.
Given the enormous code bloat of the IPF, I'd say that's a bad analogy. Perhaps there's some confusion here. You're saying that registers matter. I, from the start agreed with that but said that it's not important (and perhaps confused numbers of registers and the bandwidth of them). I'm saying that while they do matter, it still does not relieve all pressure from LS and performance will still suffer.
Unless you build a case scenario in which every cycle, all APUs' execution units are writing back a result to the LS then the LS does not need to have the same bandwidth as the registers do provide.
Let's look at the single APU's case ( valid as 1 TFLOPS would be the aggregate performance of 32 APUs ): it can do a peak of 32 GFLOPS using FP or Integer Vector MADD instructions.
32 GFLOPS * 4 bytes = 128 GB/s
LS bandwidth does not need to be that high ( and still I suspect to eb able to reach >100 GB/s ) as it does not have to provide every cycle 12 operands to the execution units ( a Vector FP MADD requires 3x128 bits Vectors and each could be a 32 bits operand 3 * ( 128 / 32 ) = 12 ) as those are loaded from the Registers.
A Vector FP MADD uses up to 4 registers: Destination = Source_1 * Source_2 + Source_3.
We have 128 Registers in each APU ( 128x128 bits Register file ).
A Vector MADD normally has a latency of 4 and a throughput of 1 ( each cycle we can execute a 1 scalar FP MADD or the 4 required for the Vector MADD instruction ) which means that in pipelined terms we do execute 1 Vector FP MADD per cycle of course ( nothing new there ).
Your point would be that each cycle the LS has to load into the Register file 256 bits of data ( for 8 FP ops ) which does not happen: compared to a more Register starved architecture a lot more operations use registers to hold and load the operands from and to store the results ( temporary or not ) and memory bandwidth is less stressed.
If each time we had to load all operands from LS ( and store them there as well ) then you would be right doing "32 GFLOPS * 4 bytes = 128 GB/s".
While you do not have to access the LS every time, you will have to eventually. That's when you have your slowdown. So my question is this: How much do you slow down?
Still, it would not be far off from the bandwidth LS should achieve.
Pentium 4's L2 cache is now well beyond 90 GB/s: 256 bits * 3 GHz ( Pentium 4 clocked at 3 GHz ) = 96 GB/s.
Why would be 2005's CELL Local Storage ( which uses SRAM ) have that much trouble to provide 100+ GB/s ?
Ok, let's backtrack. According to your previous posts, you've basically said that there's 128/256/512 GB/s per APU. I seem to have confused this with total SRAM bandwidth, which isn't hard to imagine given the concept of that much bandwidth and that wide of a bus. You are looking at a 512 * 32 = 16384 bit wide bus for the LS. Forgive me for not believe in such a thing.
you've basically said that there's 128/256/512 GB/s per APU.
"We intend to launch the successor to PlayStation 2 ahead of our initial schedule, which was drawn up some years ago. This, we believe, will bring us ahead of Microsoft, which is planning a new console towards the end of 2005."
Fafalada said:nonamer said:I don't know why you guys are focusing on registers.
Your bandwith calculation claimed every FP operand access to be also a memory access.
The only situation where this is in fact true - is looking at accesses to register/local stack/or whatever the said CPU architecture uses as local storage.
On the other hand you only counted one memory access per operation - I think there should be three - two reads, one write - of course still from register point of view.
Also multiply adds perform 2FPs in one operation so the total number would be divided by two.
As for your question about embeded memory accesses - each APU is supposed to have it's own eDram pool so yes, the eDram bandwith aggregates the same way FPU power does.
While you do not have to access the LS every time, you will have to eventually. That's when you have your slowdown. So my question is this: How much do you slow down?
If APUs are anything like VUs in respect to their memory pools, those accesses will be basically free.
V3 said:That's interesting indeed. But they were planning to launch PS2 towards the end of 1999 too, and that got delayed to 2000. Well, I'll be looking forward to the end of 2005 then, with 3 new consoles to look forward too
I'm not talking to you per say, but I find it ironic that after just arguing with people here who "educated" myself and Marco to the fact that a 2005 launch is impossible - all it takes is one quote by some dude to change the tune.
So until now, I don't know who is that dude.
So someone quote the whole article ?