Cell mass production plan for 2nd half of 2005

"Registers are mostly irrelevant (what's the point of have so many FPU if registers could feed them?)"

Registers couldn't feed them. My mistake. I don't know what you're talking about the rest of your post though.
 
Registers can feed them quite nicely or let me rephrase quite nicely enough not to have all the pressure fall back on the SRAM and then onto the e-DRAM.

I still do not understand the 1 TFLOPS * 32 bits = whoa the CELL CPU cannot possibly have enough bandwidth...

I still think the way you put it was flawed and if you re-read my last post you should be able to see what I think about the priorities in CELL's design.
 
I imagine if it were a single pipeline (the mutha of all pipelines, evidently) that was doing 1 TFLOP, there would be some real logistical issues to address (similar to what has been implied here). That said, the PS3 architecture may benefit from some amount of delocalization of resources to spread out the bandwidth demands. Just my little theory. Take it FWIW. :)
 
See, the on-die stuff isn't fast enough.

You answered this to my observation about the LS having a bandwdth of 256-512 GB/s... per APU.

Even if the SRAM ran at 2 GHz and had a 128 bits bus to the Register file this would still be 32 GB/s per APU: I expect a bit more than that though ;) ( the locally aggregated SRAM bus should be a bit wider ).

I expect something more like 256 GB/s of total bandwidth for each APU's SRAM.

Well, that is trying to feed, together with the nice 128x128 bits Registers, only 32 GFLOPS :p

32 GFLOPS * 8 bytes = 256 GB/s

Multiply it by 32, roughly, and you get the aggregate bandwidth for aLL APUs ;)
 
Even if the CPU only reaches 500 GFLOPs peak and about 100 GFLOPs sustained, that would still be tremendous.

Exactly.

You then have to count in the GPU's GFLOPS too. Maybe Sony will have PS3 at 1TFLOPS TOTAL instead of just the CPU? Seems respectable.
 
I don't know why you guys are focusing on registers. It should be obvious that they can provide the necessary bandwidth. I see that by having a lot of them, you can relieve pressure on the Local storage SRAM. But eventually, you'll have to access local storage, which is where the major slowdown can occur. The bandwidth of that is not enough, so aren't you still bandwidth limited?
 
nonamer said:
I don't know why you guys are focusing on registers. It should be obvious that they can provide the necessary bandwidth. I see that by having a lot of them, you can relieve pressure on the Local storage SRAM. But eventually, you'll have to access local storage, which is where the major slowdown can occur. The bandwidth of that is not enough, so aren't you still bandwidth limited?






Consider these two scenarios:



1) the APU's execution units ( each unit is a mixed FX/FP Unit and we have 4 of them in the APU ) have to always access the LS.





2) the APU's execution units have 128 registers to store temp data and access LS a fraction ( even 50% is a fraction of 100%, I am not giving numbers here ) of the time.





Would you say that case 1) or case 2) would put LS's "bandwidth limitation" more to the test ?





Would you say that IPF or x86 ( take a hypotetical Pentium MMX at the same speed of your low end Itanium 2 ) puts more stress on the memory ( higher percentage of memory-to-CPU and CPU-to-memory instructions [percentage of LOAD/STORE instructions that touch memory] in the instruction stream ), hence on the cache ?





One of those two processors has many more registers than the other ;)





To make a long story short: registers do matter.





Unless you build a case scenario in which every cycle, all APUs' execution units are writing back a result to the LS then the LS does not need to have the same bandwidth as the registers do provide.





Let's look at the single APU's case ( valid as 1 TFLOPS would be the aggregate performance of 32 APUs ): it can do a peak of 32 GFLOPS using FP or Integer Vector MADD instructions.





32 GFLOPS * 4 bytes = 128 GB/s





LS bandwidth does not need to be that high ( and still I suspect to eb able to reach >100 GB/s ) as it does not have to provide every cycle 12 operands to the execution units ( a Vector FP MADD requires 3x128 bits Vectors and each could be a 32 bits operand 3 * ( 128 / 32 ) = 12 ) as those are loaded from the Registers.





A Vector FP MADD uses up to 4 registers: Destination = Source_1 * Source_2 + Source_3.





We have 128 Registers in each APU ( 128x128 bits Register file ).





A Vector MADD normally has a latency of 4 and a throughput of 1 ( each cycle we can execute a 1 scalar FP MADD or the 4 required for the Vector MADD instruction ) which means that in pipelined terms we do execute 1 Vector FP MADD per cycle of course ( nothing new there ).





Your point would be that each cycle the LS has to load into the Register file 256 bits of data ( for 8 FP ops ) which does not happen: compared to a more Register starved architecture a lot more operations use registers to hold and load the operands from and to store the results ( temporary or not ) and memory bandwidth is less stressed.





If each time we had to load all operands from LS ( and store them there as well ) then you would be right doing "32 GFLOPS * 4 bytes = 128 GB/s".





Still, it would not be far off from the bandwidth LS should achieve.





Pentium 4's L2 cache is now well beyond 90 GB/s: 256 bits * 3 GHz ( Pentium 4 clocked at 3 GHz ) = 96 GB/s.





Why would be 2005's CELL Local Storage ( which uses SRAM ) have that much trouble to provide 100+ GB/s ?
 
Panajev2001a said:
nonamer said:
I don't know why you guys are focusing on registers. It should be obvious that they can provide the necessary bandwidth. I see that by having a lot of them, you can relieve pressure on the Local storage SRAM. But eventually, you'll have to access local storage, which is where the major slowdown can occur. The bandwidth of that is not enough, so aren't you still bandwidth limited?

Consider these two scenarios:

1) the APU's execution units ( each unit is a mixed FX/FP Unit and we have 4 of them in the APU ) have to always access the LS.

2) the APU's execution units have 128 registers to store temp data and access LS a fraction ( even 50% is a fraction of 100%, I am not giving numbers here ) of the time.

Would you say that case 1) or case 2) would put LS's "bandwidth limitation" more to the test ?

Obviously #2, but that's besides the point.

Would you say that IPF or x86 ( take a hypotetical Pentium MMX at the same speed of your low end Itanium 2 ) puts more stress on the memory ( higher percentage of memory-to-CPU and CPU-to-memory instructions [percentage of LOAD/STORE instructions that touch memory] in the instruction stream ), hence on the cache ?

One of those two processors has many more registers than the other ;)

To make a long story short: registers do matter.

Given the enormous code bloat of the IPF, I'd say that's a bad analogy. ;) Perhaps there's some confusion here. You're saying that registers matter. I, from the start agreed with that but said that it's not important (and perhaps confused numbers of registers and the bandwidth of them). I'm saying that while they do matter, it still does not relieve all pressure from LS and performance will still suffer.

Unless you build a case scenario in which every cycle, all APUs' execution units are writing back a result to the LS then the LS does not need to have the same bandwidth as the registers do provide.

Let's look at the single APU's case ( valid as 1 TFLOPS would be the aggregate performance of 32 APUs ): it can do a peak of 32 GFLOPS using FP or Integer Vector MADD instructions.

32 GFLOPS * 4 bytes = 128 GB/s

LS bandwidth does not need to be that high ( and still I suspect to eb able to reach >100 GB/s ) as it does not have to provide every cycle 12 operands to the execution units ( a Vector FP MADD requires 3x128 bits Vectors and each could be a 32 bits operand 3 * ( 128 / 32 ) = 12 ) as those are loaded from the Registers.

A Vector FP MADD uses up to 4 registers: Destination = Source_1 * Source_2 + Source_3.

We have 128 Registers in each APU ( 128x128 bits Register file ).

A Vector MADD normally has a latency of 4 and a throughput of 1 ( each cycle we can execute a 1 scalar FP MADD or the 4 required for the Vector MADD instruction ) which means that in pipelined terms we do execute 1 Vector FP MADD per cycle of course ( nothing new there ).

Your point would be that each cycle the LS has to load into the Register file 256 bits of data ( for 8 FP ops ) which does not happen: compared to a more Register starved architecture a lot more operations use registers to hold and load the operands from and to store the results ( temporary or not ) and memory bandwidth is less stressed.

If each time we had to load all operands from LS ( and store them there as well ) then you would be right doing "32 GFLOPS * 4 bytes = 128 GB/s".

While you do not have to access the LS every time, you will have to eventually. That's when you have your slowdown. So my question is this: How much do you slow down?

Still, it would not be far off from the bandwidth LS should achieve.

Pentium 4's L2 cache is now well beyond 90 GB/s: 256 bits * 3 GHz ( Pentium 4 clocked at 3 GHz ) = 96 GB/s.

Why would be 2005's CELL Local Storage ( which uses SRAM ) have that much trouble to provide 100+ GB/s ?

Ok, let's backtrack. According to your previous posts, you've basically said that there's 128/256/512 GB/s per APU. I seem to have confused this with total SRAM bandwidth, which isn't hard to imagine given the concept of that much bandwidth and that wide of a bus. You are looking at a 512 * 32 = 16384 bit wide bus for the LS. Forgive me for not believe in such a thing.
 
nonamer said:
I don't know why you guys are focusing on registers.
Your bandwith calculation claimed every FP operand access to be also a memory access.
The only situation where this is in fact true - is looking at accesses to register/local stack/or whatever the said CPU architecture uses as local storage.

On the other hand you only counted one memory access per operation - I think there should be three - two reads, one write - of course still from register point of view.
Also multiply adds perform 2FPs in one operation so the total number would be divided by two.


As for your question about embeded memory accesses - each APU is supposed to have it's own eDram pool so yes, the eDram bandwith aggregates the same way FPU power does.
While you do not have to access the LS every time, you will have to eventually. That's when you have your slowdown. So my question is this: How much do you slow down?
If APUs are anything like VUs in respect to their memory pools, those accesses will be basically free.
 
Here's something INTERESTING from January.

"We intend to launch the successor to PlayStation 2 ahead of our initial schedule, which was drawn up some years ago. This, we believe, will bring us ahead of Microsoft, which is planning a new console towards the end of 2005."

http://www.computerandvideogames.com/r/?page=http://www.computerandvideogames.com/news/news_story.php(que)id=86442

This ties in with the fact that at some conference Nintendo said they would beat everyone to the gate with N5 and some Sony exec went "Not if we have anything to do about it" or something along those lines.
[/quote]
 
nonamer said:
Panajev2001a said:
nonamer said:
I don't know why you guys are focusing on registers. It should be obvious that they can provide the necessary bandwidth. I see that by having a lot of them, you can relieve pressure on the Local storage SRAM. But eventually, you'll have to access local storage, which is where the major slowdown can occur. The bandwidth of that is not enough, so aren't you still bandwidth limited?








Consider these two scenarios:







1) the APU's execution units ( each unit is a mixed FX/FP Unit and we have 4 of them in the APU ) have to always access the LS.







2) the APU's execution units have 128 registers to store temp data and access LS a fraction ( even 50% is a fraction of 100%, I am not giving numbers here ) of the time.







Would you say that case 1) or case 2) would put LS's "bandwidth limitation" more to the test ?








Obviously #2, but that's besides the point.







Would you say that IPF or x86 ( take a hypotetical Pentium MMX at the same speed of your low end Itanium 2 ) puts more stress on the memory ( higher percentage of memory-to-CPU and CPU-to-memory instructions [percentage of LOAD/STORE instructions that touch memory] in the instruction stream ), hence on the cache ?







One of those two processors has many more registers than the other ;)







To make a long story short: registers do matter.








Given the enormous code bloat of the IPF, I'd say that's a bad analogy. ;) Perhaps there's some confusion here. You're saying that registers matter. I, from the start agreed with that but said that it's not important (and perhaps confused numbers of registers and the bandwidth of them). I'm saying that while they do matter, it still does not relieve all pressure from LS and performance will still suffer.







Unless you build a case scenario in which every cycle, all APUs' execution units are writing back a result to the LS then the LS does not need to have the same bandwidth as the registers do provide.







Let's look at the single APU's case ( valid as 1 TFLOPS would be the aggregate performance of 32 APUs ): it can do a peak of 32 GFLOPS using FP or Integer Vector MADD instructions.







32 GFLOPS * 4 bytes = 128 GB/s







LS bandwidth does not need to be that high ( and still I suspect to eb able to reach >100 GB/s ) as it does not have to provide every cycle 12 operands to the execution units ( a Vector FP MADD requires 3x128 bits Vectors and each could be a 32 bits operand 3 * ( 128 / 32 ) = 12 ) as those are loaded from the Registers.







A Vector FP MADD uses up to 4 registers: Destination = Source_1 * Source_2 + Source_3.







We have 128 Registers in each APU ( 128x128 bits Register file ).







A Vector MADD normally has a latency of 4 and a throughput of 1 ( each cycle we can execute a 1 scalar FP MADD or the 4 required for the Vector MADD instruction ) which means that in pipelined terms we do execute 1 Vector FP MADD per cycle of course ( nothing new there ).







Your point would be that each cycle the LS has to load into the Register file 256 bits of data ( for 8 FP ops ) which does not happen: compared to a more Register starved architecture a lot more operations use registers to hold and load the operands from and to store the results ( temporary or not ) and memory bandwidth is less stressed.







If each time we had to load all operands from LS ( and store them there as well ) then you would be right doing "32 GFLOPS * 4 bytes = 128 GB/s".








While you do not have to access the LS every time, you will have to eventually. That's when you have your slowdown. So my question is this: How much do you slow down?







Still, it would not be far off from the bandwidth LS should achieve.







Pentium 4's L2 cache is now well beyond 90 GB/s: 256 bits * 3 GHz ( Pentium 4 clocked at 3 GHz ) = 96 GB/s.







Why would be 2005's CELL Local Storage ( which uses SRAM ) have that much trouble to provide 100+ GB/s ?








Ok, let's backtrack. According to your previous posts, you've basically said that there's 128/256/512 GB/s per APU. I seem to have confused this with total SRAM bandwidth, which isn't hard to imagine given the concept of that much bandwidth and that wide of a bus. You are looking at a 512 * 32 = 16384 bit wide bus for the LS. Forgive me for not believe in such a thing.






First, each LS is separate: each APU has its own LS so yes, as the FP/FX Units power does aggregate so does the LS's bandwidth.





Let's focus on a single APU: if 1 TFLOPS was a "problem" for the entire chip ( made of 32 separate APUs ), then 32 GFLOPS will be a "problem" for each APU.







The PE's internal bus, that connects to the LS of each APU, is 1,024 bits wide ( this is how the LS is fed ).







The bus that connects LS to the Register file is 256 bits wide according to the patent ( bus 408 ).







32 GFLOPS means 128 GB/s and if we divide it by 4 GHz we obtain 32 bytes/cycle of needed bandwidth.







32 bytes/cycles * 8 bits/byte = 256 bits/cycle.







32 GFLOPS means also 8 FP ops/cycle at 4 GHz and 8 FP ops/cycle * 32 bits/op = 256 bits/cycle.







How wide did we say that the bus 408 is ? I thought we did say it is 256 bits wide.



Fuzzy math is indeed funny.
 
you've basically said that there's 128/256/512 GB/s per APU.

I don't think it need to be that high, but it seems each APU would have its own local storage, with 256 bit bus to its own registers. The bus from APU to eDRAM is suppose to be 1024 bit.

"We intend to launch the successor to PlayStation 2 ahead of our initial schedule, which was drawn up some years ago. This, we believe, will bring us ahead of Microsoft, which is planning a new console towards the end of 2005."

That's interesting indeed. But they were planning to launch PS2 towards the end of 1999 too, and that got delayed to 2000. Well, I'll be looking forward to the end of 2005 then, with 3 new consoles to look forward too :)
 
Fafalada said:
nonamer said:
I don't know why you guys are focusing on registers.




Your bandwith calculation claimed every FP operand access to be also a memory access.



The only situation where this is in fact true - is looking at accesses to register/local stack/or whatever the said CPU architecture uses as local storage.





On the other hand you only counted one memory access per operation - I think there should be three - two reads, one write - of course still from register point of view.



Also multiply adds perform 2FPs in one operation so the total number would be divided by two.







As for your question about embeded memory accesses - each APU is supposed to have it's own eDram pool so yes, the eDram bandwith aggregates the same way FPU power does.



While you do not have to access the LS every time, you will have to eventually. That's when you have your slowdown. So my question is this: How much do you slow down?




If APUs are anything like VUs in respect to their memory pools, those accesses will be basically free.






For a Vector MADD we would need to fetch 3x128 bits vectors and then store a 128 bits result so would not that make three reads and one write ?





Nonamer,





What does IPF's code size has to do with percentage of memory LOAD/STORE instructions in the executable code ?
 
V3 said:
That's interesting indeed. But they were planning to launch PS2 towards the end of 1999 too, and that got delayed to 2000. Well, I'll be looking forward to the end of 2005 then, with 3 new consoles to look forward too :)

I'm not talking to you per say, but I find it ironic that after just arguing with people here who "educated" myself and Marco to the fact that a 2005 launch is impossible - all it takes is one quote by some dude to change the tune.

Hey, I thought it was going to launch in 2007 after starting mass production in 2005 with the chips produced during those two years occupying the slot on Sony's secret warehouse next to the 40M PS2s that they "shipped" but not "sold". WTF? :LOL: So many people are going to eat shit on this console.
 
I'm not talking to you per say, but I find it ironic that after just arguing with people here who "educated" myself and Marco to the fact that a 2005 launch is impossible - all it takes is one quote by some dude to change the tune.

The only problem I have with the quote is that, I can't access the article that quote is from. Need to register with that site, which I can't be bothered.

So until now, I don't know who is that dude.

So someone quote the whole article ?
 
Can't we just all wait and see the final specs...

I mean it's pretty obvious that, in case Sony figures out that it won't be able to provide enough bandwidth to feed a Cell on steroids, it will just CUT some steroids to make the system more balanced and avoiding useless costs for power that would go unused.

If the bandwidth requirements for a 1TFlop chip (and i'm pretty sure the 1Tflop, IF it ever gets reached, will be a TOTAL figure for the whole system, still pretty impressive) is so high that they cannot provide it at a reasonable price, then they will just downgrade the chip(s) to, and i'm just giving an example, 700 Gflops, so they will save on some costs (having to run Cell at high clock speeds will require A LOT of money to get right, so one can assume a downgrade in clockspeed would save money).

Whatever comes out, the bandwidth will be just right for the power requirements of the Cell architecture. They would NOT spend gazillions of dollars on R&D for Cell if the final system could be crippled by low bandwidth. just like they projected figures of Tflops and polygons and all that crap, they also have projected figures for "how much bandwidth we can provide at a decent cost" If they see they will only get so much bandwidth, then they will just balance the whole system out. and to be honest a bit less steroids makes for a more stable and potentially more reliable system.

we'll see...
 
So until now, I don't know who is that dude.

So someone quote the whole article ?

The website has some type of protection on their pages, you can type "ps3 gauntlet" on google and it's the first result though.

When he said this he was Sony CTO Kenshi Manabe vice president of SCEI semiconductors.

He is now SCE's chief technology officer, taking over Okamoto.

This guy isn't a nobody and has actually said alot about PS3, more than most.

Quite odd, this new quote I found actually ties in with what me and Vince are saying ;)
 
Back
Top