ISSCC 2005

Megadrive1988 said:
http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318

One unconfirmed report claims that at the extreme end of the frequency/voltage/power spectrum, one sample CELL processor was observed to operate at 5.6 GHz with 1.4 V Vdd and consumed 180 W of power.

Put two of them inside the PS3 please so I don't freeze when it's cold outside. :D

Fredi
 
nAo:
Could be that the GPU has it's own interface to a chunk of XDR DRAM. The Flex I/O could be used to link the GPU and the CELL in a similar manner as two Opterons are linked with Hypertransport. The local RAM would have better latency, but bandwidth should be about the same.

That kind of structure would also allow for more than 256MB RAM (4 devices on both CELL and the GPU)

Cheers
Gubbi
 
5.6GHz is crazy! I can only imagine what speeds this will run in future iterations, beyond the next generation (that is, if they keep the Cell architecture and don't go with a new one next time around).
 
Can I ask a couple of Qs about the estimations of peak polygon transform rate that has been discussed above? What are you considering to be the smallest transform? And what instructions do you need to use to do said transform? Are you just taking their cycles, adding them up, dividing into the clockspeed and multiplying by 8? Are you considering loading/storing of vertices in your calculations? Pipelining? Sorry, I'm a little bit of a noob as far as this is concerned (I tend to work on much higher level with my code, unfortunately ;)) - if you could walk through the math of how these figures are being derived, I'd greatly appreciate it :) Thanks..
 
Fafalada said:
Well - having to spend the same amount of instructions&time on (rotational)matrix*vector transform as a 2-vector dotproduct is what greatly upsets software ppl.

Absolutely, no dot-product instruction sucks. Its even more relevant for non-graphics ops, where is often impossible to do more than 1 dot-product at a time (for example AI angle or distance calcs etc). So effectively you divide you theoritical flops by 3 or 4....

Grrr.... Just after I'd got used to being able to issue a dot product every cycle.
 
nAo said:
About the GPU:
If Nvidia is not going to adapt their next generation design to embrace eDram, what are they going to do?
It's a given they are modifying the GPU to interface it nicely with the CELL CPU.
At this time we know the CELL CPU can have a max of 256 MBytes / 25.6 GB/s. The amount of ram can be changed..but what about the external bandwith?
25.6 GigaBytes/s can be doubled with more modules and more channels, or doubling the channel frequency. One thing is for sure, 25.6 GB/s is too little to sustain a CELL CPU and modern GPU (without edram).
Is it feasible to have other 256 MB of XDRAM for the GPU (that amount of ram is needed to have a decent bandwith for the GPU), it seems quite odd to me. Are we going to see a PS3 GPU coupled with GDDR3/4 ram?
What I mean is that maybe the eDram thing wasn't that bad ;)

ciao,
Marco


cell cpu can distribute on the network own resource, gpu must be similar,
hence gpu isn't continue pc's gpu, but will be more small renderprocessors with own edram
 
Gubbi said:
nAo:
Could be that the GPU has it's own interface to a chunk of XDR DRAM. The Flex I/O could be used to link the GPU and the CELL in a similar manner as two Opterons are linked with Hypertransport. The local RAM would have better latency, but bandwidth should be about the same.

That kind of structure would also allow for more than 256MB RAM (4 devices on both CELL and the GPU)
I though about that but the problem is that you have to mount 256MB of XDRAM to give the GPU a decent amount ot bandwith. Not that I would not be
happy to such a big pool of graphics DRAM..but isn't it an overkill, costs and performance wise? That way we are going to have 8 memory modules on the PS2.. :oops:
 
london-boy said:
How about 256MB for the BE and either 128 or 256 for the GPU?
128 MB gives just 12.8 GB/s for the GPU. Is it enough for a eDram-less GPU? I don't think so
 
Fafalada said:
Or for that matter, calculating the vector square length being as expensive as a full matrix*vector transform is equally annoying.
It's rather funny to see you programmer types talk in terms of "expensive", and, 'ooh, we need one of this, and one of that, and those ones would be nice too, and...' Geez you guys! You have a chip with EIGHT fricken vector supercomputers on it to play with. So what if it lacks a unit X in the odd pipe or an instruction Y in the ISA, I think you'll manage anyway with well over a quarter teraflops at 4.6GHz... :D:p;)

I mean, if you're not HAPPY, we can always take away all your toys and stick a ZX-81 in your hands instead. See how you'll like THAT! :devilish:
 
nAo said:
happy to such a big pool of graphics DRAM..but isn't it an overkill, costs and performance wise? That way we are going to have 8 memory modules on the PS2.. :oops:

Well, they could use 256Mbit devices instead of 512 ones, that would leave 128MB hanging off the GPU. Having a 1280x720 HDR back and framebuffer and 4 or 8x MSAA would require 30-44MB leaving the rest for texture and geometry.

Cheers
Gubbi
 
Gubbi said:
Well, they could use 256Mbit devices instead of 512 ones, that would leave 128MB hanging off the GPU.
I thought 256Mbit devices production were canned :?:
 
nAo said:
Gubbi said:
Well, they could use 256Mbit devices instead of 512 ones, that would leave 128MB hanging off the GPU.
I thought 256Mbit devices production were canned :?:

I don't know, weren't that just inferred when 512Mbit devices were announced ?

Anyway Sony is probably a big enough customer to have any DRAM manufacturer at their beg and call, they will need in excess of 400 million devices over 4 years afterall.

Or Sony could use 512Mbit devices and have 256MB hanging off the GPU, it's not like it would be exclusive to the GPU, the CELL MPU would still be able to address it directly through the Flex I/O (albeit with worse latency).

All speculation BTW.

Cheers
Gubbi
 
DeanoC said:
Grrr.... Just after I'd got used to being able to issue a dot product every cycle.
You and me both... :p

Titanio said:
Are you considering loading/storing of vertices in your calculations?
It's trivial math - the housekeeping stuff is on the odd pipeline and should generally just fit within the length of minimum transform loop. Whether you can input/output transform data through the SPUs fast enough is another matter alltogether - but noone said anything about drawing or using them, it was just speculation on how fast one can transform stuff.

Guden said:
So what if it lacks a unit X in the odd pipe or an instruction Y in the ISA, I think you'll manage anyway with well over a quarter teraflops at 4.6GHz...
There's trivial and then there's non-trivial omissions. One could design an ISA with nothing but floating point ADDers and still have it run at theoretical 256GFlops.
It would be retarded - but hey, only the whiney programmers would complain about it right? :?
 
Fafalada said:
One could design an ISA with nothing but floating point ADDers and still have it run at theoretical 256GFlops.
It would be retarded - but hey, only the whiney programmers would complain about it right? :?

Yes, STOP WHINING :)

David Wang over at Real World Tech posted an article yesterday where he stated that the SPU ISA is a subset of VMX (Altivec). Altivec has a dot-sum (dot product) instruction, so don't despair, it might be in CELL.

But it will definately have worse latency than FMADD since it requires a 4-input add after the 4 muls.

BTW. the RWT article will be up again 9AM EST.

Cheers
Gubbi
 
Gubbi said:
David Wang over at Real World Tech posted an article yesterday where he stated that the SPU ISA is a subset of VMX (Altivec). Altivec has a dot-sum (dot product) instruction, so don't despair, it might be in CELL.
VMX doesn't have a single instruction dot-product, it just has a horizontal add but then VMX itself is a bit old and crusty. It has been significantly improved in some CPUs...

Right about the latency though... but then who cares about latency, pipelining and multi-threading hide latencys...
 
DeanoC said:
Absolutely, no dot-product instruction sucks. Its even more relevant for non-graphics ops, where is often impossible to do more than 1 dot-product at a time (for example AI angle or distance calcs etc). So effectively you divide you theoritical flops by 3 or 4....

so compared to an highend GPU today which IMHO can do an dot-product every cycle the 256GFlops of CELL are only worth ~64-85GFlops. That would be really disappointing.
 
mboeller said:
DeanoC said:
Absolutely, no dot-product instruction sucks. Its even more relevant for non-graphics ops, where is often impossible to do more than 1 dot-product at a time (for example AI angle or distance calcs etc). So effectively you divide you theoritical flops by 3 or 4....

so compared to an highend GPU today which IMHO can do an dot-product every cycle the 256GFlops of CELL are only worth ~64-85GFlops. That would be really disappointing.

... And a TOTAL waste.
 
Fafalada said:
It would be retarded - but hey, only the whiney programmers would complain about it right? :?
I'm sure SPU ISA isn't all composed of FADD instructions, so don't you worry Faf... :)
 
london-boy said:
mboeller said:
DeanoC said:
Absolutely, no dot-product instruction sucks. Its even more relevant for non-graphics ops, where is often impossible to do more than 1 dot-product at a time (for example AI angle or distance calcs etc). So effectively you divide you theoritical flops by 3 or 4....

so compared to an highend GPU today which IMHO can do an dot-product every cycle the 256GFlops of CELL are only worth ~64-85GFlops. That would be really disappointing.

... And a TOTAL waste.

if compute 4 vector at same time, then dotproduct is 1 cycle, what is your problem?
 
Back
Top