ISSCC 2005

Fafalada said:
Gubbi said:
Or maybe it just went on a die diet. who knows ?
With 288 GFLOPS peak performance, CELL is fairly certain to be memory starved.
Not sure eDram would be a great solution for that though - having to manually manage another layer of memory on top of local storages would complicate things a fair bit more.

Which is another reason why I prefer demand loaded caches in the first place.

Fafalada said:
At any rate - I would argue that eDram bandwith would come more handy on the GPU side, and at least that's an area that even if it were managed by hand is familiar to lots of people already.

The primary function of a big chunk of eDram would be to lower average memory latency, increased bandwidth is secondary.

Main XDR memory will be 200-400 cycles away, that's 400-800 instructions. Anything that doesn't prefetch like hell will stall all the time. Even if you vertical multithread your code, you'll be limited by the maximum 16 outstanding memory (DMA) transactions (25-50 instructions/transaction/thread). - and that is without contention, you have 8 other guys competing for the same memory channel.

Cheers
Gubbi
 
Thanks to M.Isobe at Ars, sorry if old...

images765507.jpg


images765506.jpg


images765535.jpg
 
Gubbi said:
Fafalada said:
Gubbi said:
Or maybe it just went on a die diet. who knows ?
With 288 GFLOPS peak performance, CELL is fairly certain to be memory starved.
Not sure eDram would be a great solution for that though - having to manually manage another layer of memory on top of local storages would complicate things a fair bit more.

Which is another reason why I prefer demand loaded caches in the first place.

Fafalada said:
At any rate - I would argue that eDram bandwith would come more handy on the GPU side, and at least that's an area that even if it were managed by hand is familiar to lots of people already.

The primary function of a big chunk of eDram would be to lower average memory latency, increased bandwidth is secondary.

Main XDR memory will be 200-400 cycles away, that's 400-800 instructions. Anything that doesn't prefetch like hell will stall all the time. Even if you vertical multithread your code, you'll be limited by the maximum 16 outstanding memory (DMA) transactions (25-50 instructions/transaction/thread). - and that is without contention, you have 8 other guys competing for the same memory channel.

Cheers
Gubbi

Gubbi, I have the feel that the SPU's can read the PPE's L2 cache using DMA (it might be automatic: look in the cache and if you miss then DMA from main RAM into the Local Storage): this does not mean the program will have to replace data lines in the cahce... they mention the possibility of locking portions of it though.
 
The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)
 
Titanio said:
The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)

I assume the C figures are the temperature. Seems a bit low, considering the speeds...
 
Titanio said:
The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)

The PPE seems hotter. Interestingly, the SPEs far from the PPE are cooler.
kaigai001.jpg
 
So a reciprocal ESTIMATE takes just 4 cycles. I bet one needs a couple of them to have a good approximation ;)
So my 8 cycles guess for a reciprocal wasn't that off!
4GPoly/s..LOL :)
 
Panajev2001a said:
Gubbi, I have the feel that the SPU's can read the PPE's L2 cache using DMA (it might be automatic: look in the cache and if you miss then DMA from main RAM into the Local Storage): this does not mean the program will have to replace data lines in the cahce... they mention the possibility of locking portions of it though.

It would have to query the L2 cache (or rather the L2 snoops main memory accesses), otherwise memory coherency is out the door, and the designers probably weren't that insane. However the L2 is tiny compared to the computational resources of the chip.

I'm guessing that code running on the SPUs will use non-temporal load/store semantics to a very large extend, the last thing you need is nine cores thrashing the same 512KB array.

Cheers
Gubbi
 
I'm amazed at how much information we can already get on Sony's product, i mean that temperature thing is quite neat. Will we ever get our hands on the same kind of info for other platforms?
 
Titanio said:
The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)
Could be the delta with ambient temperature.
 
nAo said:
So a reciprocal ESTIMATE takes just 4 cycles. I bet one needs a couple of them to have a good approximation ;)
1 iteration of Newton-Rhapson is almost certainly enough, it is enough for a single precision result in 3DNow!.

If they dont have a specific instruction to speed it up it will take 2 multiply-accumulates, if they do it should only take 4 cycles more (for instance by storing the square of the approximation in the lookup table too you can cut down the computation to a single multiply-accumulate ... although there are probably smarter ways of achieving the same thing).
 
MfA said:
nAo said:
So a reciprocal ESTIMATE takes just 4 cycles. I bet one needs a couple of them to have a good approximation ;)
1 iteration of Newton-Rhapson is almost certainly enough, it is enough for a single precision result in 3DNow!.
That's even better! I assumed 2 iterations cause once I read a patent by nvidia where they shown their rcp instruction used 2 NR iterations..IIRC
So can we say the shortest vertex trasformation loop is just 5 cycles? ;)
That's 6.4 GigaPoly/s! doubleLOL :)

ciao,
Marco
 
Back
Top