ISSCC 2005

Gubbi · Feb 9, 2005

Fafalada said:
Gubbi said:

Or maybe it just went on a die diet. who knows ?
With 288 GFLOPS peak performance, CELL is fairly certain to be memory starved.

Click to expand...

Not sure eDram would be a great solution for that though - having to manually manage another layer of memory on top of local storages would complicate things a fair bit more.

Which is another reason why I prefer demand loaded caches in the first place.

Fafalada said:
At any rate - I would argue that eDram bandwith would come more handy on the GPU side, and at least that's an area that even if it were managed by hand is familiar to lots of people already.

The primary function of a big chunk of eDram would be to lower average memory latency, increased bandwidth is secondary.

Main XDR memory will be 200-400 cycles away, that's 400-800 instructions. Anything that doesn't prefetch like hell will stall all the time. Even if you vertical multithread your code, you'll be limited by the maximum 16 outstanding memory (DMA) transactions (25-50 instructions/transaction/thread). - and that is without contention, you have 8 other guys competing for the same memory channel.

Cheers
Gubbi

j^aws · Feb 10, 2005

Thanks to M.Isobe at Ars, sorry if old...

Panajev2001a · Feb 10, 2005

Gubbi said:
Fafalada said:

Gubbi said:

Or maybe it just went on a die diet. who knows ?
With 288 GFLOPS peak performance, CELL is fairly certain to be memory starved.

Click to expand...

Not sure eDram would be a great solution for that though - having to manually manage another layer of memory on top of local storages would complicate things a fair bit more.

Click to expand...

Which is another reason why I prefer demand loaded caches in the first place.

Fafalada said:

At any rate - I would argue that eDram bandwith would come more handy on the GPU side, and at least that's an area that even if it were managed by hand is familiar to lots of people already.

Click to expand...

The primary function of a big chunk of eDram would be to lower average memory latency, increased bandwidth is secondary.

Main XDR memory will be 200-400 cycles away, that's 400-800 instructions. Anything that doesn't prefetch like hell will stall all the time. Even if you vertical multithread your code, you'll be limited by the maximum 16 outstanding memory (DMA) transactions (25-50 instructions/transaction/thread). - and that is without contention, you have 8 other guys competing for the same memory channel.

Cheers
Gubbi

Gubbi, I have the feel that the SPU's can read the PPE's L2 cache using DMA (it might be automatic: look in the cache and if you miss then DMA from main RAM into the Local Storage): this does not mean the program will have to replace data lines in the cahce... they mention the possibility of locking portions of it though.

London Geezer · Feb 10, 2005

Jaws said:
Thanks to M.Isobe at Ars, sorry if old...

Wow, they got it up to 5.2GHz. Wonder how long the chip lasted at that speed...

Titanio · Feb 10, 2005

The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)

London Geezer · Feb 10, 2005

Titanio said:
The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)

I assume the C figures are the temperature. Seems a bit low, considering the speeds...

one · Feb 10, 2005

Jaws said:
Thanks to M.Isobe at Ars, sorry if old...

One more picture, a clearer die photo of an SPE, from the ascii24.com

BTW, SPE, separated from Cell/PPE, was tested up to 5.6GHz.

one · Feb 10, 2005

Titanio said:
The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)

The PPE seems hotter. Interestingly, the SPEs far from the PPE are cooler.

nAo · Feb 10, 2005

So a reciprocal ESTIMATE takes just 4 cycles. I bet one needs a couple of them to have a good approximation

So my 8 cycles guess for a reciprocal wasn't that off!
4GPoly/s..LOL

Gubbi · Feb 10, 2005

Panajev2001a said:
Gubbi, I have the feel that the SPU's can read the PPE's L2 cache using DMA (it might be automatic: look in the cache and if you miss then DMA from main RAM into the Local Storage): this does not mean the program will have to replace data lines in the cahce... they mention the possibility of locking portions of it though.

It would have to query the L2 cache (or rather the L2 snoops main memory accesses), otherwise memory coherency is out the door, and the designers probably weren't that insane. However the L2 is tiny compared to the computational resources of the chip.

I'm guessing that code running on the SPUs will use non-temporal load/store semantics to a very large extend, the last thing you need is nine cores thrashing the same 512KB array.

Cheers
Gubbi

London Geezer · Feb 10, 2005

I'm amazed at how much information we can already get on Sony's product, i mean that temperature thing is quite neat. Will we ever get our hands on the same kind of info for other platforms?

Titanio · Feb 10, 2005

one said:
The PPE seems hotter. Interestingly, the SPEs far from the PPE are cooler.
http://pc.watch.impress.co.jp/docs/2005/0209/kaigai001.jpg

Ah, so these figures would be for an SPE on its own, then? That makes sense.

MfA · Feb 10, 2005

Titanio said:
The figure above the wattage in that table - is that the temperature it was running at that speed? They're not really in line with what was said before if it is (e.g. 85C at 4.6Ghz ). Presumably I'm not taking something into account..(?)

Could be the delta with ambient temperature.

MfA · Feb 10, 2005

nAo said:
So a reciprocal ESTIMATE takes just 4 cycles. I bet one needs a couple of them to have a good approximation

1 iteration of Newton-Rhapson is almost certainly enough, it is enough for a single precision result in 3DNow!.

If they dont have a specific instruction to speed it up it will take 2 multiply-accumulates, if they do it should only take 4 cycles more (for instance by storing the square of the approximation in the lookup table too you can cut down the computation to a single multiply-accumulate ... although there are probably smarter ways of achieving the same thing).

Panajev2001a · Feb 10, 2005

Wanted to post all the images, sorry Jaws you forgot one

.

PiNkY · Feb 10, 2005

Wanted to post all the images, sorry Jaws you forgot one Wink.

London Geezer · Feb 10, 2005

PiNkY said:
Wanted to post all the images, sorry Jaws you forgot one Wink.

Click to expand...

:? *shared confusion* :|

nAo · Feb 10, 2005

MfA said:
nAo said:

So a reciprocal ESTIMATE takes just 4 cycles. I bet one needs a couple of them to have a good approximation

Click to expand...

1 iteration of Newton-Rhapson is almost certainly enough, it is enough for a single precision result in 3DNow!.

That's even better! I assumed 2 iterations cause once I read a patent by nvidia where they shown their rcp instruction used 2 NR iterations..IIRC
So can we say the shortest vertex trasformation loop is just 5 cycles?

That's 6.4 GigaPoly/s! doubleLOL

ciao,
Marco

Panajev2001a · Feb 10, 2005

PiNkY said:
Wanted to post all the images, sorry Jaws you forgot one Wink.

Click to expand...

Back when I started to post that Jaws had posted only 3 images and the die photograph of the SPE had not been posted, that is what I meant.

London Geezer · Feb 10, 2005

Ohhh, it was meant as "Sorry Jaws you forgot one (picture). Full stop.

"

I see

ISSCC 2005

Gubbi

j^aws

Panajev2001a

London Geezer

Titanio

London Geezer

one

Unruly Member

one

Unruly Member

nAo

Nutella Nutellae

Gubbi

London Geezer

Titanio

MfA

MfA

Panajev2001a

PiNkY

London Geezer

nAo

Nutella Nutellae

Panajev2001a

London Geezer

Similar threads