Watch Impress PS3 Technical article, from GDC (some new info)

pjbliverpool · Mar 31, 2006

blakjedi said:
What makes you say that the ALU's are weaker in Xenos than RSX? According to a certain ATI rep Xenos' ALUs are more powerful than R520s no comparison that I have seen has been done with R580/G70 however normalized for clock rate.

Weaker individually or weaker overall? I would like to understand the context of the statements, do you have a link please?

ROG27 · Mar 31, 2006

According to this slide, it seems the the PPE received both a L1 cache and, indeed, the 128-bit VMX (Altivec) upgrade Watch Impress reported.

Titanio · Mar 31, 2006

ROG27 said:
According to this slide, it seems the the PPE received both a L1 cache and, indeed, the 128-bit VMX (Altivec) upgrade Watch Impress reported.

This slide is old. The one upgrade from that mentioned at GDC, as pointed out by Panajev_2001, is that L1 data cache now appears to be 64KB, versus the 32KB shown here.

The VMX128 thing was a complete misunderstanding (on my part, I should add). The slide shown here refers to 128-bit VMX - the upgrade that was suggested before was a bump in the register count from 32 to 128.

Shifty Geezer · Mar 31, 2006

Yes, the previous debunked 'upgrade' was for 128 register VMX. 128 bit was already known.

ROG27 · Mar 31, 2006

Titanio said:
This slide is old. The one upgrade from that mentioned at GDC, as pointed out by Panajev_2001, is that L1 data cache now appears to be 64KB, versus the 32KB shown here.

The VMX128 thing was a complete misunderstanding (on my part, I should add). The slide shown here refers to 128-bit VMX - the upgrade that was suggested before was a bump in the register count from 32 to 128.

Why would they upgrade the register count anyway? It would seem rather useless from the perspective of how CELL is to be utilized on the whole. Is that a correct assertion?

Shifty Geezer · Mar 31, 2006

It would probably help with multithreading, which PPE is expected to do.

ROG27 · Mar 31, 2006

Shifty Geezer said:
It would probably help with multithreading, which PPE is expected to do.

I'm just saying in the scheme of CELL's architectual layout as a whole, it was probably a conscious design tradeoff to allocate more of the transistor budget somewhere where performance boosting got the most bang for the buck, like the SPEs.

heliosphere · Mar 31, 2006

Shifty Geezer said:
For the most common filters, blurs, a fast pseudo-gaussian will work on horizontal than vertical components, only needing a one pixel lookahead, same as a fast box blur. This could be streamed ideally well for SPEs.

How does that work for something like the 25 tap separable gaussian in the ATI presentation I linked? It seems like you could do the horizontal pass fairly efficiently because you could read in one complete scanline at a time and blur it efficiently. The vertical pass would be trickier because as far as I can see (and I may be missing something here) you'd still need 25 pixels vertically in local memory at once. For a 1080p 16 bit float frame that's going to be tricky to fit in memory. You'd have to break the frame up into vertical strips to fit to get efficient streaming I'd have thought. I'm not very familiar with how these filters are usually implemented on the CPU though so maybe there's some clever tricks I'm missing.

I'm sure you could come up with reasonably efficient implementations for many common filters for the SPEs but they'll be more effort to implement and slower than a GPU implementation in almost all cases. That doesn't mean it might not be worth doing if you're bottleneck is the GPU but really the SPEs aren't as suited for this kind of work as a GPU.

j^aws · Mar 31, 2006

"Physical GPU Tiles"

....hmmm, seen a couple of NV patents on those!

Anyway, can't you NDA'ed devs with nothing better to do than tease, say how many VS units are in RSX? ...Ta!

j^aws · Mar 31, 2006

Alright, it was a tongue 'n' cheek question, but seriously, RSX can do 136 instructions/ cycle and has 24 PS units, can't you at least confirm whether it has 8 VS units?

Shifty Geezer · Mar 31, 2006

heliosphere said:
The vertical pass would be trickier because as far as I can see (and I may be missing something here) you'd still need 25 pixels vertically in local memory at once.

Nope. An implementation of this paper...
http://www.ph.tn.tudelft.nl/~lucas/publications/1995/SP95TYLV/SP95TYLV.pdf

Only needs 4 pixels of storage as it were. Think of it as a FIFO queue where as a new pixel is processed, the older pixel stored is rejected.

Given a processing of a quad (100,100) to (200,200), the order of data would need to be something akin to

For pixel (100,100)
r1 = Tap(98,100)
r2 = Tap(99,100)
r3 = Tap(100,100)
r4 = Tap(101,100)

Pixel (101,100)
r1 = Tap(99,100)
r2 = Tap(100,100)
r3 = Tap(101,100)
r4 = Tap(102,100)

Pixel (102,100)
r1 = Tap(100,100)
r2 = Tap(101,100)
r3 = Tap(102,100)
r4 = Tap(103,100)

and so forth across the columns, and then exactly the same for the rows after all pixels have been prcoessed this way...

For pixel (100,100)
r1 = Tap(100,98)
r2 = Tap(100,99)
r3 = Tap(100,100)
r4 = Tap(100,101)

Pixel (100,101)
r1 = Tap(100,99)
r2 = Tap(100,100)
r3 = Tap(100,101)
r4 = Tap(100,102)

Pixel (100,102)
r1 = Tap(100,100)
r2 = Tap(100,101)
r3 = Tap(100,102)
r4 = Tap(100,103)

The actual working space needed is thus very small. You would probably do something like have two quads in memory at any time as a double buffer on the data to accomodate the DMA time, processing one quad to completion as the other is output and the next quad loaded. Each quad would only need a few pixels boundary on right and bottom edges for tiling.

And no, I don't understand how it works! It's incredible how maths boffins can work this sort of thing out. As far as I'm concerned it's magic, that makes for the fastest gaussian like blur you'll get on a CPU!

I'm sure you could come up with reasonably efficient implementations for many common filters for the SPEs but they'll be more effort to implement and slower than a GPU implementation in almost all cases. That doesn't mean it might not be worth doing if you're bottleneck is the GPU but really the SPEs aren't as suited for this kind of work as a GPU.

I agree, but they're not too badly suited. You're never going to compete with 16+ pixel pipes all fetching and processing data simultaneously. But that's also comparing a whole GPU to a single SPE. Get all the SPE's going on this together and the gap isn't going to be enormous, at least in a gaussian (or approximation thereof) example. I think the key advantage for PS3 is the option to use the resources as needed. If the RSX is using all it's effort to render the geometry, you can offload postprocessing to Cell with some efficiency. And if Cell is all used up running physics, you can post-process on RSX with maximum efficiency. The other advantage to SPE's in post-processing is more efficient conditional processing, but I don't know if those sorts of processes will be used in game post-processing. That's normally keeps to blurs and blends and outlining.

nAo · Mar 31, 2006

heliosphere said:
The vertical pass would be trickier because as far as I can see (and I may be missing something here) you'd still need 25 pixels vertically in local memory at once.

SPEs have a local store, GPUs have DDR memory pages, they both don't like to cross those boundaries, that's why GPUs love to render into tiled frame buffers.
You can still apply your separable filter on a tile and you don't even need to tile anything at all as GPUs already do that..

So you just load one or more frame buffer tiles (how big they are? just check GDDR3 mem specs..) with a single sequential DMA transfer into a SPE's local store...

heliosphere · Mar 31, 2006

Shifty Geezer said:
And no, I don't understand how it works! It's incredible how maths boffins can work this sort of thing out. As far as I'm concerned it's magic, that makes for the fastest gaussian like blur you'll get on a CPU!

Hmm, I looked it over and I can't really understand how it works either

Still, it does seem to be an efficient way of implementing a blur on the CPU and doesn't require too many surrounding pixels, I'll give you that.

The other advantage to SPE's in post-processing is more efficient conditional processing, but I don't know if those sorts of processes will be used in game post-processing. That's normally keeps to blurs and blends and outlining.

Conditional's aren't that efficient on the SPEs either (unless you're talking about conditionals at a fairly high level, like for entire tiles) but yeah, they may well be better than a GPU for something like that. Can't think of any examples off the top of my head but doesn't mean there aren't some good uses.

heliosphere · Mar 31, 2006

nAo said:
SPEs have a local store, GPUs have DDR memory pages, they both don't like to cross those boundaries, that's why GPUs love to render into tiled frame buffers.
You can still apply your separable filter on a tile and you don't even need to tile anything at all as GPUs already do that..
So you just load one or more frame buffer tiles (how big they are? just check GDDR3 mem specs..) with a single sequential DMA transfer into a SPE's local store...

Yeah, we did discuss using frame buffer tiles earlier in the thread. That would probably be the way to go if you were actually implementing this on the PS3. Of course GPUs do have similar memory issues to deal with as SPEs but they are designed to be as efficient as possible for these kinds of memory access patterns.

I'm not trying to argue that you can't do post processing filters efficiently on the SPEs. Im just saying it's more work to implement and slower than a GPU version. There could still be good reasons to do it. My original point was that the unified shader architecture on the 360 (together with the EDRAM) makes it very efficient at post-processing, more so than RSX. Both 360 and RSX are more efficient at post-processing effects than the SPEs. The SPEs will do a better job than the 360 CPU cores. There are going to be times when it would make sense to do post processing on the GPU side or the CPU side on either platform but it'll be more common to use the SPEs for post-processing than to use a 360 core given the relative strengths of the platforms.

nAo · Apr 1, 2006

heliosphere said:
My original point was that the unified shader architecture on the 360 (together with the EDRAM) makes it very efficient at post-processing, more so than RSX

How the EDRAM is helping a post processing pass?
Though 360 unified architecture should be more efficient in a un full screen pass it's still likely to lag behind RSX in this very same departement

heliosphere · Apr 1, 2006

nAo said:
How the EDRAM is helping a post processing pass?

Post processing shaders are often very fill-rate hungry, especially if they are fairly light on math or texture fetches. The EDRAM means you rarely have a situation where frame buffer bandwidth is the bottleneck.

nAo · Apr 1, 2006

heliosphere said:
Post processing shaders are often very fill-rate hungry, especially if they are fairly light on math or texture fetches. The EDRAM means you rarely have a situation where frame buffer bandwidth is the bottleneck.

I completely disagree, post processing shaders are often pixel shader/texture bandwith hungry, not fill rate hungry.
If your full screen pass is so simple to be fill rate hungry you should combine/collapse them.

heliosphere · Apr 1, 2006

nAo said:
If your full screen pass is so simple to be fill rate hungry you should combine/collapse them.

That's probably true but it's not always practical if you have a variety of effects that you want to mix and match at different times.

Anyway, I haven't done any benchmarks or comparisons to establish which is actually faster in practice so I'm not going to speculate any further. I'll probably be finding out over the next few months.

Fafalada · Apr 1, 2006

heliosphere said:
The EDRAM means you rarely have a situation where frame buffer bandwidth is the bottleneck.

Screen aligned ops have completely fixed requirements though - unlike normal rendering you know exactly if you have enough bandwith or not, and whether you will run at full fillrate or not - and you also have the freedom to tailor your postprocessing algorithm until it does run without memory bottlenecks, like people have done for these effects in PS2/XBX generation.

That's probably true but it's not always practical if you have a variety of effects that you want to mix and match at different times.

Problem is that needing many acumulative passes with very little pixel/texture work means you need to resolve most of them into main memory as well, which wastes most of the eDram benefits. I'd side with nAo here - you want to collapse into as few passes as possible, leave the fillrate focused postprocessing to PS2/PSP

nondescript · Apr 1, 2006

BTW, has anyone (one?) translated the article? ( Don't want to scan the whole thread...)

If not, I can work one up - just don't want to be redundant.

Watch Impress PS3 Technical article, from GDC (some new info)

pjbliverpool

B3D Scallywag

ROG27

Titanio

Shifty Geezer

uber-Troll!

ROG27

Shifty Geezer

uber-Troll!

ROG27

heliosphere

j^aws

j^aws

Shifty Geezer

uber-Troll!

nAo

Nutella Nutellae

heliosphere

heliosphere

nAo

Nutella Nutellae

heliosphere

nAo

Nutella Nutellae

heliosphere

Fafalada

nondescript