PS3 GPU not fast enough.. yet?

SPM · Jun 8, 2006

Shifty Geezer said:
Which makes sense. Basically there is no official Cell<<GDDR memory BW, no Cell<<GDDR lines, and that 16 MB/s is jumping through hoops to access the GDDR which accounts for it being so slow, presumably making requests of the RSX to fetch and deliver. For clarity I'd like to see a table of dependences or how these BWs interact, showing how one data path affects another data path reducing available BW. eg. Using 22.4 GB/s RSX<<GDDR means 0 GB/s RSX>>GDDR. Does that 16 MB/s Cell<<GDDR consume some or all of that BW, or the 4 GB/s read BW, or 16 MB/s of the RSX>>Cell BW, or what?

The 16MB/s figure has got to be wrong. It is impossible to make a modern inter-processor bus interface go that slow without artificially limiting it. To put it into context, the 16MB/s data transfer rate is half the speed of a UDMA 33 IDE hard drive. Either the 16 MB/s is a typo and should read 16 GB/s, or it has to be some sort of built-in automatic DMA transfer designed to keep the RSX fed with commands or something without GDDR bus contention.

The link http://portable.joystiq.com/2006/06/05/rumor-ps3-hardware-slow-and-broken/ does seem to show that RSX can read and write to XDR at 20GB/s and 15GB/s respectively (if there is anything credible about the link). However Mercury's datasheet for a 2 CPU Cell blade communicating over the Flex i/o lists it as an SMP system with full cache and memory coherence rather than an NUMA system http://www.mc.com/literature/literature_files/Cell_blade-ds.pdf . Therefore the secomd CPU must be able to access the first's XDR memory over the Flex i/o interface as fast as it can access it's own, which seems to lay to rest the idea that the XDR cannot be accessed via Cell without excessive latency.

Acert93 · Jun 8, 2006

SPM said:
The 16MB/s figure has got to be wrong. It is impossible to make a modern inter-processor bus interface go that slow without artificially limiting it. The 16MB/s data transfer rate is half the speed of a UDMA 33 IDE hard drive. Either the 16 MB/s is a typo and should read 16 GB/s, or it has to be some sort of built-in automatic DMA transfer designed to keep the RSX fed with commands or something without GDDR bus contention.

Disbelief like yours is why it says, "No, not a type"

Mintmaster · Jun 8, 2006

nAo said:
I was just wondering if on Xenos bfc is part of the setup stage or not.
If it's part of it we might say that in most cases the actual setup cost is 2 clocks per visible triangle, not just one, cause on average every 2 triangles one is back facing the camera.

Yeah, but that's true with every GPU. If RSX/G71 can only do 275M polys including culled polys (not sure if this is the case, which is why I'm asking you), then by your logic one could say it does one visible triangle every 4 clocks.

I don't think it makes sense to talk about clocks per visible triangle, because it's not like you'll get triangles in pairs where one is visible and one isn't. Most of the time either all will be visible or all will be culled, because you want to maximize vertex re-use.

Does it make sense to have a very fast triangle setup engine if most of it is idle half the time?

The setup engine is idle a lot more than half the time in actual games, since you're pixel-shader limited well over 50% of the time. Nearly every part of a GPU is idle at some point or the other.

And yes, it does make sense if it's cheap, because in high-poly low pixel clumps (e.g. distant models, slanted polys, Hi-Z rejected), you don't waste time. If 60M transistor GPUs like NV2A/NV25 can setup one tri every 2 clocks, per clock, then doubling that rate for a 300M GPU is no big deal.

There are a lot of aspects in the 3D pipeline that are fast simply to handle bursts. The 35GB/s FlexIO will never make 35GB of data transfer in any given second. But in a burst over one millisecond, it will send 10MB to Cell. With only 16ms available per frame, that's very valuable.

But my question remains: Does RSX backface cull and/or clip triangles at a rate of one per clock? That would certainly reduce the impact of setup rate.

Npl · Jun 8, 2006

SPM said:
The 16MB/s figure has got to be wrong. It is impossible to make a modern inter-processor bus interface go that slow without artificially limiting it. To put it into context, the 16MB/s data transfer rate is half the speed of a UDMA 33 IDE hard drive. Either the 16 MB/s is a typo and should read 16 GB/s, or it has to be some sort of built-in automatic DMA transfer designed to keep the RSX fed with commands or something without GDDR bus contention.

Might be that RSX has alot of different internal caches, digged deep in its rendering pipe. If cache-coherency is required, then RSX could have a hard time checking all caches for each read. Just a random shot in the dark, Im surprised its that low myself.

Its surely not the fault of the bus tough, whatever is the case its purely within RSX.

Dave Baumann · Jun 8, 2006

nAo said:
I was just wondering if on Xenos bfc is part of the setup stage or not.

Its not on other ATI designs, and I don't see that Xenos does away with these fixed elements, just unifies the shaders.

http://www.beyond3d.com/reviews/ati/r520/

Mintmaster · Jun 8, 2006

SPM said:
The 16MB/s figure has got to be wrong. It is impossible to make a modern inter-processor bus interface go that slow without artificially limiting it.

It's not the bus, it's the interface logic on the RSX end (that's why the server data is irrelevant). It wasn't built to take low-level random access memory requests from Cell and send the result back. Instead, it was built to take a command from Cell that instructs RSX to stream data into XDR (hence the 10.6GB/s actual write speed into XDR), likely using the ROPs. Because of this, the 16MB/s figure is really no limitation at all, except you don't want game code or data residing in GDDR3 (well duh, it runs on Cell).

Fafalada · Jun 8, 2006

Mintmaster said:
If 60M transistor GPUs like NV2A/NV25 can setup one tri every 2 clocks, per clock, then doubling that rate for a 300M GPU is no big deal.

It might be a big deal when the setup pipeline grows more complex and clock speeds increase at the same time.

Because of this, the 16MB/s figure is really no limitation at all, except you don't want game code or data residing in GDDR3 (well duh, it runs on Cell).

Actually arguably you could have the game code there - hey, we paged overlays from floppy discs at one point in history, 16MB/s is not so bad

The 35GB/s FlexIO will never make 35GB of data transfer in any given second.

No but then again that's true of every bus outside trivial applications.

But my question remains: Does RSX backface cull and/or clip triangles at a rate of one per clock? That would certainly reduce the impact of setup rate.

Bah, BF culling on GPU is for winnies anyway. Real MAN's GPUs make you do it by hand.

Urian · Jun 8, 2006

One correction for Mintmaster.

The 116.5 milions number of Xbox are vertexs, not triangles.

275 milions from RSX are triangles.

London Geezer · Jun 8, 2006

Even after 7 pages, I think some people here are still a bit confused. To put things straight and to allow some people to follow the conversation a bit better:

The 16MB "limit" is from CELL to the GDDR3. It's not an issue because Cell will never have to access the GDDR3 directly. In fact what will happen most often is for Cell to send something to RSX, and RSX would then write it back to XDR using the 35GB bus, and Cell will access that data sitting in XDR at the amazing bandwidth it has. Cell will never have to access GDDR3.
The RSX-to-CELL direction is still the quoted 35GB or so (divided into different paths of course.

The panic really isn't needed and it was started by The Inq purely to generate publicity after they just saw "16MB bandwidth!" mentioned and just completely jumped the gun.

PS3 still has amazing bandwidth, and it's ridiculous to even think Sony would release something that's so broken as it was alleged.

nAo · Jun 8, 2006

Mintmaster said:
Yeah, but that's true with every GPU.

of course!

If RSX/G71 can only do 275M polys including culled polys (not sure if this is the case, which is why I'm asking you), then by your logic one could say it does one visible triangle every 4 clocks.

exactly

I don't think it makes sense to talk about clocks per visible triangle, because it's not like you'll get triangles in pairs where one is visible and one isn't.

statistically speaking it makes sense a lot of sense. as long as you have a deep primitves fifo between your post transformed cache and your setup engine..

Most of the time either all will be visible or all will be culled, because you want to maximize vertex re-use.
The setup engine is idle a lot more than half the time in actual games, since you're pixel-shader limited well over 50% of the time. Nearly every part of a GPU is idle at some point or the other.

looking at nvidia and ati design it seems they care about it since they put bfc at some earlier stage and probably with different throughputs compared to the setup stage.

And yes, it does make sense if it's cheap, because in high-poly low pixel clumps (e.g. distant models, slanted polys, Hi-Z rejected), you don't waste time. If 60M transistor GPUs like NV2A/NV25 can setup one tri every 2 clocks, per clock, then doubling that rate for a 300M GPU is no big deal.

I don't think it's that cheap..with all those interpolators to set up...

But my question remains: Does RSX backface cull and/or clip triangles at a rate of one per clock? That would certainly reduce the impact of setup rate.

best kept secret ever

predicate · Jun 8, 2006

It's not entirely related, but I can't think of a single PS3 game that has texture detail as good as, say, Kameo, a first-gen 360 game. Even in MGS4 the texture detail for the environment is only about as good as Half-Life 2. Is the low texture resolution in PS3 games likely related to/caused by the fact that the VRAM bandwidth is only 22.4GB/sec, and it has to put framebuffer and all textures through that, making large textures impossible to use fast enough?

crystalcube · Jun 8, 2006

Mintmaster said:
But my question remains: Does RSX backface cull and/or clip triangles at a rate of one per clock? That would certainly reduce the impact of setup rate.

Fafalada said:
Bah, BF culling on GPU is for winnies anyway. Real MAN's GPUs make you do it by hand.

nAo said:
best kept secret ever

hmmm the plot thickens

Titanio · Jun 8, 2006

predicate said:
Is the low texture resolution in PS3 games likely related to/caused by the fact that the VRAM bandwidth is only 22.4GB/sec, and it has to put framebuffer and all textures through that, making large textures impossible to use fast enough?

No, because that is not a fact.

It's quite reasonable for RSX to use XDR for texturing/vertex access and even rendering outside of the primary framebuffer.

I don't see a problem with 'texture resolution' in PS3 games either, for that matter.

Mintmaster · Jun 8, 2006

nAo said:
statistically speaking it makes sense a lot of sense. as long as you have a deep primitves fifo between your post transformed cache and your setup engine..

How big can this fifo be? A 20K model would easily run hundreds of polys consecutively that are culled, and this holds even more so for terrain and whole objects that get clipped. The output of a vertex shader can easily be 50-100 bytes or more. AFAIK, the post transform cache is 63 verts max on RSX.

I don't think it's that cheap..with all those interpolators to set up...

What is there to set up exactly? The interpolation itself is done at the pixel shader, and is a per-pixel cost. After clipping/culling comes the scan conversion, where you figure out pixel locations and per-pixel or per-quad perspective-correct factors used for z and interpolation. It shouldn't really matter how many interpolators you have for this setup. When doing pixel shading, you just lerp between the raw per-vertex values using these factors. Xenos generates 16 iterator values per clock, and RSX probably does 24 per clock (in order to feed the texture units).

best kept secret ever

Pffft. You suck.

predicate · Jun 8, 2006

Titanio said:
No, because that is not a fact.

It's quite reasonable for RSX to use XDR for texturing/vertex access and even rendering outside of the primary framebuffer.

I don't see a problem with 'texture resolution' in PS3 games either, for that matter.

Compare the environment texture detail of MGS4 to Gears of War. They aren't even in the same league. No PS3 games even have textures as detailed as Kameo, a 360 launch game.

Look at, for example, the texture detail of the environment (or the NPC models, even) in this shot of MGS4 compared with the texture detail of the background of this shot of Gears of War. There's a huge discrepancy in the texture resolution. Resistance shows similar low texture resolution; even Half-Life 2: Episode 1 has that kind of texture detail.

There is definitely a discrepancy in average texture resolution between the two platforms, if you ask me, and it makes sense considering PS3's diminished VRAM bandwidth. Does anyone think this might be a problem for PS3? I mean, considering the GPU in there (assuming it's closer to a 7800 than a 7600) was designed to operate with twice the VRAM bandwidth.

Titanio · Jun 8, 2006

I don't know if what you're comparing is texture resolution. I could compare MGS4 to any number of other 360 games to its benefit - certainly complaints about 360 texture filtering have not been few or far between. But I don't see anything wrong with MGS4's texturing.

Theoretically PS3 should be in a better position wrt texturing than 360. 16 texture units on 360 compared to up to 24 on PS3 at a 10% higher clock. Reportedly larger texture cache. In terms of bandwidth, 360 is splitting 22.4GB/s between the CPU and GPU for texture/vertex fetch, PS3 is using up to 22.4GB/s for the primary fb and then using the rest for texture/vertex, and splitting 25.6GB/s with the CPU for the same.

You seem to continue to completely ignore the XDR bandwidth available to RSX.

Jawed · Jun 8, 2006

Titanio said:
No, because that is not a fact.

It's quite reasonable for RSX to use XDR for texturing/vertex access and even rendering outside of the primary framebuffer.

I don't see a problem with 'texture resolution' in PS3 games either, for that matter.

Texturing from XDR is going to suffer from higher latency than texturing from GDDR3 - the data simply has to travel further. How easy is it for RSX to hide that latency? Perhaps an examination of the texturing performance of TurboCache would help?

Also, if you've got textures in XDR then that's less memory for the rest of the game.

So is the relief of GDDR3 bandwidth consumption worth the expense of XDR space, bandwidth and latency?

It's far from a zero-sum problem.

Jawed

predicate · Jun 8, 2006

Titanio said:
I don't know if what you're comparing is texture resolution. I could compare MGS4 to any number of other 360 games to its benefit - certainly complaints about 360 texture filtering have not been few or far between. But I don't see anything wrong with MGS4's texturing.

Theoretically PS3 should be in a better position wrt texturing than 360. 16 texture units on 360 compared to up to 24 on PS3 at a 10% higher clock. Reportedly larger texture cache. In terms of bandwidth, 360 is splitting 22.4GB/s between the CPU and GPU for texture/vertex fetch, PS3 is using up to 22.4GB/s for the primary fb and then using the rest for texture/vertex, and splitting 25.6GB/s with the CPU for the same.

You seem to continue to completely ignore the XDR bandwidth available to RSX.

No, I'm definitely comparing texture resolution. I'm seeing big discrepancy. In the majority of PS3 games I've seen, where the lighting and polycounts are 'next-gen' the textures look like the kind of stuff my X800 kicks around without fussing. Especially MGS4, which I think look beautiful, but that's much more to do with the quality of the character models, lighting and general 'artistry' than it is to do with technical aspects (which it seems to be lagging behind other games in terms of).

The reason I'm 'ignoring' the XDR access from RSX is that I thought it had been agreed that the latency is way too high for you to use it for a sustained bandwidth-intensive purpose like texturing?

Titanio · Jun 8, 2006

Jawed said:
Texturing from XDR is going to suffer from higher latency than texturing from GDDR3 - the data simply has to travel further. How easy is it for RSX to hide that latency?

Does it matter, if you're bound somewhere else?

A number of theories have been floated about RSX latency hiding, but no one has totally confirmed what is actually allowed.

Jawed said:
Also, if you've got textures in XDR then that's less memory for the rest of the game.

True, but if there were a unified pool, any space the GPU occupies is less space for the rest of the game

That's a tradeoff for the developer to make given their own requirements.

predicate said:
The reason I'm 'ignoring' the XDR access from RSX is that I thought it had been agreed that the latency is way too high for you to use it for a sustained bandwidth-intensive purpose like texturing?

I don't think it has..

Mintmaster · Jun 8, 2006

Fafalada said:
It might be a big deal when the setup pipeline grows more complex and clock speeds increase at the same time.

Sure it might be, but its role has pretty much stayed the same for a while now. I don't think its complexity has grown anywhere near as fast as everything else in a GPU.

No but then again that's true of every bus outside trivial applications.

Well, not to that extent. I expect the GDDR3 bus to average maybe 70% of its peak rate because you'll have the biggest and most constant BW load (colour and z) passing through there all the time.

Fafalada said:
Bah, BF culling on GPU is for winnies anyway. Real MAN's GPUs make you do it by hand.

You mean by tranforming the camera postion to object space and using Cell? If there's no matrix blending, sure I guess. Does it always work out faster on SPE's than RSX? How about frustum culling?

Funny thing is that when I was at ATI they optimized the drivers to do this very thing to assist the HW T&L back in the day. Had to worry about load balancing with the CPU, or it would harm performance. Helped very nicely with 3DMark2001.

PS3 GPU not fast enough.. yet?

SPM

Acert93

Artist formerly known as Acert93

Mintmaster

Npl

Dave Baumann

Gamerscore Wh...

Mintmaster

Fafalada

Urian

London Geezer

nAo

Nutella Nutellae

predicate

crystalcube

Titanio

Mintmaster

predicate

Titanio

Jawed

predicate

Titanio

Mintmaster

Similar threads