PS3 GPU not fast enough.. yet?

ShootMyMonkey said:
Also, just how big is the post-transform cache? I don't think I've heard of one that's all that huge.
Last one I heard a specific size for was the first one I heard: the 2 vertex-sized cache that the GF4/XGPU has to accelerate triangle strip rendering. I think they're bigger now, but not terribly so. 10-20, maybe? They don't have to be very large to get most of the reoccurences given mesh optimization precomputation.
 
DarkRage said:
Maybe I am wrong, but I think to remember nVidia talking about 1.1 billion.
Yeah, I believe the official line is 1.1 billion "vertex transformations" per second. Only 275M actual vertices, though. Only if you happen to need four or more vertex transformations (i.e. matrix-vector multiplies) per clock does Sony's number mean anything.

Anyway, not interesting. I am more interested in how Cell and RSX cooperate.
It depends on what you mean by "cooperate". If you referring to what Sony says when they say RSX can control SPEs, then don't expect much. It'll pretty much be for vertex generation, and the biggest benefit is saving memory for vertices. If you're talking about how Cell and RSX are used to make a game, then yes, it will be interesting.
 
ShootMyMonkey said:
D'oh... for some reason, it went into my head that he meant per shader in the array.


Yeah, I can see it as one of the caveats of doing loads of work in software. But in general, there's enough work to be done inside of a vertex shader (at least when you have loads to do in a corresponding pixel shader) that you can theoretically fill in a good percentage of stalls that might occur due to attribute read rates.
You need 8 vertex shader instructions per attribute in order not to be read-limited, so I don't think "in general" qualifies here. Also, it's the simpler pixel shaders that expose slow vertex rates, all else being equal. Shadow maps, z/stencil passes, particles, etc. Such shaders won't need a lot of vertex work to fill in iterators for the pixel shader to use.

EDIT: Oh yeah, the post transform cache is 14 vertices for Xenos, and I can guess that it's 45 vertices for RSX because digit-life did a test suggesting this is the case with G70. Of course, it could easily be different for RSX, and we don't know if their test was accurate or not.
 
Last edited by a moderator:
ManuVlad3.0 said:
Which makes sense. Basically there is no official Cell<<GDDR memory BW, no Cell<<GDDR lines, and that 16 MB/s is jumping through hoops to access the GDDR which accounts for it being so slow, presumably making requests of the RSX to fetch and deliver. For clarity I'd like to see a table of dependences or how these BWs interact, showing how one data path affects another data path reducing available BW. eg. Using 22.4 GB/s RSX<<GDDR means 0 GB/s RSX>>GDDR. Does that 16 MB/s Cell<<GDDR consume some or all of that BW, or the 4 GB/s read BW, or 16 MB/s of the RSX>>Cell BW, or what?
 
Jawed said:
What's puzzling me is did we or didn't we know that NV4x sets-up 1 triangle every 2 clocks? Was this a surprise?
It wasn't a surprise - but NVidia did help to muddy the waters (they never released any official number publicaly, and they never bothered to correct hardware review sites when they posted transform numbers as the limit).
Incidentially I don't recall triangle setup numbers being publicaly announced for any GPUs since well - GeForce 1 days. Most people just assumed 1:1 mapping to geometry engines.

Acert,
there's a whole bunch of considerations (many which have been mentioned already). One of the bigger ones is that rendering small triangles (IIRC anything <=20pixels in size) can severely decrease your PS performance.
Last gen things were kind of inverted (at least on GC/PS2). Especially on PS2 - sweet spot was somewhere around 10-30pixel triangles, and efficiency could actually decrease on large polys, so you wanted to avoid them as much as possible.

Mintmaster said:
Many (maybe over half) of your polygons will be either offscreen or backfaces. When processing those, the pixel shaders' only hope of staying active is to feed off of visible polygons sitting in the post-transform cache, which won't be much.
That's where geometry shading capabilities come into play - there are far more interesting uses then creating/deleting stuff, but optimization of vertex stream can be one of them.
 
Faf, so do you feel that having a lower triangle set-up rate will not be a really big limit for you and that pushing for almost 2x the triangle set-up rate (reaching Xenos's 500 MTriangles/s peak) would be a waste for modern graphics engines ?
 
You are talking about 500M vs. 275M but what if RSX has higher peak triangle efficiency than Xenos?

If I remembered correctly it had been mentioned that one or even more SPU on Cell can be dedicated for triangle generation or processing (or even for triangle setup). Would not that mean that Cell + RSX architecture is extremly flexible regarding triangle limit? Than it would be up to the developer what they want regarding 3D GFX. (Higher geometry or better postprocessing effects?)

Just my 0.02c.
 
You need 8 vertex shader instructions per attribute in order not to be read-limited, so I don't think "in general" qualifies here.
8 per attribute doesn't sound that extreme to me, but then, I'm getting that largely out of disappointment with the generated SM3 code from a shader compiler. In all fairness, I haven't seen the results of PS3's Cg compiler, but MS' HLSL compiler... eeeehhh...

Even otherwise, there are certainly common cases where you'll hit way more than that (e.g. skinned geometry).

Also, it's the simpler pixel shaders that expose slow vertex rates, all else being equal. Shadow maps, z/stencil passes, particles, etc. Such shaders won't need a lot of vertex work to fill in iterators for the pixel shader to use.
Yeah, but those are cases where the amount of data and the stuff you need to do is light enough that I doubt they'll be major problem cases overall (unless they proliferate in number of course). The gap between the "typical" shaderized polygons and the not-so-shaderized polygons is pretty big, and it's much more likely that someone is going to create something too complex than load up on things that are too simple (the power of money).
 
Last edited by a moderator:
Mintmaster said:
Yeah, I believe the official line is 1.1 billion "vertex transformations" per second. Only 275M actual vertices, though. Only if you happen to need four or more vertex transformations (i.e. matrix-vector multiplies) per clock does Sony's number mean anything.

It depends on what you mean by "cooperate". If you referring to what Sony says when they say RSX can control SPEs, then don't expect much. It'll pretty much be for vertex generation, and the biggest benefit is saving memory for vertices. If you're talking about how Cell and RSX are used to make a game, then yes, it will be interesting.
Don't know about the "1.1 billion vertex transformations", all I've ever seen expressed is "1.1 billion vertices per second".

ps3source - According to a press release by Sony at the May 16 2005 E3 Conference, the specifications of the PlayStation 3 are as follows:
1.1 billion vertices per second
Also wikipedia

-aldo
 
Last edited by a moderator:
Thats the theoretical maximum rate of the vertex shaders; it will not produce that because it will be setup limited.
 
Dave Baumann said:
Thats the theoretical maximum rate of the vertex shaders; it will not produce that because it will be setup limited.
That's what I've been hearing. ;)

I was just wanting to point out the official Sony specifications.

-aldo
 
Dave Baumann said:
Thats the theoretical maximum rate of the vertex shaders; it will not produce that because it will be setup limited.


:oops: What determines the limit of the setup rate :?:
 
ERP said:
The setup engine.

Now that we are on it, does anyone have a nice chart of G70's workflow for triangle setup, VS, etc?

Now that everyone is on the same page in regards to RSX being a modified G70, it may be worthwhile going over the rendering pipeline?
 
Kryton said:
No it shouldn't unless you want suckful performance, I'm saying the graphics hierarchy is the same as what we had (AGP) and no one had a problem with it until trying to push it too hard (the PS3 interface is entirely different but follows a similar design).
Coding on PS3 like you do on a PC (well, at least the aspects relative to this discussion) won't be "suckful" at all for PS3. The only big exception is to use XDR texturing as much as possible (provided you have RAM space), whereas AGP texturing is undesirable on the PC.

It's true that there are some new effects that might be possible by doing post-processing on Cell, and GPU->CPU transfers are waaaaaay faster on PS3 than a PC. However, if you don't have any reason to do this, high-level PS3 game structure likely won't be any different than on a PC. Certainly, practicing PC coding habits on the PS3 won't hurt it.

People keep expecting the FlexIO line to be critical in PS3 performance, but the only way you get 35 GB/s of transfer is with constant reading and writing throughout a frame. Graphics workloads just don't happen that way. If you get an average of over 100MB of data moved between RSX and Cell per frame I'd be very surprised.

And, the PC has the opposite - which is better I don't know.
I don't know which is better, though the reason for this in the PC is modularity. Different people have different needs from a graphics card (just look at the order of magnitude gap between low-end and high end), and so it doesn't make sense to have a unified memory pool. I can say for certainty that 50 GB/s to a 512MB pool is better than 25 GB/s to each of two 256MB pools. This, of course, is not the sort of situation we're seeing in XB360 vs. PS3, so the point is moot. In a way, XB360 sort of has a split pool too.

Making a good memory controller that can handle requests coming at it from 2 locations is tricky because you have to give priority to someone, but who? Full duplex just means you can read/write so I'm not sure what you mean here.
Actually, you only need to consider priority if you're transferring at the peak rate (i.e. can't accomodate both), and in this case deciding who gets the priority is mostly irrelevant.

When you're bandwidth limited, you have some total amount of data transfer that's necessary to complete one frame of game code and rendering. There is usually a frame of latency between the CPU and GPU because draw calls are buffered up, and there isn't any interdependency between them. What I'm saying is usually the GPU has one frame of commands to execute, and the CPU is preparing the next one. So if you're BW limited, it doesn't matter what order you do this in, since you'll saturate BW either way, and there's no way to go faster than that.

The only time priority matters is if you have drastic changes in BW consumption throughout a frame for sustained periods. Then ordering could matter, because low BW code on the CPU would run best with a high BW load on the GPU, and vice versa. This is not something that the memory controller can predict, though, and it's up to the coder to manually assign priorities or reorder their code. If one is on average much higher than the other, it makes sense to give the low load the priority because the high load can always fill in the gaps to keep the bus saturated.

But if both loads are high BW, then it doesn't matter who goes first.

Also, handling multiple clients in the memory controller isn't anything new or "tricky". GPUs already have to handle requests from the command processor, vertex engine, texture units, z test units, render back-end, etc. It's not a big deal for ATI or NVidia to add in CPU requests as they've done with XBox 360, original XBox, and integrated chipsets.
 
Dave Baumann said:
Thats the theoretical maximum rate of the vertex shaders; it will not produce that because it will be setup limited.
It's not the max rate of vertex shaders, it's max rate you can pull data out from post transformed vertices cache and yes, it's achievable.

btw, does some one know if on Xenos back face culling is performed at primitve setup level or on a earlier stage?
 
So nAo, what are you trying to say, that RSX can cull and clip one polygon per clock? One vertex output from the vertex shader, and two from the cache? I always considered culling and clipping to be part of setup myself.
 
Mintmaster said:
So nAo, what are you trying to say, that RSX can cull and clip one polygon per clock? One vertex output from the vertex shader, and two from the cache? I always considered culling and clipping to be part of setup myself.
I was just wondering if on Xenos bfc is part of the setup stage or not.
If it's part of it we might say that in most cases the actual setup cost is 2 clocks per visible triangle, not just one, cause on average every 2 triangles one is back facing the camera.
Does it make sense to have a very fast triangle setup engine if most of it is idle half the time?
 
Pana said:
Faf, so do you feel that having a lower triangle set-up rate will not be a really big limit for you
I believe that there are other problems with that much geometry before you even try to render it. And IF they can be adequately solved, we should not need to render counts near the peak hw rates.
That said, inefficient small tris is definately a change of direction in console space - and it doesn't help that most people still grossly overestimate the polycounts of last generation (then again, PS1 gen was overestimated in that regard also).

nAo said:
Does it make sense to have a very fast triangle setup engine
If your hw doesn't cull at all it does :p
 
Back
Top