Vertex shading on CPUs & accomodating vertex-biased work

Jawed said:
Jaws, don't forget that ATI architectures have a "free texture address calculation" per clock because of the dedicated ALU for this.

I don't know what it amounts to in terms of FLOPs. I don't see any point in calculating that though - FLOPs are such a meaningless measure in these discussions.

Jawed

I've already included it as the 16 filtered texture instructions per cycle for Xenos.

And I'm not comparing flops at the moment here but the instruction distributions in the fragment pipelines...
 
Jaws said:
Each G70/RSX fragment pipeline can issue 5 inst./cycle, can anyone give a breakdown? Here's my guess,

5 maths
4 maths + 1 texture
4 maths + 1 norm
3 maths + 1 texture + 1 norm

Sadly, this breaks down extremely quickly. One thing I learnt from the efficiency thread is that NV40/G70 has "combinations" of instructions that it can execute in one cycle.

There might be 50 or 100 of these different combinations (just a guess). Some combinations will amount to only 2 instructions per clock. Others might amount to 6 or 8 or more instructions.

It's hideously complicated.

Simple texture operations appear not to block ALU1 - I don't know what defines this. But I don't think it's possible to say that a texture operation on ALU1 always blocks that ALU from other instructions.

Jawed
 
Jawed said:
The Post-VS cache (buffer, frankly, not cache) has grown from 24 vertices in 6800Ultra to 45 in 7800GTX - presumably RSX will be the same.
the post transformed cache and the transformed vertices fifo that sits between VS and setup engine are 2 different things.
 
Jawed said:
Still, some people around here like to carry on under the assumption that Xenos's architecture is impossible to run with near 100% efficiency :oops:
Jawed
Yeah..it's so simple I can't understand why anyone hasn't done it before ;) :LOL:
 
nAo said:
Jawed said:
The Post-VS cache (buffer, frankly, not cache) has grown from 24 vertices in 6800Ultra to 45 in 7800GTX - presumably RSX will be the same.
the post transformed cache and the transformed vertices fifo that sits between VS and setup engine are 2 different things.

So what do these two different things do then?

Jawed
 
Jawed said:
xbdestroya said:
Where I'm coming from is the angle that at any given moment in time all the ALU's in a given array must be working on either vertex or pixel work - as in though the work itself might be dynamic, within a given array you can't have 5 ALU's working on vertex and 11 ALU's working on pixel at the same moment in time.

Or is this wrong?

Nope. A shader array is defined simply by the fact that it is an array of 16 execution units, each of which shares a single program counter. It's a single-instruction multiple-data array.

In a conventional GPU the array would consist of 4 units in a "quad". In Xenos, it's 16. It's simply larger grained.

Every unit is performing precisely the same instruction. Effectively there are 16 threads all being processed, one instruction at a time.

On the next tick of the clock, a completely different set of 16 threads are processed. Maybe with the same instruction (e.g. another 16 fragments). Equally, it could be a different instruction in different threads, e.g. for 16 vertices.

Jawed

Word.

Muchas gracias Jawed. 8)
 
Inane_Dork said:
I would think that to take up more than half your SPE pool and require you to stick with very simple pixel shaders. Still, it would be possible.

I actually figured, just purely looking at Gflops which may be simplistic, that it would take nearly all your SPEs.

But while cycle-to-cycle you may see a 100:0 ratio of vertex to pixel work on Xenos, frame-to-frame you won't. The tasks being talked about where 100:0 is useful apparently would be ultra-fast. If that were the case, I'm sure you could for one or two tasks per frame do the same with your SPEs and VS, before letting the work balance out more (on the CPU) between vertex work and other CPU stuff i.e. do a really fast z-pass with all SPEs once before returning to a more balanced physics etc/vertex mix for the rest of the frame.

Frame-to-frame, rather than cycle-to-cycle, what might the ratio look like when we move away from a heavy pixel bias? How far would it go in favour of vertices? That would be a better indicator of how many SPEs would have to generally be invested in vertices. If over the frame the proportions of work are more like 50:50, which in itself would be a significant shift away from a (typical) heavy pixel bias, looking purely at Gflops you might need maybe 2-3 SPEs for most of the time to match the same amount of vertex work in absolute terms, as is being done by Xenos. That'd be pretty comfortable to accomodate, I'd think. 60:40 would be more extreme and you'd need about 3 SPEs. 70:30, would require 4.

Also, I'm perhaps missing something, but why would your pixel shaders have to be more simple?

Re. load balancing, how much overhead would a software solution introduce? Although NVidia may "provide" here, if we're lucky. The problem could be approached from a number of angles, methinks, if done in software.
 
I think post transformed vertices cache is so small, but I can't think why a post transformed vertices fifo couldn't be bigger than that, imho a 20 vertices buffer it's absolutely not enough to absorb even a small workload unbalance.

FWIW I didn't make the number up, the fifo's are or at lest circa NV20 were extremely small.
 
Jawed said:
Jaws said:
Each G70/RSX fragment pipeline can issue 5 inst./cycle, can anyone give a breakdown? Here's my guess,

5 maths
4 maths + 1 texture
4 maths + 1 norm
3 maths + 1 texture + 1 norm

Sadly, this breaks down extremely quickly. One thing I learnt from the efficiency thread is that NV40/G70 has "combinations" of instructions that it can execute in one cycle.

There might be 50 or 100 of these different combinations (just a guess). Some combinations will amount to only 2 instructions per clock. Others might amount to 6 or 8 or more instructions.

It's hideously complicated.

Simple texture operations appear not to block ALU1 - I don't know what defines this. But I don't think it's possible to say that a texture operation on ALU1 always blocks that ALU from other instructions.

Jawed

Something tells me that we won't get to the bottom of this but anyway, I opened another thread to discuss specifically the G70 fragment pipeline,

http://www.beyond3d.com/forum/viewtopic.php?t=24933

But from what you've mentioned, the 5 instructions/cycle for each G70 fragment pipeline may also exclude 'norm'... :?
 
Jawed said:
xbdestroya said:
Where I'm coming from is the angle that at any given moment in time all the ALU's in a given array must be working on either vertex or pixel work - as in though the work itself might be dynamic, within a given array you can't have 5 ALU's working on vertex and 11 ALU's working on pixel at the same moment in time.

Or is this wrong?

Nope. A shader array is defined simply by the fact that it is an array of 16 execution units, each of which shares a single program counter. It's a single-instruction multiple-data array.

In a conventional GPU the array would consist of 4 units in a "quad". In Xenos, it's 16. It's simply larger grained.

Every unit is performing precisely the same instruction. Effectively there are 16 threads all being processed, one instruction at a time.

On the next tick of the clock, a completely different set of 16 threads are processed. Maybe with the same instruction (e.g. another 16 fragments). Equally, it could be a different instruction in different threads, e.g. for 16 vertices.

Jawed

thanks Jawed.


I'm still not fully understanding how Xenos maintains the 95 -100% efficiency in that case but that's my fault. ;)
 
Xenos is designed to switch thread every clock. It doesn't care if there's a change from pixel fragments to vertices because the shader instructions and data formats are essentially common to both types of objects.

The loss in efficiency would appear to derive from units that are operating (because of the 16-wide array) but have nothing to do. A good example is where a triangle edge leaves one or more pixels in a batch of 16 "off the edge" of the triangle.

Jawed
 
Jawed said:
So what do these two different things do then?
Their respective names are self-explainatory: the first one is a cache and saves more recently transformed vertices, the second one is a buffer and its purpose it's to reduce pipeline bubble/stalls between VS and the setup engine.
As example we could afford a post transformed cache of few vertices while there could be much larger fifo (filled with transformed vertices incoming from the post transformed vertices cache) that sits between VS and setup engine to damp workload unbalance.
 
Jaws said:
(I presume you've excluded 'norm' instruction in the 96 figure...)
Yep.

Each G70/RSX fragment pipeline can issue 5 inst./cycle...
That's the source of the difference. I thought one of the 2 ALUs was used as a texture coordinate setup unit when a texture op was made. Thus either you get 4 math ops + 1 free op or you get 2 math ops + 1 texture op + 1 free op, peak. That was my thinking, anyway.
 
Titanio said:
i.e. do a really fast z-pass with all SPEs once before returning to a more balanced physics etc/vertex mix for the rest of the frame.
Certainly possible, I would think. You might find yourself limited by fillrate, bandwidth or one of the fixed function parts of the pipe, but it would be interesting to see where you could throw the bottleneck.

Also, I'm perhaps missing something, but why would your pixel shaders have to be more simple?
It depends on how you're thinking of adding polygons. If you want to draw the same objects just with better tessellation, then pixel shader complexity need not change. If you want to draw far more objects with roughly similar tessellation, you will be drawing more pixels.
 
Inane_Dork said:
Also, I'm perhaps missing something, but why would your pixel shaders have to be more simple?
It depends on how you're thinking of adding polygons. If you want to draw the same objects just with better tessellation, then pixel shader complexity need not change. If you want to draw far more objects with roughly similar tessellation, you will be drawing more pixels.

Thanks again for your reply.

I'm not sure I really follow on this part, though. How are you painting more pixels? Does this not remain constant, fixed to resolution? Or are you talking about work per pixel? Or..?

If this were the case, would this - high vertex bias - not then be quite unattractive (depending on developer goals), on Xenos? It may be a more sustainable situation on PS3, where for example with a 50/50 vertex:pixel ratio, on paper RSX would be left with roughly 2x the pixel power to keep pixel shader complexity higher.
 
Titanio said:
I'm not sure I really follow on this part, though. How are you painting more pixels? Does this not remain constant, fixed to resolution? Or are you talking about work per pixel? Or..?
I guess if you stuck with a Z prepass, it wouldn't be too different. Though, it depends on what you're adding. If you add more guys, no big deal. If you add more birds, small deal as you replace skybox filling with more expensive ops. If you add more trees, the alpha testing may add work. And the main culprit is if you added more particles which all have to be pixel shaded presuming they're not occluded by an opaque object.

If this were the case, would this - high vertex bias - not then be quite unattractive (depending on developer goals), on Xenos? It may be a more sustainable situation on PS3, where for example with a 50/50 vertex:pixel ratio, on paper RSX would be left with roughly 2x the pixel power to keep pixel shader complexity higher.
Well, if you're comparing a few SPEs + RSX to Xenos, yeah, the first option has more power. :p But vertex work can go on XeCPU pretty well, and that would increase the pixel work you could do.

IIRC, ERP commented that with X1 level pixel work, you might get around the Xenos' polygon limit. With aid from the XeCPU, you could push a bit more complexity than that. Anyway, at that point you really have to re-examine what has to take place in the pixel shader. Is normal mapping worth it anymore? I don't know.
 
Inane_Dork said:
Titanio said:
I'm not sure I really follow on this part, though. How are you painting more pixels? Does this not remain constant, fixed to resolution? Or are you talking about work per pixel? Or..?
I guess if you stuck with a Z prepass, it wouldn't be too different. Though, it depends on what you're adding. If you add more guys, no big deal. If you add more birds, small deal as you replace skybox filling with more expensive ops. If you add more trees, the alpha testing may add work. And the main culprit is if you added more particles which all have to be pixel shaded presuming they're not occluded by an opaque object.

Ah, I see what you're getting at now. I can see now how the average may raise depending on what you're doing.

Inane_Dork said:
Well, if you're comparing a few SPEs + RSX to Xenos, yeah, the first option has more power. :p But vertex work can go on XeCPU pretty well, and that would increase the pixel work you could do.

True. In our 50:50 instance, thinking again in quite simple "gflop" terms the usage of two Xenon cores for vertex processing could allow Xenos could then split 15:85 in favour of pixels (maintaining the same amount of vertex performance before). But to be then fair to PS3, also keeping just one PPE for non-graphics work would still allow it, roughly speaking again, nearly 2x the power for vertices and 1.15x the power for pixels vs X360.

As the bias toward vertices goes up, the gap in terms of vertices goes down and the gap in terms of pixels go up. The only scenario that I can figure out where X360 would "win" is where the 2 Xenon cores and a really high proportion of Xenos were busy with vertex work - say 90%. This is a crazy, academic extreme of course, but in that scenario, X360 would have 1.2x the power for vertices, but the gain would be on only one front (on the pixel side, PS3 would then have a whopping 12x gap).

There are ways to try reduce the pixel load with some vertex tricks as mentioned before, but the same options are open to both systems on that front.
 
Titanio said:
As the bias toward vertices goes up, the gap in terms of vertices goes down and the gap in terms of pixels go up. The only scenario that I can figure out where X360 would "win" is where the 2 Xenon cores and a really high proportion of Xenos were busy with vertex work - say 90%. This is a crazy, academic extreme of course, but in that scenario, X360 would have 1.2x the power for vertices, but the gain would be on only one front (on the pixel side, PS3 would then have a whopping 12x gap).
I think you're more likely to gain useful insight in thinking of reasonable scenarios and what usage is like in them. For instance, I wouldn't expect more than one thread on the XeCPU to be generating or shading vertices simply due to graphics state synchronization. Sure, you could work it otherwise, but I would expect no more in the general case.

There are ways to try reduce the pixel load with some vertex tricks as mentioned before, but the same options are open to both systems on that front.
True, but you always gain performance on Xenos from that, whereas you only gain a speedup on RSX when you're pixel shader bound. It's the mixed blessing/curse of dedicated hardware.

EDIT: You don't *always* get a speed up on Xenos, but you're much more likely to get more work done in its case.
 
Back
Top