Vertex shading on CPUs & accomodating vertex-biased work

BenQ · Jul 15, 2005

xbdestroya said:
BenQ said:

I don't see how you can claim that shading power has to be divied up into multiples of 3 simply because the Xenos has 3 shader arrays, as all the ALU's on each of those arrays functions independantly of the others.

Click to expand...

Isn't it the case that all ALU's inside a given array must be working on either vertex or pixel shading at any given time?

No, not according to Dave's Xenos article.

Carl B · Jul 15, 2005

BenQ said:
No, not according to Dave's Xenos article.

Well, then I've missed the boat - no doubt. But where in the article does it state to the contrary?

BenQ · Jul 15, 2005

xbdestroya said:
BenQ said:

No, not according to Dave's Xenos article.

Click to expand...

Well, then I've missed the boat - no doubt. But where in the article does it state to the contrary?

I already quoted the article. Check back on the 1st page = )

Carl B · Jul 15, 2005

BenQ said:
I already quoted the article. Check back on the 1st page = )

But that seperation is between the ALU arrays; within the arrays all ALU's must be performing either pixel or vertex work.

Anyway I'm going to bed now - I'm sure by the time I wake up it'll be definitively answered one way or another.

Tap In · Jul 15, 2005

xbdestroya said:
BenQ said:

I don't see how you can claim that shading power has to be divied up into multiples of 3 simply because the Xenos has 3 shader arrays, as all the ALU's on each of those arrays functions independantly of the others.

Click to expand...

Isn't it the case that all ALU's inside a given array must be working on either vertex or pixel shading at any given time? Just read over my original post, replacing 'array' for 'unit' and 'ALU' for 'pipe.'

If we just consider a single one of the arrays for the time being - with 16 ALU's available this means that on every cycle it is processing a maximum of either 16 vertices or four 2x2 pixel quads. However, as there is no pipelining from one set of ALU's to the next, the ALU array will need to first process the first shader instruction, then go back and process the second shader instruction. For cases where there is a direct data dependency (i.e. the first instruction says A + B = C and the resultant value for C is used in the next instruction), there must be some way of making sure that C is available in time for the second instruction to execute.

When the ALU's move from one instruction to the next, there is an inherent latency (this is the amount of pipeline clocks it takes to execute the first instruction). The Xenos shader contains a large number of independent groups of pixels and vertices (threads) which are 16 wide. In order to hide the latency of an instruction for a given thread, a number of other threads are used to "fill in the gaps". By doing this, the ALU's are fully utilized all the time, and the shader can have direct data dependency on every instruction and still run full rate. Xenos has a very large number of these independent threads ready to process, so there are always enough independent instructions to execute such that the ALU's are fully utilized. Each of these different threads can be executing a different shader, can be at different places within the same shader, can be pixels or vertices, etc.

http://www.beyond3d.com/articles/xenos/index.php?p=08

nAo · Jul 15, 2005

aaaaa00 said:
Not just that, but the workload can shift dramatically just in a single render batch within a single frame.

Agreed

I think it would be a monumentally difficult task to load balance something like that in software alone.

Internal fifos should be enough to absorb most of these small, unpredictable, workload bubbles.

Erp said:
If the FIFO in the chip between the Vertex work and the pixel work is large enough it doesn't matter, but they are generally very small (10-20)

I think post transformed vertices cache is so small, but I can't think why a post transformed vertices fifo couldn't be bigger than that, imho a 20 vertices buffer it's absolutely not enough to absorb even a small workload unbalance.

Titanio · Jul 15, 2005

Thanks for all the replies!

The issue of load balancing was one I was trying to consider. I think effectively it relates mostly to how you'd balance between the CPU and VS for vertex shading?

I guess my second point is that when talking comparatively with Xenos, yes in this situation if mapping an arbitrary workload from Xenos to RSX+Cell in some situations there'll be some underutilisation. But I'm not really asking how it can address the utilisation issue, just how it can prop itself up in absolute terms on the vertex processing side in cases where Xenos can be dedicated more to vertices.

If this all seems reasonable, I guess perhaps the more significant question is actually how things look when you move away from "simply" accomodating the same absolute workload as Xenos in vertex-biased situations, and consider how PS3 games built for the system can take full advantage. Now this might seem controversial or presumptuous or even stupid- take down your pitch forks!

- but I'll put it out there anyway: Following this same line of logic, if this seems reasonable, then your game could aim at a setup where you can simultaneously do pretty much as much work on both vertices and pixels at the same time as all of Xenos can do on either of these workloads if dedicated fully to that task. Consider games, then, that aim for full utilisation of that kind of balance...it could have interesting implications.

Benq raises a pertinent issue re. RSX pixel shading performance, since the theory relies on it being able to, there or there abouts, keep up with Xenos if it was fully dedicated to pixel shading. But given the proportion of RSX's power that's invested in pixel shading, given the clock differences, and given the expected efficiency gain of a dedicated unit over a more general unit, it seems reasonable enough to think it could (more than?) keep up? On a side note, counting the number of units and comparing thusly doesn't seem so sound - an ALU in Xenos != a dedicated shader on paper.

Shifty Geezer · Jul 15, 2005

Regards Xenos' ALUs, I'm pretty darned certain each array is tied to either Vertex or Pixel work - they work as a group. But each array can have a number of threads to work on, so when one stalls, they work on another. And they can context switch between V and P work. In 5 cycles the workload could look something like

Code:

Array is currently working on

ALU1        ALU2       ALU3

 V           P          P
 V           P          V
 P           P          P
 V           V          P
 P           P          P

In this example the above ratio of P:V is 1:2. As long as there is no context switch overhead switching from P to V and vice versa (I think I asked this and no overhead was confirmed) then ratio of P:V is totally arbitary.

nAo · Jul 15, 2005

Shifty Geezer said:
As long as there is no context switch overhead switching from P to V and vice versa (I think I asked this and no overhead was confirmed) then ratio of P:V is totally arbitary.

I believe there is no overhead if the GPU can schedule the context switch far ahead..

Inane_Dork · Jul 15, 2005

Titanio said:
But I'm not really asking how it can address the utilisation issue, just how it can prop itself up in absolute terms on the vertex processing side in cases where Xenos can be dedicated more to vertices.

I swear I meant my post to be only a footnote.

As regards to this, getting SPE aid for vertex shading should allow more variance in general workloads. I would imagine it would take educated guessing or performance testing to see what batches benefit from that.

Following this same line of logic, if this seems reasonable, then your game could aim at a setup where you can simultaneously do pretty much as much work on both vertices and pixels at the same time as all of Xenos can do on either of these workloads if dedicated fully to that task. Consider games, then, that aim for full utilisation of that kind of balance...it could have interesting implications.

I would think that to take up more than half your SPE pool and require you to stick with very simple pixel shaders. Still, it would be possible.

Benq raises a pertinent issue re. RSX pixel shading performance, since the theory relies on it being able to, there or there abouts, keep up with Xenos if it was fully dedicated to pixel shading. But given the proportion of RSX's power that's invested in pixel shading, given the clock differences, and given the expected efficiency gain of a dedicated unit over a more general unit, it seems reasonable enough to think it could (more than?) keep up?

They're definitely in the same ballpark, theoretically speaking. Hope I get this all right: Xenos lacks the free special instruction, but it can process 96 pixel math ops and 16 pixel texture ops per cycle. If the RSX is a 24 pipe variant of the G70, it can do 96 pixel math ops per cycle or 48 math ops and 24 pixel texture ops per cycle. Throw in some dependency issues, overall efficiency and clockspeed differences and it certainly seems close enough that I'd have to test various pixel shaders to see how fast they are on both systems.

With that in mind, I wouldn't be surprised to see a fair amount of vertex shading take place on the XeCPU. I also would be really interested in seeing a few titles that really try to push pixel-sized polygons all over the frame. Even if the pixel shaders there are really simple, they wouldn't have to be as complex as they are now if polygons are that small. Just one game where everything is smoothly curved, just one...

Shifty Geezer · Jul 15, 2005

Hmmm. Let's say Cell can step in to aid RSX when needed. How would it know? How would it be able to monitor the GPU shaders and say 'a-ha! the vertex pipes are snowed under. I'll load up a vertex shader APUlet and help out.'

nAo · Jul 15, 2005

Shifty Geezer said:
Hmmm. Let's say Cell can step in to aid RSX when needed. How would it know? How would it be able to monitor the GPU shaders and say 'a-ha! the vertex pipes are snowed under. I'll load up a vertex shader APUlet and help out.'

ehehe, obviously it can't work that way

What if your goemetry/pixels ratio is quasi constant (at least on the background geometries) and you offset some pixel shading work to SPEs and VS?

Jawed · Jul 15, 2005

blakjedi said:
Couple of thoughts

1) Can there ever be a time when pixel work is evenly divided on Xenos? It seems that pixel versus vertex work is based on thirds at any given moment. so the ratios are P:V can only be:

00:100
33:66
66:33
100:0

The shader arrays are time-sliced.

For example the entire workload for a frame might amount to 10% vertex versus 90% fragment shading. This is achieved simply by time-slicing. For every nine fragment operations there is one vertex operation, on average.

Obviously as frame rendering proceeds, the time-slicing will actually vary in bias, one way or the other.

I suspect the organisation of Xenos into 3 shader arrays is a compromise of granularity (e.g. number of fragments per triangle) versus the overhead of scheduling and arbitration hardware. If it was 6 arrays of 8 ALUs, the scheduling and arbitration hardware would increase in complexity, but Xenos would be able to handle smaller triangles with less chance of ALUs having no pixels to work on (not enough fragments running the current shader instruction on this triangle, or another triangle).

Jawed

Carl B · Jul 15, 2005

Tap In said:
If we just consider a single one of the arrays for the time being - with 16 ALU's available this means that on every cycle it is processing a maximum of either 16 vertices or four 2x2 pixel quads. However, as there is no pipelining from one set of ALU's to the next, the ALU array will need to first process the first shader instruction, then go back and process the second shader instruction. For cases where there is a direct data dependency (i.e. the first instruction says A + B = C and the resultant value for C is used in the next instruction), there must be some way of making sure that C is available in time for the second instruction to execute.

When the ALU's move from one instruction to the next, there is an inherent latency (this is the amount of pipeline clocks it takes to execute the first instruction). The Xenos shader contains a large number of independent groups of pixels and vertices (threads) which are 16 wide. In order to hide the latency of an instruction for a given thread, a number of other threads are used to "fill in the gaps". By doing this, the ALU's are fully utilized all the time, and the shader can have direct data dependency on every instruction and still run full rate. Xenos has a very large number of these independent threads ready to process, so there are always enough independent instructions to execute such that the ALU's are fully utilized. Each of these different threads can be executing a different shader, can be at different places within the same shader, can be pixels or vertices, etc.

Click to expand...

http://www.beyond3d.com/articles/xenos/index.php?p=08

Believe me I read that too.

Where I'm coming from is the angle that at any given moment in time all the ALU's in a given array must be working on either vertex or pixel work - as in though the work itself might be dynamic, within a given array you can't have 5 ALU's working on vertex and 11 ALU's working on pixel at the same moment in time.

Or is this wrong?

Jawed · Jul 15, 2005

BenQ said:
It would be realy nice to have solid numbers, such as "Standard shaders are 70% efficient, while Unified shaders are 95% efficient"...... but I don't think it's quite that simple.

ATI asserts something like 50-70% efficiency for current GPUs (prolly referring to ATI GPUs) and 95% for Xenos - as far as I can remember.

Sure, the proof of the pudding is in the eating.

Jawed

Jawed · Jul 15, 2005

nAo said:
I think post transformed vertices cache is so small, but I can't think why a post transformed vertices fifo couldn't be bigger than that, imho a 20 vertices buffer it's absolutely not enough to absorb even a small workload unbalance.

http://www.digit-life.com/articles2/video/g70-part2.html

The Post-VS cache (buffer, frankly, not cache) has grown from 24 vertices in 6800Ultra to 45 in 7800GTX - presumably RSX will be the same.

Jawed

j^aws · Jul 15, 2005

Inane_Dork said:
...

Benq raises a pertinent issue re. RSX pixel shading performance, since the theory relies on it being able to, there or there abouts, keep up with Xenos if it was fully dedicated to pixel shading. But given the proportion of RSX's power that's invested in pixel shading, given the clock differences, and given the expected efficiency gain of a dedicated unit over a more general unit, it seems reasonable enough to think it could (more than?) keep up?

Click to expand...

They're definitely in the same ballpark, theoretically speaking. Hope I get this all right: Xenos lacks the free special instruction, but it can process 96 pixel math ops and 16 pixel texture ops per cycle. If the RSX is a 24 pipe variant of the G70, it can do 96 pixel math ops per cycle or 48 math ops and 24 pixel texture ops per cycle. Throw in some dependency issues, overall efficiency and clockspeed differences and it certainly seems close enough that I'd have to test various pixel shaders to see how fast they are on both systems.
...

I'd agree with the same ballpark but your numbers 'seem' off for G70/RSX (I presume you've excluded 'norm' instruction in the 96 figure...)

If we only look at fragment pipes for G70/RSX and the equivalent 'peak' for Xenos, i.e. 3 SIMD engines (48 ALUs) and 16 filtered texture units,

Xenos (peak fragments only),

-96 instructions/cycle + 16 texture inst./cycle ~ 112 inst.cycle
~ 112*0.5GHz ~ 56 Billion inst./sec

Or

48 Billion shader inst./sec + 8 Billion filtered texture inst./sec

RSX (fragments pipes only),

- 120 instructions/cycle (includes 24 filtered texture inst./cycle)
~120 * 0.55 GHz ~ 66 Billion inst./sec

Or

53 Billion shader inst./sec + 13 Billion filtered texture inst./sec

Key difference with Xenos ~ 56 Ginst./sec is that 8 Ginst./sec are always filtered texture instructions but for RSX ~ 66 Ginst./sec, the instructions are scheduled differently...

Each G70/RSX fragment pipeline can issue 5 inst./cycle, can anyone give a breakdown? Here's my guess,

5 maths
4 maths + 1 texture
4 maths + 1 norm
3 maths + 1 texture + 1 norm

Would that be correct? IIRC, the vec4 ALUs also work on texture data with the texture ALUs?

EDIT: added an extra option...

Jawed · Jul 15, 2005

nAo said:
Shifty Geezer said:

As long as there is no context switch overhead switching from P to V and vice versa (I think I asked this and no overhead was confirmed) then ratio of P:V is totally arbitary.

Click to expand...

I believe there is no overhead if the GPU can schedule the context switch far ahead..

Which is trivial when you know exactly how long it takes to execute every instruction.

As Aaron Spink said a while back, this is not rocket science:

http://www.beyond3d.com/forum/viewtopic.php?p=549040#549040

Still, some people around here like to carry on under the assumption that Xenos's architecture is impossible to run with near 100% efficiency

Jawed

Jawed · Jul 15, 2005

Shifty Geezer said:
Hmmm. Let's say Cell can step in to aid RSX when needed. How would it know? How would it be able to monitor the GPU shaders and say 'a-ha! the vertex pipes are snowed under. I'll load up a vertex shader APUlet and help out.'

NVidia's drivers for their PC GPUs share vertex workload between the CPU and GPU.

Frankly I don't know how they do this and if the CPU undertakes anything particularly meaningful in terms of actual vertex shader execution.

But the principle is there.

It seems to me that a game dev would put the burden of the vertex workload onto Cell (if they designed an ultra-high poly game) and use RSX to help smooth things out - rather than vice versa.

Cell seems to be significantly more capable than RSX at actual vertex shading. RSX could end-up being "the icing on the cake".

Jawed

Jawed · Jul 15, 2005

xbdestroya said:
Where I'm coming from is the angle that at any given moment in time all the ALU's in a given array must be working on either vertex or pixel work - as in though the work itself might be dynamic, within a given array you can't have 5 ALU's working on vertex and 11 ALU's working on pixel at the same moment in time.

Or is this wrong?

Nope. A shader array is defined simply by the fact that it is an array of 16 execution units, each of which shares a single program counter. It's a single-instruction multiple-data array.

In a conventional GPU the array would consist of 4 units in a "quad". In Xenos, it's 16. It's simply larger grained.

Every unit is performing precisely the same instruction. Effectively there are 16 threads all being processed, one instruction at a time.

On the next tick of the clock, a completely different set of 16 threads are processed. Maybe with the same instruction (e.g. another 16 fragments). Equally, it could be a different instruction in different threads, e.g. for 16 vertices.

Jawed

Vertex shading on CPUs & accomodating vertex-biased work

BenQ

Carl B

Friends call me xbd

BenQ

Carl B

Friends call me xbd

Tap In

nAo

Nutella Nutellae

Titanio

Shifty Geezer

uber-Troll!

nAo

Nutella Nutellae

Inane_Dork

Rebmem Roines

Shifty Geezer

uber-Troll!

nAo

Nutella Nutellae

Jawed

Carl B

Friends call me xbd

Jawed

Jawed

j^aws

Jawed

Jawed

Jawed

Similar threads