Next gen graphics and Vista.

Jawed said:
And requires its own special pass as far as I can tell, after the rest of the frame has been rendered. Presumably the driver could handle that, instructing the GPU to render strictly in order.
No, you can blend any time with any objects.

Edit:
And besides, Bob showed an even better situation, where either EQUAL, LEQUAL or GEQUAL functions are used for the z test.
 
Bob said:
Off the top of my head:
- How do you maintain DRAM coherent access streams?

I see no relation. You are still accessing the same streams and more or less in the same order (or lack of order).

Bob said:
- If you allow for batches to execute out of order, you need to buffer up the results: Although you can run shaders out of the order, you can't run ROP out of order. How big is the buffering? How efficient is it?

A reorder buffer may not require that much resources if you store the result in the shader until commit. It's just a FIFO with a couple of pointers. And in any case if you are talking about fragments there is a single output attribute in the general case. If you are using tiles (as ATI seems to be doing) the order between the fragments in a tile or between fragments from different tiles doesn't matter. Only the order of fragments from different triangles for the same same framebuffer tile matters (and that is only critical for alpha blending).

Bob said:
- How do you resolve resource starvation?
- How do you guarantee forward progress?

As you always do. Try to avoid deadlocks, assign all the resources that a shader input requires earlies so it can always complete the execution, give priority to older inputs and to inputs that generate new inputs.

Bob said:
- If you have a shader that can generate triangles, how do you guarantee that rasterization/early-Z happens in order of generation, accross the whole chip?
A reorder FIFO before the next stage? Or if the execution is never out of order assign an order to each shader unit that is generating triangles. In any case it looks more like a specification problem (what DX10 or OpenGL will require to be the order for generated primitives) that a hardware problem.

Bob said:
- How do you build efficient register files, when you need fine-grained allocation?
- Is resource allocation done in hardware or in software? Either case, how is the allocation done? Is there a cost for reallocation, and if so, what is it?

CPUs doesn't seem to have any problem with this. They run at way higher frequencies and allocate on a single register basis ... Resource allocation could be performed statically or dynamically and I don't know if one is better than the other. The code generator can calculate how many live registers requires the a shader program at any point of the execution and pass that information to the shader controller.

Bob said:
- Do you keep triangles generated by one shader pipe totally inside that shader pipe (ie: no transmission of attributes), or do you add large busses to funnel through attributes? Either way has its own set of issues.
- How do you share vertex work between shader pipes? Do you need to vertex batches to be complete primitives?

But there isn't a single pipe here. There are pipes that fetch streams from memory, pipes that read and write from and to the framebuffer, pipes that read texture data, pipes that generate or clip triangles, and pipes that shade different inputs (vertex, fragments or whatever). Of course requires somekind of interconnection network, how complex would it be will depend of what you want, how much it costs and how much you can pay. All this already exists in current GPUs implemented with more or less flexibility.
 
Bob said:
Texture isn't the only client to memory, especially not in DX10.
Well you got me. What other clients are you thinking of, that are specific to a unified architecture, as opposed merely to DX10.

Consider this example: The application clears Z to 1.0f and sets the depth compare mode to LESS. It then draws a green triangle at z = 0.5f followed by a perfectly overlapping red triangle at the same Z value.

The final output *must* be green. If you process either pixels or triangles in a different order than they were issued by the application, then you'll start seeing red.

You can't always predict latency from texture, so you can't predict how long the green triangle may be delayed with respect to the red triangle. So although you can run the PS for the two triangles in arbitrary order, you do need to resolve the Z test in ROP in the correct order.
That's a good argument.

I guess we'll have to ask ATI how they solved it :!:

Does DX actually require that the triangle should be green? It looks to me like a classic example of a trap that programmers fall into, where they assume that submission order is rendering order. But I'm out of my depth here...

Oldest first is not always the most optimal scheduling policy. You can severily thrash your texture cache this way.

You can also deadlock this way: If the PS runs really slow, your oldest thread is now a vertex thread. But vertices are blocked by the raster unit because there are no free resources to run PS on. So you can't just pick the oldest thread. You now need to walk the thread list and find a thread that can run. This can be a rather involved and expensive process.
Which ATI is keeping secret as far as I can tell. It's simply alluded to in the patent.

But, anyway, since the post vertex cache is of a fixed size, the scheduler knows there's no point finishing yet more vertex batches. So PS automatically gets priority. And with deterministic execution times in the ALU pipes, the scheduler can see the stall coming.

If all PS is extremely texture-heavy, then of course the GPU will reach a stage where it becomes texture-bandwidth bound. Obviously you don't want to increase the problem by texture cache thrashing, so the scheduler needs to take account of shader state too (i.e. which batches are on the same triangle) and schedule them contiguously as each batch gets its texture results.

You don't only need to single-thread the setup engine. You need to ensure that the setup engine's queue is filled up in order. This means a lot more buffering at the output of the shader, if you want to be able to still do useful work while the rasterizer is busy (like, for example, pixel shader work to unblock the rasterizer).
Well I don't think buffer space is a particularly harsh constraint if you're talking about 20-50 (or so :?: ) vertices at a few hundred bytes per vertex.

Presumably, also, vertices and triangles are sequentially ID'd, in order to make predication work under DX10, so ordering isn't a particularly difficult nut to crack.

It's not a question of latency (although that does play a big role), but of bandwidth. If you want to just do MADs, you need 128-bit * 3 reads / clock from arbitrary registers. RF banking can help, if you limit the number of threads you run.
But in a multi-pipe conventional GPU you have the same problem. Every clock you're loading/saving new registers, because on every clock you're shading different pixels.

So it's just a bigger version of an existing problem.

Consider a GPU with 2 unified shader pipes that can run either VS or PS threads. If you transform a triangle on one shader pipe, you don't want to pay the cost of transmitting all that data to the other shader pipe if you can help it. Instead, you want to keep everything on the same sahder pipe, and run PS threads for that triangle. That way, you don't need to shuffle vertex attributes throughout the whole chip.

The other shader unit can, in the mean time, work on some other triangle.

You may want to transmit big triangles between shader pipes though, so you don't starve the other pipes, but that cost can be rather expensive.
Xenos, by contrast, is always running different batches on consecutive clocks. It is continually swapping batches into and out-of context. So in Xenos it's immaterial "which pipe did the triangle".

And, anyway, even current GPUs only run one instruction per fragment, before switching that fragment out of context and switching another fragment (from the same batch) into context. So the register-swap-frenzy is nothing new.

500 M tris/sec * 3 vertices * 32-bit float * 64 scalar attributes == 384 GB/sec of internal bandwidth. If you way you only want to run with 8 scalar attributes at full speed (ie: position and some texcoords or colors), then your internal bandwidth is "only" 48 GB/sec. If you can somehow optimize this down to one vertex / triangle on average (which, btw, has a whole other set of issues), you're down to a more manageable 16 GB/sec. And you haven't even done any texturing yet!
Maybe it's time for that nicely detailed Xenos diagram:

012l.jpg


The way I see it you're talking about a dedicated point-to-point link here. No other data travels down this link. A 256-bit link running at twice clock is going to be in the right ball-park.

Don't current GPUs already have a link to do this kind of work? If not, why is this unique to a unified architecture?

Jawed
 
Chalnoth said:
No, you can blend any time with any objects.

Edit:
And besides, Bob showed an even better situation, where either EQUAL, LEQUAL or GEQUAL functions are used for the z test.

I'm not sure what you both exactly mean but you can't change the z test in the middle of a batch of primitives (off topic: different kind of batch that the one Jawed and others use to name the work assignment unit to the shaders, my PhD advisor complains that is too confusing to use the same name for both, I think that I'm starting to agree).

I don't see GPUs mixing different primitive batches and stage changes. So that case is impossible. All the fragments generated from the triangles for the batch with z equation set to 'EQUAL' will be processed before any of the fragments generate from the triangles for the next batch with z equation se to 'LESS'. Primitive batches are expected to be large in terms of triangles, generated fragments and rendering cycles (tens of thousand cycles).
 
Being a D3D programmer and an MVP this has been a very interesting thread :)

Might give some of these points some thought on the flight over to say hello to the DX team in a couple of weeks.

Jawed said:
Bob said:
Texture isn't the only client to memory, especially not in DX10.
Well you got me. What other clients are you thinking of, that are specific to a unified architecture, as opposed merely to DX10.
There's always been more than textures as clients to VRAM, it just so happens that (most of the time) textures use a lot more resources than others (e.g. shaders and IB/VB's)

Jawed said:
Does DX actually require that the triangle should be green? It looks to me like a classic example of a trap that programmers fall into, where they assume that submission order is rendering order. But I'm out of my depth here...
Yes, it requires that it is green. The REFRAST should confirm this.

The execution order is sequential - if it were concurrent and hence non-deterministic you'd be using a VERY strange API that'd make my job a whole lot more difficult ;)

The green geometry/triangles should be completed before it moves onto the next entry in the queue, thus they should be fully written to the buffer before the red triangle(s) come along.

Although there is a small quirk to be very warey of - one that causes a lot of problems for some people ;)

If two pieces of geometry are submitted with identical Z-Depths you can easily get into the realm of floating point consistency (and D3D tends to force the FPU into the performance mode rather than accuracy). The net result, if you're familiar, is "flimmering" or "Z fighting" - so you might get some very odd results from coplanar polygons :)
 
Thanks - it's great to get solid facts.

So, I wonder how Xenos solves this then, as it clearly executes batches out of order. I can't see it delaying the completion of one batch just because an earlier batch is still running...

Sadly I can't bump your reputation, because I've already bumped it up recently... Now, who else can I bump up...

Jawed
 
I see no relation. You are still accessing the same streams and more or less in the same order (or lack of order).
NV4x solves this by design: all shader pipes run in lock-step (for a certain definition of lock-stepping). So you get nice long predictable coherent data streams. If you can schedule arbitrary threads at arbitrary times, you lose that. It gets worse though, because the machine can trip on itself: One shader pipe gets a cache miss and is delayed from the others. In turn, its delay causes even less memory coherency because it's now out of step with the others, and may cause other shader pipes to be delayed, which exacerbates the problem.

You can imagine any number of solutions to this problem, but none of them are pretty.

Only the order of fragments from different triangles for the same same framebuffer tile matters (and that is only critical for alpha blending).
Totally agreed (except the alpha blend part - it affects pretty much all rendering). You still need to reorder your triangles though, across multiple (arbitrary) shader pipes, with arbitrary delay between vertices of the same primitive. Buffering helps, but it's extra buffering that's not all that needed on traditional GPUs...

But in a multi-pipe conventional GPU you have the same problem. Every clock you're loading/saving new registers, because on every clock you're shading different pixels.
You can predict what the next couple of threads are going to access, on traditional systems. So it's not really an issue. If you have an instruction like MAD R0, R1, R2, R3; a whole bunch of threads are going to access their R1's one after the other. You can then build your register file such that you have all the R0 registers for one batch in one bank, then all the R1 registers for that batch in another bank, and so on. So you never have bank conflicts and your access pattern is very predictable.

However, if you can schedule arbitrary threads at arbitrary times, you can't do that anymore. Especially not if different threads need different bank configuration.

CPUs doesn't seem to have any problem with this. They run at way higher frequencies and allocate on a single register basis ...
CPUs don't run throusands of active threads concurrently, with some expectation of performance.


Anyway, all this to say, unified shaders (with both features and performance) isn't as trivial as some of you make it out to be.
 
  • Like
Reactions: Geo
RoOoBo said:
I don't see GPUs mixing different primitive batches and stage changes. So that case is impossible. All the fragments generated from the triangles for the batch with z equation set to 'EQUAL' will be processed before any of the fragments generate from the triangles for the next batch with z equation se to 'LESS'. Primitive batches are expected to be large in terms of triangles, generated fragments and rendering cycles (tens of thousand cycles).
Er, that's not what I was trying to say. I was making reference to Bob's situation where whenever you have a Z test equation that includes EQUAL, the order that primitives are rendered will make a difference.
 
Bob said:
Anyway, all this to say, unified shaders (with both features and performance) isn't as trivial as some of you make it out to be.
Well, I think that a design that only worries about being unified, and doesn't worry about dynamic branching could be quite simple indeed. Pretty much all of the input/output concerns that you discussed would be handled automatically. All that you'd need is a design that runs on one batch (draw call) at a time, and does it in this way:

1. Start processing vertices until the pixel shader input buffer is full.
2. Freeze the vertex cache and switch to processing pixels until buffer is empty.
3. When pixel shader input buffer is empty, freeze the texture caches and start processing vertices.

This system has the very nice feature that all of the things that IHV's have used in the past to ensure good performance would still work just fine. There may, however, be some performance hit in switching the pipelines between vertex/pixel work (though this could seemingly be avoided quite nicely just by allowing the vertex/pixel work to both be executed at different stages of the pipeline at the same time).
 
Er, that's not what I was trying to say. I was making reference to Bob's situation where whenever you have a Z test equation that includes EQUAL, the order that primitives are rendered will make a difference.
I was careful to pick a test that does not contain EQUAL: I picked LESS. The ordering issue is true regardless of the test you chose (except maybe NEVER, but that's not really an interesting case ;)

Well, I think that a design that only worries about being unified, and doesn't worry about dynamic branching could be quite simple indeed. Pretty much all of the input/output concerns that you discussed would be handled automatically. All that you'd need is a design that runs on one batch (draw call) at a time, and does it in this way:
Sounds like big pieces from a TBDR, if you ask me, and why it's probably a little easier for the SGX to do it.
 
Bob said:
NV4x solves this by design: all shader pipes run in lock-step (for a certain definition of lock-stepping). So you get nice long predictable coherent data streams. If you can schedule arbitrary threads at arbitrary times, you lose that. It gets worse though, because the machine can trip on itself: One shader pipe gets a cache miss and is delayed from the others. In turn, its delay causes even less memory coherency because it's now out of step with the others, and may cause other shader pipes to be delayed, which exacerbates the problem.

You can imagine any number of solutions to this problem, but none of them are pretty.
Xenos's solution is 16 pipes per array all running in lock-step. Three arrays in total. And one array of 16 TMU pipes.

So at any one time, one batch is dependent on memory latency (i.e. the batch that's in the TMU).

You can predict what the next couple of threads are going to access, on traditional systems. So it's not really an issue. If you have an instruction like MAD R0, R1, R2, R3; a whole bunch of threads are going to access their R1's one after the other. You can then build your register file such that you have all the R0 registers for one batch in one bank, then all the R1 registers for that batch in another bank, and so on. So you never have bank conflicts and your access pattern is very predictable.

However, if you can schedule arbitrary threads at arbitrary times, you can't do that anymore. Especially not if different threads need different bank configuration.
Good info. Does this mean that the register file looks more like an n-way cache (obviously it's not a cache per se), with very short, say 16 byte, lines.

I still think this is a matter of hiding latency in a longer pipeline. It's interesting that Cell and Xenon CPUs can fetch 3x 128-bit FP registers in 2 or 3 cycles, at 3.2GHz, each fetch being from a 128-register file, and in Xenon's case, the register file supports two hardware contexts.

Anyway, all this to say, unified shaders (with both features and performance) isn't as trivial as some of you make it out to be.
Sure we don't actually know how well Xenos works, but clearly it's built upon a unified architecture and an ambitious scheduler.

We've even got die photos:

MSGPU700.jpg


To help us speculate wildly on how much of the die is taken up by the scheduler :D

Jawed
 
sorry to sound dumb but when seeing things like:

R520 is "16-1-1-1," R580 "16-1-3-1," RV515 "4-1-1-1," RV530 "4-1-3-2."

What exactly do these numbers mean? i know the first one is pipelines but the rest?
 
Last edited by a moderator:
That's been a subject of speculation for quite some time now. Some seem to think that the "3" in R580 is the number of shader ops/clock/pipe it can do, others, the number of texture reads. Me, I think either would take up too many transistors and would be counter to where ATI seems to be going in technology.
 
Jawed said:
Well you got me. What other clients are you thinking of, that are specific to a unified architecture, as opposed merely to DX10.
Look at the Xenos diagram you posted and look at the lines going to and from memory. Although these aren't necessarily specific to a unified architecture.
 
Chalnoth said:
That's been a subject of speculation for quite some time now. Some seem to think that the "3" in R580 is the number of shader ops/clock/pipe it can do, others, the number of texture reads. Me, I think either would take up too many transistors and would be counter to where ATI seems to be going in technology.

I can see 48 fragment shaders on G70 with 302M Transistors.
 
I know this is quite off topic, but:

I was wondering if anyone knows approximate fill rate of the Xenos? when doing normal textured polygons but also z-only or stencil-only writes.

I'm very interested in it's ability to render things like shadow maps and perticuarly stencil-shadows... Combined with the (what I'd expect) phenomenal stencil-only performance and the ability to write to memory, it would be an absolute beast at generating and rendering stencil shadows.

I was surprised, for example, to see Saints Row using full scene stencil shadows on an entire city (from the sun), at 720p 4xaa that is an awful lot of fillrate for 60fps+.

I know it is a damn fast peice of kit after seeing it at siggraph (it was too fast ironically), but does anyone know numbers?
 
3dcgi said:
Look at the Xenos diagram you posted and look at the lines going to and from memory. Although these aren't necessarily specific to a unified architecture.
Well that's my problem - Bob's original list was supposedly about "gotchas of a unified architecture" - not "gotchas of a DX10 GPU".

The bandwidth/latency problems need solving in both, not just a unified architecture.

Jawed
 
Back
Top