Next gen graphics and Vista.

Jawed said:
Chalnoth, you seem to be wilfully ignoring the scheduler and the fact that it increases utilisation of ALL units, TMUs and ALUs and minimising expensive pipeline stalls.
Actually, I believe I specifically mentioned the increased utilization of execution units.

You also seem to be wilfully ignoring the fact that out of order scheduling allows the GPU to work on smaller batches, which brings greater algorithmic freedom - it means that dynamic flow control is a viable programming technique, unconstrained by multi-thousand pixel batches.
I'm not sure that's true. I think it's more the ability of the architecture to store more than one batch at a time that allows this, and while this ability is a prerequisite of a unified design, it doesn't necessarily lead to cheap dynamic flow control, nor is it necessary to have a unified design to make a core that supports multiple batches in-flight in order to have cheap dynamic flow control.
 
I'm sure there is extra control logic in there, but then if it was all that significant, how does that reconcile with putting it in devices bound for mobile platforms?
I'm assuming you're referring to the PowerVR SGX. Considering that:
- No one has produced the PowerVR, yet.
- We don't know how big it really is (claim is 2 to 8 mm^2 for the core)
- We don't know how power-hungry it really is.
- We don't know how fast it really is.
- We don't know if it can be fabbed.
- SGX is not DX10.
- SGX is a TBDR

I'd say that the conclusion that "unifed (for DX10, on IMRs) is good because you can make a mobile (DX9, TBDR) part out of it" seems rather premature.

Like TBDR, Unified shaders is nice in theory, but you end up with very nasty implementation issues that need to be resolved.
 
Chalnoth said:
Actually, I believe I specifically mentioned the increased utilization of execution units.
And said that it isn't certain. Well I disagree. The entire point of the scheduler is to increase utilisation.

Otherwise you might just as well do a blind time-slice on vertex and pixel batches (e.g. in a 1:3 ratio). That isn't scheduling.

You also seem to be wilfully ignoring the fact that out of order scheduling allows the GPU to work on smaller batches, which brings greater algorithmic freedom - it means that dynamic flow control is a viable programming technique, unconstrained by multi-thousand pixel batches.
I'm not sure that's true. I think it's more the ability of the architecture to store more than one batch at a time that allows this, and while this ability is a prerequisite of a unified design, it doesn't necessarily lead to cheap dynamic flow control, nor is it necessary to have a unified design to make a core that supports multiple batches in-flight in order to have cheap dynamic flow control.
No, but small batches are the key to cheap dynamic flow control. And out of order scheduling of those batches is also important.

If your batches are small, then the batch count goes up. As the batch count increases then you get a more fine-grained selection of shader states amongst which to schedule your ALU and TMU tasks.

At the same time, with smaller batches there's a lower penalty per-vertex or fragment when only one object in the batch runs the worst-case path through the program.

All of which maximises overall utility. As I said earlier, you can't have a unified architecture without an advanced (multi-batch per SIMD unit) scheduler (unless you consider blind time-slicing viable, which I don't) - and an out of order scheduler seems the best fit.

Jawed
 
I'd say that the conclusion that "unifed (for DX10, on IMRs) is good because you can make a mobile (DX9, TBDR) part out of it" seems rather premature.
Actually, the conclusion is along the lines that the control logic to necessitate it probably isn't more significant than added a few ALU's for a dedicated VS instead, as PowerVR would have looked at that. I'd also fully expect ATI's first shader enabled mobile part to utilise a unified architecture as well.
 
Bob said:
Like TBDR, Unified shaders is nice in theory, but you end up with very nasty implementation issues that need to be resolved.
Can you enumerate the nasty implementation issues? Would be good to have your analysis.

Jawed
 
Bob said:
- SGX is not DX10.
Given that "DX10" is not out yet (and may not be set in stone) that may be true but the press release does say:
enabling a feature set that exceeds the requirements of OGL 2.0 and Microsoft Shader Model 3,
 
Bob said:
I'm assuming you're referring to the PowerVR SGX. Considering that:
- No one has produced the PowerVR, yet.
- We don't know how big it really is (claim is 2 to 8 mm^2 for the core)
- We don't know how power-hungry it really is.
- We don't know how fast it really is.
- We don't know if it can be fabbed.
- SGX is not DX10.
- SGX is a TBDR

I'd say that the conclusion that "unifed (for DX10, on IMRs) is good because you can make a mobile (DX9, TBDR) part out of it" seems rather premature.

Like TBDR, Unified shaders is nice in theory, but you end up with very nasty implementation issues that need to be resolved.


Since SGX is aimed for PDA/mobiles it doesn't obviously need to be able to render all the Aero-wizzbang, but in terms of shader capabilities define where it actually falls short compared to DX10?

And just for the record the fact that it's a TBDR doesn't change a thing in what Wavey said; either an architecture is unified or it isn't.
 
Jawed said:
Can you enumerate the nasty implementation issues? Would be good to have your analysis.

Jawed

Poke Simon here; he's the experienced engineer and doesn't work with hearsay :p
 
Jawed said:
And said that it isn't certain. Well I disagree. The entire point of the scheduler is to increase utilisation.
Well, I think I misspoke. I meant that it isn't certain that the increase in execution will lead to a performance increase when compared to a traditional renderer on the same budget.

Otherwise you might just as well do a blind time-slice on vertex and pixel batches (e.g. in a 1:3 ratio). That isn't scheduling.
No, you would never want to do that. But just having the ability to store one pixel batch and one vertex batch at a time would give you the ability to have unified pipelines, but wouldn't improve dynamic branching. And it would need no complex scheduling, you just have a simple rule: if my buffer for fragment data is more than x% full, process pixels. Else process vertices.

All of which maximises overall utility. As I said earlier, you can't have a unified architecture without an advanced (multi-batch per SIMD unit) scheduler (unless you consider blind time-slicing viable, which I don't) - and an out of order scheduler seems the best fit.
I'm really not sure how out of order scheduling actually comes into play here. From what I understand about CPU's, out of order processing has to do with the reordering of instructions to make use of parallel execution units in code that is designed for serial execution. This concept doesn't seem to apply at all to graphics processing.
 
Can you enumerate the nasty implementation issues?
Off the top of my head:
- How do you maintain DRAM coherent access streams?
- If you allow for batches to execute out of order, you need to buffer up the results: Although you can run shaders out of the order, you can't run ROP out of order. How big is the buffering? How efficient is it?
- How do you resolve resource starvation?
- How do you guarantee forward progress?
- If you have a shader that can generate triangles, how do you guarantee that rasterization/early-Z happens in order of generation, accross the whole chip?
- How do you build efficient register files, when you need fine-grained allocation?
- Is resource allocation done in hardware or in software? Either case, how is the allocation done? Is there a cost for reallocation, and if so, what is it?
- Do you keep triangles generated by one shader pipe totally inside that shader pipe (ie: no transmission of attributes), or do you add large busses to funnel through attributes? Either way has its own set of issues.
- How do you share vertex work between shader pipes? Do you need to vertex batches to be complete primitives?
 
Chalnoth said:
I'm really not sure how out of order scheduling actually comes into play here. From what I understand about CPU's, out of order processing has to do with the reordering of instructions to make use of parallel execution units in code that is designed for serial execution. This concept doesn't seem to apply at all to graphics processing.
It's important to distinguish between the instruction-level out of order execution you get in CPUs and the batch/thread level that is part of Xenos and future ATI unified architectures.

A simple example is two batches, each of a separate shader state. The first batch of 16 fragments runs a shader that consists of 1 texture op and 3 ALU ops all dependent on the texture op.

While the texture op is executing in the TMU pipeline (for all 16 fragments) those fragments cannot execute ALU ops. So that's where you swap another batch into the ALU pipes. But there's no point in swapping in another batch that has the same shader state, batch "1.1" as it were. That's because the first instruction requires the TMU pipeline, and that's currently busy.

So the GPU finds a different batch that can run in the ALU pipes. That might be a shader with a texture op and 3 ALU ops with the first 2 ALU ops independent of the texture.

So batch 2 can run 2 instructions in the ALU pipeline before needing the TMU pipeline while batch 1 is still running its texture op - hence the GPU executes the two batches "out of order".

Obviously it's a matter of load-balancing as to how far "ahead" batches might get. The ideal is to minimise the number of batches in flight while also ensuring that all execution pipelines have work to do.

More batches in flight increases the number of entries required in the batch queue and seriously increases the register file size and the complexity of retrieving the right registers when a batch returns to being in context. So the cost of out of order scheduling is much larger amounts of on-GPU memory.

Jawed
 
In the ideal case the unified shader model should always be better than the non unified model because you have more processing units for whatever is (or may be) your bottleneck at any point.

Dave mentions that there are different bottlenecks from frame to frame but the fact is that there are large difference from batch to batch inside a frame. The trace I'm using from the UT2004 Primeval map has at least three different zones: one that becomes dominated by fillrate when the shader to ROP ratio is at least 3:1 (a very small fragment program), one that is always fragment shader limited (and becomes texture bw limited if you increase the number of shader units, but that may be because inneficiencies in the current implementation of the simulator) and another that is vertex limited.

Our tests show that just because of those vertex limited zones (which are relatively small) the unified shader architecture gets slightly better performance than the non unified one. Of course our current model for an unified architecture is quite ideal and there are little costs associated with the unified architecture. The unit that distributes the work gives priority to vertex inputs, but their number is limited by the size of a vertex reorder queue that is usually smaller than the total amount of shader inputs that can be stored in the shader units. Vertex and fragment inputs must compete for registers in an unified register array. The thread model we are currently testing is also ideal and supports unordered execution of the shader threads. And we don't add a cost because of the (likely) larger constant bank and instruction memory (which in any case for UT2004 and similar games are already relatively empty with the minimum resources required for ARB).

For embedded architectures there is no questions to be asked. You don't even need anymore the optional geometry processor or unit that current implementations seem to require. A single ROP pipe, some logic to fetch data from memory and a single shader pipe and you have a GPU that can rival in features with a PC GPU.
 
Last edited by a moderator:
RoOoBo said:
In the ideal case the unified shader model should always be better than the non unified model because you have more processing units for whatever is (or may be) your bottleneck at any point.
So, the question is, does the extra logic required to keep that unified shader architecture purring along smoothly cost little enough so that it makes sense for current games? Future games?
 
Chalnoth said:
So, the question is, does the extra logic required to keep that unified shader architecture purring along smoothly cost little enough so that it makes sense for current games? Future games?

Nothing is cheap at first, but later down the road it will be taken for granted. So the question is when do you start investing?
 
Bob said:
Off the top of my head:
I could prolly spend a few hours on this :LOL: and always run the risk of speaking way out of turn being a normal person, not a GPU engineer. But anyway...

- How do you maintain DRAM coherent access streams?
How do you maintain them now? Texture reads are based on tiled organisation of textures in memory, as far as I can tell. Textures are pre-fetched.

As I see it, a unified architecture could make multiple, non-contiguous (in time) accesses to the same textures, so a degree of set-associativity would need to be introduced to caches, where I presume GPUs currently have no need. So that's definitely an added cost in a unified GPU.

- If you allow for batches to execute out of order, you need to buffer up the results: Although you can run shaders out of the order, you can't run ROP out of order. How big is the buffering? How efficient is it?
Since every fragment is accompanied by one or more z values, I don't accept your argument that fragments have to be submitted to the ROPs "in order". A fragment is only rendered if its z allows it. There is always a final z test on pixel render, as I understand it, because fragments that pass early-z rejection are not 100%-guaranteed visible.

On the other hand, ROPs already have to batch-up fragments to make the most efficient DRAM accesses. So there's already a degree of buffering in ROPs. Sure, if a ROP happens to receive, say, 16 fragments for contiguous pixels simultaneously, then buffering is trivial.

- How do you resolve resource starvation?
By making that a parameter of the scheduler. The scheduler knows exactly how long non-TMU operations take, so it can (trivially) predict that the ALU pipeline is going to need new batches ahead of the pipeline actually emptying.

TMU pipelines are a bit more tricky due to the degrees of requested filtering and the latency of DRAM accesses. But it should be reasonably easy to put a short buffer on the front of the TMU pipelines. Perhaps the TMU can warn the scheduler when it's about to start its last iteration.

- How do you guarantee forward progress?
By designing the scheduler to prioritise oldest batches in an "available batches" tie. Also by sizing the inter-stage buffers in a reasonably typical fashion and keeping an eye on their "percentage full" indicators.

- If you have a shader that can generate triangles, how do you guarantee that rasterization/early-Z happens in order of generation, accross the whole chip?
By single-threading the triangle setup engine (erm, the interpolation/rasterisation engine is prolly a better name for it), which is how GPUs currently work (apparently).

If the GPU is designed to perform tiled/predicated triangle rendering, then the GPU will need to have either an internal or external vertex/triangle buffer. Additionally if you can queue triangles per tile, then you won't get competing rasterisation/early-z-test threads corrupting z or rasterising otherwise useless fragments.

- How do you build efficient register files, when you need fine-grained allocation?
I presume that the execution pipeline gets longer in order to allow slower register fetching. So you don't necessarily need to tackle fetch latency at source. Since there's no branching in a pipeline (that only executes one instruction per batch, as Xenos seemingly does), there's no risk in making the pipeline longer. (All branching will happen when the batch is out of context.)

- Is resource allocation done in hardware or in software? Either case, how is the allocation done? Is there a cost for reallocation, and if so, what is it?
I'm unclear what kinds of resources you're referring to that haven't already been tackled. Do you mean cache lines? ALU or TMU pipeline prioritisation?

In Xenos resource allocation is a GPU-state-driven hardware function, as far as I can tell.

- Do you keep triangles generated by one shader pipe totally inside that shader pipe (ie: no transmission of attributes), or do you add large busses to funnel through attributes? Either way has its own set of issues.
I guess you're referring to a geometry shader and attributes generated by the shader, as "output values to be treated as constants by the next stage".

My understanding is that these attributes are part of the batch shader state, for the next stage. It's irrelevant what pipes work on these triangles, as the batch shader state is universally accessible by all pipelines.

- How do you share vertex work between shader pipes? Do you need to vertex batches to be complete primitives?
I can't say anything about this as my understanding of vertex shading is pretty minimal. I'm not sure why vertex work would be shared between shader pipes (concurrently), whilst performing vertex shading.

---

I'm basing my answers on what I've read about Xenos. Some of it is pure presumption, no doubt about it :smile:

Jawed
 
Last edited by a moderator:
Jawed said:
Since every fragment is accompanied by one or more z values, I don't accept your argument that fragments have to be submitted to the ROPs "in order".
Blending is order-dependent.

As for the rest of your ideas on the unified design, it sounds vastly too resource-hungry to me. To make unified work, you really need to have a design that keeps things simple.
 
CMAN said:
Nothing is cheap at first, but later down the road it will be taken for granted. So the question is when do you start investing?
That's a fine argument when it comes to things like pixel shading. But not unified pipelines: unified pipelines are a performance optimization. They really don't offer much in the way of programming flexibility.
 
Chalnoth said:
Blending is order-dependent.
And requires its own special pass as far as I can tell, after the rest of the frame has been rendered. Presumably the driver could handle that, instructing the GPU to render strictly in order.

As for the rest of your ideas on the unified design, it sounds vastly too resource-hungry to me. To make unified work, you really need to have a design that keeps things simple.
Why don't you read the Xenos article and the multi-threaded scheduling patent? You'll find it's pretty much all there.

Jawed
 
How do you maintain them now? Texture reads are based on tiled organisation of textures in memory, as far as I can tell. Textures are pre-fetched.
Texture isn't the only client to memory, especially not in DX10.

Since every fragment is accompanied by one or more z values, I don't accept your argument that fragments have to be submitted to the ROPs "in order".

Consider this example: The application clears Z to 1.0f and sets the depth compare mode to LESS. It then draws a green triangle at z = 0.5f followed by a perfectly overlapping red triangle at the same Z value.

The final output *must* be green. If you process either pixels or triangles in a different order than they were issued by the application, then you'll start seeing red.

You can't always predict latency from texture, so you can't predict how long the green triangle may be delayed with respect to the red triangle. So although you can run the PS for the two triangles in arbitrary order, you do need to resolve the Z test in ROP in the correct order.

By designing the scheduler to prioritise oldest batches in an "available batches" tie. Also by sizing the inter-stage buffers in a reasonably typical fashion and keeping an eye on their "percentage full" indicators.
Oldest first is not always the most optimal scheduling policy. You can severily thrash your texture cache this way.

You can also deadlock this way: If the PS runs really slow, your oldest thread is now a vertex thread. But vertices are blocked by the raster unit because there are no free resources to run PS on. So you can't just pick the oldest thread. You now need to walk the thread list and find a thread that can run. This can be a rather involved and expensive process.

By single-threading the triangle setup engine (erm, the interpolation/rasterisation engine is prolly a better name for it), which is how GPUs currently work (apparently).
You don't only need to single-thread the setup engine. You need to ensure that the setup engine's queue is filled up in order. This means a lot more buffering at the output of the shader, if you want to be able to still do useful work while the rasterizer is busy (like, for example, pixel shader work to unblock the rasterizer).

I presume that the execution pipeline gets longer in order to allow slower register fetching.
It's not a question of latency (although that does play a big role), but of bandwidth. If you want to just do MADs, you need 128-bit * 3 reads / clock from arbitrary registers. RF banking can help, if you limit the number of threads you run.

My understanding is that these attributes are part of the batch shader state, for the next stage. It's irrelevant what pipes work on these triangles, as the batch shader state is universally accessible by all pipelines.
Consider a GPU with 2 unified shader pipes that can run either VS or PS threads. If you transform a triangle on one shader pipe, you don't want to pay the cost of transmitting all that data to the other shader pipe if you can help it. Instead, you want to keep everything on the same sahder pipe, and run PS threads for that triangle. That way, you don't need to shuffle vertex attributes throughout the whole chip.

The other shader unit can, in the mean time, work on some other triangle.

You may want to transmit big triangles between shader pipes though, so you don't starve the other pipes, but that cost can be rather expensive.

500 M tris/sec * 3 vertices * 32-bit float * 64 scalar attributes == 384 GB/sec of internal bandwidth. If you way you only want to run with 8 scalar attributes at full speed (ie: position and some texcoords or colors), then your internal bandwidth is "only" 48 GB/sec. If you can somehow optimize this down to one vertex / triangle on average (which, btw, has a whole other set of issues), you're down to a more manageable 16 GB/sec. And you haven't even done any texturing yet!
 
Back
Top