Unified and diversified shading

nAo

Nutella Nutellae
Veteran
From this article we learned that Xenos can execute at least 2 different types of instructions at the same time such as an arithmetic instruction and a texture fetch instruction.
After all Xenos has a 'global' pool of compunational resources that can be shared and dynamically assigned across many threads.
What if these computational resources are diversified even more?

In this NVIDIA patent resources diversification is someway addressed:
In addition, selection may take into account the state of the execution module. In one such embodiment, execution module 142 contains specialized execution units (or execution pipes), with different operations being directed to different execution units; e.g., there may be an execution unit that performs floating-point arithmetic and another that performs integer arithmetic. If the execution unit needed by a ready instruction for one thread is busy, an instruction from a different thread may be selected. For instance, suppose that at a given time, the floating-point pipeline is busy and the integer pipeline is free. A thread with an integer-arithmetic instruction ready can be given priority over a thread with a floating-point instruction

In the beginning we had only one kind of computational resource, later we got (semi)indipedent ALUs and TMUs, and at some point in the future there would probably be even more diversification.
If the DOT3 operation is used ten times more frequently than a reciprocal operation why should we include a RCP unit in every ALU?
Modern CPUs already do this but a much lesser extent cause usually they don't run more than 1 or 2 threads at the same time.
To be fair unified shading is not needed to diversifcate computational resources, but it seems (at least logically) that further diversification makes more sense in a unified shading architecture (as that NVIDIA patent shows..)
What do you think?

NB:
f@nb0y1 -> my ATI GPU is better than yours, I've got a 2:8:2:5:12:6:4 GPU!
f@nb0y2 -> WTF!? stupid moron my NVIDIA GPU is faster! I've got a 2:6:4:3:16:2:1 GPU!
 
Interesting idea. So do a large statistical analysis of instruction distribution, and try to match up execution unit distribution. Hell, I'd say that since NRM is fairly common, "macro" ops should be considered in the mix as well.
 
yes. that does make very strong sense in the context of unified shader architectures. because the statistical workload of op types on vertices vs fragments can easily be the determinant factor of your efficiency - so you definitely want such a diversification, i.e. both horizontal (e.g. N units go to vertices, M units go to fragments) and vertical (e.g. X of the N vertex-processing units are of type A, remaining of the type B, and Y of the M fragment-processing units are of type A, remaining of type B).

actually it would be _very_ interesting to see which control scheme, horizontal or vertical, needs what granularity in order for a given design to achieve higher efficiency. for example we know that the horizontal granularity of Xenos is 1/3, i.e. you can assign shader ALUs to work on tasks (i.e. vertex vs fragment) in groups of 16 out of 48 ALUs in total. now, what ALU diversification (ie. vertical control) would be needed for the 'average workload' so that a part like Xenos would achieve even higher efficiency? oh, all those questions.. ; )
 
Last edited by a moderator:
I was thinking about the same thing, actually, especially with all the R580 talk. They're adding more arithmetic units, but what if they weren't all identical?

I'm guessing NRM may not be the best target because it maps so well as a DP3+RCP+MUL, and probably doesn't account for a significant percentage of instructions except short lighting shaders. However, a lot of the other instructions like sin, cos, log, exp, pow, and the derivative functions could work.

Of course, the extra routing and scheduling logic may negate the savings, especially since the most obvious targets are the scalar ops. My guess is the DP3, MUL, and ADD units are already interwined to avoid unnecessary redundancy, so changing ratios among these instructions isn't worth it.
 
darkblu said:
for example we know that the horizontal granularity of Xenos is 1/3, i.e. you can assign shader ALUs to work on tasks (i.e. vertex vs fragment) in groups of 16 out of 48 ALUs in total.
I'm hoping when you say "you", you're not implying the programmer. Xenos dynamically changes the assignment depending on where the limiting factor is - that's the whole point of a USA. A single draw call dramatically shifts the ratio of pixels to vertices, especially when you consider clipped or backfacing triangles.

In terms of the optimal number, you have to consider time as well. There is going to be a FIFO between the pixel and vertex shaders, and it can be quite small if Xenos switches fast enough. If it starts getting full, assign execution units to pixel shading, and vice versa. In fact, for vertex/pixel shading, I doubt 16 shader granularity is required. That decision probably had more to do with making it easy to share the texture units.
 
Actually, there are three different processing routes on Xenos: Arithmetic, texture fetch (filtered textures) & Vertex Fetch (Unfiltered Float Textures); each one of these processing types have their own sequencers and arbiters for controlling them, the primary difference being that both of the texture types are effectively servers to the Arithmetic units. Even each of the three Arithmetic pipelines also have multiple sequencers and arbiters each.

Ever since I wrote the Xenos article I've been wondering about having different pipes for different types of functionality - in one way it makes sense, in others it'll likely only be with careful consideration that its done. One of he main problem is the costs of the controlling mechanisms in the sequencers are arbiters; these are quite fixed costs and the more different types of processing pipelines you have the more these costs go up relative to the quantity of processing power they are controlling. The other issue is that it kinda ends up in the place you were beforehand - guessing about the types of workloads that are going to be required and judging how much is required of each element such that other things aren't stalled; the more general things are the more utilisation all units will have, but the lower efficiency they will have as well.
 
darkblu said:
i.e. you can assign shader ALUs to work on tasks (i.e. vertex vs fragment) in groups of 16 out of 48 ALUs in total.
As far as it was described to me, this is not the case - all the workload is assigned across the ALU's by Xenos itself; the programmer doesn't know where it is going to get processed. Xenos does have controls that enable developers to tweak the load balancing algorithm such that it will bias one way or the other (if its exposed through the API), but that doesn't mean to say the developer assigns ALU pipes to particular commands.
 
Mintmaster said:
I'm hoping when you say "you", you're not implying the programmer. Xenos dynamically changes the assignment depending on where the limiting factor is - that's the whole point of a USA. A single draw call dramatically shifts the ratio of pixels to vertices, especially when you consider clipped or backfacing triangles.

well, that honestly was wishful thinking on my part. i knew there were mechanisms to tweak the ballance, so i assumed that goes as far as full manual control. eh, so much about it : ) but i have to disagree re the totally automatic ballancing - and i think you also don't believe in such, as you say it yourself.

In terms of the optimal number, you have to consider time as well. There is going to be a FIFO between the pixel and vertex shaders, and it can be quite small if Xenos switches fast enough. If it starts getting full, assign execution units to pixel shading, and vice versa.

that mechanism implies a default 'equilibruim' ratio, as otherwise you have all of the time basically an 'all in, all out' behavuour - vertices come in - all goes to verts, fifo's get full, stop all vertex work and all in to fragments, rinse, repeat.

In fact, for vertex/pixel shading, I doubt 16 shader granularity is required. That decision probably had more to do with making it easy to share the texture units.

pretty good point. maybe Dave can add some specularity on top : )
 
Last edited by a moderator:
Dave Baumann said:
Actually, there are three different processing routes on Xenos: Arithmetic, texture fetch (filtered textures) & Vertex Fetch (Unfiltered Float Textures);
Isn't the latter simply a vertex fetch - i.e. fetch one vertex from the vertex buffer. Being a main memory operation, it will benefit from latency-hiding, but I'm puzzled why it would be referred to as a float texture.

More on-topic: isn't the idea of diversified pipelines, really just VLIW?

Jawed
 
Jawed said:
Isn't the latter simply a vertex fetch - i.e. fetch one vertex from the vertex buffer. Being a main memory operation, it will benefit from latency-hiding, but I'm puzzled why it would be referred to as a float texture.
I believe the "vertex fetch" means point filtered texturing. It's what's used in verticies a lot (IIRC), hence the semi-misleading name.
 
NRM can probably implemented more efficienctly than a DP/RSQ/MUL (why else does NVidia have free NRM16).

Seems to me that this is an integer Knapsack problem. You have a Knapsack corresponding to your transistor budget. You have items to put into this knapsack in the form of functional units. Each functional unit you put into the knapsack has an associated size (how much of your transistor budget) as well as an associated value (which can be computed figuring out the expected probability that it will be used over all workloads)

Challenge: figure out the set of functional units, and how many, to put into the knapsack to maximize overall value, modulo the overhead cost of linking up an additional unit and scheduling work to it.

It's not self evident to me that including an extra DP unit and RSQ unit is better than a fixed function NRM unit. If the NRM unit were the same size as a DP+RSQ, and if it ran in similar time, and NRMs weren't that common relative to DP/RSQ, I would agree. But if NRMs make up say, 1 in ten operations, and a NRM runs 3x faster than a DP/RSQ, and consumes say, 1/2 the gates, seems to me that a "NRM" unit could be a net win.

Whether it is a win or not depends on the relevant statistics and transistor costs. It doesn't always lose under all scenarios. Just because adding another DP/RSQ/MUL unit might seem "generally more useful" than a fixed function unit doesn't mean fixed function units are a lose.

I would bet that a big fraction of RSQ instructions are used in NRM macro expansion.
 
Isn't that, what has become of the "unified" shader requirement for DXNext/D3D10 to be viewed more or less only from a logical standpoint?
After all, IHVs only have to make their shaders, vertex and pixel alike, adressable only by the same instructions and shader programs.

And seeing how long certain fixed function units have survived even in the age of floating-point ALUs, my money's on physically separated units in terms of shader at least from one of the IHVs at least for the first WGF2.0-GPU.
 
DC, I'm talking more from the point of view of R580 type architechtures. If you have that much number crunching power, it's probably not worth it, especially since it'll take time for games to take advantage of it. Might as well do NRM with a macro expansion.

NVidia probably did it because it was the perfect time for it. Many of these early shaders use simple lighting equations, so a NRM macro expansion is a significant portion of the code. Futhermore, making it FP16 is probably the only way it was feasable. However, FP16 normalization is really not adequate for many situations, like specular lighting on low curvature surfaces. NVidia probably figured they could hand tweak it into the more popular games where appropriate.

Make the shader longer, and the dedicated NRM unit probably won't be used much. If you're doing falloff then you need R^2 anyway, and again NRM doesn't save you much. If you need the precision, then it'll get unused (hehe, ironic flashback: "FP24 is too much precision for pure color calculations and its not enough precision for geometry, normal vectors or directions or any kind of real arithmetic work" - David Kirk).

I could be wrong, but I don't think there's any big shortcut for normalization. You need to sum the squares, find the inverse square root, and multiply. The first and last only differ from DP3 and MUL in terms of input count to the logic. You might as well generalize them so they can be used for other operations.


In terms of the other scalar instructions like rcp, rsq, log, exp, etc, aren't these implemented with lookup tables and interpolation/taylor expansion? There's so much hardware sharing, and they're single input scalar functions, so I don't think there's a whole lot to save, especially in light of the extra scheduler/arbiter logic needed.
 
Jawed said:
Isn't the latter simply a vertex fetch - i.e. fetch one vertex from the vertex buffer. Being a main memory operation, it will benefit from latency-hiding, but I'm puzzled why it would be referred to as a float texture.
Whether you are fetching vertices from a vertex buffer or point-sampled texels from a texture, it's fundamentally the same thing.


Regarding the topic, I wonder whether it pays off to have separate integer processing units alongside the FP ALUs, or if it's less expensive to have a combined ALU that reuses parts of the mantissa processing for integers, possibly over multiple cycles to get 32 bits.
 
or if it's less expensive to have a combined ALU that reuses parts of the mantissa processing for integers, possibly over multiple cycles to get 32 bits.
Well this is more or less pretty much what Cell does, so obviously it's less expensive (SPE design is all about cost efficiency after all).
But I'm not convinced it's clear whether you really need full vector integer execution units, or would simpler scalar stuff suffice (in which case the question of cost efficiency comes up again).
 
DemoCoder said:
Interesting idea. So do a large statistical analysis of instruction distribution, and try to match up execution unit distribution. Hell, I'd say that since NRM is fairly common, "macro" ops should be considered in the mix as well.
Some kind of RISG (Reduced Instruction Set GPU) :;

Simplify as much as you can and have a better/simpler/faster/generic design.
In fact you could use this design for other non-3D applications too :)
 
Mint, like I said, it all depends on statistics, numbers which we don't have. One could make the same argument about DP, since it is essentially a macro that expands into a two instructions in most SIMD instruction sets. But the XB360 added a dot product instruction to the VMX unit, instead of just using a vmsum/vsum, because dot product occurs frequently enough in justify it.

As for normalization, it can be computed more efficiently compared to DP/RSQ/MUL via Newton-Raphson.
 
From my POV, the problem here is you're thinking of shaders as mere serial instructions. The key to a proper unified architecture, IMO, is thinking of it more along the lines of parallel instruction streams with dependances. I know I'm being rather vague here, so let me describe a potential scheme for such a technology.

The shader compiler's goal would 'simply' be to divide the instructions into separate streams that can be run in one of the kind of processing engines as efficiently as possible, while maximizing parallelism. Texturing could thus also be seen as a general case, and not an exception, with the particularity of having much higher latency.

The so-called scheduler then would in fact be massively simplified, and you could even partially get rid of having to allocate "blocks" to VS or PS, although it might remain slightly more efficient if you did it that way.

Basically, each instruction stream would increase a counter on up to x other instruction streams, and each instruction stream would be "switched on" when the counter reaches y. For obvious facility and performance reasons, it would be a good idea to keep x and y - especially x - relatively small.

And since each instruction stream would be sent to a specific kind of processing engine (MAD, TEX, Special-Op, NRM, etc. - but it could also be combined, like 2MAD+1MUL, and it could get more sophisticated with Vec2/Vec3/Vec4/Scalars), all you need is a FIFO at the entry of each of these kinds of processing engines, with everything sent to those FIFOs as soon as possible. And that's, besides some potential hacks for PS-VS balance, all you need for your so-called "scheduler". At least, once again, from my POV.

The nice thing with such a scheme is that all the "hard work" is done on the CPU, once per shader, and it really wouldn't be THAT hard if optimized properly imo. Respecting the x/y constraints (see above) might be the only potentially quite problematic thing for a compiler, and the solution to that obviously is to make parallel things serial when there's no proper reason for it to be parallel (remember, we got quad/vertex-level parallelism, extra instruction-level parallelism is nice but it's not our primary goal either).

In fact, all of this makes me wonder how Eurasia works... Hmm :)


Utar
 
Of course it is :) I was just describing a scheme under which I feel the cost of additional kinds of processing engines wouldn't add too much complexity to the scheduler, since that seemed to be the primary criticism of such a structure here.

Uttar
 
Back
Top