Unified and diversified shading

DemoCoder said:
As for normalization, it can be computed more efficiently compared to DP/RSQ/MUL via Newton-Raphson.
Interesting. Can you explain this a bit more to me? I know you can do approximate normalization with a Taylor series (for GF2 etc.) and I tried doing a Newton Raphson expansion, but I just can't find the savings. Are you just talking about the RSQ part that gets faster? Then I understand, but you still need to sum the squares and scale, and I don't see the big savings over DP3 and MUL. Furthermore, you need a separate RSQ unit anyway, and you said yourself that it won't be needed much beyond normalization.

Anyway, like I said before, I see why NVidia did FP16 normalization, but beyond that it seems pointless.

I see what you're saying regarding Xenon's VMX unit, but I'm positive that the DP hardware shares the per-component multiplication hardware, and just puts a few adders at the output (which can be enabled or disabled depending on the instruction) to sum the components. This is why I think it would be pointless to separate DP3 and MUL in pixel shaders. NVidia says MAD is the most commonly used instructions, so ADD also fits well in that grouping.

The only thing that seems to be worth optimizing in this way are the "complex" scalar functions, and they may be small anyway.
 
Although this is a bit of a tangent, it's interesting to compare R580 and G70 in their approaches to maximize hardware usage, specifically the high expense of texture units.

NVidia thinks it's best to use the texture address hardware for general math when a texture isn't being accessed. ATI thinks it's best to reduce the number of texture units to maximize their load, and use the savings to add more math units.

Both methods are trying to balance instruction usage with die occupation.
 
darkblu said:
yes. that does make very strong sense in the context of unified shader architectures. because the statistical workload of op types on vertices vs fragments can easily be the determinant factor of your efficiency - so you definitely want such a diversification, i.e. both horizontal (e.g. N units go to vertices, M units go to fragments) and vertical (e.g. X of the N vertex-processing units are of type A, remaining of the type B, and Y of the M fragment-processing units are of type A, remaining of type B).

If you go further, you end up describing a "non specialised stream processor":
A stream X[] if feeding a small program which generate other streams Yi[]. Each of them feeds other small program and so on. You can visualize that by a graph. Each node is a program, each edge is a stream. Additionnaly, each leaf is a "pit" (any rendering buffer) and each root is a "source" (vertex stream).
A gpu is just a graph is quite simple:
vertex stream (source) -> vertex -> pixel -> screen(pit)

And the flow has to be optimized. And of course it's much funnier if you *share* the logic power among all the program-nodes (like Xenos does). In the general case of a complex stream graph, that incidentally means that the "logic operation queue" is big and very diversified. Indeed, there's different type of thread running at the same time (pixel or vertex), each instance of a thread being at a different level of completion (a vertex can be almost done, or the vertex program could be just beginning).

So, facing this big diversity of logic-operation request, I completely agree that more diversification would lead to a better flow optimisation: the logic set will always find the proper request to fullfill all it's different processing element. It's "just" a question of scheduling it right...

But scheduling this fictionnal stream processor must be a nightmare. Indeed, it's obviously impossible to statically known what's going to happen, so the flow must change dynamically... The only way I can see that working is two have a small processor on chip constantly analysing the incoming stream and trying to find a good balance... A very expensive scheduler...

Am I dreaming ?? :)
 
Efficient scheduling is a very hard problem but when you have a lot of non dependent threads to schedule that may be executed out of order/out of sync your life is going to be much easier.
I don't think we are going to see those huge uber schedulers we have on CPUs on a GPU anytime soon.
 
nAo said:
Efficient scheduling is a very hard problem but when you have a lot of non dependent threads to schedule that may be executed out of order/out of sync your life is going to be much easier.
I don't think we are going to see those huge uber schedulers we have on CPUs on a GPU anytime soon.
It doesn't have to be a hard problem, though. Imagine, for example, that I have N different execution units within each pipeline. Some units can perform multiple tasks. And each unit has its own execution queue.

I could schedule them all pretty efficiently if once an instruction is completed, the next instruction to be executed is analyzed, and this instruction is then placed in the appropriate execution queue. If multiple queues are available, it is placed in the shortest queue.

Scheduling efficiency of this architecture would then just be a function of the length of the instruction queues. That said, such an architecture could well start to get rather out of sync, as it were, for its texture and framebuffer accesses, which could in turn result in decreased memory efficiency. One might then wish to add some more cache to the chip, and possibly a more efficient memory controller, to bring that efficiency back up.

Sound familiar to anybody?
 
Chalnoth said:
I could schedule them all pretty efficiently if once an instruction is completed, the next instruction to be executed is analyzed, and this instruction is then placed in the appropriate execution queue. If multiple queues are available, it is placed in the shortest queue.
You just rediscovered Out of Order Execution ;)

The devil is, of course, in the details.
 
Bob said:
You just rediscovered Out of Order Execution ;)

The devil is, of course, in the details.
Well, this is very far from the out of order execution found in CPU's, where you're only interested in serial operation, and you have to find ways within one single thread to execute instructions out of order. Far from an easy task.

But with GPU's we have potentially thousands of threads that can all be completely independent of one another. So it becomes vastly easier to execute these threads out of order. This is somewhat more akin to the idea of Hyperthreading that was implemented in later Pentium 4 models.
 
Chalnoth said:
Well, this is very far from the out of order execution found in CPU's, where you're only interested in serial operation, and you have to find ways within one single thread to execute instructions out of order. Far from an easy task.
Oh sure, I totally agree. It's much easier to schedule work when you have more threads than execution units.

Btw, the scheduler you've described in your previous post is strikingly similar to Tomasulo's Algorithm, variants of which are implemented in real CPUs, typically to perform OoOE.

It's very much generelizable to multiple instruction streams.

The problem isn't so much the algorithm itself (which is fine, if you have no exceptions or interrupts), but rather the physical implementation thereof.
 
Well, it's a pretty obvious algorithm. I mean, you have the problem of finding instructions to fill execution units. So why search for them? Why not just assign instructions to valid execution units as they become available? That was basically my reasoning, so I'm not surprised at all that it's been thought of before (and I expect it's very similar to what ATI does with the Xenos and R520, albeit likely with some significant alterations to optimize die space).
 
Crikey, what a good thread! Good stuff, nAo.

I don't know, or understand, enough about the kind of low-level hardware tradeoffs like locality of resources vs generality vs complexity, etc. this involves but I think it's similar in spirit to the CPU question of the tradeoffs between multicore and multiple-threads-per-core (hyperthreading).

From a software POV, it's very useful that shading has a uniform feature set everywhere, so you don't have to worry about "does feature X work in THIS situation here?". But whether that's implemented with unified hardware or just uniform hardware is a tradeoff for the IHVs to decide.

I'd love to hear the thoughts of the IHV participants here.
 
Bob said:
You just rediscovered Out of Order Execution ;)

The devil is, of course, in the details.
One of them being the trade-off between a large batch/thread size for efficient OoOE versus a fine granularity for efficient dynamic branching maybe?
I was under the Impression that exactly that was the reason NV GPUs Operating on such large-sized batches to make it easier to pick a fitting Quad to process without having to shuffle to many internal ressources around but to keep the pipeline "flowing".
 
Bob said:
The problem isn't so much the algorithm itself (which is fine, if you have no exceptions or interrupts), but rather the physical implementation thereof.
I'm not a hw designer so correct me if I'm completely off target but I think that even if you have a tons of threads to potentially execute at any time and a number of functional units scattered all over the die you don't want to make threads data (registers, instructions, program counters, etc..) travel everywhere forth and back, cause it would be slow and costly.
I think one would like to have the flexibility to schedule any instructions keeping threads data local at the same time.
Maybe that's one of the reasons why Xenos's ALUs are splitted over 3 processors: once a thread is assigned to a processor is not going anywhere.

ciao,
Marco
 
In R5xx each quad owns its threads (due to tiling locality). So each quad has a set of 128 threads (each of 16 fragments) to "play" with.

I don't know of any tiling locality in Xenos.

Jawed
 
Jawed said:
In R5xx each quad owns its threads (due to tiling locality). So each quad has a set of 128 threads (each of 16 fragments) to "play" with.

I don't know of any tiling locality in Xenos.

Jawed
Yeah..but that is another kind of locality, I was not talking about screen space (and thus memory locality with tiled framebuffers..) locality.
I was specifically talking about on chip data locality.
 
The point is screen-space locality is translated, by way of thread-locality, into cache, register and ALU locality. Screen-space tiling effectively lies at the root of all locality, as far as I can tell. ATI people have talked vaguely about the trade-offs in altering the size of screen-space tiles. It sounded to me very much as though this tiling is the driver for locality of resource usage.

It seems to have allowed ATI to get away with only an L1 cache (no L2) - though there was that recent patent indicating ATI is going for L1/L2 at some point. Twas your thread:

http://www.beyond3d.com/forum/showthread.php?t=25332

Jawed
 
I'm also wondering how many ALUs slots go wasted when you're doing 1,2,3 components ops and at the same time another op can't be co-issued.
Obviously it's not going to happen but it would be nice to push 'diversification' even further and to have only scalar ALUs..;)
 
Last edited:
Bob said:
The problem isn't so much the algorithm itself (which is fine, if you have no exceptions or interrupts), but rather the physical implementation thereof.

I have no clue how one can implement such algorithm in a GPU (or CPU?). Is there a "soft" part, which instructions, branching, etc. Or is it directly hardwired ? Perhaps this question can be rephrase as: does the scheduler use it's own little embeded processor ?

This may be science fiction again, but I like the idea of a specialised processor whose only job is too make sure other processor are doing a good job.
 
Jawed said:
The point is screen-space locality is translated, by way of thread-locality, into cache, register and ALU locality.
Of course Jawed, but I was thinking about a unified shading GPU where resources should be mostly shared
 
nAo said:
I'm also wondering how many ALU ops go wasted when you're doing 1,2,3 components ops and at the same time another op can't be co-issued.
NVidia addressed this fairly comprehensively in NV40 with the 4, 3+1, 2+2 or 2+1 or 1+1 capabilities of each ALU.

ATI may not be able to do 2+2 - not sure.

I think vertex shaders are more flexible. Supposedly you can co-issue 5 scalars on ATI hardware (presumably the same instruction for all of them).

Anyway, as far as I can tell, the IHV optimisation guys like to reinforce the message of masking your instructions (e.g. .rg) at every possible opportunity so that must be to maximise co-issuability in the pipeline.

Jawed
 
Back
Top