HLSL 'Compiler Hints' - Fragmenting DX9?

Humus said:
Ah, I see. "The memory is good but short."
If you happend to be in the uni area on Nov 7, feel free to come to my presentation of my examination work I did at ATI. :)

Hmm, maybe i'll do. When, where ?
 
demalion said:
Why would you want to design your hardware in a non-traditional way that had register performance issues? That's particular thing is what I'm saying IHVs have reason to avoid.

The NV3X is non-traditional in the sense that register usage isn't free, like it tends to be on other architectures. Does that in itself mean it's a bad architecture? No.

"Bad" in what sense? It is indeed inefficient by a whole host of metrics.

If nVidia could implement their own compiler to reduce the register usage and bring this thing up to performance levels of the 9800, then it would be a good architecture. However, with DX9 HLSL they can't do this.

It's not because of the DX 9 HLSL that they couldn't do this, it is because of the hardware. That's my point.

This way the GPU industry will just become more uniform and less innovative, simply because they have to design their hardware to meet certain expectations the compiler has.

Well, the expectation under discussion is not to discard computation opportunities due to a temporary register capability that is severely limited for your designated workload. I don't think that is much of an innovation, and I think there is quite a bit of room for quite a few other innovations that remain....like a solution for your architecture that permits you a better return on your transistor budget.

I'm not saying register usage limitations is a feature, or in itself desirable. It's certainly a drawback. But why was it used in NV3X? It probably added something useful somewhere else, or maybe just saved enough transistors to implement another feature or improve performance somewhere else in the pipeline. Now what if it is possible to acheive the same level of performance on such a hardware than more traditional hardware by using a compiler written for it? Would you still think the hardware is flawed?
 
Bjorn said:
Humus said:
Ah, I see. "The memory is good but short."
If you happend to be in the uni area on Nov 7, feel free to come to my presentation of my examination work I did at ATI. :)

Hmm, maybe i'll do. When, where ?

Nov 7 at 13.00 in A1514.
 
Humus said:
I'm not saying register usage limitations is a feature, or in itself desirable. It's certainly a drawback. But why was it used in NV3X? It probably added something useful somewhere else, or maybe just saved enough transistors to implement another feature or improve performance somewhere else in the pipeline. Now what if it is possible to acheive the same level of performance on such a hardware than more traditional hardware by using a compiler written for it? Would you still think the hardware is flawed?

And the register usage limitations aren't unique to NVidia. There's this little company in California that some people might have hard of called Intel which has an architecture (IA-32) with a whole host of limitations that cause difficulty for compilers. Only with both herculean effort on the part of compilers AND herculean hacks to the underlying x86 implementation (to get around x86 inadequacies) have they resurrected their awful CISC instruction set design from the performance purgatory.

Besides the limited number of registers (after which, you must spill to stack or memory), you have conflicts with MMX, and an FPU design that has no registers whatsoever (stack based)

Is there anything inherently wrong with a stack-oriented pipeline that has no registers what so ever? Should we accept an intermediate instruction set that forces future HW vendors to "target" their architecture towards it (in terms of datatypes, registers, instruction types)?

The are alternative architectures to the obvious ones that you would implement given the DX9 spec, and I think future proofing demands that the immediate mode API should support the highest levels of abstraction.
 
Humus said:
Bjorn said:
Hmm, maybe i'll do. When, where ?
Nov 7 at 13.00 in A1514.
Now Lads! Remember what they say about meeting someone who you've only met in a chat room on the internet. Take an adult with you :)
DemoCoder said:
Is there anything inherently wrong with a stack-oriented pipeline that has no registers what so ever? Should we accept an intermediate instruction set that forces future HW vendors to "target" their architecture towards it (in terms of datatypes, registers, instruction types)?
Well it does have registers... it's just that you can't really access the damned things! It was the same with the transputer. AFAICS stack-based architectures are great for teaching simple compilers and CPUs but in practice they are pretty awful from a performance point of view.
 
Humus said:
...

I'm not saying register usage limitations is a feature, or in itself desirable. It's certainly a drawback. But why was it used in NV3X? It probably added something useful somewhere else, or maybe just saved enough transistors to implement another feature or improve performance somewhere else in the pipeline.

But we're talking about this particular register usage limitation, down to 2 or 4 registers, and its transistor savings and what it offers.

Here is the issue: this limitation wastes computational units and the associated transistor budget as a matter of course of its operation. This is why the discussion of this is an extreme case. If the analogy still serves: it is like arguing about how it snowed on a Spring day (in a situation where this is an anomaly), and using that to argue towards expecting snow on another.

This doesn't change the issues with the HLSL/LLSL, it relates to why applying the NV3x as a universal illustration is flawed.

Now what if it is possible to acheive the same level of performance on such a hardware than more traditional hardware by using a compiler written for it? Would you still think the hardware is flawed?

If you mean "compared to what the 9800 can offer", as I presume:

If it wasted computational units due to register limitations in a similar fashion and degree of limitation, I'd only view it as flawed as compared to an innovation that allowed it to overcome that and outpace the competition. But that latter innovation is something they'd have reason to do anyways.

But to this specific discussion:

For the purposes of it being a demonstration that the HLSL/LLSL issue for it is something more likely for IHVs not to have reason to actually avoid, and your argument that the case shows successful innovation is stifled, "No".

For the purposes of it showing that it is something other IHVs would actively want to emulate (the opposite of what it shows now IMO), that depends on the process technology and clock speed challenges involved in accomplishing that. If we're talking about clock for clock parity, though, I'd also say "No". How that compares to the challenge of writing their own compiler, depends on the IHVs ability to successfully recognize and deal with this issue. A very effective common glslang compiler starting point would go a long way to answering that, though.

I also think that MS might have made different decision for HLSL/LLSL specifications based on this hypothetical NV3x, rather than the decisions they did make. But what they decided doesn't seem to be as much of a mistake as it could theoretically be, and the NV3x problems seem to work to support that.

However, if the PS/VS 3.0 specification don't change, I think the long term applicability of what it exposes is much more opportunity for theoretical advantages to manifest (through a significant amount of work by IHVs). It is just that with glslang's delay in appearing and demonstrating such work by IHVs, MS has time to establish whether they will adapt, or not...see my prior conclusions relating to MS's competing. They seem to be displaying they are willing to adapt in reaction to issues at the moment...the issue is whether they will adapt enough in time, and whether they are basing the decision on thorough information from what IHVs are telling them, and that "in time" issue is directly determined by what glslang delivers and when, not just what it theoretically can deliver.

Regardless, glslang delivering advantage or not is secondary to the factor of it putting pressure on MS to be required to adapt or fail to compete with it, and maintaining a higher standard for the industry as a whole.
 
Simon F said:
AFAICS stack-based architectures are great for teaching simple compilers and CPUs but in practice they are pretty awful from a performance point of view.
Do not disparage the name of Forth in such a way!

(there has to be one in every crowd, so I'm just playing the part)
 
Simon F said:
AFAICS stack-based architectures are great for teaching simple compilers and CPUs but in practice they are pretty awful from a performance point of view.

I think the key is, "in practice". I'm not aware of any theoretical proofs that stack architectures are inherently less efficient. I tried looking for papers, but couldn't find any that claimed stacks were less efficient than registers. I did find this from comp.compilers:

Philip Koopman said:
Stacks work pretty well if someone takes the trouble to write a
stack-scheduling compiler. The problem has always been that
register-based machines with whizzy compilers were compared with
idiot-simple stack code -- not a fair comparison. Similarly, most of
the old "stacks are worse than registers/memory" arguments were based
on small code snippets that didn't exploit reuse of on-stack
variables, or didn't use realistic instruction sets that permitted
nondestructive accesses to the top couple stack elements. The stack
machines I worked on a decade ago were optimized for cost/performance,
not just raw performance. (But they did pretty well at raw
performance too.)


I published a paper on a first cut at such an optimizing stack
compiler back in 1994; I've since moved on to other pursuits. See:
http://www.ices.cmu.edu/koopman/stack_compiler/index.html It's
suitable for stack architectures that keep the top 2 or 3 values in
registers, which is commonly the case.

Of course, you could argue he is biased, since he is the author of Stack Computers On the other hand, he seems to also be the one most likely to know. :) He explains here some of the optimizations he was able to achieve.

Of course, I only wanted to bring up the example of stack machines as an example to how historically, before the Wintel monopoly, there was a lot of diversity in the early CPU market. Currently today, GPU vendors still have alot of freedom and flexibility in their designs. I don't want to see MS's DX assembly code become a sort of IA-32 that locks vendors into a potentially limiting design. Due to the fact that we can do dynamic compilation, there is an opportunity to avoid the "binary compatibility" trap that has locked past architectures and vendors.
 
DemoCoder said:
Simon F said:
AFAICS stack-based architectures are great for teaching simple compilers and CPUs but in practice they are pretty awful from a performance point of view.

I think the key is, "in practice". I'm not aware of any theoretical proofs that stack architectures are inherently less efficient. I tried looking for papers, but couldn't find any that claimed stacks were less efficient than registers. I did find this from comp.compilers:
I haven't got a theoretical argument but how about the following hand-waving exercise... :)

In the earlier days of computing, each instruction was executed in full by its own piece of microcode. That moved on to the ~ RISC era with pipelined instructions that would overlap in execution. We thus have the situation that if we wanted to execute..
Code:
A := B+C
D := E+F
With a register machine (with a latency of 2 instructions) we'd do something like
Code:
LD R0, B
LD R1, C
LD R2, E
LD R3, F   ; Gives time for B and C to load
ADD R0, R0, R1   ;add B & C
ADD R2, R2, R3   ; add E & F
STO A, R0           ;hopefully B+C has finished executing by now...  
STO D, R2
If you like, you have a FIFO of results -- You feed in LD R0, LD R1, etc etc, and then get the results, in order, out the other end.

Now with a pure stack machine it's a First In, Last out model...
Code:
PUSH B;
PUSH C;
ADD     ; Add B and C... note B and C might not have loaded... let's assume that the hardware is [i]really[/i] clever and won't actually stall unless we tried to read the result
PUSH E;
PUSH F;
ADD       ;Same as above
POP D    ;STALL? 
POP A
Note that result of A is ready before that of D but that the FILO order means that we can't get at it. Of course we could add instructions to swap the order of entries on the stack... but then it isn't a true stack anymore. Why not just have direct mapped registers and save all the hassle in the first place?

Also, in this example, I assumed that the data were all in memory - if they were in registers/stack to start with, the argument towards allowing register access is probably stronger.

Finallym with a stack machine you must specify push/pops to access the data which can mean more, albeit simpler, instructions. The T800 transputer was very much of this ilk. To get more parallelism, you could try what Inmos did with their later T-8000 transputer which was to recognise patterns and group several of these simple instructions together and run them at the same time.... but then you are getting more like a VLIW/register based machine so why not start there in the first place?
 
Yeah, I thought about how it hampers pipelining and ILP, but then I thought, what about the idea that the stack based units are simpler, so you can have alot more of them with hyperthreading to keep units busy. Sort of a connection-machine on-a-chip like approach. (lots and lots of extremely simple processors). Isn't some ex-ID guy trying to do a startup or something on this?

Another approach is to use 2 stacks.

I'm not advocating stack based architectures. I think random access that registers provide allow more opportunities for balancing dataflow, but I can't help but wonder if restricting the dataflow order (LIFO, FIFO, or whatever) provides some clever optimization opportunities (besides smaller number of transistors) that I'm missing.
 
Back
Top