HLSL 'Compiler Hints' - Fragmenting DX9?

Humus said:
Bjorn said:
The closest thing to ICD programming offered would be the course in reactive programming, where we wrote a driver for a simple AD/DA converter to control some odd pendulum device.

Been there, done that :)

Not sure if I have asked this before, but I have a faint memory that I might have done so but don't remember anything ..
Are you studying at Luth now, or are you done? What did you study / are studying?

I finished my studies 99 actually (system science). But i took some "free courses" this spring because i had nothing better to do :)

And yes, we "talked" about this before :)
 
Sorry to go OT a bit but... ( not like Humus and Bjorn aren't quite OT already ;) )
Talking of compiler technology...

I did some tests with the Det50s. I noticed a few nice register usage optimizations, among which an AUTOMATIC use of swizzle to reduce register usage.
That means if you always use r0.x->r12.x and never .y, .z, ... - the compiler will be smart enough to use 4 times less registers than with the Det45s.

Registers which are no longer used are also automatically reused, although it seems this was already done with the Det45s, possibly in not as complex ways though, I didn't test it extensively.

So all optimizations regarding register usage, but ones which require adding instructions to reduce it, seem to be in automatic application now. Swizzle and reusing registers were two of the three or four main points about optimizing for CineFX in the for-developer documents at developer.nvidia.com.
Now, of course, any developer still doing that stuff manually is just wasting his time ;)

I'd like to test some other stuff, but my testing is based on DX9 PS2.0. ( actually, MDolenc, I based my testing on your FillrateTester interface, was too lazy to redo the job ) and I don't have sufficent OpenGL knowledge to test most of that stuff in the NV30FP extensions which show the hardware more natively regarding multiple precisions.

DX9 way of exposing multiple precision formats is simply awful IMO. It most likely wasn't thought with multiple-precision hardware in mind. But that's just me.

---

A few interesting things still have to be tested on the Det50s:
- Why are ALL numbers nearer to the theorical maximums than before, even when practically no registers are used? Could it be a certain person I will simply call Th. was right all along, and overhead was an important factor?

- COS/SIN parallelism with other instructions has to be tested, and compared to the Det45s. It seems to me it wasn't done in parallel in the Det45s, even though the patent's CineFX documents said it could be done in parallel. Could it be that part was never translated to real silicon?

- Making sure the swizzle optimization works in more complex scenarios is pretty important IMO. I only tested in a very naive shader, could be the compiler ain't very smart about it...

- Register usage penalties: Comparing NV30 ( and NV35, but NV30 is preferable because older drivers are available ) register usage penalties with old and new drivers, seeing if it has improved or not in cases where you CANNOT reduce the number of used registers.


Once again, sorry for going OT, but I didn't feel this was worth another thread.


Uttar
 
Humus said:
demalion said:
The fact that new profiles have to be consistently created...

No, the fact is that a new profile had to be created for the NV3x.

You have two scenarios. Either new profiles constantly needs to be provided for new hardware, or IHVs just suck it up and design hardware around assembly shader specs instead of around their visions. Neither way is desirable.

Your characterization seems flawed to me, because designing "around assembly shader specs" such that new profiles "don't constantly need to be provided for new hardware" just means avoiding significant issues like the NV3x register issues for beyond 2 or 4 registers...LLSL just states the minimum goal for that. It doesn't mean implementing a hardware mapping of an identical bytecode interpretation point for point.

Trying for that is not "sucking" anything "up" besides doing what IHVs will have reason to do anyways, to create an efficient architecture. nVidia failed at achieving that, and didn't succeed in getting a return on it for the decisions they made (in the NV3x). ATI did not fail at it. Will other IHVs? Perhaps. Will other IHVs fail at implementing better in glslang? Perhaps.

I'm not arguing against the idea that glslang has a clear methodology for a solution to such a hardware issue happening, I'm arguing against the idea that DX HLSL has none besides "monthly application patches" and that there are no additional issues to be overcome for an IHV to implement the glslang solution over the DX HLSL/LLSL one.

demalion said:
Have you added anything to the discussion besides "well, maybe glslang won't do as advertised, I think, perhaps, but have no information to offer"
Really.

Why are so many of your questions rhetorical and insulting? What does this "add to the discussion", then?

Frankly, I share DC's view on your posts. No offense intended, but it tend to me more quantity than quality. Lots of word, little actual information and lot of speculation.

Sorry, but accusations like this are not additions to the discussion, and your usage here seems to me only suited to either:
  • Have me allow you to dismiss everything I said under some ambiguous and unspecified label of worthlessness, when your just sticking to the other parts of the discussion (like above or below this comment) would have served (atleast this is better than doing this instead of the other parts)
  • Provide a self-fulfilling description of a portion of my posts because I actually responded to something so unspecified and ambiguous.
Please note that when I characterize something as something I'm "dismissing", I'm not being ambiguous or unspecified as to the justification and reasoning behind it, and: give the exact reasons why, ask if you understand what I'm saying, give you an opportunity to specify why you maintain disagreement, etc. That's why my posts are "greater quantity" than the posts with the ambiguous dismissals that they respond to. Witness this paragraph, and the next. :-?

I do find it annoying to have to explain this, especially as you take no exception to the insulting nature of DC's "question", because I've explained it plenty of times before. I find it offensive if you just continue to repeat the practice with no regard to applying the standard to yourself.

Aside from that part of the discussion, if you think I've offered no specific information on why I think glslang "might not do as advertised", and that this is thg entirety of what I said, you would seem to be in error. If you're just going to state that and accept no accountability on your part for having done the above, what exactly are you adding with this? :-? This doesn't mean that you have to agree with what I said, or that it is automatically right, this just means that this label adds nothing worthwhile when discussing what I said would suffice. Please look through the parts of the text you don't like, and consider whether my commentary has some merit in describing the situation that resulted in your not liking it. Hopefully, my just stating something you disagreed with isn't sufficient for that.

demalion said:
Writing a compiler from scratch is easier than writing an ICD. Adding a compiler into an ICD isn't that complex, since, as I stated, the compiler doesn't need to "interact" with anything.

So there is no concern with the any other state updates influencing shader output at all on the GPU's end? Perhaps in relation to things like fog, AA, and texture handling?

Nope.

Could you explain a bit further? You make it sound like it is impossible, for example, for AA to work as expected, a shader to work as expected, and a shader to malfunction under OpenGL when using AA? Is my characterization flawed? Perhaps DC's reply will cover this.
 
demalion said:
Could you explain a bit further? You make it sound like it is impossible, for example, for AA to work as expected, a shader to work as expected, and a shader to malfunction under OpenGL when using AA? Is my characterization flawed? Perhaps DC's reply will cover this.

I don't quite get the question. Fog is applied after the shader and has no impact on the actual shader, it's a post processing operation after the fragment has left the shading pipeline. I also don't see how whether AA is switched or on off will affect compilation of a shader. It just affects fillrate and bandwidth (in terms of runtime state). You could equally ask "should the compiler interact with the current screen resolution?" (e.g. do something different for 1280x1024 vs 640x480?)

Sorry, I'm just not understanding what you're getting at. In theory, you could try to optimize based on what bandwidth or fillrate utilization "might be" based on AA on/off, but it's low on the list of runtime state that I think is important to the compiler.
 
Ilfirin said:
That would put most my concerns to rest. It's not that I'm dead set against JIT shader compilation (you can find posts by me in the past supporting it), it's that I'm against being forced to do JIT.
I think that being forced to do JIT is the only way for it to be effective.

Remember that the #1 reason to go for JIT compilation is to free hardware developers to provide more innovation in the instruction set. The only way this can be done is if most shaders are runtime compiled.

Of course, if compile times become an issue, it would be a good idea to support an extension that saves the shader object files (esp. if one can store more than one shader per file), so that a program would only have to use runtime compiling once, perhaps with a check against the video card's drivers to see when it needs to recompile the shaders.
 
demalion said:
Please note that when I characterize something as something I'm "dismissing", I'm not being ambiguous or unspecified as to the justification and reasoning behind it, and: give the exact reasons why, ask if you understand what I'm saying, give you an opportunity to specify why you maintain disagreement, etc. That's why my posts are "greater quantity" than the posts with the ambiguous dismissals that they respond to. Witness this paragraph, and the next. :-?
You would do well to find a way to be more concise, Demalion. I won't read most of your posts in this thread, because I just don't want to spend the time to do so. If you want to make a good point, make it short. Otherwise, people just won't read what you have to say.
 
DemoCoder said:
And overriding principles (shortest register count, shortest instruction count) are too simplistic to capture everything that needs to be done, because they are competing goals.
No, they're not competing goals if you don't have a significant performance penalty simply from increasing temporary register utilization beyond a very low number. Perhaps you mean from a hardware design standpoint? I agree, but then I'm not saying glslang doesn't have a theoretical advantage, I'm saying it is obviously in an IHVs interest to avoid certain mistakes as high priorities with a given goal, if possible.

They are competing goals on some architectures. Eliminating registers bloats code by forcing recalculations to occur. Let's say you are using 4 registers, and using a 5th register drops your performance by 25%. To avoid this, you eliminate the register by redoing some calculations, but in doing so, you added 25% more code. For example, maybe you normalized a register (3 instructions) and saved the result for later reuse in two other expressions. But you now have to eliminate this extra register, so you just do the normalize twice, instead of eliminating subexpressions.

That seems a waste of a calculation unit for that clock cycle, purely due to a design limitation, unless this limitation saved you a lot of transistors or allowed you to do something significantly unique and useful. If it didn't, this design would seem to be a mistake, and a mistake that IHVs would be trying to avoid.

Now, does DX HLSL allow you to work around this for hardware that faces this problem? Yes, by a few methods currently: a new profile (as done for NV3x, the architecture that has required this so far), the "LLSL"->GPU opcode compiler for an existing profile (if it allows it to be addressed effectively). Now, have I said there aren't issues with these, or have I said that the significance of issues depend on the IHVs failing to avoid the problem in the first place, and then comes down to how cumbersome the solutions are for all involved?
As far as disagreeing with you, it seems to me we depart fundamentally on a few key points: where you say users will have to download patches for every application every month; my questions about the apparent errors in your N*M discussion, at least AFAICS; your insisting on replacing this discussion with commentaries on MCD/ICD and Pentium 4/SSE that don't speak to what we are disagreeing on.

Now, does glslang allow you to work around this? Yes, by an IHV writing a compiler, and taking on the additonal challenges with that. These challenges ares new work an IHV has to do.
Insurmountable challenges? I certainly hope not, nor see why they should be...for the purposes of my disagreement with what you stated, I'm proposing the challenges are more than "10%" to the "90%" already done for DX HLSL. For the purposes of proposing my initial commentary, I'm also proposing that when these challenges are overcome, when they are necessary to be overcome, and whether an IHV is getting more out of them for their hardware, are all relevant to which delivers on their advantages.

...

Please feel free to discuss what you believe is my error in any and all of these, if you think I am in error (like this post seems to do). Please don't feel free to insist on discussing analogies that obscure these issues instead, while throwing insults at me, no matter what you think. I don't think such expectations are unreasonable at all.

The most optimal code is not neccessarily at the extremes (shortest actual shader, or shader with fewest registers used), but is someone in between, and possibility finding the global minimal is extremely hard.
Outside of the NV3x, what type of performance yield are you proposing from this compared to what can be done with the LLSL?

I'm just telling you that the issue is alot more complex than just "shortest shader or shortest register count". Not all instructions have single cycle throughput, and they certainly have differing latencies, so instruction selection is different for each HW. For example, LERP is way more expensive than MIN/MAX on NVidia hardware. RSQ is expensive on NVidia, so sometimes using a Newton Raphson approximation is better.

My premise was simply that: 1) expressing the functionality in the most compact form is an approach that seems widely applicable...the LLSL->opcode compiler does have the ability to address the type of issue you specify, 2) specific issues like the NV3x register problem are required to prevent that solution from being applicable (i.e., require a new profile). Your characterizations (that I was addressing then) seem to depend on there being an abundance of such issues, and that they will be addressed one at a time each month, only by patching applications/the HLSL compiler, and I was pointing out that there seem a significant set of issues of that type that can be addressed otherwise, if they appear.

Also, symbolic high level manipulation can yield improvements, here's an example

Well, the LLSL has a "nrm" macro, but the HLSL compiler does have issues with properly implementing some macros. I presume this is one of them, then?
This is an issue with MS's DX HLSL implementation, not the overall approach (except as they are failing to meet their challenges...strengths and weaknesses, as I mentioned). If they don't address this in a certain window of application patching for affected applications, it will manifest as a direct advantage opportunity for glslang. This is not because I'm making excuses for MS, it is because of how that actually compares to how glslang is addressing it at the moment (not at all).

...example...
you've shaved all one rsq and one instruction, but traded for a dot product. Depending on the HW, this may or may not be a win, since the RSQs might execute in a different unit, and might be able to be ran in parallel, whereas the extra dot product has to run serially. Who knows.

Microsoft's profiles have to know about more than just general goals like "shortest X or Y", they must also know about the individual timings and latencies of the VLIW instructions that will be used to implement the LLSL operations.

If they use the macros, they won't. If they don't use the macros, they will, and glslang will have to overcome less to deliver on more of its advantage. Our disagreement here consists of me pointing out that the LLSL has that macro...I understant the premise you've mentioned.

Ditto for predicate vs branch vs LERP vs CMP vs MIN/MAX

Well, I've covered what the LLSL specification allows...it is a matter of what MS and IHVs achieve in adapting each approach, and when.

In fact, FXC isn't able to eliminate the extra RSQ was I showed above, therefore, NVidia takes a hit, because their RSQ is expensive. In fact, FXC doesn't even generate DX9 normalize macros in the shader, which makes it even harder for the driver to rewrite the expression.

These are the same issue and problem, introduced in a confusing way here. Please clarify: if "FXC" generated the normalize macro (and others), then it would solve all of these problems you list, because the IHV would then be in direct control of these decisions, right? You seem to say that the not expressing using the normalize macro is a new problem in addition to the discussion before it...?

There are loads of other DX9 HLSL library functions which might be directly accelerated on future hardware like faceforward, smoothstep, and transpose.

No relation between the face register and faceforward function, or just some limitation that cannot be resolved by the drivers?
In any case, implementing these efficiently on hardware that has that could benefit would require an update and patch. Was your N*M discussion meant for hardware upgrades over time?

With LLSL, the semantics of these operations are lost because they are replaced with a code expansion.
It becomes very difficult for the driver to recognize what is happening and substitute alternatives using algebraic identities after that.

With the current LLSL, yes. If MS isn't looking forward at least a year or a bit more, this would be a pretty significant issue, because the LLSL needs changes beyond just the compiler and application needing (a minimum of) one patch to better target the LLSL specifications already made. The current hardware issues you brought up seem to be addressed in the LLSL, though.

Moving on to something new in the discussion, though:

Finally, with regards to JIT compilation and dynamic optimization, this doies not incur significant overhead, and has been used for years on some systems (Smalltalk, Java)

The way it would work is this: The driver keeps a small table of statistics for the "most active" shaders used, it can do this in "debug mode" or in retail mode, it doesn't matter. For the most active shaders, the driver further records which of the runtime constants passed to the constant registers don't change very much.

After the driver has collected this profile information, the compiler can then use it to generate "speculative" compiles of hot shaders. A speculative compile is one where you ASSUME you know the values of those constants which you found not to change very often.

This can lead to constant propagation, algebraic and strength reduction opportunities, along with removing branches, min/max/lerps, etc.

Yes, I see how that is significant, just don't see how it is something that reduces the burden of implementing glslang. However, in this case, I can see how a shared compiler that provided such tools could achieve that for all IHVs in a "solve once" fashion...much less of a hurdle than the bandwidth balancing proposition. Where is some information on what the glslang compiler baseline will do? This seems like a believable opportunity for capitlazing significantly on its advantages, maybe even at launch.

A few things, relating to the current discussion: How much CPU overhead will this add? How does that scale to repetition and constant usage for larger amounts short shader execution dispatches? It seems this will be tested for every instruction using constants, and cascade (to some limited level) more work on top of what else is going on...how will that percentage of performance loss equate to performance gained from this given the CPU workload of a game? It seems likely that this won't be an issue for the "high end" at the time, but it will be for whatever manner of CPU usage occurs around the "baseline" the developers targetted for their game.
Also, as far as I understand, the HLSL->LLSL compiler simply can't do this, but the LLSL->GPU opcode can (a clear miss of higher level optimization opportunity)...confirm that if you would.

Also, will there be any type of impact on "shader caching" on the GPU for having the host have to manage this? With higher bandwidth interfacing, this might be less of a problem though, and this seems a problem that might be solved for other reasons. Or is there never any type of "shader caching" (crude description perhaps) for GPUs?

You also compile a version of the shader that is based on not knowing the value of runtime constants.

Now the driver, armed with these two shader versions uploaded to the card, can choose which one really gets bound (when asked for) by looking at what constants were fed via the API. If the constants fed match up with the profile statistics, it chooses the "known constant" shader, if not, fallback to the "unknown" one.

Hmm...elegant, though I'm not clear of the amount of benefit and frequency of it within the limitations of execution for real time shaders in the next few years. However, the longer the shader, the more worthwhile this can be, and the less CPU overhead matters...though the more CPU overhead there is because of more constants used and more analysis to manage the tracking. Any thoughts on how this will pan out? Some factor I missed?

This technique is used in C and C++ compilers to overcome performance related to dynamic dispatch and polymorphism. Over the years, and programmers use languages that offer more dynamic method invocation (pointers to functions, et al), compilers have had a toughter time figuring out how to do global analysis and method inlining.

With speculative compilation, the compiler can use profiling data collected from real application runs to generate code that looks like this

function foo(Object * b)
{
if(b = X)
{ inline version of function B.BAR() }
else
b->BAR();
}

It does this, because perhaps B.BAR() is called 90% of the time, but there is a rare chance that the 'b' pointer points to a different object.

With shaders, the compiler could speculatively propagate constants, and determine if the result yields an improvement based on some heuristic (e.g. don't do it, if it only shaves off n cycles and those particular constant values only appear 70%of the time, since some cycles are lost because of state changes)

There are in fact, a boatload of compilation techniques that are available to GPU compiler authors, and OpenGL gives us a platform to explore this, DirectX9 does not.

Hmm...so you mean the LLSL->GPU opcode scheduler can't do this at all? I'm not sure why, though this discussion seems familiar...was this specified somewhere else before? AFAICS, it isn't "something or nothing", it is "higher level of abstraction or something not as abstracted". However, I also think that if this is done in the baseline glslang compiler, it would actually be a further advantage to glslang from that standpoint, as even if my understanding is correct, it seems this is something DX would absolutely have to require each IHV to do individually.
 
DemoCoder said:
demalion said:
Could you explain a bit further? You make it sound like it is impossible, for example, for AA to work as expected, a shader to work as expected, and a shader to malfunction under OpenGL when using AA? Is my characterization flawed? Perhaps DC's reply will cover this.

I don't quite get the question. Fog is applied after the shader and has no impact on the actual shader, it's a post processing operation after the fragment has left the shading pipeline. I also don't see how whether AA is switched or on off will affect compilation of a shader.

I'm not saying they affect compilation of the shader, I'm saying they affect what the driver has to do to implement the rest of its functionality on the GPU, and what will result from executing the shader there. If implementing fog and AA and all non shader featurs that might be used with shaders are discrete on the GPU, and drivers have no function in optimizing such implementation as part of their function, then this answers my question (no, the "interaction" that occurred to me is not a factor), but my current impression is the opposite of that.

It just affects fillrate and bandwidth (in terms of runtime state). You could equally ask "should the compiler interact with the current screen resolution?" (e.g. do something different for 1280x1024 vs 640x480?)

Well, actually, wouldn't your bandwidth balancing proposition end up doing exactly that?

Sorry, I'm just not understanding what you're getting at. In theory, you could try to optimize based on what bandwidth or fillrate utilization "might be" based on AA on/off, but it's low on the list of runtime state that I think is important to the compiler.

That would just introduce a more extreme case of what I'm talking about, along the lines of your characterization above. That's not what I'm talking about here, though...let me know if I'm still unclear.
 
Chalnoth said:
demalion said:
Please note that when I characterize something as something I'm "dismissing", I'm not being ambiguous or unspecified as to the justification and reasoning behind it, and: give the exact reasons why, ask if you understand what I'm saying, give you an opportunity to specify why you maintain disagreement, etc. That's why my posts are "greater quantity" than the posts with the ambiguous dismissals that they respond to. Witness this paragraph, and the next. :-?
You would do well to find a way to be more concise, Demalion.

It is hard to do that when people don't "listen", Chalnoth. That's my point.

OTOH, if you want to complain about the length of my long response to DemoCOder just above, though, that's a different matter. The only problem was that I had to repeat some things I'd already said more than once. But I think there is a lot more directly pertinent discussion in the quantity as well...feel free to point out to me where that didn't happen.

I won't read most of your posts in this thread, because I just don't want to spend the time to do so.
That phenomenon doesn't just seem to occur due to length, for some people.
If you want to make a good point, make it short. Otherwise, people just won't read what you have to say.

Well, I have a problem when that happens to the shorter posts too, Chalnoth, which is how the longer posts come about. I covered this, if you go back and find time to read the rest of my text around this...which wasn't all that long.

I get people literally asking me to say the same thing again, seeming to try and shift focus from issues with their statements to "annoying length". I think the better solution is to take the time initially to understand what you're arguing against, and admit when and how you failed to, if you do, instead of asking for people to simply restate things that are quite plainly stated already. There are other solutions (look at how many questions I ask that give people opportunities to directly answer what specifically I'm addressing instead of just having to repeat something)
But I can't force people to try that solution, and I don't view shorter and insulting posts as an acceptable alternative (though often more appealing). Maybe I should?
 
This seems to the most direct opportunity to answer my bug/issue discussion.

DemoCoder said:
It is less likely that optimizer bugs will show up on the GPU. On the CPU, you can get obscure C compiler optimization bugs to do multithreading, over-eager dead code elimination (timing loops), stack unwinding, and memory alignment. Most of these won't be an issue for the GPU.

What falls outside of this "most"? Is it "almost never" or "sometimes, but not most of the time"? Either seems to relate to my concern, though the first might limit new occurrences from glslang.

The GPU's stateless (no ability to modify shared data between pixels or vertices) makes alot of bugs that have to do with stale register values go away, not to mention the lack of memory and stack as well.

Lack of memory and stack? You mean for individual programs in a multi-tasking environment, and the issues for interaction that introduced for standard programs, right?

On future "more CPU-like" architectures, it could be an issue. Something like the PS2's "Scratchpad ram" could be dangerous to use correctly in the context of shaders and optimizers due to concurrency and ordering issues.

It seems to me that as more opportunity for glslang to show significant advantages arise, more opportunities for these concerns arise for it as well. As in the July thread, I view PS/VS 3.0 implementation as when both of these issues will start expanding in significance most. Does this clarify what my concern is more effectively?
 
demalion said:
That seems a waste of a calculation unit for that clock cycle, purely due to a design limitation, unless this limitation saved you a lot of transistors or allowed you to do something significantly unique and useful.

Yes, but we are not arguing about the merits of Nvidia's HW, we are arguing about compilers being able to utilize bizarre architectures to their fullest potential.

Every design has tradeoffs. NVidia's is an extreme example, but I can imagine other scenarios, such as whether a comparison gets realized as a MIN/CMP/LRP/IF. Some native mechanisms take less slots, but more cycles. Sometimes the compiler will have to choose whether to use up more slots and run faster, because the shortest code isn't always the quickest.

Imagine if there is only one special "CMP" unit, but you have two CMP instructions to do (independently). The compiler might target one towards the special comparator unit, and another gets expanded as a LERP or perhaps a IF/BRANCH unit.


Well, the LLSL has a "nrm" macro, but the HLSL compiler does have issues with properly implementing some macros. I presume this is one of them, then?

The compiler doesn't issue macros, and nrm is just one of the library instructions needed. How does the compiler communicate TRANPOSE() to the driver today? Answer: it generates TWENTY instructions. What if HW had native support for matrix transpose operations just like it has HW support for swizzles today?

If they use the macros, they won't. If they don't use the macros, they will, and glslang will have to overcome less to deliver on more of its advantage. Our disagreement here consists of me pointing out that the LLSL has that macro...I understant the premise you've mentioned.

Yes, it has normalize, but doesn't use it. But they'd have to add macros for every "instrinic" function in HLSL.


No relation between the face register and faceforward function, or just some limitation that cannot be resolved by the drivers?

Faceforward is a function used by shaders to ensure a vector points in the same general direction (e.g. angle between them is acute). If vector A points away from vector B, faceforward returns -A to "flip" it. It's a very commonly used function in shaders, along with reflect().

The limitation here is that there is no LLSL opcode to represent calls to these "instrinic" functions, which may or may not have hardware accelerated support on some platforms, or atleast, more optimal microcode representation.

The FXC compiler instead "implements" faceforward() by generating the code to implement the function. After code motions and other optimizations, it is diffcult for the driver to "recognize" the faceforward function by examining a stream of LLSL instructions and replace it with some native substitution.



Yes, I see how that is significant, just don't see how it is something that reduces the burden of implementing glslang.
How much CPU overhead will this add?

Didn't say it made the implementation easier. I used it as an example of some of the types of optimizations you can do if you have driver support.

CPU would probably be a few % hit, but it can be done once during playtesting (to capture a bunch of profile statistics) and doesn't need to be done actually during the game, although obviously, full dynamism would catch more cases.
 
demalion said:
Your characterization seems flawed to me, because designing "around assembly shader specs" such that new profiles "don't constantly need to be provided for new hardware" just means avoiding significant issues like the NV3x register issues for beyond 2 or 4 registers...

And that's how it discourages innovation. What if you want to design your hardware in a non-traditional way? The NV3X is non-traditional in the sense that register usage isn't free, like it tends to be on other architectures. Does that in itself mean it's a bad architecture? No. If nVidia could implement their own compiler to reduce the register usage and bring this thing up to performance levels of the 9800, then it would be a good architecture. However, with DX9 HLSL they can't do this. This way the GPU industry will just become more uniform and less innovative, simply because they have to design their hardware to meet certain expectations the compiler has.

demalion said:
Frankly, I share DC's view on your posts. No offense intended, but it tend to me more quantity than quality. Lots of word, little actual information and lot of speculation.

Sorry, but accusations like this are not additions to the discussion

I'm not saying this to be insulting. All I'm saying is that your posts are long and fuzzy. Often I don't reply on much of your comments simply because I have no clue what you're arguing. Are you trying to provide information, opionion, speculation? I tend just to ignore posts I don't understand, others might get into a discussion about semantics of words such as perhaps/maybe/may/might/could/should/will/won't/can't/must and the level of certainty they represent.

Not to mention the time it takes to parse through all that posts. I can only imagine how long it takes to type them. It seems you never really get to a point. Whenever you feel people don't listen or that you have already addressed something, then the answer in not even longer posts. The answer is a one-liner that points that out, possibly with a quote of yourself.

I'm not saying this to offend you. It's just piece of advice. People just skip through long posts, especially when you feel that the poster in question never takes clear stands. We don't have infinite amount of time.

demalion said:
demalion said:
Writing a compiler from scratch is easier than writing an ICD. Adding a compiler into an ICD isn't that complex, since, as I stated, the compiler doesn't need to "interact" with anything.

So there is no concern with the any other state updates influencing shader output at all on the GPU's end? Perhaps in relation to things like fog, AA, and texture handling?

Nope.

Could you explain a bit further? You make it sound like it is impossible, for example, for AA to work as expected, a shader to work as expected, and a shader to malfunction under OpenGL when using AA? Is my characterization flawed? Perhaps DC's reply will cover this.

If you got a fully functional low-level interface in your driver, then you got everything you need already. The compiler doesn't need any kind of interaction. Just because the shader originally was written in a high level language doesn't make any difference on how the driver handles it once it got its hands on the hardware specific binary version of it.
 
Yes, you're wrong on that. An error in the compiler would typically just fail to compile and return an error, or possibly just generate incorrect code, which would simply result in incorrect rendering. But I doubt the compiler would ever cause instant reboot or anything of that sort.

Ah, I should've been more clear, more reaching problems for the developer as far as getting the same code to work on multiple platforms. If your shader code compiles differently across different platforms how much of a pain would it be to get the same output, especially if some of them are broken? If you have a broken feature you can disable it for that hardware right? How much more difficult if at all would it be to have to deal with different code output instead of broken functions? Would there be any cases in which you'd actually need to write more than one shader for different platforms just to get the same output?

I don't know enough about this to avoid making some assumptions, so I'd be greatful if you correct me when I'm wrong.
 
Ilfirin said:
Humus said:
The same iron shader should be reusable for each iron object.

Yes, but really, how many different iron objects do you expect from a game not set on the titanic? A few dozen throughout the game, maybe. That's the amount of "uniqueness" I meant. That and the procedural thing.

Well, when I look around in my room, I see maybe ten or so different materials; materials different enough to require another shader. Individual differences are handle through parameters, and possibly other textures. I would expect that games would only require tens or maybe a hundred shaders, but not thousands.
 
Bjorn said:
Humus said:
Not sure if I have asked this before, but I have a faint memory that I might have done so but don't remember anything ..
Are you studying at Luth now, or are you done? What did you study / are studying?

I finished my studies 99 actually (system science). But i took some "free courses" this spring because i had nothing better to do :)

And yes, we "talked" about this before :)

Ah, I see. "The memory is good but short."
If you happend to be in the uni area on Nov 7, feel free to come to my presentation of my examination work I did at ATI. :)
 
Eolirin said:
Ah, I should've been more clear, more reaching problems for the developer as far as getting the same code to work on multiple platforms. If your shader code compiles differently across different platforms how much of a pain would it be to get the same output, especially if some of them are broken? If you have a broken feature you can disable it for that hardware right? How much more difficult if at all would it be to have to deal with different code output instead of broken functions? Would there be any cases in which you'd actually need to write more than one shader for different platforms just to get the same output?

I don't know enough about this to avoid making some assumptions, so I'd be greatful if you correct me when I'm wrong.

A compiler problem generally is easier to work around than other problems, the times I have hit a DX9 HLSL bug I have been able to work around it, though not always optimally. In cases where a particular vendor has a problem, you can always use #define to work around the problem for that vendor and leave everyone else unaffected.
 
DemoCoder said:
demalion said:
That seems a waste of a calculation unit for that clock cycle, purely due to a design limitation, unless this limitation saved you a lot of transistors or allowed you to do something significantly unique and useful.

Yes, but we are not arguing about the merits of Nvidia's HW, we are arguing about compilers being able to utilize bizarre architectures to their fullest potential.

Actually, we're arguing about how that ability will manifest, and the NV3x is an example case you've already been using. It doesn't make sense to try and say the NV3x isn't relevant when we've been discussing the NV3x, just because I have established a point in our NV3x discussion about why IHVs have very good reason to avoid such situations.

Every design has tradeoffs. NVidia's is an extreme example, but I can imagine other scenarios, such as whether a comparison gets realized as a MIN/CMP/LRP/IF.

Yes, and nVidia's is the one you point to when maintaining all of these need a new profile, because it's situation seemed to need that. It being extreme compared to other situations is exactly my point as to why it fails to establish that.

Imagine if there is only one special "CMP" unit, but you have two CMP instructions to do (independently). The compiler might target one towards the special comparator unit, and another gets expanded as a LERP or perhaps a IF/BRANCH unit.

The question for this issue seems to be whether there is a LLSL expression that would allow a LLSL->opcode compiler to make that choice? For PS 2.0 extended and higher, there seems to be one: if_comp. This seems to restrict the problem for LLSL to architectures that can't use PS 2.0 extended or higher, and whether the usage of a cmp instruction precludes it for a LLSL->opcode compiler.

Well, the LLSL has a "nrm" macro, but the HLSL compiler does have issues with properly implementing some macros. I presume this is one of them, then?

The compiler doesn't issue macros, and nrm is just one of the library instructions needed.

None at all? I recall the discussion around the SINCOS issue (the very significant issue that comes to mind) indicating that some macros have this issue and some do not. I guess I'll have to search for it.

How does the compiler communicate TRANPOSE() to the driver today? Answer: it generates TWENTY instructions. What if HW had native support for matrix transpose operations just like it has HW support for swizzles today?

Again:

"In any case, implementing these efficiently on hardware that has that could benefit would require an update and patch. Was your N*M discussion meant for hardware upgrades over time?"

Normlization is the only one I proposed had an answer in the LLSL specification, and then commented about the rest separately where you brought them up.

If they use the macros, they won't. If they don't use the macros, they will, and glslang will have to overcome less to deliver on more of its advantage. Our disagreement here consists of me pointing out that the LLSL has that macro...I understant the premise you've mentioned.

Yes, it has normalize, but doesn't use it. But they'd have to add macros for every "instrinic" function in HLSL.

Well, they'd have to add the ones that aren't expressed in a way suitable for released hardware using the targets, or compare unfavorably with glslang for that hardware. If this happens, I'd expect DX HLSL would have the choice of competing unfavorably or adapting in a timely fashion.

No relation between the face register and faceforward function, or just some limitation that cannot be resolved by the drivers?

Faceforward is a function used by shaders to ensure a vector points in the same general direction (e.g. angle between them is acute). If vector A points away from vector B, faceforward returns -A to "flip" it. It's a very commonly used function in shaders, along with reflect().

The limitation here is that there is no LLSL opcode to represent calls to these "instrinic" functions, which may or may not have hardware accelerated support on some platforms, or atleast, more optimal microcode representation.

OK, we're on the same wavelength with "may or may not".

The FXC compiler instead "implements" faceforward() by generating the code to implement the function. After code motions and other optimizations, it is diffcult for the driver to "recognize" the faceforward function by examining a stream of LLSL instructions and replace it with some native substitution.

Yes, this is the suitability issue of the LLSL. If MS chose badly given their communication with IHVs, this will become evident. I do worry about IHVs other than ATI and nVidia in this regard, but I also worry about them having resources for a glslang compiler. :-? That's why I'm so curious about the actual status of any "shared" glslang compiler baseline.

Yes, I see how that is significant, just don't see how it is something that reduces the burden of implementing glslang.
How much CPU overhead will this add?

Didn't say it made the implementation easier. I used it as an example of some of the types of optimizations you can do if you have driver support.

Then I'll just mention a reminder that my point is that capitalizing on these advantages is more work, and that when an advantage is offered is important, not that the possibility for them isn't there.

CPU would probably be a few % hit, but it can be done once during playtesting (to capture a bunch of profile statistics) and doesn't need to be done actually during the game, although obviously, full dynamism would catch more cases.

Hmm...I don't see how that would work well in an interactive environment, because it seems it would take application/scene awareness to manage applying a collection of profile statistics for a scene, and that the application would have to be managed by the IHV driver side. :?:
 
Ilfirin said:
You have to compile them eventually, if they're even going to be used (you'd obviously filter out every redundant or unused shader). Even you spread them out and only compile the shaders for one level when that level loads. If you're compile times are 30 minutes, and you have 10 levels, that's an extra 3-minute loading time for each level. Any loading times more than a couple of seconds and I won't play a game.. and even those load times piss me off.
What I meant with carefully designing the API is the ability to save object code for reuse. That means recompiling only if you install a new compiler.

A given indoor FPS game scene generally only has a dozen or so objects in it, then there are maybe 50 or so scenes for a level, and 10-20 levels. 5*50*10=2500. *note: I'm not advocating the creation of 2500 shaders here.. just saying that you can easily have thousands of different shaders throughout a game without bogging down performance from all the shader changes (5 changes per frame is nothing).
It's not about objects, but materials. While your game can certainly contain thousands of different objects, having thousands of different materials that require all different shaders is a massive waste of work IMO. I'd say the room I'm just sitting in could be rendered in full detail with <30 configurable shaders. And it contains many objects :D

I don't know about you, but my CPU apps never turn into one long switch statement..
Not one long switch statement, but certainly many possible paths. That's what flow control is about.

Very bad analogy. We're not talking about different configurations of the same routine, we're talking about a bunch of different routines (as different as a bloom shader is from a phong lighting shader). Doing what you're suggesting is analogous to taking all your RayTriangleIntersect(), SphereSphereIntersect(), etc routines and dropping them all in one Intersect( <function id> ) routine, that's just one big switch statement.
Then you got me wrong. You don't need a big switch statement, but you don't need two different shaders to enable/disable detail texturing or bump mapping either. If you have several shaders that are 95% identical, you can put them all into one shader, which should be more efficient than switching them all the time. This will leave you with dozens, but never thousands of shaders.
 
Humus said:
demalion said:
Your characterization seems flawed to me, because designing "around assembly shader specs" such that new profiles "don't constantly need to be provided for new hardware" just means avoiding significant issues like the NV3x register issues for beyond 2 or 4 registers...

And that's how it discourages innovation. What if you want to design your hardware in a non-traditional way?

Why would you want to design your hardware in a non-traditional way that had register performance issues? That's particular thing is what I'm saying IHVs have reason to avoid.

The NV3X is non-traditional in the sense that register usage isn't free, like it tends to be on other architectures. Does that in itself mean it's a bad architecture? No.

"Bad" in what sense? It is indeed inefficient by a whole host of metrics.

If nVidia could implement their own compiler to reduce the register usage and bring this thing up to performance levels of the 9800, then it would be a good architecture. However, with DX9 HLSL they can't do this.

It's not because of the DX 9 HLSL that they couldn't do this, it is because of the hardware. That's my point.

This way the GPU industry will just become more uniform and less innovative, simply because they have to design their hardware to meet certain expectations the compiler has.

Well, the expectation under discussion is not to discard computation opportunities due to a temporary register capability that is severely limited for your designated workload. I don't think that is much of an innovation, and I think there is quite a bit of room for quite a few other innovations that remain....like a solution for your architecture that permits you a better return on your transistor budget.

demalion said:
Frankly, I share DC's view on your posts. No offense intended, but it tend to me more quantity than quality. Lots of word, little actual information and lot of speculation.

Sorry, but accusations like this are not additions to the discussion

I'm not saying this to be insulting. All I'm saying is that your posts are long and fuzzy.

"Fuzzy"? Not absolute or coming to a conclusion of certainty, you mean?

Here is the thing: whether glslang or DX HLSL will deliver faster code is a complicated issue. Discussing a conclusion as a given based on some of the information and ignoring other parts of the information, or simplifying things to either "DX HLSL is the best possible in all ways" or "glslang is the best possible in all ways", is a distortion of the issue.

A "might/perhaps/maybe...":

Glslang comes out tomorrow, and delivers performance on par, or better, than with DX HLSL. The issue just became a lot less complicated, and a lot of what I've discussed becomes resolved. After some shake down and exploration time to answer the questions of conflict and bugs (might take a while, unless there has been a lot of developer testing going on and we get access to their observations), and almost certainly after another IHV enters the arena with a good glslang implementation, it should be pretty black and white how they compare.

But we're not holding this discussion then, we're holding it now.

condensed per SPCMW* directive said:
Often I don't reply on much of your comments simply because I have no clue what you're arguing. Are you trying to provide information, opionion, speculation?

Probably all of them...why does classification matter? I typically make it clear what my conclusions are, and what went into them.

Consider my prior example: I state the "fact" that it didn't rain yesterday. I say this fact supports my saying how it will not rain today, and suggest that you'll just burden yourself taking an umbrella. Someone else "speculates", based on considering the temperature, barometric pressure, and season, that it might be smart to carry an umbrella, because that "fact" distorts the issue.

Is the person suggesting taking an umbrella being less definite in their conclusion? Yes. Is that reason to dismiss it? No. Why not check a forecast that looks at weather satellite photos and collects barometric pressures, wind speeds, etc, from many locations? Well, for this thread, we don't have that detail level of information in a convenient place, and I was simply trying to take a step in gathering such information. My subsequent conversation is less concerned with defending that I was right in my conclusions, than it is with continuing that exploration by discussing which additional factors not of a "it didn't rain yesterday" nature do or do not clarify things, however they impact my conclusions. That does seem to necessitate rejecting commentary that seems flawed, and stating why.

I tend just to ignore posts I don't understand, others might get into a discussion about semantics of words such as perhaps/maybe/may/might/could/should/will/won't/can't/must and the level of certainty they represent.

Actually, two words "might" and "will"...pretty simple. That long list sure is scary though. :p My point: You distorted the issue.

Not to mention the time it takes to parse through all that posts. I can only imagine how long it takes to type them. It seems you never really get to a point.
Well, you see, I said you were in error before about my not getting to a point, and gave you a clear path for discussing how you were not in error. Do you recall this? My point: My entire discussion about "ambiguous and unspecified accusations" was a point you seem to have missed, as it tries to answer this accusation.

A restatement of that point, because I assume you read it and I failed to communicate this: Look again at prior posts...is it that I never get to a point, or that the other side of the conversation never acknowledges that I made one, and I have to make the same point again to something new they say? I think if you pick an example and go through it, you'll see examples of the latter pretty clearly.

Whenever you feel people don't listen or that you have already addressed something, then the answer in not even longer posts. The answer is a one-liner that points that out, possibly with a quote of yourself.
That's in there too, Humus. It doesn't work any better, and can be ignored just as easily...that's why I expand on my explanation of the problem with a statement. What you're actually asking for is a host of "See earlier" commentary, and that doesn't seem to have been any clearer when I tried that.
I'm not saying this to offend you. It's just piece of advice. People just skip through long posts, especially when you feel that the poster in question never takes clear stands. We don't have infinite amount of time.
Query: what's the excuse for skipping through my initial post? Were the "My opinion" parts and the summary not "points"? Was it really too long for the issue at hand? That skipping being done is what resulted in the longer posts that followed...and that is my point.

As for the "interaction" topic, I think that other post is a clearer place to address the issue...my reply to what you said would be very similar to what I said there.

*Society for the Prevention of Cruelty to Mousewheels
 
Back
Top