Nvidia's unified compiler technology

Ostsol said:
Why? Its still the same source that runs on every other board out there.

And the "non optimized" source also runs on every other hardware out there. So why create a new code path in the first place?

Looking at the parallel OpenGL situation: Is the NV3X code-path (edit: for Doom3) "nividia specific?" Why use that code path when the ARB-2 Path runs on avery other (DX9 class) board out there?
 
DemoCoder said:
Of course, this only matters if you have a non-trivial architecture (e.g. not a direct map to dx9 assembly), where you might have a significant disconnect between what people are targeting, and what your hardware actually does.
That's an additional very good point. It seems likely that hardware will evolve to become less like the API specification over the long term (cf. x86, where the execution engines bear little resemblance) and therefore compile times will get longer.

But I wouldn't say that's the only reason compile times might be long. A 2000 instruction program is never going to be quick to optimise - optimisers can be > O(n^2) algorithms (although something between O(n) and O(n^2) is likely to be the common case)... and O is probably not too small, either.

I really don't want people to get into the bad habit of thinking 'JIT might work for this 20-instruction shader' and then three years down the line the engine has come to rely on JIT but needs 300 instruction shaders... compile shaders only at load time please!
 
I agree Dio. I'm just thinking that eventually the API will evolve to more than just CompileShader(source), but will need an API to set compiler flags just like with command line compilers. e.g. "disable optimization" (if you need to debug and figure out if the problem is your code, or the optimizer), "no predicates", "newton raphson replacement", etc

Perhaps we should use the term JAL "Just at load" instead of JIT. I guess what I was getting at is that the Installer could run a "benchmark scene" to gather some profile data, and then shove it back to the JAL Compiler to enable profile-directed-optimization.
 
I'm not convinced.

My personal view is that if a benchmark came out that allowed you to test 'with' and 'without' optimisations - and it would! - everyone would probably just ignore the 'disable optimisations' flag to avoid having the 'raw hardware' labelled slow. (This is even before getting into the argument of 'what is an optimisation, and what is a code generation convention?')

If the optimised code gets the same result, how could you ever tell it was on or off?
 
Well, that's just one of the examples you pass to compilers. Pick your favorite compiler and just count the number of compile time flags available.

BTW, I wasn't suggesting that the flags would be standardized, as obviously everyone's would be different, it would be more like an extension mechanism, e.g. setCompilerFlag("ATI_foo", "true")


A few might be standardized for usefulness and commonality however (generate debug code or profile data, for example). Debug drivers are already available today, yet no one is running benchmarks on them and showing IHV's hardware in a bad light. I'm not convinced this will really be the negative issue you make out.

The issue with turning the optimizer off is because sometimes optimizers have bugs, and if a developer could not test to see if the driver were at fault, he could go insane trying to find the problem in his code.
 
Dio said:
I really don't want people to get into the bad habit of thinking 'JIT might work for this 20-instruction shader' and then three years down the line the engine has come to rely on JIT but needs 300 instruction shaders... compile shaders only at load time please!

Load time is any time. You cannot slow down loading without making the games run-time degrade significantly. You had better not hog the processor and/or take a long time to get the results back otherwise you going to annoy alot of gamers.

Many of us are using streaming worlds today, loading screens are becoming things of the past. Thats means very little time to 'process' data at load, doesn't matter if its textures, vertex buffers, game data or shaders. We can compile (prehaps) the HLSL at author time but we will find it difficult to submit all low level shaders before hand. There are thousands, usually auto-generated via stitching fragments, when I load a segment of the world, I will discover what shaders are needed and compile/assemble them there.

A initial submission strategy, would imply the driver getting (in future)10,000s of shaders... I doubt the driver is going to hand that well.

That why OGL2 driver level compiler has serious issues AND why IHV's have to give us an option to assemble low-level shaders fast. I only have a few seconds to get several megabytes of data into the computer and ready to be used, the process of creating the shaders can't take much of that (I'm probably counting in milliseconds per shader compilation).

I suspect going forward were going to need API level optimisation settings. If I can hint to the optimisation that I want the result back quick but less well optimised or take your time and do it well, it would help both sides...
 
Joe DeFuria said:
RussSchultz said:
Why? Its still the same source that runs on every other board out there.

And the "non optimized" source also runs on every other hardware out there. So why create a new code path in the first place?
They didn't. They optimized the old one.

Looking at the parallel OpenGL situation:
Well, it certainly doesn't intersect...
Is the NV3X code-path (edit: for Doom3) "nividia specific?"
That is apparent, isn't it?
Why use that code path when the ARB-2 Path runs on avery other (DX9 class) board out there?
Because it works better than the other one on that set of hardware.
 
RussSchultz said:
They didn't. They optimized the old one.

Not sure what you mean...Doom3 has an ARB2 path that any "GL Compliant" card should run (that supports a certain featureset), including nVidia's DX9 hardware.

So why is there an NV3x path in Doom3, considering the fact that FX cards can run the ARB2 path just fine?

That is apparent, isn't it?

Exactly. Just as it's apparent to me that any "special DX9 paths" are there specifically for nVidia hardware. Because they don't run "full featured" DX9 paths particularly well. The same reason why Doom3 has an NV3x path.

Are other, non nVidia cards, going to be able to take advantage of the NV3x path in Doom3?

I know what you're saying Russ. But to pretend that these "special" DX9 (or hybrid DX8/DX9 paths) are not being put in place specifically for nVidia's benefit, is just not looking at the reality of the situation.
 
Joe DeFuria said:
Exactly. Just as it's apparent to me that any "special DX9 paths" are there specifically for nVidia hardware. Because they don't run "full featured" DX9 paths particularly well. The same reason why Doom3 has an NV3x path.

Are other, non nVidia cards, going to be able to take advantage of the NV3x path in Doom3?

I know what you're saying Russ. But to pretend that these "special" DX9 (or hybrid DX8/DX9 paths) are not being put in place specifically for nVidia's benefit, is just not looking at the reality of the situation.

I don't think anyone has said that they're not there for Nvidia. But the question was if they need two make codepaths and they don't.
 
Joe DeFuria said:
But to pretend that these "special" DX9 (or hybrid DX8/DX9 paths) are not being put in place specifically for nVidia's benefit, is just not looking at the reality of the situation.
I never said partial precision hints weren't put in there for NVIDIA's benefit. It's clear that it does benefit them, and any other part that comes out that finds a benefit with partial precision hints. Its clear that microsoft deemed them important enough to include in the spec.

But why you insist on calling them "special" paths is beyond me. They're completely legitimate DX9 syntax, it does no harm to anybody else.

Unless you're confused about what we're talking about. Uttar and I were discussing the unified shader compiler and partial precision hints. You seem to be focusing on Doom3's vendor specific support for the NV3x series.
 
Deano, that's a very interesting view which I'll make sure is better known inside ATI (it might well be, except by me ;) ). I agree that this kind of model isn't the best for streaming worlds.

DeanoC said:
(I'm probably counting in milliseconds per shader compilation).
That's not too bad. I was worried that you were talking 10-20us per compile!

If '<1 second to compile 100 shaders' is acceptable, then things should look OK until we get into the thousands-of-instructions range.
 
Ultimately, I think you could benefit from shading LODing...

So, for streaming, you could have maybe 25 basic shaders whose sole purpose is of doing a low quality representation of complex shaders requiring compilation.
So you don't lose your time for compilation - while it compiles in a secondary thread, you just use a low-quality representation.

Heck, if you wanted to be really insane, you could use occlusion query to see how much that low quality representation is hitting IQ and define compilation priority and required speed based on that. But it'd seem uselessly complex to me though! :)


Uttar
 
I was just going to write something along the lines of DeanoC :)
If you start whacking together a serious material system, your count of possible permutations easily goes into the thousands. As a wild shot, I'd say I can bear 20~30ms compile time in a JIT scenario if it's only a single frame hitch.
Somewhat OTish, is it possible that this is what happened to Halo's performance? Are they 'uploading shader code' all the time instead of reusing established shader objects?

On the topic of permutations, I seem to remember GLslang provided a linker mechanism that can eliminate the parsing (and syntax check) from the JIT equation. That was one of the things I really liked about the design though I must admit I haven't followed where things currently stand.
 
DeanoC said:
A initial submission strategy, would imply the driver getting (in future)10,000s of shaders... I doubt the driver is going to hand that well.

zeckensack said:
If you start whacking together a serious material system, your count of possible permutations easily goes into the thousands.

We had this discussion recently in another thread.

We shouldn't demand a driver to handle outrageous cases well. I have no sympathy for the developer who feeds the API with 10,000 shaders. The application is flawed. It's a poor usage of the API, and it's not the driver's responsibility to adhere to misuse of the API. It's the application's responsibility to use the API properly if you expect any level of performance, and that includes load time performance.

I don't have any sympathy for developers using 10,000 vertex buffers either. They made a poor design decision, and they have to suffer the consequences, namely poor performance. Just because you have 10,000 objects doesn't mean you should have 10,000 vertex buffers. The vertex buffers are flexible enough for you to be able to pack many objects into each buffer and yet draw them independently if neccesary. Just because you have 10,000 objects doesn't mean you should have 10,000 shaders. The shaders are very flexible, and if you want any level of performance, run time and load time, you better learn to use it properly.

I'm opposed to writing drivers according to the standards of badly written applications.
 
Glslang does provide a linker so in theory you will just compile each shader fragment onece and then link them together. If the driver does non trivial optimisation at the linking stage this obviously won't give a speed up though. And it might be tricky to implement.

In addition, the ARB thinks giving compiled fragments back to the app is a very interesting way to make sure you only pay the compilation cost once, so you might see an API for getting binary "machine code" back.

Another thing worth thinking about is that as real time shaders get longer they will also become more general and allow stuff like branching whcih will reduce the number of shaders needed.
 
Joe DeFuria said:
But to pretend that these "special" DX9 (or hybrid DX8/DX9 paths) are not being put in place specifically for nVidia's benefit, is just not looking at the reality of the situation.
I'd argue that the optimizations aren't being put there for Nvidia's benefit, but for the benefit of Nvidia card owners and the game developers. Most developers aren't going to put in optimizations because they care about an IHV's success. They just want as many people as possible to be able to play their game.

Enough philosophy. Now back to the good stuff. I read this forum because other's often think of things I don't, so it's enlightening to read a technical thread that presents many sides to a discussion.
 
You could save an md5 checksum for each shader and store the compiled info with the checksum. Then you would just have to compile them once. The second time you lookup the pre-compiled info using the md5 checksum.

Oracle databases do this something like this with SQL. They save the sql as a hash value and the next time the SQL is run they lookup the hash value to see if there is a pre-compiled execution plan.
 
Humus said:
DeanoC said:
A initial submission strategy, would imply the driver getting (in future)10,000s of shaders... I doubt the driver is going to hand that well.

zeckensack said:
If you start whacking together a serious material system, your count of possible permutations easily goes into the thousands.

We had this discussion recently in another thread.

We shouldn't demand a driver to handle outrageous cases well. I have no sympathy for the developer who feeds the API with 10,000 shaders. The application is flawed. It's a poor usage of the API, and it's not the driver's responsibility to adhere to misuse of the API. It's the application's responsibility to use the API properly if you expect any level of performance, and that includes load time performance.

I don't have any sympathy for developers using 10,000 vertex buffers either. They made a poor design decision, and they have to suffer the consequences, namely poor performance. Just because you have 10,000 objects doesn't mean you should have 10,000 vertex buffers. The vertex buffers are flexible enough for you to be able to pack many objects into each buffer and yet draw them independently if neccesary. Just because you have 10,000 objects doesn't mean you should have 10,000 shaders. The shaders are very flexible, and if you want any level of performance, run time and load time, you better learn to use it properly.

I'm opposed to writing drivers according to the standards of badly written applications.

Humus, I can only assume you haven't worked on any large scale games projects. That kind of thinking is only valid for small scale projects.

Shaders are related to art, and that means the more flexibility you allow, the better art you get. We give options to ours artists and create shaders from there choices.

You are trying to convince us to lower our art quality in the name of speed. We trying to make the most emmersive world we can, if my artists say they want to use all those thousands of permutations of shaders, I'll do my best to give them that flexibility. I'm not going to lower quality over something that is purely a 'drivers' choice. Compilation time is not a variable based on the hardware, its a variable based on software design. Some of us (for a variety of reason) want/need the ability to create 1000's of shaders in the lifetime of our game, nobody ever said using them all at the same time merely that we may create them on the fly.

We know how to drive cards fast and its only a 'badly written' (to quote your discription of mine and Zeckensack's renderers) renderer that would use this as an excuse to lower the art quality used in-game. We use every trick in the book to reduce the working set of shaders every frame to a minimum (its really quite easy) but I'm not going to reduce the quality simple because you haven't figured out a thing called shader management and shader LOD yet.

Total resource virtualisation is the name of the game and resource virtualisation requires creation and destruction to be 'reasonable' fast. If I could ship 'native' shaders I would but I can't (at least on PC) so all I ask is that its not to slow from shader source till I can use it.
 
Dio said:
If '<1 second to compile 100 shaders' is acceptable, then things should look OK until we get into the thousands-of-instructions range.

100 shaders per second will be fine. We know it takes time to do the compile, just making sure that its known that some of us can't really on a trusty loading screen to hide things :)

So that brute force optimisor that takes several seconds might not be a good idea :D
 
DeanoC said:
So that brute force optimisor that takes several seconds might not be a good idea
No sign of it being that bad.

Yet :D. One day we will have a 10,000 instruction program to optimise and then it gets hairy.

If it does get to be a problem, we'll solve it.
 
Back
Top