What was that about Cg *Not* favoring Nvidia Hardware?

790 said:
E.g., 2.0 hardware can only just about pull off many of the HDR image-space techniques that have recently become a reality. This is the stuff that's going to be the "next big thing" after per-pixel lighting (heh, seems so dated now to me, despite no games using it yet!). Tone mapping, and post-processing effects such as a bloom filter to simulate glare/scattering can really let us go wild and simulate the optics of camera or even our eyes. Like, for example, when you emerge from a dark room into broad daylight, everything's going to look significantly overexposed while your pupils dialate, which can be simulated with tone-mapping techniques. And you automatically get flares on any illuminating objects that have high contrast with the surrounding environment, no hacking in billboards for cheesy effects any more. I'd love to be doing this stuff right now, but we'd need to delay the game a year or two before anyone could play it!

I'm not really fond of these kinds of bloom effects as they seem artificial and contrived to me. In real life when you leave a dark room and enter a brightly lighted area there's very little perceived bloom--mainly you squint, close your eyes and tend to look away from a bright light source until your eyes adjust--it's almost reflex--autonomic protection of the retina. Some of these effects speak more to the limits of camera and film technology compared with the eye than they are realistic, at least I think so.

What I'd like to see are vastly improved dynamic ranges of color which make full use of the FP pipeline--in essence let's throw away the 32-bit integer and move into FP rendering. As such with decent FSAA we should be able to get very close to photorealistic--not so much in the 3D models at the moment but in the general shading of a scene. This kind of color will do more to convince one of realism than all of the neat effects we can use. To me the object should be real life--not quirky effects that we see which are just limitations of the media employed (eg, exaggerated motion blur, etc.) I'm really looking forward to getting away from that "canned, 3D" look that is so prevalent in current software and moving toward lighting and tone and depth and contrast which fool the eye enough so that the temptation to reach inside the screen becomes powerful...;)

You might not agree, but the ATI "bear" demo for some reason strikes a chord with me. It's not the details of the bear--which could obviously use some work--but its the feeling I have every time I watch the bear's fur ripple and move that it's *real*...;) I really have a strong urge to reach out and pet the damn thing! We can do much better than the bear, of course, but that's the kind of suspension of disbelief we should be working towards. In that sense we can do more than cameras and film. Ultimately.
 
WaltC said:
I'm not really fond of these kinds of bloom effects as they seem artificial and contrived to me. In real life when you leave a dark room and enter a brightly lighted area there's very little perceived bloom--mainly you squint, close your eyes and tend to look away from a bright light source until your eyes adjust--it's almost reflex--autonomic protection of the retina.

That's all I'm talking about. Of course, some developers are going to go overboard for cinematic effect, but in real life your eyes take a while to adjust to all conditions. If you go into a dark room it can take up to 20 minutes for your pupils to dialate fully.

This is closely linked to high-dynamic range (FP) rendering. Once you've mapped a dynamic range down to a range the monitor can display, you must factor in human visual limitations to mirror what the scene looks like in real life. Things this include are, luminance around bright areas to simulate scattering in our eyes (Halo2 is doing a cheesy and overblown hack of this), loss of color sensitivity in dark scenes, and in very dark scenes, loss of image resolution. The closest to the last two right now would be Splinter cell's nightvision goggles, but it's still a completely different thing from a dynamically adjusted scene.
 
790 said:
That is *not* like the sitatution with compilers where they use registers, because that's transparent. You can't make such a limited resource such as samplers transparent though, not yet, maybe when we have 512+ we can abstract them too, but not yet.

The only reason the situation is different is because the compiler has a stack to work with. The CPU still has only paultry few functional units, just like the GPU has limited texture units. External memory (stack) prevents running out of limited registers, and instruction scheduling and pipelining allows the CPU to avoid many of the hazards of having a limited number of functional units. There's nothing bad about the compiler giving your an error when you exceed a resource limit. Hell, you can run out of stack space today.


Perhaps future GPUs will add a "scratchpad RAM" (16k? 32k?) which can function as a limited stack and memory area for pixel shaders. And perhaps an arbitrarily high number of textures can be bound (enough to not worry about resource limits. IMHO, 16 is already more than most people need. Show me a RenderMan shader with more than 16 textures)


This whole compiler argument really goes to what people expect out of a language. When C first came out, you had the assembly vs C jockies. Then the C vs C++ wars. (ohmigod! The compiler added a vtable! I don't want that!) More fighting followed with Java, Ocaml, et al. (I protest garbage collection and array bounds check!)

At each level of abstraction, you have some people fighting against letting the compiler take over yet one more aspect of programmer control.

At each level of abstraction, you have some people fighting against paying a small overhead or cost in performance for enormous productivity gains. (e.g. register management, memory management, etc)

I'm looking forward to the environment taking over more of the resource management issues, and even more of the database management, freeing me to spend most of my time on artistic issues.

I did over a decade of assembly and C and got sick of it. I lost productivity writing stuff and worrying about stuff that tens of thousands of other people did, low level headaches that you encounter over and over and over, something that should have become commodified and provided by the environment. Other people might get their kicks optimizing an assembly routine to squeeze one more cycle out (like solving a puzzle) debugging memory leaks, or doing manual cache alignment.

So for me, 99.9% of the time, I would like the language/runtime to manual how textures get bound to which register, and 99.9% of the time, I would probably call normalize() instead of dp3/rsq. I would expect the compiler to know when to make the tradeoff between lookup and procedural calculation, and I would expect the compiler to know how to tradeoff precision in areas where it is not needed, and I would expect it to do this on a per-card basis, knowing more about the underlying architecture of the various GPUs than I know, or that is public knowledge. I mean, why should I sit there with pen and paper trying to figure out how to optimally parallelize computation for R300 VLIW and NV30 VLIW (and who knows how many others) and balance bandwidth

Can DX9 HLSL/Cg/GLSLANG do this? No. Are there other people like me? yes. There are even people who use third party scene graphs to avoid rewriting them.

That's why there isn't going to just be 1 or 2 standard HLSL languages. We have different preferences on how low level or high level, how much control we want to delegate to the runtime. Some people want more, some people want less. I'm wishing for some combination of very high level HLSL and flexible scene graph runtime in the future.

Let a thousand flowers bloom.
 
I'm not really fond of these kinds of bloom effects as they seem artificial and contrived to me. In real life when you leave a dark room and enter a brightly lighted area there's very little perceived bloom--mainly you squint, close your eyes and tend to look away from a bright light source until your eyes adjust--it's almost reflex--autonomic protection of the retina.

Then you havent seen this effect done properly then. hehe. If pulled off correctly, the player will squint and close their eyes a little from a natural reflex from the experience. :)

There are really no games that pull this off in such a fashion that it feels realistic, although it has been pulled off synthetically in visual tech demos to some degree.

I think the closest effect I've seen in the place of a game was done in ICO- a Playstation 2 title. When going from dark, dank castle interiors to the bright outdoors with the sun, the light is blinding and colors are very washed out for a few seconds before the games "eyes" adjust to the difference. It's a pretty neat effect and probably the best example to date for experiencing such an effect in a videogame.

Small details such as this can only help to improve the experience- but only if they are done in such a way to add realism and not some sort of gimmick or box check-mark. I look at lens flares as a big gimmick these days as they are usually overdone and far from immersive (dont really see lens flares walking down the street..).
 
DemoCoder said:
So for me, 99.9% of the time, I would like the language/runtime to manual how textures get bound to which register, and 99.9% of the time, I would probably call normalize() instead of dp3/rsq. I would expect the compiler to know when to make the tradeoff between lookup and procedural calculation, and I would expect the compiler to know how to tradeoff precision in areas where it is not needed, and I would expect it to do this on a per-card basis, knowing more about the underlying architecture of the various GPUs than I know, or that is public knowledge. I mean, why should I sit there with pen and paper trying to figure out how to optimally parallelize computation for R300 VLIW and NV30 VLIW (and who knows how many others) and balance bandwidth.
How can a COMPILER know when a texture lookup is good enough tradeoff between precision and speed? Texture lookup (well depending on resolution which in this case must be chosen by COMPILER anyway) might be good enough when you use a model with soft edges but might not be enough when you use model with sharp edges. So will compiler have to run down your models to find out if they are sharp or not?
It's not just normalization that ends up like this. You could debate if a texture lookup is good enough tradeoff for sin, cos, noise, pow,... How many other COMPILERS do that (raplace your sin, cos, random, pow,... calls with table lookup)?
You can't expect something like this from LANGUAGE you could perhaps expect it from 3D API but not from Direct3D or OpenGL as they currently stand. Perhaps Cg could - eventually, do this, but it would probably complicate it so much that even NVidia would gave up with it.
Even DX9 HLSL does not actually compile in "hardware aware" style. It does compile into shader assembly as we have till now. And as Jason Mitchell said: DRIVERS will be made to spot specific code outputs from DX9 HLSL to make further LOW LEVEL optimisations.

DemoCoder said:
Can DX9 HLSL/Cg/GLSLANG do this? No. Are there other people like me? yes. There are even people who use third party scene graphs to avoid rewriting them.
More appropriate question here would be: Can Direct3D or OpenGL do this? No. You'll have to wait for some extremely high level 3D API which would know everything about underlying hardware. And even if such API would be as easy as logo for application developers it would be real pain in the ass for driver writers. And I somehow think it would end up like Fahrenheit did. If you want that high level of abstraction then perhaps there are some 3D engines that could with come modificatons expose what you need.
 
DemoCoder said:
Perhaps future GPUs will add a "scratchpad RAM" (16k? 32k?) which can function as a limited stack and memory area for pixel shaders.
The problem is that there isn't just one pixel in flight at a time - it will probably be into the thousands now or in the not-too-distant-future. Every pixel in flight needs its own scratchpad...

I can envisage a time when you get some 'global' variables (not per-pixel) but then there are the Read-Modify-Write and order-of-execution problems to be worked through...
 
Presumably, the company producing the compiler would build that knowledge into compiler, either through reverse engineering, or NDAed specs. I'm sure Cg will have specific NV30 backend optimizer. If ATI produced a compiler for a hypothetical future HLSL, they could also build their own optimization backend. There is no need to place it into DX9/OGL. There are C compilers today which produce code for various x86 architectures (cache size, internal pathways, # execution units) DirectX/OGL does not expose underlying implementation, and it should not. Today, developers are talking about "architecture specific shaders" which means, developers will be hand coding hardware implementation specific shaders based upon some knowledge (probably empircal testing) on how such shaders work on each hardware. I am merely proposing to have this hardware work done once and given to the rest of us via a third party of ATi/Nvidia. Today, Intel produces C compilers specific for Pentiums that beat most C compilers on Intel, especially at vectorization.


As for knowing when to trade off speed for precision, this can be done in any number of ways. First, the language would have to be extended so that you could semantically declare error tolerances on variables. Secondly, the compiler would have to have an underlying model of hazards in the pixel shader pipeline. Third, you'd have to be able to declare/turn it off.

Finally, this is all for very well defined, special cases: library calls.

Sure, the compiler can't have perfect information and will sometimes make mistakes. Today's C compilers don't have perfect information and fail to optimize many situations which are data dependent and can only be determined at runtime. (hence profile-feedback optimizers) But if 90% of the time it achieves better output than me on a multiplatform set of 3D cards, and I only have to hand code 10% and fix a few problems, then it has done its job.
 
I mean, why should I sit there with pen and paper trying to figure out how to optimally parallelize computation for R300 VLIW and NV30 VLIW (and who knows how many others) and balance bandwidth

Christ man, nobody worries about this already. Just write the shader, and let it run. The driver has sufficient information to do low-level optimizations, intermediate-level optimizations are NOT WORTH THE TROUBLE. STOP WORRYING ABOUT IT.

Because ps1.x hardware is so primitive, it's rather futile to try and abstract all the details from the developers, because 99% of the time they _MUST_ write specific shaders because the hardware is so fundamentally different in both performance and capabilities. ps1.x is dying and barely programmable hardware, please forget about convoluted abstractions to try and hide this fact.

In ps2.0+ and fowards you can do almost everything in the shader, and this notion of needing to create a whole API to insert lookup tables will die, and we can sit back and let the driver and hardware worry about the low-level stuff, while you soley focus on the high-level implementation. Anything in-between isn't worth the bother.
 
So for me, 99.9% of the time, I would like the language/runtime to manual how textures get bound to which register, and 99.9% of the time, I would probably call normalize() instead of dp3/rsq.

HLSL does both. There are no visible registers, and normalize() is available for any hardware that supports it. Furthermore, the effects files allow you to completely forget about any low-level stuff such as binding texture levels, writing a single line of code, or even loading textures! Try EffectEdit in DX9 SDK, or RenderMonkey DX9 (ATI secure), it's all there, baby! ... minus this auto-texture lookup nonsense, from which I refrain to argue any further o_O
 
790 said:
Christ man, nobody worries about this already. Just write the shader, and let it run. The driver has sufficient information to do low-level optimizations, intermediate-level optimizations are NOT WORTH THE TROUBLE. STOP WORRYING ABOUT IT.

I disagree wholeheartedly. The intermediate level is the best place to do optimizations. Unless you think that device drivers are going to implement full fledged just-in-time optimizers that do everything a normal optimizer would, you are going to lose out on a lot of optmizations.

In typical CPU programs, 90% of execution is spent in just 10% of the program, so if you miss out of a few optimizations, it's not much of a problem. With pixel shaders, 100% of the execution is spent in 100% of the shader. The pixel shader IS the HOTSPOT. Shaving off just a few instructions, folding a few constants, simplfying one expression can make a HUGE DIFFERENCE to performance. So you should worry, because the cost of a lost optimization in a complex pixel shader is a huge loss in framerate.

Current HLSL compilers (Cg, FXC) do intermediate level optmizations. If they didn't do this, then PS2.0+ drivers would have to parse the shader, recreate an intermediate level representation, and perform these optmizations (such as common subexpression elimination, copy-propagation, etc) Currently these are done in Cg/FXC, and they have gotten better since the original betas yielding shaders that get closer to hand-coded assembly. If you don't think intermediate level optimizers are worth it, then you'll have no problem using older versions of FXC which produce more bloated shaders.



The intermediate representation of a shader (PS2.0 assembly) isn't semantically rich enough to preserve all the original source information, so an optimizing device driver has to work even harder. This will get worse in later revisions of the pixel shader spec as data-dependent loops and branches are added. Data-dependent scheduling is NP-complete, and unlike graph coloring for register allocation, the appoximation algorithms aren't cheap and aren't that good.


Peephole optimizations won't cut it. I bet by next year, several people will be selling PS2.0 performance optimizers. Maybe you have no problem just writing your shaders and not worrying about the driver doing all your optimizations, but in the real world, drivers have bugs, vendors are incompetent, and developers worry about squeezing every last cycle.

As pixel shaders become more and more widely used, optimal execution of given shader will become very important, too important to just "let the driver do peephole optimizations"
 
Dio said:
DemoCoder said:
Perhaps future GPUs will add a "scratchpad RAM" (16k? 32k?) which can function as a limited stack and memory area for pixel shaders.
The problem is that there isn't just one pixel in flight at a time - it will probably be into the thousands now or in the not-too-distant-future. Every pixel in flight needs its own scratchpad...

I can envisage a time when you get some 'global' variables (not per-pixel) but then there are the Read-Modify-Write and order-of-execution problems to be worked through...

Why would every pixel in flight need its own scratchpad? Only one of these needs to exist in each pipeline when the pixel shader is running. A pixel that is inflight in the beginning of the pipeline (say, still at the z-check stage) doesn't need a scratchpad. This situation is no different than the temporary register file.

If you need 1000 scratchpads (one for each inflight pixel), wouldn't you need 1,000s of temporary register files too? I see the scratchpad as nothing more than a conceptual extension of the temporary register file from dozens to thousands.

Scratchpads come to live at the beginning of a shader, and die at the end. It's still stateless.

Now, once you introduce the idea of a global scratchpad that lives between invocations, you have all sorts of problems with ordering and concurrency. Ideally, you'd be restricted to writing algorithms that are order independent.
 
DemoCoder said:
I disagree wholeheartedly. The intermediate level is the best place to do optimizations.

Let me clarify, I was talking about your middle-level table lookup operations when I said intermediate, not general HLSL->hardware assembly. I do agree that hardware-aware optimizations may be important, but I have full faith that if Microsoft, ATI, and NVIDIA are happy with the current setup, then it's the best thing right now. There's nothing stopping them making further revisions in the future (for example, D3D could pass the HLSP to the driver if they so wanted).
 
Ok, I misunderstood. Most of what I'm talking about won't really become relevant until 3.0 shader model, since the lack of data dependent branching and stateless memory model simplifies scheduling algorithms enormously.
 
DemoCoder,

I suppose all memory accessed by a process can be viewed as an extension of register file space. Just that 16kB scratch pads for
1000 in-flight pixels will take up 16MB. That's not going to fit on-chip, whereas 1000 32x128bit register files (~512 kB) will (even on current processes).

Anyway, why is allowing such a large number of pixels in flight beneficial?

Serge

EDIT: to clarify, I'm assuming 1000s of in flight pixels at the shader stage, (perhaps to fully cover texture cache miss/dependent texture read latencies).
 
Just an aside on the topic of level of abstraction-
Democoder-
I'm looking forward to the environment taking over more of the resource management issues, and even more of the database management, freeing me to spend most of my time on artistic issues.

This is indeed the Utopian situation, and one that can be accomodated provided you aren't pushing the envelope.

In all the arguments you have brought forth (the C vs C++ battle brings up quite a chuckle), the target performance and target resource limitations is what dictates what level of abstraction can be allowed, with the highest level possible always being the desired approach, but rarely in practice yields the features/technology desired to be reached for your target.

I've always felt abstraction levels are always a full generation behind the technology needed to allow them uninhibited or with even a most reasonable level of complexity. This seems to be a repeating trend. As the next higher level of abstraction is reached, it will be a full generation later before widespread usage for most circumstances can use this level of abstraction. Until then, the coder has to get involved in the "nuts and bolts" and often bypass the higher level of abstraction to use a lower level (and use mixes of both) for at least a single generation of hardware in order to hit their goal of performance or target resource limitation for the end user.

As more power/technology is developed, more of the higher level of abstraction can be used.. but then some new higher level of abstraction/libraries or toolkits come along and the process repeats itself. :)
 
Personally, I don't see any difference between nVidia supporting a PS 1.4 target and an ATI_fragment_shader target (Actually, they couldn't support this target, but it would be possible to support an ATI_text_fragment_shader target...).

But, since I've heard that ATI provides no developer support for PS below 1.4, and their cards are sometimes buggy on the lower pixel shader versions, perhaps nVidia will support PS 1.4 in the interests of Cg working properly on ATI cards under DX.
 
Chalnoth said:
But, since I've heard that ATI provides no developer support for PS below 1.4, ...

Chalnoth, I would like to point out that we at ATI offer developer support on all Pixel Shader versions as PS 1.4 and below work on ALL of our shader-based hardware. And now with the Radeon 9500 / 9700 based hardware, we support PS 2.0 and below.
 
Well, I suppose I wasn't speaking from personal experience, but I still believe it. I do know that ATI's Radeon 8500 and up expose support for all pixel shader versions up to 1.4. I also know that ATI's drivers are notorious for not always adhering properly to the spec. I'm still willing to bet that I'd be correct if I said a number of developers who were having problems on ATI hardware with PS 1.0-1.3 were referred to PS 1.4. There are tons of software companies that pull this kind of crap. Some companies today, for example, will only offer support if the system is an Intel system.
 
Back
Top