G70 compiler

trinibwoy

Meh
Legend
Supporter
Apologies if this is a dumb question but I was wondering - are the ALU changes on G70 just drop-ins that would work fine with NV40's compiler or would there be significant changes to take full advantage of the new instructions?
 
I'd imagine the reason people touted "no shader replacements!" for G70 is because nV needs some time to retarget and recompile. :)

Seriously.
 
I think the opposite. G70 would appear to be a more symmetrical architecture and so a whole bunch of bottlenecks that used to exist no longer apply.

Presuming that NVidia's driver has a run-time shader compiler, I'm also presuming that in 6800 the shader replacements were provided as "hints" to the run-time compiler so that selected portions of original shader code were implemented in a "non-obvious" way (to the run-time compiler). This would be to get round the really quite tricky combinations of operations that can be co- or dual-issued.

I posted some extremely "efficient" code for 6800 a while back:

http://www.beyond3d.com/forum/viewtopic.php?p=550703#550703

That really makes your head spin!

The G70's two primary ALUs per fragment pipe can both do a MUL and/or ADD:

- r0=a*b or
- r0=a+b or
- r0=a*b+c

- r1=d*e or
- r1=d+e or
- r1=d*e+f

Here one of three different calculations for r0 can be calculated by ALU 1. Similarly, one of three calculations can be calculated by ALU 2, and the result assigned to r1.

Whereas you can see the limited options for NV40 which can only do a MUL and a MUL and/or ADD, in total:

- r0=a*b

- r1=d*e or
- r1=d+e or
- r1=d*e+f

(These examples only work if the source data is in FP16 format.)

It seems to me that G70's much easier for the run-time compiler to work with, and that's the reason why shader replacements aren't needed.

But, well, I expect we'll never find out for sure.

Jawed
 
I haven't had a chance to really review the G70 that much, but from everything I've seen it's less constrained than the NV40. The current compilers might not produce optimal code for the new architecture, but they won't break either.

The drivers translate the assembling into the internal code so I assume that would be handled at that level, but some of the higher things might benefit from having the full source since the assembly target invariable has some information loss.
 
Jawed said:
The G70's two primary ALUs per fragment pipe can both do a MUL and/or ADD:

- r0=a*b or
- r0=a+b or
- r0=a*b+c

- r1=d*e or
- r1=d+e or
- r1=d*e+f

Here one of three different calculations for r0 can be calculated by ALU 1. Similarly, one of three calculations can be calculated by ALU 2, and the result assigned to r1.

Whereas you can see the limited options for NV40 which can only do a MUL and a MUL and/or ADD, in total:

- r0=a*b

- r1=d*e or
- r1=d+e or
- r1=d*e+f

(These examples only work if the source data is in FP16 format.)

It seems to me that G70's much easier for the run-time compiler to work with, and that's the reason why shader replacements aren't needed.

But, well, I expect we'll never find out for sure.

Jawed

If that is true, than G70 pipeline == 1.5 x NV40 pipeline! Extreme pipeline! :p ;)
 
switch(CHIP):
case NV40: {<assemble this way>}; break;
case G70: {<assemble that way>}; break;
case ...
default: {<do compatibility mode, works with every chip>};
}

Just from the top of my head, I don't see that being especially hard to implement in the drivers.
 
Reading that, one does have to wonder at what point the UDA gets just too unwieldy for words. They really ought to pinch that thing off at GFX and start a new one for NV40 forward.
 
_xxx_ said:
switch(CHIP):
case NV40: {<assemble this way>}; break;
case G70: {<assemble that way>}; break;
case ...
default: {<do compatibility mode, works with every chip>};
}

Just from the top of my head, I don't see that being especially hard to implement in the drivers.

You're not considering the difficulty of the <assemble this way> part. (assemble would be a misnomer, the driver performs compilation, not simple assembly)

Less non-orthogonal restrictions make optimization a little easier, but these driver compilers still have a ways to go.
 
trinibwoy said:
Apologies if this is a dumb question but I was wondering - are the ALU changes on G70 just drop-ins that would work fine with NV40's compiler or would there be significant changes to take full advantage of the new instructions?
I'm actually less than impressed with NVidia's compiler technology. I'd think that after developing Cg they'd be ahead of ATI, not behind.

Recently I did some work with the penumbra wedge soft shadow algorithm, optimizing the lengthy shader. I actually found an application where FP24 was inadequate precision and I needed FP32. In essence, I found a way to further alter my optimized shader using the algebraic expansion of (A1 - A2) dot (B1 - B2). In the end, I had a few more instructions, but many more were scalar or two component, and were easily parallelizable. On my 9700Pro, I found this code ran faster, but had FP24 artifacts. On a colleague's 6800, there were no artifacts, but it ran slower than the original code, showing that co-issue was virtually non-existent.

More evidence of NVidia's compiler woes can be seen at Digit-Life:
http://www.digit-life.com/articles2/video/g70-part2.html
There is very little difference between the scores of 6800 and 7800-16 (a 7800 with 16 pipes clocked at 6800 speeds). Maybe 10% on average. Maybe I'm being overly harsh on NVidia and this is all we should expect, but it seems to be only around 5-25% faster than a R300 pipe at pixel shaders.

Either adding more ALU's to the shader pipes can only do so much, or there's a lot more NVidia's compilers should be doing.
 
Part of your problems might be attributable to NV40's inability to read more than four FP32 operands per clock. But hey, what do I know?...

I get the impression that G70 is the same.

Jawed
 
I guess nvidia compiler can reissue the instructions in a slightly different order on G70 then on NV40, but that shouldn't be very difficult to do. If tehy have a clear step division - tree->optimisations->reordering->assembly then they only have to alter last two parts. Assembly should be pretty simple(it is always). Reissuing... - well, G70 and NV40 aren't SO different. I guess the reissuer became even simpler on G70 due to better pipe features.
 
Mintmaster said:
I'm actually less than impressed with NVidia's compiler technology. I'd think that after developing Cg they'd be ahead of ATI, not behind.

There's kind of an inside joke at ATI that we should start promoting the use of Cg. The reason is that Microsoft's HLSL compiler does quite a lot of optimizations, which unfortunately makes our driver's compiler's work a lot harder. Cg on the other hand returns more or less unoptimized code. This usually works better with our compiler, so it often comes out ahead in the end. In fact, in many cases using the D3DXSHADER_SKIPOPTIMIZATION flag to the MS compiler gives you better performance.
 
Hehe, neat. Do you work on compiler tech at all or are you doing more software/demos?

I actually tried that flag, but unfortunately it bumped me over the instruction count several times. I'm still on the 9700Pro, so no 512 instruction limit for me unfortunately.

I can't wait for unified shader tech to come to the PC. If it's a generation after R520, can you imagine something like 64 vec4+scalar ops per cycle? Vertex texturing with completely hidden latency? Mmmm...
 
I actually tried that flag, but unfortunately it bumped me over the instruction count several times.
So an unoptimised shader needs to fit under the instruction count even if you know that the optimised version would come in under the limit?
Interesting.
(though I guess it'd be more interesting if it were actually possible)
 
Humus said:
Mintmaster said:
I'm actually less than impressed with NVidia's compiler technology. I'd think that after developing Cg they'd be ahead of ATI, not behind.

There's kind of an inside joke at ATI that we should start promoting the use of Cg. The reason is that Microsoft's HLSL compiler does quite a lot of optimizations, which unfortunately makes our driver's compiler's work a lot harder. Cg on the other hand returns more or less unoptimized code. This usually works better with our compiler, so it often comes out ahead in the end. In fact, in many cases using the D3DXSHADER_SKIPOPTIMIZATION flag to the MS compiler gives you better performance.

This is very interesting, I've noticed this too, didn't know the cause of it till now, but unfortunately not many developrs use Cg enough, which really sux since its has so many advantages.
 
Perhaps, once developers start working with CG for the PS3, you may see more developers using it as opposed to HLSL.
 
DemoCoder said:
_xxx_ said:
switch(CHIP):
case NV40: {<assemble this way>}; break;
case G70: {<assemble that way>}; break;
case ...
default: {<do compatibility mode, works with every chip>};
}

Just from the top of my head, I don't see that being especially hard to implement in the drivers.

You're not considering the difficulty of the <assemble this way> part. (assemble would be a misnomer, the driver performs compilation, not simple assembly)

No, but that is the same for the respectable path this way or the other. You marely pass the switch for the right chip.
 
Razor1 said:
This is very interesting, I've noticed this too, didn't know the cause of it till now, but unfortunately not many developrs use Cg enough, which really sux since its has so many advantages.
What are the many advantages of Cg (besides OpenGL portability)? MS put a lot of effort into shader debugging, fragment linking, compiling, etc.

I'm not overly familiar with Cg, so I'd really like to know if you can spare the time.
 
well its very protable, cross platform uses its is great. Its very similiar to HLSL and it will output all shaders to ARB if written properly, Some of the functions still don't go over to ARB though, so the construction of the shader has to be handle carefully if crossplatfrom is needed.

Pretty much writing 1 shader for 3 platforms, much less work. And only need to know 1 shader language instead both HLSL and GLSL and ARB

There is also a partly compiled version of shaders with Cg which greatly increased load times, not really sure of this haven't used it yet.

Plus the speed difference is very interesting.
 
Back
Top