Interesting RTHDRIBL results @ Hexus

nelg

Veteran
LINK

I would imagine that clocks play some role in the discrepancy but maybe this shows that the 6800 drivers are really immature.
3X the performance of 9800XT and over 2X as fast as 6800 Ultra, the X800 XT runs my favourite DX9 demo at speed. The image is to R360's previous high standards, of course.
 
yes, nVIDIA still has a lot of work to do before fully utilizing NV40's 2 shader units.
And, maybe the demo uses too many registers?
 
In theory, it's been only improved; but in practice it's good enough that I consider it solved. I have an insanely long pixel shader (I can't run it all at once, it's just too long) which used to be an ogre for register (eating them a lot), and is now using only a few regs - thanks to serious improvements in the compiler.

Therefore I don't think it's going to be a real issue in games...

Edit: I'm speaking for high-level languages shaders, I dunno how good the drivers are with hand-written assembly ones... They should quickly become a thing of the past, particularily if they're so complex that they need numerous regs...
 
The NV40 decreased register pressure and decreased the performance penalty when you exceed the threshold, but they did not eliminate it, and I do not expect that the R300/R420 is entirely free either. No one's really did a test to see if 12/16/32 registers, with or without textures in use, and/or long shaders has any penalty. The issue is how much memory you're going to dedicate to storing registers x number of pixels in flight.

The NV3x compiler got better over time at eliminating excess registers, but it needs to be tweaked for the NV40 and they've only had a few months to do the NV40 driver which I hear needed substantial changes. The issue is, with the new dual issue and co-issue modes, plus register forwarding and predicates, the heuristics used to pack registers got alot more complicated.

For example, you could choose to eliminate a temporary result register, but in doing so, you might create a data dependency that prevents dual issue. If you remove the dependency (compute two separate results in parallel, then combine in another register), you gain dual issue, but end up using an extra register.

The complexity of trying to schedule up to 7 parallel ops in the NV40 pipes, with register pressure conditions, texture instruction penalties, and other conditions, makes the compiler very tricky to get right. This is one reason why you're still seeing "shader replacement" because human beings can still hand-write short assembly language code easily compared to writing compilers.

The NV40 has a greatly reduced the penalties of the NV3x architecture so that in the average case, it now performs very nicely on games. But there is much headroom left in the compiler optimization side that is not being utilized yet, that if NV does a good job, will start to show in another couple of months as they learn lessons in trying to optimize for the new architecture.

Even ATI's compiler still has some leeway left in it, due to some tweaks, it appears, to the ALU units. Expect both the NV40 and R420 to get faster in the compiler department in the next few months, although my gut tells me the NV40 will probably have a greater relative speedup because it's a bigger change over the NV3x, requiring more driver work vs the R420 which is a minor change to the R300 shaders, and because the NV40 shader pipes are more complex.
 
DemoCoder said:
The NV3x compiler got better over time at eliminating excess registers, but it needs to be tweaked for the NV40 and they've only had a few months to do the NV40 driver which I hear needed substantial changes. The issue is, with the new dual issue and co-issue modes, plus register forwarding and predicates, the heuristics used to pack registers got alot more complicated.

Interesting as I assumed the drivers on the much more straight-forward architecture of the NV40 would work pretty good already. The benchmarks behaves pretty much as expected too (disregarding FarCry) and I would also point out that the drivers on the R300 worked pretty damn good from the get-go back then AFAIR.

Did you hear that they expects much more performance or a bit of 10% increase here and there?
 
DemoCoder said:
Even ATI's compiler still has some leeway left in it, due to some tweaks, it appears, to the ALU units. Expect both the NV40 and R420 to get faster in the compiler department in the next few months, although my gut tells me the NV40 will probably have a greater relative speedup because it's a bigger change over the NV3x, requiring more driver work vs the R420 which is a minor change to the R300 shaders, and because the NV40 shader pipes are more complex.
It will be interesting to see what the DEC compiler guys do for ATI’s compiler .

B3D x800 XT review said:
So, during the initial phases of the R400 design a team of compiler developers were hired, formerly from DEC, and they have been busy coding up shader compilers for ATI. Presumably they have been focusing on the projects the Marlborough team have been working, however they have also been working on a version for R300/R420 and the first fruits of their labours are scheduled to be dropped into the CATALYST's in a couple of releases time.
 
I don't want to raise expectations and without saying anything specific, I think on some of the more nasty shaders it will be much more than 10%

Yes, the NV40 is "more straightforward" to program, in that, a naive programmer won't trip up any horrendous penalties like he could on the NV3x (e.g. use more than 2 FP32 registers) Even running with penalties, the NV40 still is faster than previous generation cards, so it's straightforward in that regard. Games will run adequately with no big surprises.

But that's leaving alot of performance on the table. Shaders which properly take advantage of things like free normalization, dual issue, 2+2 co-issue, and some other conditions, can get a nice boost. It obviously doesn't apply to all cases, since some shaders already fit in well on the NV40. (see ShaderMark figures, notice how some run insanely well, others run surprisingly poorly, worse than expected)
 
Some RightMark 3D numbers:

Rightmark Procedural Wood PS1.4
Procedural Marble PS2.0
Procedural Marble - PS2.0 FP16
Lighting (Blinn) - PS2.0
Lighting (Blinn) - PS2.0 FP16
Lighting (Phong) - PS2.0
Lighting (Phong) - PS2.0 FP16

6800 Ultra 554.4 414.4 413.1 333.6 421.6 167.7 235.1

X800 XT PE 383.1 594.5 595.1 495.2 493.7 277.9 277.9

Okay, on second thought maybe DemoCoder got a point here regarding reuse of NV3X code.
 
With respect to FP16 vs FP32, one thing to consider is that FP16 decreases register pressure even on the NV40, so that it can store more shader state for inflight pixels. This means that some dual issue scenarios might become enabled whereas they previously weren't, and also that the free FP16 normalization can be used.

Thus, the phong code, which uses multiple normalizations, will get to use the free normalize.

One way to test this is to write a Phong shader with all of the instructions left as full FP32 precision, but with only the NRM instruction using half precision (NRM_PP)
 
Can it be the reason that ATi didn't put as much computation resources as NV40 in each pipeline because ATi felt working out such a complex compiler at decent speed is beyond its driver people's ability?
 
No, I think ATI would hire the right software developer resources if they needed them (e.g. DEC compiler guys). If hardware designers really felt they could get more performance, but it required more driver support, I think they'd go with the enhanced hardware.

Look what's happening with XBox2, PS3. These things aren't getting easier to program, they're getting much harder. The compilers will have a much more difficult job on those architectures. However, it enables the possibility of huge performance leaps if the right software tools are there are developers.

It's a tradeoff, but I think it boils down to schedules. The next ATI part may have unified shaders, tesselators, and all kinds of stuff, plus load balancing between vertex and pixel shading, so it may be even more difficult than the NV3x to write drivers for.
 
I'm still attempting to get an early release of this driver build from ATI too see what differences there might be with shader performances.
 
How does ATi predict which way the next DX spec update will go? Isn't it better for IHVs and MS sitting around the table and finalize the spec before IHVs invest billions of dollar in the R&D process? I think it's too risky to put your men on sth you don't really know for sure.

And DemoCoder, what's the thing you're most eager to see in next-gen hardwares, in both performance and feature wise?
 
Programmable tesselation.
Virtual memory.
Maybe some sort of "shared" global memory for shaders but which is concurrently safe to operate on.
And much faster ALU throughput for longer shaders.
 
full virtualisation of memory, and shaders. so that you define different shading-units, that will do different tasks, and can bind shaders to them. the hw then profiles how much workload each defined unit gets, and schedules a different amount of internal processing units (the 16 ones we have now) to the different user defined units.

that way, you could, for now, define the vertex and pixel shading units, and, if the hw is designed right, you can, instead define the 3 units of renderman, to map shaders directly to hw, or you can define an intersection, and a shading unit, to do hw raytracing.

and each of those (and many more possible) can run at 100% gpu speed (unlike now, where you have to drop vertexshader units for example if you try to do some raytracing..).

another gain of this is the ability to have multiple thread units on the gpu, allowing (longhorn style) running different threads at the same time on the gpu. that means, two longhorn windows, both with some shading (one rendering a modern video, processed in video shaders, one doing some complex rendering tasks, say 3dsmax in the background, while you watch a movie), and both get a part of the gpu for its own.

and the most important: plugability. the removal of actual video output from the generic gpu. make it a render card, not a graphics card. instead, introduce a display card, wich is merely there to talk with a lot of hd screens in different settings (so you buy a small pcie-1x card for your ordinary 1024x768 tft you had since long time now, and you buy a pcie-4x for the new hd-tv you buyed, and just plug them in. the pcie-16x gpgpu's on the other hand just render, and don't bother to wich display card they have to send the images. full hw acceleration all the time, independent on (multi-)display modes at all.

another thing that bothers me is the temporal antialiasing.. this can get, if performance moves on, even motion blur, if continued.. i mean, having several houndreds of fps on q3, set to 60fps max with vsync on leads to 5 and more frames (and at lower res, much more:D) per "screen-frame". this could directly cummulate into an accumulation-buffer to get motion blur. possibly something directly choosable in a next driver revision? or on nextgen hw, at least. (there is hw accumulation on radeons, and they have fancy post-effects like those smartshaders we had a contest on.. and they have trible-buffering-support, too.. in combination, we could have it even now).


those where some random things that moved trough my brain.
 
DemoCoder said:
The complexity of trying to schedule up to 7 parallel ops in the NV40 pipes, with register pressure conditions, texture instruction penalties, and other conditions, makes the compiler very tricky to get right. This is one reason why you're still seeing "shader replacement" because human beings can still hand-write short assembly language code easily compared to writing compilers.
Apparently the NV40 does most of this in hardware now, so the main issue is probably going to be proper ordering of instructions so that the hardware has everything it needs to execute just the next instruction.

To this end, there doesn't appear to be any register performance issue (yet) for simple, straightforward cases. When the shaders get more complex there are problems.

There are probably some very specific instruction orderings that are causing headaches for the NV40 right now that should be resolved in a few driver releases. Perhaps the most obvious example of this is MDolenc's fillrate tester that shows higher performance in FP32 than in FP16 under one of the simple tests (this, for example, would probably be solved by simply storing the FP16 registers in different places: they probably share the same 128-bit register in current drivers).

Still, given that there is hardware that does a good amount of the scheduling, the biggest gains will likely not be seen until the NV45 comes out.
 
DaveBaumann said:
I'm still attempting to get an early release of this driver build from ATI too see what differences there might be with shader performances.
Nice.. Can you get the Alpha OpenGL driver while you are at it? :p

I think ATi needs to double the number of people they have working on both these projects... oh and double their salaries.... oh yeah.. and get smarter faster guys to....

and hurry it up would ya??? :LOL:
 
What do you mean HW does the scheduling? The HW is not going to replace a dp3/rsq/mul with a free normal. The HW is not going to replace an ADD/MUL with a dual-issue automatically. This is done by the compiler packing both into a VLIW instruction. And the HW certainly isn't going to do register allocation and expression inlining.

I took a look at some of the RTHDRIBL shaders via 3DAnalyze, many look to be inefficient, and almost look like they were written by hand, since they don't even do constant folding.

Let me give you example, this fragment is from 3DAnalyze of RTHDRIBL,

Code:
def c3 , 256.000000, 0.111031, 0.000000, -128.000000

mad_pp r4.w , r0.wwww , c3.xxxx , c3.wwww 
mul r11.w , r4.wwww , c3.yyyy

Here we have
r11.w = r4.w * c3.y,

substituting for r4.w we have
r11.w = (r0.w * c3.x + c3.w) * c3.y

= r11.w = (r0.w * c3.y*c3.y + c3.w*c3.y)
= with constant folding
= r11.w = (r0.w * c_fold.x + c_fold.y)

lets us rewrite as
MAD r11.w, r0.w, c0.x, c0.y

Just saved 1 register and 1 instruction. I saw one shader that used 11 registers when only 2 were really needed, and most of those 11 were scalars! (r1.x, r2.w, r3.y), and they hadn't hit any port limits, so there was no justifable reason no to reuse dead registers or pack the scalars.
 
Back
Top