Futuremark Announces Patch for 3DMark03

I think most of the confusion stems from their use of the term "unified compiler technology". It's a marketing term and as such means nothing. It specifically does not mean "compiler" - as an average person would understand it. The sinister twist here is that the term contains "compiler". Note the difference to the "high resolution anti-aliasing" (dribble) vs multisampling (meaningful description) situation.

You can't tell for sure what's what. Whenever NVIDIA speak of a "compiler" they may either refer to the usual meaning, or they may use the word as an abbreviation of UCT.

The all important catch really is "technology". It's a combination of some stuff (true compiler) and some other stuff (replacements). If one component is circumvented, "the UCT" is compromised, which lays basis to their claims.

This is the wiggle room they're currently using, and it can be interpreted as making sense, but it really takes some brain surgery.

PS: the above doesn't mean I sympathize with those twits. Not at all.
 
zeckensack said:
PS: the above doesn't mean I sympathize with those twits. Not at all.
:LOL:

The one nice thing about these repeat debates is the humor they elicit. I loved surrounding Fuad's name with warning lights, too. Corwin's subversive PRs are always good for a laugh, and this last one was possibly the best, not least because of that magical little dwarf. And using "gentle caress" to refer to engineering in an official PR? Classic! :D
 
Hey, maybe the compiler is just.... a little virtual dwarf rewritting shaders by hand in semi real time :D That would explain a lot !

Well, it could be Dobby too :D

That would well stick with nV's Image ;)
 
3dilettante said:
Does competing hardware have the same problem, albeit with greater numbers of registers?
Any degredation on R300 would be a. unlikely, b. gradual and c. slight. It's pretty much a non-problem.
 
3dilettante said:
Okay, I've tried to search out an answer, but I can't seem to find one with regards to the register usage problem.

Why? Why is there such a significant penalty? It can't be something silly like that there's a problem sharing The One Register :rolleyes: .
Why is it still extant in the next generation GPU? What, are they literally missing dozens of registers that the competition has and can't remember where they are?

Well, register usage is not as important in the NV35; and it'll be even less important in the NV40. But that doesn't mean it'll be gone.

According to my testing, the NV35 indeed has twice the register file size; the penalties are nearly 2 times as small ( My results on a NV35: -40% = 1.7x as slow for 12 FP32 registers. thepkrl results on a NV30: 3.4 times as slow. I tested with 12 registers as it's the theorical maximum for PS2.0. although both R3xx and NV3x support 32 full precision register.

Now, considering a program with only Vec4 MAD, the R360 would get: 415*8 = 3320
NV38 = 475*8*0.6 = 2280
That's the WORST case scenario for the NV38 for pure arithmetic. I'm not taking into account the TEX disadvantage for the NV30 or the Scalar advantage for the R300, so in practice, best case PS2.0. is even better for ATI.

Now, the best case scenario for NVIDIA is a 8 FP16 register program, thus all done in Vec4 FP16. Operations can be either MAD, MUL or ADD.
R360 = 415*8 = 3320
NV38 = 475*12 = 5700

If we make an average of 2280 and 5700, we get: 3990.
According to that, NVIDIA has the lead with the NV38. But obviously, such an average is not accurate, because of the TEX and Vec3+Scalar factors
I'd go as far as saying +15% performance for the R360 and -20% for the NV38
->
R360 = 3818
NV38 = 3192

That would mean the R360 is 19.6% faster than the NV38 for shading, shading here meaning PS1.1->PS.2.0+
And I'd say the calculations I used would most likely favorize NV.
Something's obvious though, and that is that without the register usage problem, the NV38 would nearly never be more than 10% to 15% slower.


Now, you also asked: WHY is it that way?
Well, there's a limited register file. If there aren't enough registers left, the maximum number of quads ( 8 = 32 pixels ) cannot be sent through the pipeline.
And why was such a design choice made? I don't really know, I admit.
The NV30 register file ( which is in FP16, FP16 registers unite themselves to become FP32, not the opposite ) is 4*32 = 128 FP16 registers. The NV35 has 256 FP16 registers. I doubt that would take SO much cache... Although the NV30 already has over 1MB cache, while the R300 has like 500KB cache, only :!:

Now, I'd be interested in knowing how many registers the R300 has - sicne I don't have a clue about that.


Uttar
 
According to NVIDIA, NV35 has 256 quads in flight and 8 FP32 registers per quad.

-> 2048 FP32 registers -> 32 Ko de cache pour ces registres
 
Uttar said:
Well, register usage is not as important in the NV35; and it'll be even less important in the NV40. But that doesn't mean it'll be gone.

I don't think so. I think that the problem is different. A part of the register 'issue' is for a same pass in the pipeline. Because of this, it is difficult to optimise for using all the units in the pipeline every pass.

In NV30 the problem was slighlty different.

Of course NV40 is a lot better about register and even if this problem will still be present, we won't have to focus on it.
 
Are you sure about the amount of internal cache you talk about ?

If your numbers are true it means that NV35 has less logic than R3x0. (more than 50 millions transistors for 1 Mo of basic cache + many more transistor as a part of the cache is more complex).
 
Tridam said:
According to NVIDIA, NV35 has 256 quads in flight and 8 FP32 registers per quad.

-> 2048 FP32 registers -> 32 Ko de cache pour ces registres

Oui, bon, on parle tout les deux français, mais c'est pas pour ça qu'il faut pas parler anglais sur des forums anglais ;)

Anyway...
Those numbers do surprise me quite a bit.
8 FP32 registers/quad -> 2 FP32 registers per pixel.
That would mean 1 FP32 register/pixel for the NV30...
Yet, thepkrl's numbers were clearly showing that 2 FP32 registers/pixel were FREE.

Unless they doubled the number of quads in the NV35. I doubt that though, because then my 1.7x vs 3.4x number makes no sense. Unless they did that ONLY through drivers; the drivers I tested with are much more recent than the ones thepkrl used. But that'd be an helluva driver improvement, eh!

Anyway, 32KB of cache seems awfully small. Maybe that's what they need for the registers; but I'd assume there's a per-register transistor overhead somewhere in the architecture. If all it cost was 32KB, which is not even a 25th of the cache on the NV35... NVIDIA's engineers would seriously need a reality check.

Also, I'm not 100% sure of the 1MB cache number, but the source is rather reliable generally. Notice how NVIDIA said so proudly they had 60%+ logic on the GF4 and we never had an official number for the NV3x? ;)
Remember 1MB is for the NV30 though. Maybe they got less on the NV35, and they just added all that cache in a desesperate move to get some units at least working a bit... Who knows.
Certainly reducing that huge amount of cache would be a good way to find the transistors they needed to replace the FX12 units by FP16/FP32 ones.


Uttar
 
Uttar said:
Tridam said:
According to NVIDIA, NV35 has 256 quads in flight and 8 FP32 registers per quad.

-> 2048 FP32 registers -> 32 Ko de cache pour ces registres

Oui, bon, on parle tout les deux français, mais c'est pas pour ça qu'il faut pas parler anglais sur des forums anglais ;)

Oups :D

I need to sleep more :p
 
Back on topic...has anyone heard anything official out of nVidia or FM since the retraction statement?

I haven't heard/read/seen any spin this morning and I'm having a serious sarcasm build up. :(
 
digitalwanderer said:
Back on topic...has anyone heard anything official out of nVidia or FM since the retraction statement?

I haven't heard/read/seen any spin this morning and I'm having a serious sarcasm build up. :(
Here you go ;)
 
Uttar said:
Also, I'm not 100% sure of the 1MB cache number, but the source is rather reliable generally. Notice how NVIDIA said so proudly they had 60%+ logic on the GF4 and we never had an official number for the NV3x? ;)

40% of the 63 milion transistors used in the NV25 is actually almost exactly 512KB.
 
zeckensack said:
digitalwanderer said:
Back on topic...has anyone heard anything official out of nVidia or FM since the retraction statement?

I haven't heard/read/seen any spin this morning and I'm having a serious sarcasm build up. :(
Here you go ;)
Thanks zeckensack! :D

Dig's daily dose of graphics melodrama brought to us by the Inquirerer which said Futuremark said:
Well, I do not think that it adds any value to go back and forth on this forever. Us and them simply disagree on this one thing: Should application specific optimizations be allowed in 3DMark03. We continue to opinion no, while they argue yes.

Thus, 3DMark03 will not show the optimised performance for any hardware, not Nvidia's, not ATI's, not Matrox', etc. That is why we say that the performance scores are comparable. If someone wants to know what is the performance on a specific game with specific drivers, 3DMark03 is not the tool to use. In that case, the user should use that specific application with those drivers.

However, I'd like to say that it will be interesting to see how well Nvidia is able to develop the Unified Compiler Technology. I think it has a great potential. Let's give them time to work on that and hopefully we'll see great generic performance improvements in all applications as their technology matures.

Best regards,

Tero
Anyone heard Chuckle-boy's response to this from the dark & angry heart-o-darkness? :|
 
That post pre-dates ATi's response and nVidia's retraction, actually. (Well, maybe "dates" isn't the word to use in this case since it was all coming in very quickly, but I saw that pose before seeing mention of the others on the sites. Sent links to ATi's response and nVidia's retraction to the Inq., but despite following the affair so closely they have for whatever reason been slow to respond with the latter releases.)
 
I'd guess the whole "This is not true" thing is just a way to give them more time, then.
They could either:

1) Find another insane theory later.
2a) Screw the replacements, never put them back again, and never comment on how come the score dropped.
2b) Screw the replacements, never put them back again, and be honest about it.
2c) Screw the replacements, never put them back again, and say they had 'certain problems in their driver which made the compiler overly aggressive, and these have now been fixed.'
3) Never comment on it again, but continue to put the optimizations back with every new driver release.

I'd love 2b, but that's not gonna happen. 2c, though, would still be very nice... 1 and 3 would simply be stupid from their part IMO.


BTW, Dig, read my PM at nV News?


Uttar
 
Uttar said:
3dilettante said:
I doubt that would take SO much cache... Although the NV30 already has over 1MB cache, while the R300 has like 500KB cache, only :!:

Uttar

No idea about the NV30, but your number makes no sense for the R300. What cache? There are so many. And none of them are even close to 500KB (not by a long shot).
 
sireric said:
No idea about the NV30, but your number makes no sense for the R300. What cache? There are so many. And none of them are even close to 500KB (not by a long shot).

I meant the total of all caches on the R300. That includes texture cache, FIFO cache, Compression Technology ( LMA on GeForces ) cache, and so on.
If that's still wrong by a long shot, I'd be surprised if the GFFX figure was correct; although that person's NVIDIA sources are better than his ATI's.


Uttar
 
BTW, regarding a way to beat NVIDIA at their own game if they want to continue cheating:
When reading the shader files, change them a bit randomly. A few easy and "annoying-for-nv" changes would be:

1) randomly change the name of registers ( search for R0 in the string, replace all instances by R3, all instances of R3 by R2, and so on )
2) Put NOP operations in the code randomly
3) Add "easy-to-optimize-by-driver" operations randomly, such as MUL R0,R0,1 or ADD R0,R0,0. If ATI's compiler can also do this ( and I know NVIDIA can ), do operations whose results are never checked again ( like, if in the end only R0, R1 and R2 are still used, do MUL R3, R1, R2 and never use R3 again ).

Just make sure you don't add too many and that it makes it goes beyond PS2.0. standards and can't compile, though!
All of these could be bypassed by NVIDIA, but it'd take an awful lot more time for them to fix it than for FutureMark to add them IMO.


Uttar
 
Back
Top