NVIDIA Fermi: Architecture discussion

You didn't. However, you did point to the fact that it was not a 100% jump in performance so I simply pointed out that the theoretical increase was far from 100% to begin with.

You said "how could you paint this as a worse scaling". I did not paint anything, nor I made comparison between the vendors or their scaling.

And how do we know that Fermi's flops aren't also more usable compared to GT200? To Ail's point, random speculation isn't really going to get you anywhere.

They will indeed be more usable if we compare the figures "with MUL". But it´s funny that anyone could speculate about how fast GF100 could be only looking at FLOPs and that possible speculation based on the behaviour of past chip generations history is worthless.
 
Last edited by a moderator:
Yeah but there are a whole bunch of mitigating factors there. Even if there is a 20% shortfall from the nominal increase there are potential efficiency gains that could regain that loss.
The loss can't be regained, merely mitigated. I was surprised to see the shortfall was as large as 20%. The loss in games can be much higher (I've seen 60%).

Curiously, Vantage GT1:

http://www.xbitlabs.com/articles/video/display/radeon-hd5770-hd5750_13.html

shows 97% performance advantage for HD5870 over HD5770. GT2 shows 86%. There could be a clue there, I suppose...

Anyway, NVidia's working from a lower base. GT240 is great evidence of that.

I think it's fair to say the potential for those gains is far higher with Fermi than they were with Cypress given the architectural overhaul.
A lower base does tend to do that - hence the shock and awe of RV770.

Jawed
 
shows 97% performance advantage for HD5870 over HD5770. GT2 shows 86%. There could be a clue there, I suppose...
Unless the platform difference was exactly half as well (i.e. half the CPU performance, half the memory performance, half the bus performance, etc.) I'm not sure you can conclude too much that is graphics related alone.
 
We weren't discussing die sizes
Which is why that was just a side note.

I'm not sure that question is even relevant. In any case good luck with that considering ATI's shaders and texture units have been scaling in lock step for some time now. :)
I won't be able to distinguish between ALU and TEX, but that's not very relevent anyway. I'll show how much is dependent on the shader core, which is the only part that your 1% ALU and -22% TEX numbers make any difference with. Bandwidth, setup, and CPU are all part of the equation.

ATI's ALUs scale perfectly, as does NVidias. There are plenty of tests out there that are not limited by BW or setup or CPU that show this. Using games as proof of ALU scaling is just stupid.
 
Unless the platform difference was exactly half as well (i.e. half the CPU performance, half the memory performance, half the bus performance, etc.) I'm not sure you can conclude too much that is graphics related alone.
3DMark makes their graphics tests have a very light CPU load. I suppose PCIe could be a bottleneck at some points, but it's doubtful.
 
Yes, that tells you that GT200 had poor margins. Low sales and bumpgate charges had absolutely nothing to do with that. I really expected you would put more effort into it. How about you account for the charges and then see how Nvidia did relative to the rest of the industry....you know a real analysis? :)



Yes, and they make that very clear in nearly every conf call. Which is why the die size arguments are silly. However, last quarter's favorable results were not boosted by a resurgence in the professional markets which obviously means that die size hasn't been disastrous for the consumer segment either.

could not the argument be made then that if (as some propose) profitability of graphics (Geforce) did indeed suffer that out of pure profit percentage, Tesla/Quadro (where bumpgate, die yields etc have very little bearing) would mitigate any Geforce losses ? If 75% of profits come from GPGPU/HPC/Workstation then even if Geforce (consumer) graphics suffers huge losses (say catastrophic .. ie 50%) a drop of .. I don't know.. 10% (from 20%) would in fact be huge but overall profitability of nV wouldn't really show it as such. ?

So I would assume that both sides could be right,.. Geforce profitability could have indeed tanked but this would not be represented in nV's numbers overall. I haven't looked at nV's quarterly so I'm not sure if nV breaks down Graphics division to include or exclude GPGPU, HPC and Workstation seperately.
 
BTW.. maybe I missed it over the last 4 pages but did xman86 ever post any links to support his claim that "sold more GTX295s than ATI 4870X2s" (the fact that X2s have been EOL for some time would negate supply claims). ?? I've looked around and couldn't find anything.. Google sends me to Fruad and ironically back to this tread.
 
You said "how could you paint this as a worse scaling". I did not paint anything, nor I made comparison between the vendors or their scaling.

Yeah, sorry that wasn't directed at you specifically. Was just a comment on the overall contempt for GT200 :)

ATI's ALUs scale perfectly, as does NVidias. There are plenty of tests out there that are not limited by BW or setup or CPU that show this. Using games as proof of ALU scaling is just stupid.

Yeah it's almost perfect on stuff like 3dmark's perlin noise but those tests aren't particularly relevant are they? It's hard to make a case for games being useless when evaluating the scaling of ALU+MEM+TEX altogether.

could not the argument be made then that if (as some propose) profitability of graphics (Geforce) did indeed suffer that out of pure profit percentage, Tesla/Quadro (where bumpgate, die yields etc have very little bearing) would mitigate any Geforce losses ?

Why would there be any mitigating effect if the professional segment is still performing well below longrun averages (per their last quarterly CC)? Also, bumpgate is a charge directly against income - it doesn't matter where that income came from.
 
3DMark makes their graphics tests have a very light CPU load. I suppose PCIe could be a bottleneck at some points, but it's doubtful.
Second test is the space one with all the rocks, so I guess it has a huge triangle count - so we could be seeing a bit of a setup limit I suppose...

Jawed
 
Is there any data for how ALU-limited GTX285 is? Current games where only the ALU-clock is changed?

This is anecdotal at best but here goes - Dow II @ 2560x1600 Max

Core: 648 Shader: 1296 Mem: 1350
Min: 9.9 Max: 77 Avg: 35.45

Core: 648 Shader: 1566 Mem: 1350
Min: 10.84 Max: 84.32 Avg: 37.7

Core: 702 Shader: 1566 Mem: 1250
Min: 10.28 Max: 86.39 Avg: 37.77

Core: 702 Shader: 1566 Mem: 1350
Min: 11.04 Max: 92.88 Avg: 40.66

I had to overclock my e8400 from 3.0-3.6 to break 35fps so I don't know how much these numbers are still CPU limited. Will be installing a Q9550 this weekend so I might try again.

A lower base does tend to do that - hence the shock and awe of RV770.

Why does the base matter? 100% is 100%.
 
You(the driver) can change the code however you(it) want(s) at compilation time.

When applications use shaders, they don't send down the whole shader, just a handle. The driver then binds the HW ISA version of that shader to the HW.

Because the compilation doesn't happen on the fly. It's done long before the shader is invoked (usually while the game or level is loading).

Sorry if I go back on talking 'bout the MADD to FMA substitution...
Actually I don't think that the thread processor, who should be the one in charge for the substitution can operate the change on a big bunch of shaders...
This because it can only read stuff which is on its internal buffers (which can contain only small portions of code). This leads to the fact that actually the substitution has to be performed separately from compiling, so it's not done at compile time, but rather at runtime.
At least this is what a guy much more knowledgeable than me on the matter explained to me... ;)

I would be very interested in listening to Rys' view on this...
 
This is anecdotal at best but here goes - Dow II @ 2560x1600 Max

Core: 648 Shader: 1296 Mem: 1350
Min: 9.9 Max: 77 Avg: 35.45

Core: 648 Shader: 1566 Mem: 1350
Min: 10.84 Max: 84.32 Avg: 37.7

Core: 702 Shader: 1566 Mem: 1250
Min: 10.28 Max: 86.39 Avg: 37.77

Core: 702 Shader: 1566 Mem: 1350
Min: 11.04 Max: 92.88 Avg: 40.66

I had to overclock my e8400 from 3.0-3.6 to break 35fps so I don't know how much these numbers are still CPU limited. Will be installing a Q9550 this weekend so I might try again.
Well, with either core or memory clock increases of ~8% producing ~8% more performance, it seems "fine" - though how much difference in performance did the 20% on your CPU make? ALUs are making very little difference, 6% for 21% higher clocks.

A complication is that texture coordinate interpolations by the ALUs mean that the shader clock could make some difference to texturing, so shading may be even less dependent upon ALU rate, if you exclude the interpolations. HD5870 has that complicating factor, of course.

HD5870 is up to 94% faster than HD5770 in this game:

http://www.computerbase.de/artikel/...ati_radeon_hd_5970/7/#abschnitt_dawn_of_war_2

That indicates HD5870 is ~44% faster than GTX285, which is slightly higher than the texture and fillrate advantage, despite having less bandwidth.

If you could wangle a ~45% core overclock on your GTX285, I wonder how close you'd get to HD5870? :p

I have to admit the way that's scaling on your system is a bit of a puzzler overall.

Why does the base matter? 100% is 100%.
You effectively said it yourself, an architectural overhaul tends to increase per-unit per-clock efficiency (highlighting the inefficiencies of the older design). R800 isn't such an overhaul (except, perhaps, in attribute interpolation) so...

Jawed
 
Sorry if I go back on talking 'bout the MADD to FMA substitution...
Actually I don't think that the thread processor, who should be the one in charge for the substitution can operate the change on a big bunch of shaders...
This because it can only read stuff which is on its internal buffers (which can contain only small portions of code). This leads to the fact that actually the substitution has to be performed separately from compiling, so it's not done at compile time, but rather at runtime.
At least this is what a guy much more knowledgeable than me on the matter explained to me... ;)

I would be very interested in listening to Rys' view on this...
The thread processor doesn't work on RAW uncompiled code. The driver is responsible for converting the input shader/kernel/whatever into machine code. During that process it can quite easily convert MAD into FMA or even NOPs if it so chooses.

Now if an end-user were able to pass in machine code directly, then the chip would have to do the conversion itself, but this is not possible as the driver is always in between.
 
The thread processor doesn't work on RAW uncompiled code. The driver is responsible for converting the input shader/kernel/whatever into machine code. During that process it can quite easily convert MAD into FMA or even NOPs if it so chooses.

Now if an end-user were able to pass in machine code directly, then the chip would have to do the conversion itself, but this is not possible as the driver is always in between.

Would it be possible that Fermi's opcode for FMA is a reuse of the MAD one?
That would make things even more straightforward.
 
The thread processor doesn't work on RAW uncompiled code. The driver is responsible for converting the input shader/kernel/whatever into machine code. During that process it can quite easily convert MAD into FMA or even NOPs if it so chooses.

Now if an end-user were able to pass in machine code directly, then the chip would have to do the conversion itself, but this is not possible as the driver is always in between.

Obviously not.
So the cpu is responsible for the compiling of the instructions, in order to transform the code into something that the gpu can actually read and process (machine language). That should be the only thing that is done in compile time.
Then the thread processor works on the compiled code in order to do the optimization that is contained into drivers.
The problem is that it simply cannot optimize the whole bunch of code that has been compiled, because it cannot read outside of its registers, so it will optimize (in this case changing MAD to FMA) only small portions of the code each time.

Correct me if I'm wrong. :smile:
 
Obviously not.
So the cpu is responsible for the compiling of the instructions, in order to transform the code into something that the gpu can actually read and process (machine language). That should be the only thing that is done in compile time.
Then the thread processor works on the compiled code in order to do the optimization that is contained into drivers.
The problem is that it simply cannot optimize the whole bunch of code that has been compiled, because it cannot read outside of its registers, so it will optimize (in this case changing MAD to FMA) only small portions of the code each time.

Correct me if I'm wrong. :smile:
The driver's compiler does more than just convert the kernel into machine code. For example, there's certainly an optimizer in there. When the driver is converting the kernel into machine code it is free to do what it wants. It doesn't need to know the contents of different registers before doing a general optimization. If FMA is valid, then it is free to use it wherever it's suitable.

The GPU doesn't need to be aware of this MAD -> FMA process at all.
 
The driver's compiler does more than just convert the kernel into machine code. For example, there's certainly an optimizer in there. When the driver is converting the kernel into machine code it is free to do what it wants. It doesn't need to know the contents of different registers before doing a general optimization. If FMA is valid, then it is free to use it wherever it's suitable.

The GPU doesn't need to be aware of this MAD -> FMA process at all.

Well, I think that it should be aware, given that the only piece of hw that can read what's written in a vga driver is the gpu... :LOL:

I think that the key issue is this:

- can a gpu optimize its code by substituting all the MADD with FMA at once?

Chances are that it can't. And this because optimization is done by the gpu, and in particular into the thread processor, who reads only what's written in its registers, that can't physically contain all the recompiled code that is necessary to an entire game level, let's say.
The driver can say to the TP "if you find a MADD, change it into an FMA", but this change is going to be performed for each single bunch of stuff that fits into the TP registers.
But how this process is going to impact real world performances, it goes much beyond my knowledge....

Am I missing something? No flames, just wanting to understand better this kind of process... :smile:
 
Back
Top