NVIDIA Fermi: Architecture discussion

The GPU is just a dumb state machine, that does what the driver tells it to. The driver doesn't say to the GPU, "find a MADD and substitute it for FMA", since that execution of that instruction is defined by the driver well before the GPU ever tries to run it.
 
I thought you would have your code written in high level language, and the compiler has to figure out how to perform x = a*b + c calculations at one point. This is where I don't understand much the discussion. At that point, there's no MADD or FMA yet? only the assertion that you want that result.

So there would never be a MADD at all, in the reasonable scenario that all such calculations default to a FMA, unless you make it clear you want a MADD.
Correct me, too :)
 
The GPU is just a dumb state machine, that does what the driver tells it to. The driver doesn't say to the GPU, "find a MADD and substitute it for FMA", since that execution of that instruction is defined by the driver well before the GPU ever tries to run it.

So what the driver would tell to the gpu in this case?
Not the same thing that a GT200 driver would tell, I guess, due to this "missing" MADD capability.
I am just curious, because I am wondering why, if it's so trivial, AMD went all the other way, with cores able to calculate both MADD and FMA...
I mean, if it's for free and it's so simple, why wasting time in leaving the MADD capability? :?:
 
The compiler generates code that targets the chip ISA though. So it's low enough level to generate MADD or FMA and there is the raw instruction there. Check out PTX for more, since it's pretty much the raw opcodes.
 
So what the driver would tell to the gpu in this case?
Not the same thing that a GT200 driver would tell, I guess, due to this "missing" MADD capability.
I am just curious, because I am wondering why, if it's so trivial, AMD went all the other way, with cores able to calculate both MADD and FMA...
I mean, if it's for free and it's so simple, why wasting time in leaving the MADD capability? :?:
The driver would give it the raw instruction (see my last post). And it's not wasting time, just area. It's quite feasible to design two ALUs, both capable of FMA, but only one also capable of MADD at the same rate.
 
I thought you would have your code written in high level language, and the compiler has to figure out how to perform x = a*b + c calculations at one point. This is where I don't understand much the discussion. At that point, there's no MADD or FMA yet? only the assertion that you want that result.

When the compiler's done with it's high level optimizations, it breaks the intermediate representation of the code into basic operations. Basic here is defined by the ISA of the chip. Basic would be a MADD for rv770 and FMA for cypress.

Each chip has it's own ISA baked into the driver. At shader-compile-time, the relevant ISA description is pulled from disk and machine code is generated accordingly.
 
Well, I think that it should be aware, given that the only piece of hw that can read what's written in a vga driver is the gpu... :LOL:
That's not how it works at all. The driver builds buffers of commands for the GPU to process. The GPU doesn't "process" the driver.
A.L.M. said:
- can a gpu optimize its code by substituting all the MADD with FMA at once?
The GPU doesn't optimize code. The driver may enable performance features of the GPU ("interpret MADDs as FMA") but the GPU doesn't (yet) compile its own shader code.
A.L.M. said:
Chances are that it can't. And this because optimization is done by the gpu, and in particular into the thread processor, who reads only what's written in its registers, that can't physically contain all the recompiled code that is necessary to an entire game level, let's say.
Why would the GPU need to hold all the code at once anyway? As stated earlier, the driver runs on the CPU, not GPU.
 
Sure, but why is there a need to always assume the worst for NVIDIA ? This phenomenon, which is usually what Charlie does, isn't really logical.
Not assume, but take into consideration as well.

Well, gt200 parts are in short supply right now, if not eol-ed.
Which, i bet, they are regretting now, considering the shortage of Cypress-GPUs.

Curiously, Vantage GT1:

http://www.xbitlabs.com/articles/video/display/radeon-hd5770-hd5750_13.html

shows 97% performance advantage for HD5870 over HD5770. GT2 shows 86%. There could be a clue there, I suppose...
As far as i know, Vantage runs much more simulations on the GPU in the extreme settings than in performance mode - instead of just upping resolution and AA/AF levels:
http://www.pcgameshardware.de/aid,641615/3DMark-Vantage-im-PCGH-Benchmark-Test/Benchmark/Test/
This is from the Whitepaper regarding the second Game-Test "New Calico:
• Almost entirely consists of moving objects
• No skinned objects
• Variance shadow mapping shadows
• Lots of instanced objects
• Local and global ray-tracing effects (Parallax Occlusion Mapping, True Impostors and volumetric fog)

All-in-all, it looks like Vantage is putting more emphasis on the shaders than most game seem to do.


WRT to setup limitation - the values from HD 4890 vs. OC 5770 do not look like this is the case, both being neck to neck with each other. Bandwidth seems suffiecient on HD 5770, considering HD5870-scaling.



No, it's up to 94% faster in the presumably scripted introduction sequence of the first mission. While it's maybe a useful test wrt raw performance, it doesn't tell us much about behaviour while really playing the game.

--
Oh, and while we'r at it:
Here's a nice (not because we've done it, but because it's nice data in itself) comparison showing the improvements for GTX 280 and HD 4870 made possible by driver progress alone over the course of one year after their respective introduction:
http://www.pcgameshardware.de/aid,6...us-aktuelle-Treiber-im-Test/Grafikkarte/Test/
 
Last edited by a moderator:
Well I read Rys' very good piece over at Techreport and yet still I cannot get over enthused about it. Why should I pay for transistors that do HPC when I am not doing it myself? Same for the next Intel chip, they bung a graphics chip on there as well .. why do I want to pay for something I will not use? I'd rather have something dedicated to what I am paying for.
 
Compute Shader is part of D3D11, so for good graphics performance on games, e.g. those that do post-processing using CS, you'll be using most of the chip. The ECC shouldn't make much difference in die size (a few percent?) and double-precision is almost entirely based on re-use of existing units (with a huge adder and some additional routing being the overhead).

I don't think it's a big deal at all.

Jawed
 
My 5850 has been delayed again. Right now its a race between me actually getting my hands on that GPU, and NV releasing some concrete information about Fermi. If I can get a date, performance numbers or both that look favourable then I might just cancel the 5850. Hard launch my ass.
 
My 5850 has been delayed again. Right now its a race between me actually getting my hands on that GPU, and NV releasing some concrete information about Fermi. If I can get a date, performance numbers or both that look favourable then I might just cancel the 5850. Hard launch my ass.

I can't help but notice that you already placed the HD 5850 in your signature, even though you don't have it. I wonder how many people do the same :)
 
As far as i know, Vantage runs much more simulations on the GPU in the extreme settings than in performance mode - instead of just upping resolution and AA/AF levels:
http://www.pcgameshardware.de/aid,641615/3DMark-Vantage-im-PCGH-Benchmark-Test/Benchmark/Test/
This is from the Whitepaper regarding the second Game-Test "New Calico:
• Almost entirely consists of moving objects
• No skinned objects
• Variance shadow mapping shadows
• Lots of instanced objects
• Local and global ray-tracing effects (Parallax Occlusion Mapping, True Impostors and volumetric fog)

All-in-all, it looks like Vantage is putting more emphasis on the shaders than most game seem to do.
Yes, since it's scaling with GPU performance fairly well. Though ALUs don't seem to be relevant:

b3da029.png

WRT to setup limitation - the values from HD 4890 vs. OC 5770 do not look like this is the case, both being neck to neck with each other. Bandwidth seems suffiecient on HD 5770, considering HD5870-scaling.
The fillrate graphs seem to indicate ATI is not pixel limited. GT2 tends to be pixel limited on NVidia.

b3da027.png

b3da028.png

No, it's up to 94% faster in the presumably scripted introduction sequence of the first mission. While it's maybe a useful test wrt raw performance, it doesn't tell us much about behaviour while really playing the game.
And most websites don't use gameplay for testing...

--
Oh, and while we'r at it:
Here's a nice (not because we've done it, but because it's nice data in itself) comparison showing the improvements for GTX 280 and HD 4870 made possible by driver progress alone over the course of one year after their respective introduction:
http://www.pcgameshardware.de/aid,6...us-aktuelle-Treiber-im-Test/Grafikkarte/Test/
I dare say I'm surprised to see a few substantial improvements for NVidia.

Jawed
 
I can't help but notice that you already placed the HD 5850 in your signature, even though you don't have it. I wonder how many people do the same :)
Well if you see anyone with Fermi in their sig I'd kind of doubt that one too.

I got a longcard in my system, got the HDD cage ripped out and the HDDs sitting on the bottom of my case to prove it too. :p
 
Pixel limited? These cards are capable of billions of pixels per second and you're showing 40 million pixels per second in your graph.
11.6 fps for a 4,096,000 pixel frame is ~37MP/s. Same goes for 1920x1200 and 1680x1050, with ~35MP/s at 1280x1024.

Some stuff, like shadow maps I presume, is fixed in size regardless of resolution. So in addition to the vertex workload being pretty much static regardless of resolution, some of the pixel rendering passes are, too.

How would you characterise the bottlenecks of these two tests at various resolutions?

Jawed
 
Back
Top