NVIDIA Fermi: Architecture discussion

Rys · Nov 26, 2009

The GPU is just a dumb state machine, that does what the driver tells it to. The driver doesn't say to the GPU, "find a MADD and substitute it for FMA", since that execution of that instruction is defined by the driver well before the GPU ever tries to run it.

Blazkowicz · Nov 26, 2009

I thought you would have your code written in high level language, and the compiler has to figure out how to perform x = a*b + c calculations at one point. This is where I don't understand much the discussion. At that point, there's no MADD or FMA yet? only the assertion that you want that result.

So there would never be a MADD at all, in the reasonable scenario that all such calculations default to a FMA, unless you make it clear you want a MADD.
Correct me, too

A.L.M. · Nov 26, 2009

Rys said:
The GPU is just a dumb state machine, that does what the driver tells it to. The driver doesn't say to the GPU, "find a MADD and substitute it for FMA", since that execution of that instruction is defined by the driver well before the GPU ever tries to run it.

So what the driver would tell to the gpu in this case?
Not the same thing that a GT200 driver would tell, I guess, due to this "missing" MADD capability.
I am just curious, because I am wondering why, if it's so trivial, AMD went all the other way, with cores able to calculate both MADD and FMA...
I mean, if it's for free and it's so simple, why wasting time in leaving the MADD capability? :?:

Rys · Nov 26, 2009

The compiler generates code that targets the chip ISA though. So it's low enough level to generate MADD or FMA and there is the raw instruction there. Check out PTX for more, since it's pretty much the raw opcodes.

Rys · Nov 26, 2009

A.L.M. said:
So what the driver would tell to the gpu in this case?
Not the same thing that a GT200 driver would tell, I guess, due to this "missing" MADD capability.
I am just curious, because I am wondering why, if it's so trivial, AMD went all the other way, with cores able to calculate both MADD and FMA...
I mean, if it's for free and it's so simple, why wasting time in leaving the MADD capability?

The driver would give it the raw instruction (see my last post). And it's not wasting time, just area. It's quite feasible to design two ALUs, both capable of FMA, but only one also capable of MADD at the same rate.

FrameBuffer · Nov 26, 2009

BRiT said:
No.

so does that claim get sent to the FUD Bin for filing ?

digitalwanderer · Nov 26, 2009

FrameBuffer said:
so does that claim get sent to the FUD Bin for filing ?

If you haven't already done so, it is safe to do so at this time. :yep2:

rpg.314 · Nov 26, 2009

Blazkowicz said:
I thought you would have your code written in high level language, and the compiler has to figure out how to perform x = a*b + c calculations at one point. This is where I don't understand much the discussion. At that point, there's no MADD or FMA yet? only the assertion that you want that result.

When the compiler's done with it's high level optimizations, it breaks the intermediate representation of the code into basic operations. Basic here is defined by the ISA of the chip. Basic would be a MADD for rv770 and FMA for cypress.

Each chip has it's own ISA baked into the driver. At shader-compile-time, the relevant ISA description is pulled from disk and machine code is generated accordingly.

OpenGL guy · Nov 26, 2009

A.L.M. said:
Well, I think that it should be aware, given that the only piece of hw that can read what's written in a vga driver is the gpu...

That's not how it works at all. The driver builds buffers of commands for the GPU to process. The GPU doesn't "process" the driver.

A.L.M. said:
- can a gpu optimize its code by substituting all the MADD with FMA at once?

The GPU doesn't optimize code. The driver may enable performance features of the GPU ("interpret MADDs as FMA") but the GPU doesn't (yet) compile its own shader code.

A.L.M. said:
Chances are that it can't. And this because optimization is done by the gpu, and in particular into the thread processor, who reads only what's written in its registers, that can't physically contain all the recompiled code that is necessary to an entire game level, let's say.

Why would the GPU need to hold all the code at once anyway? As stated earlier, the driver runs on the CPU, not GPU.

CarstenS · Nov 26, 2009

Silus said:
Sure, but why is there a need to always assume the worst for NVIDIA ? This phenomenon, which is usually what Charlie does, isn't really logical.

Not assume, but take into consideration as well.

rpg.314 said:
Well, gt200 parts are in short supply right now, if not eol-ed.

Which, i bet, they are regretting now, considering the shortage of Cypress-GPUs.

Jawed said:
Curiously, Vantage GT1:

http://www.xbitlabs.com/articles/video/display/radeon-hd5770-hd5750_13.html

shows 97% performance advantage for HD5870 over HD5770. GT2 shows 86%. There could be a clue there, I suppose...

As far as i know, Vantage runs much more simulations on the GPU in the extreme settings than in performance mode - instead of just upping resolution and AA/AF levels:
http://www.pcgameshardware.de/aid,641615/3DMark-Vantage-im-PCGH-Benchmark-Test/Benchmark/Test/
This is from the Whitepaper regarding the second Game-Test "New Calico:
• Almost entirely consists of moving objects
• No skinned objects
• Variance shadow mapping shadows
• Lots of instanced objects
• Local and global ray-tracing effects (Parallax Occlusion Mapping, True Impostors and volumetric fog)

All-in-all, it looks like Vantage is putting more emphasis on the shaders than most game seem to do.

WRT to setup limitation - the values from HD 4890 vs. OC 5770 do not look like this is the case, both being neck to neck with each other. Bandwidth seems suffiecient on HD 5770, considering HD5870-scaling.

Jawed said:
HD5870 is up to 94% faster than HD5770 in this game:

http://www.computerbase.de/artikel/...ati_radeon_hd_5970/7/#abschnitt_dawn_of_war_2

No, it's up to 94% faster in the presumably scripted introduction sequence of the first mission. While it's maybe a useful test wrt raw performance, it doesn't tell us much about behaviour while really playing the game.

--
Oh, and while we'r at it:
Here's a nice (not because we've done it, but because it's nice data in itself) comparison showing the improvements for GTX 280 and HD 4870 made possible by driver progress alone over the course of one year after their respective introduction:
http://www.pcgameshardware.de/aid,6...us-aktuelle-Treiber-im-Test/Grafikkarte/Test/

dizietsma · Nov 26, 2009

Well I read Rys' very good piece over at Techreport and yet still I cannot get over enthused about it. Why should I pay for transistors that do HPC when I am not doing it myself? Same for the next Intel chip, they bung a graphics chip on there as well .. why do I want to pay for something I will not use? I'd rather have something dedicated to what I am paying for.

Jawed · Nov 26, 2009

Compute Shader is part of D3D11, so for good graphics performance on games, e.g. those that do post-processing using CS, you'll be using most of the chip. The ECC shouldn't make much difference in die size (a few percent?) and double-precision is almost entirely based on re-use of existing units (with a huge adder and some additional routing being the overhead).

I don't think it's a big deal at all.

Jawed

pjbliverpool · Nov 26, 2009

My 5850 has been delayed again. Right now its a race between me actually getting my hands on that GPU, and NV releasing some concrete information about Fermi. If I can get a date, performance numbers or both that look favourable then I might just cancel the 5850. Hard launch my ass.

Silus · Nov 26, 2009

pjbliverpool said:
My 5850 has been delayed again. Right now its a race between me actually getting my hands on that GPU, and NV releasing some concrete information about Fermi. If I can get a date, performance numbers or both that look favourable then I might just cancel the 5850. Hard launch my ass.

I can't help but notice that you already placed the HD 5850 in your signature, even though you don't have it. I wonder how many people do the same

Jawed · Nov 26, 2009

CarstenS said:
As far as i know, Vantage runs much more simulations on the GPU in the extreme settings than in performance mode - instead of just upping resolution and AA/AF levels:
http://www.pcgameshardware.de/aid,641615/3DMark-Vantage-im-PCGH-Benchmark-Test/Benchmark/Test/
This is from the Whitepaper regarding the second Game-Test "New Calico:
• Almost entirely consists of moving objects
• No skinned objects
• Variance shadow mapping shadows
• Lots of instanced objects
• Local and global ray-tracing effects (Parallax Occlusion Mapping, True Impostors and volumetric fog)

All-in-all, it looks like Vantage is putting more emphasis on the shaders than most game seem to do.

Yes, since it's scaling with GPU performance fairly well. Though ALUs don't seem to be relevant:

WRT to setup limitation - the values from HD 4890 vs. OC 5770 do not look like this is the case, both being neck to neck with each other. Bandwidth seems suffiecient on HD 5770, considering HD5870-scaling.

The fillrate graphs seem to indicate ATI is not pixel limited. GT2 tends to be pixel limited on NVidia.

No, it's up to 94% faster in the presumably scripted introduction sequence of the first mission. While it's maybe a useful test wrt raw performance, it doesn't tell us much about behaviour while really playing the game.

And most websites don't use gameplay for testing...

--
Oh, and while we'r at it:
Here's a nice (not because we've done it, but because it's nice data in itself) comparison showing the improvements for GTX 280 and HD 4870 made possible by driver progress alone over the course of one year after their respective introduction:
http://www.pcgameshardware.de/aid,6...us-aktuelle-Treiber-im-Test/Grafikkarte/Test/

I dare say I'm surprised to see a few substantial improvements for NVidia.

Jawed

digitalwanderer · Nov 26, 2009

Silus said:
I can't help but notice that you already placed the HD 5850 in your signature, even though you don't have it. I wonder how many people do the same

Well if you see anyone with Fermi in their sig I'd kind of doubt that one too.

I got a longcard in my system, got the HDD cage ripped out and the HDDs sitting on the bottom of my case to prove it too.

OpenGL guy · Nov 26, 2009

Jawed said:
The fillrate graphs seem to indicate ATI is not pixel limited. GT2 tends to be pixel limited on NVidia.

Pixel limited? These cards are capable of billions of pixels per second and you're showing 40 million pixels per second in your graph.

CarstenS · Nov 26, 2009

Jawed said:
Yes, since it's scaling with GPU performance fairly well. Though ALUs don't seem to be relevant:

http://www.cupidity.f9.co.uk/b3da029.png

Where's that data in your image compiled from?

Jawed · Nov 26, 2009

OpenGL guy said:
Pixel limited? These cards are capable of billions of pixels per second and you're showing 40 million pixels per second in your graph.

11.6 fps for a 4,096,000 pixel frame is ~37MP/s. Same goes for 1920x1200 and 1680x1050, with ~35MP/s at 1280x1024.

Some stuff, like shadow maps I presume, is fixed in size regardless of resolution. So in addition to the vertex workload being pretty much static regardless of resolution, some of the pixel rendering passes are, too.

How would you characterise the bottlenecks of these two tests at various resolutions?

Jawed

Jawed · Nov 26, 2009

CarstenS said:
Where's that data in your image compiled from?

I took the X-bit review I linked earlier, plus

http://www.xbitlabs.com/articles/video/display/radeon-hd5970_13.html#sect1

And I used the board clocks/unit counts specified in the test setup page of those reviews.

Jawed

NVIDIA Fermi: Architecture discussion

Rys

Graphics @ AMD

Blazkowicz

A.L.M.

Rys

Graphics @ AMD

Rys

Graphics @ AMD

FrameBuffer

digitalwanderer

rpg.314

OpenGL guy

CarstenS

Moderator

dizietsma

Jawed

pjbliverpool

B3D Scallywag

Silus

Jawed

digitalwanderer

OpenGL guy

CarstenS

Moderator

Jawed

Jawed

Similar threads