NVIDIA Fermi: Architecture discussion

This has so much more impact when they say it than when you say it :) So care to give an attributed quote?
Yes

"Ships '09" -it was in a meeting with 2 officials .. i believe it was Drew (i have to review my video to see for sure)
- they also would not talk about performance but said "it will be the fastest" single-GPU (period!)


Is there a reason other than aliteracy for "Jensen" instead of Jen-Hsun (Huang)? 'Cos it's pissing me off...
Well, he will tell you to call him Jensen
.. it appears that *everybody* does
:p
 
Sure. But if your GPU can "emulate" a new feature then it's technically compatible.
Wasn't Intel planning on doing something like this for LRB's DX compatibility?

Possibly. I think that i.e. also DX11 tessellation could be easily "emulated" in RV770 by using shaders to emulate the DX11 programmable path.
Question is that with emulation preformance is affected some way, so a "native" HW solution could be usually preferrable in a market that looks only at FPS numbers.
 
i forgot to mention .. i asked about the videocard that Jensen was holding up for us to see
(yes, i am that rude)

- "it is a production mock-up"
:p

. . . but they do have working Fermi silicon
(the closest anyone could get from any Nvidia official - as far as i can ascertain - for a firm Firmi GTX "number" comparison, was 1.6-1.8 times faster than the current GT 285 GTX; that is not confirmed, however)
 
Possibly. I think that i.e. also DX11 tessellation could be easily "emulated" in RV770 by using shaders to emulate the DX11 programmable path.
Why isn't that done then? And I'm pretty sure that you can't "emulate" DX11 tesselation on a GT200 for example.

Question is that with emulation preformance is affected some way, so a "native" HW solution could be usually preferrable in a market that looks only at FPS numbers.
That's a tough guess. Sometimes it may be smarter to use a software solution. For example if GF100 really won't have a h/w tesselator we'll see soon enough if that's true for DX11-type tesselation.
 
I agree. In the past DirectX has set targets for what PC hardware should be capable of. Now that the IHVs are pushing the envelope DirectX will become more of a hindrance than a help.

G80(+successors) situation is unprecedenced. Entire HW generation has emerged, lived on marked, and is going to be superceded soon, all while it's important capablities were unexposed by DirectX (applies to OGL as well, but it lagged even before that). The last 2 years is a testimony to what bottleneck a 3D API has become.
 
3. Vector lanes cannot simultaneously use the ALU and FPU, there simply isn't register file bandwidth.

There are a couple of ways to get the necessary read bandwidth. If you run the FPU mantissa bits as a 24bit int adder, and the normal ALU as 24bit, note that 4operands x 24bit = 3operands (ala FMA) x 32bit. Of course, you'd have to build the cache to be able to extract data like that, which is likely a pita. But from a bit perspective, there's enough read bandwidth.

Or you could run one of the ALUs on every other cycle and pull one extra operand for it on each cycle, assuming that you aren't running FMAs. [and then we could have two years of articles regarding the 'missing alu' -- wouldn't that be fun? :) ]

However, I'm not sure what you do about write bandwidth :(

Separate notes:
I am surprised that the extra FMA precision will really cause problems in games. I am surprised that the FPU and ALU are really separate units. I wish Rys would explain the reasons behind 8 vs. 16 as well. Or if apoppin would just spill the beans :)

-Dave
 
Why isn't that done then? And I'm pretty sure that you can't "emulate" DX11 tesselation on a GT200 for example.

Performance issues? Probably it's doable to convert a DX11 call to the tessellator + hull shader+domain shader by emulating the hull and domain shader part, but maybe it is not fast enough.

As for the second, IIRC you could use geometry shaders to do tessellation, it is probably dead slow as an implementation.

That's a tough guess. Sometimes it may be smarter to use a software solution. For example if GF100 really won't have a h/w tesselator we'll see soon enough if that's true for DX11-type tesselation.

Yes, it is the same situation as Larrabee when throwing almost everything fixed-function away but keeping the TMU
 
There are a couple of ways to get the necessary read bandwidth. If you run the FPU mantissa bits as a 24bit int adder, and the normal ALU as 24bit, note that 4operands x 24bit = 3operands (ala FMA) x 32bit. Of course, you'd have to build the cache to be able to extract data like that, which is likely a pita. But from a bit perspective, there's enough read bandwidth.

Or you could run one of the ALUs on every other cycle and pull one extra operand for it on each cycle, assuming that you aren't running FMAs. [and then we could have two years of articles regarding the 'missing alu' -- wouldn't that be fun? :) ]

However, I'm not sure what you do about write bandwidth :(

Separate notes:
I am surprised that the extra FMA precision will really cause problems in games. I am surprised that the FPU and ALU are really separate units. I wish Rys would explain the reasons behind 8 vs. 16 as well. Or if apoppin would just spill the beans :)

-Dave

i really don't know; i am trying to set up another session for my site .. i will try to get clarification on more architectural details

Anything else you reasonably want me to get details on? No promises, but i will get what info i can
 
There seems to be a bit of confusion here about a number of things in terms of Fermi's execution units. I recommend this page of my article:

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

1. Fermi cores have 32 vector lanes (CUDA cores in NV's bastardized and retarded terminology). Each pipeline is 16 vector lanes.
2. Each vector lane has a 32-bit ALU and a 32-bit FPU
3. Vector lanes cannot simultaneously use the ALU and FPU, there simply isn't register file bandwidth.
4. 64-bit FPU ops use both pipelines, 64 bit integer operations simply take longer, sometimes a lot longer (e.g. 64b IMUL = 4x slower than 32b IADD/ISUB)

Fermi definitely has TMUs and ROPs. They didn't talk about them, but they are certainly there.

David


Hi,

I checked again the Fermi diagram and i saw that the load/store units is 16 per SM.
The load store units in GT200 is 8 per TPC.
Loads and stores are issued to the SM Controller and handled in the texture pipeline, correct?
Is there a correlation between Loads/stores number and texture units number?
I mean if it is, we have:

The load/store units in Fermi are 16 per SM. 16X16=256.
The load store units in GT200 are 8 per TPC. 10X8=80
 
That's a tough guess. Sometimes it may be smarter to use a software solution. For example if GF100 really won't have a h/w tesselator we'll see soon enough if that's true for DX11-type tesselation.

I think it would be much better for your arguments if you were more balanced (ie. less biased towards nVidia). In all your posts I see the black and white theme of nVidia = always right.

So it turns out nVidia will not have a HW tesselator, this is your cue to start pushing this as possibly better than ATI HW tesselator? So predictable...
 
That might not be a good thing given the stability and progress we've had in PC gaming due to adherence to a common API. Unless by "graphics" you mean something other than graphics for games/PC space, eg simulation/animation.

I don't think Nvidia is intending to break away from DX compatibility. (...)
I don't think we'll see any breaking compatiblity coming, but rather decline of DX as the driving force in graphics world (gaming included).

We've been getting clues during recent years:
- lukewarm reception of DX10-11 by Carmack and Sweeney
- Larrabee PR
- DX lagging 2 years behind HW (G80)

nVidia's event seems like the final straw.

I believe the only part of DX11 that will make an impact, is DC. Even now, in modern game graphics engine there are tons of tasks that don't require geometry data or rasterisation (deferred lighting & shadow mapping, SSAO, bloom, HDR tone mapping, motion blur, particles, etc). With DC there are posiblities of new techniques, or optimisations of the old ones. Perhaps with new rasterisation hybrid engines the percentage of engine work covered by DC will grow. The DC probably will see some updates. But the API as a whole, with it's old pipeline supersized by 4 new geometry shader stages, is going to look like tail trying to wag the dog.
 
i really don't know; i am trying to set up another session for my site .. i will try to get clarification on more architectural details

Anything else you reasonably want me to get details on? No promises, but i will get what info i can

The 8 vs. 16 was in regards to TMU count, which you indicated you had last night.... :)

I think the real fun bit would be to try to get them to spill beans on architectural differences between chief scientists. Letting them tell you the differences in leanings of top scientist could give you hints for the next arch (Gauss? -- that's what we called our post-Fermi release :shrug:)

Real details are going to have to wait for actual cards and tests. 1.6x is pretty low, but would certainly bolster the case for 8bilerps vs. 16 (for example).

I think the other question floating around in my head was to what extent the improvements in efficiency improve overall performance. How much does the second dispatch unit help, how much does the faster context switch help, how much does letting more than one kernel run at the same time help, etc.

Thanks and good luck, wish I was there, but we have had an awful release cycle these past couple of days, and I get to do perf work for the next few days, which is where all the fun in a release is anyway :)

-Dave
 
The 8 vs. 16 was in regards to TMU count, which you indicated you had last night.... :)

I think the real fun bit would be to try to get them to spill beans on architectural differences between chief scientists. Letting them tell you the differences in leanings of top scientist could give you hints for the next arch (Gauss? -- that's what we called our post-Fermi release :shrug:)

Real details are going to have to wait for actual cards and tests. 1.6x is pretty low, but would certainly bolster the case for 8bilerps vs. 16 (for example).

I think the other question floating around in my head was to what extent the improvements in efficiency improve overall performance. How much does the second dispatch unit help, how much does the faster context switch help, how much does letting more than one kernel run at the same time help, etc.

Thanks and good luck, wish I was there, but we have had an awful release cycle these past couple of days, and I get to do perf work for the next few days, which is where all the fun in a release is anyway :)

-Dave

There is a MASS of details that i am still working thru - i have a BOOK of notes and many GBs of HD video and still images; the conference is all day - and i *finally* got some sleep last night :p
the worst thing i could do is put out wrong information

i do know that they recruited Bill Dally as their chief scientist because of his background in parallel processing ... that IS the direction they are heading in - without ignoring gaming [they like DX11; Fermi architecture has improved quite a bit for gaming (especially PhysX with their "gigathread" feature)
i.e. each solver is a kernel that executes on GT 200 when the kernal before it is done; with Fermi, they can execute without waiting for the one before it to be completed]

.. So they effectively shortened the idle time so that all 512 cores can execute more efficiently (and i have to look at my video for more details as this is from memory and scribbled notes)

and i gotta run .. i have to be back at the conference at 8:30 am
 
I think it would be much better for your arguments if you were more balanced (ie. less biased towards nVidia). In all your posts I see the black and white theme of nVidia = always right.
I think you need to take a look in the mirror.

So it turns out nVidia will not have a HW tesselator, this is your cue to start pushing this as possibly better than ATI HW tesselator? So predictable...
Is it impossible for s/w-based tesselator to and up being better than AMD's h/w one? How's that mirror thing going?
 
I don't think we'll see any breaking compatiblity coming, but rather decline of DX as the driving force in graphics world (gaming included).

We've been getting clues during recent years:
- lukewarm reception of DX10-11 by Carmack and Sweeney
- Larrabee PR
- DX lagging 2 years behind HW (G80)

nVidia's event seems like the final straw.

I believe the only part of DX11 that will make an impact, is DC. Even now, in modern game graphics engine there are tons of tasks that don't require geometry data or rasterisation (deferred lighting & shadow mapping, SSAO, bloom, HDR tone mapping, motion blur, particles, etc). With DC there are posiblities of new techniques, or optimisations of the old ones. Perhaps with new rasterisation hybrid engines the percentage of engine work covered by DC will grow. The DC probably will see some updates. But the API as a whole, with it's old pipeline supersized by 4 new geometry shader stages, is going to look like tail trying to wag the dog.

Err, how exactly is DX lagging 2 years behind HW? DX10 was out (even if not for public retail sales, it was available for MSDN/Connect etc subscribers) in same month as G80 was released
 
Err, how exactly is DX lagging 2 years behind HW? DX10 was out (even if not for public retail sales, it was available for MSDN/Connect etc subscribers) in same month as G80 was released

He's referring to hardware features that are beyond the DX spec.

Is it impossible for s/w-based tesselator to and up being better than AMD's h/w one?

It's not impossible but is it likely? To be honest, I'm hoping that we are missing something big and that the good ole rasterization pipeline gets a shake up on Fermi. They've already exceeded expectations architecturally so now it's time to deliver the goods. That report of 1.6x GTX285 is far from exciting though. 2x units + caching + higher efficiency => 1.6x performance => boring.


That label doesn't tell me anything at all.
 
I think you need to take a look in the mirror.


Is it impossible for s/w-based tesselator to and up being better than AMD's h/w one? How's that mirror thing going?

I am just suggesting you tone down your bias a little. I would be happy if they had the card out now and it performed great...
 
It's not impossible but is it likely?
Why not? 16 SMs with 512 SPs is bigger than 1 tesselator. The question is was Fermi designed with tesselation performance in mind or was it an afterthought? In the second case it's certainly not likely. But that would mean that NV forgot to implement one of key DX11 features. How likely is that?
We'll see. I think that both outcomes are possible.

To be honest, I'm hoping that we are missing something big and that the good ole rasterization pipeline gets a shake up on Fermi. They've already exceeded expectations architecturally so now it's time to deliver the goods. That report of 1.6x GTX285 is far from exciting though. 2x units + caching + higher efficiency => 1.6x performance => boring.
It's the same as with Cypress but with a much more potent underlying architecture comparable to LRB. If that's boring then I don't know what's not.

I am just suggesting you tone down your bias a little.
That's the problem. I don't have any bias. You're seeing what doesn't exist.
 
It still kind of makes me pause when people turn up their noses at 1.6x scaling in a concurrent processing environment.

There are workloads and applications where people would murder their mothers to get that kind of scaling.

Here are some things we don't know scaled:
The crossbar/whatever that gets read data to the L1s.
In Cypress, this read-only crossbar scaled only by clock speed.
I didn't see mention of Fermi's bandwidth in this area, nor what has happened to the implementation since there is now data flowing in the write direction.

The writes may not have as much bearing in games because of how heavily current game workloads seem to lean on reads.

Memory bandwidth has not scaled by over 2x.
Triangle setup and rasterization is not known, and possibly has not doubled.
AMD doubled (maybe, sorta, from a certain point of view, who knows) its rasterizers. The triangle setup rate so far doesn't appear to have changed per clock. Can rasterizers double but still only expect to have one triangle sent to them by the setup pipeline? Does tesselation bypass this?

ROPs?
TMUs?
CPU limitations?
 
Back
Top