NVIDIA Fermi: Architecture discussion

Novum · Oct 3, 2009

Does anybody here know for certain how Fermi executes warp instructions? I'm not sure if I understand the white paper exactly, but is a warp of 32 threads really finished in two clock cycles by a group of 16 cores?

Dave Baumann said:
There is no "sorta" about it. There is 2x the raster rate there.

There is a difference between "raster rate" and "dual rasterizers" and you know that

thop · Oct 3, 2009

Forum poster claims to know quite a bit about Fermi.

Dave Baumann · Oct 3, 2009

Novum said:
There is a difference between "raster rate" and "dual rasterizers" and you know that

One serves one half of the tiles the other serves the other half of the tiles.

silent_guy · Oct 3, 2009

KonKort said:
Nvidia GF100: 128 TMUs and 48 ROPs

http://www.hardware-infos.com/news.php?news=3228

If I summarize, we have following facts:

- 40 nm
- 3.0 billion transistors
- 512 SPs
- 128 TMUs
- 48 ROPs
- 384 Bit GDDR5

And, as you were able to confirm many times in the past, all SP's are MIMD, right?

BTW, according to your article, the clock speeds have been changed compared to the previous stepping (as 'reported' here on May 15th). Now the chip that was shown this week was also an A1 stepping, yet produced in week 35? Isn't that quite a bit later than May 15th?

What about the reported 2547 GigaFlops that were confirmed in the same earlier article? And the gigantic bandwidth due to the 512bit bus reported also in May? You reported then about 2.4B transistors. Did they add a whole lot of logic in between the first stepping and the next first stepping?

Do you have any real source at all, other than the writings of the brilliant Theo Valich who penned down pretty much identical nonsense? Did you get anything right at all?

dkanter · Oct 3, 2009

Novum said:
Does anybody here know for certain how Fermi executes warp instructions? I'm not sure if I understand the white paper exactly, but is a warp of 32 threads really finished in two clock cycles by a group of 16 cores?

There is a difference between "raster rate" and "dual rasterizers" and you know that

Why bother reading white papers when you could read something more informative?

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

Many warps finish in two cycles, but not all.

David

Novum · Oct 3, 2009

Dave Baumann said:
One serves one half of the tiles the other serves the other half of the tiles.

Hm okay. Somebody told me that only the raster units that generate quads where doubled and it still was only one rasterizer. But I believe you in this point.

dkanter said:
Why bother reading white papers when you could read something more informative?

Yeah, but this point really is strange for me. Is it sure that they didn't misunderstand something?

It's very hard to believe for me that every instruction can be retired in only two clocks at that high speed, but otherwise the double amount of SFUs points in that direction too.

3dcgi · Oct 3, 2009

3dilettante said:
Can rasterizers double but still only expect to have one triangle sent to them by the setup pipeline?

Dave answered this question, but there are probably multiple reasons for dual rasterizers without increasing the setup rate. One is by working on 2 triangles or tiles in parallel you speed up rendering of triangles that don't cover multiple tiles.

trinibwoy said:
I don't recall the number of ROPs ever influencing the number of (marketed) rasterizers. If so, why is it that 8 and 4 ROP derivatives weren't marketed as having fewer rasterizers or vice versa? I find it hard to believe that AMD just up and decided to change that trend.

Previous designs rasterized one triangle at a time so all pixels had to come from one triangle to achieve full rate. As Dave said each now works on a different group of tiles.

dkanter · Oct 3, 2009

Novum said:
Hm okay. Somebody told me that only the raster units that generate quads where doubled and it still was only one rasterizer. But I believe you in this point.

Yeah, but this point really is strange for me. Is it sure that they didn't misunderstand something?

It's very hard to believe for me that every instruction can be retired in only two clocks at that high speed, but otherwise the double amount of SFUs points in that direction too.

The pipeline for a warp is much longer than 2 cycles. A warp can be executed in two cycles though...

Remember, like all real pipelines:

Fetch-->Decode-->Issue-->Execute-->Write Back

Execute may only take two cycles, the rest takes a VERY long time, especially since the first three stages occur at the speed of the scheduler (slow clock).

I think someone measured GT200's pipeline as 28 stages...if that's true, then going from 4 execute cycles to 2 execute cycles doesn't make a big difference.

David

Novum · Oct 3, 2009

Yes I understand that, but still the execution stage for all instructions including DP including full denorm support (!) would need to be done in two clock cycles at >1.5Ghz. Even ATI with much lower clocked ALUs uses 4 cycles. And what exactly is the benefit for Fermi?

I find this astonishing, but perhaps I'm missing something.

dkanter · Oct 3, 2009

Novum said:
Yes I understand that, but still the execution stage for all instructions including DP including full denorm support (!) would need to be done in two clock cycles at >1.5Ghz. Even ATI with much lower clocked ALUs uses 4 cycles. And what exactly is the benefit for Fermi?

I find this astonishing, but perhaps I'm missing something.

ATI's wave fronts are 64 instructions, so 2x larger than NV.

Also, note that while DP is done in two cycles, they have to use ALL the vector lanes to achieve that.

Anyway, the speed of your FMA isn't really relevant. If you want to be impressed look at the POWER6 or POWRE7. POWER6 and can do an FMA in 7 cycles @ 5GHz and it's fully pipelined.

Ultimately FPU latency/throughput is all a trade off in terms of power, area, etc.

DK

Arty · Oct 3, 2009

silent_guy said:
And, as you were able to confirm many times in the past, all SP's are MIMD, right?

BTW, according to your article, the clock speeds have been changed compared to the previous stepping (as 'reported' here on May 15th). Now the chip that was shown this week was also an A1 stepping, yet produced in week 35? Isn't that quite a bit later than May 15th?

What about the reported 2547 GigaFlops that were confirmed in the same earlier article? And the gigantic bandwidth due to the 512bit bus reported also in May? You reported then about 2.4B transistors. Did they add a whole lot of logic in between the first stepping and the next first stepping?

Do you have any real source at all, other than the writings of the brilliant Theo Valich who penned down pretty much identical nonsense? Did you get anything right at all?

You made my sig. Thanks silent_guy.

DegustatoR said:
It would make much more sense for them to do a middle range part first to compete against Cypress directly, yes. But it looks like they're not planning on competing with AMD again, they're planning to hit the high margins market which in turn should allow them to sell consumer graphics cards at a subsidised price until they have a middle class GF10x chips. Maybe they're right, who knows.

But until when?

DegustatoR said:
Is it late? A1 made in August kinda destroys that theory, no? As for the launch date, you have to show products for that and no products were shown.

Taping out just iwhen your competitor is almost ready with their launch is late. Come on, you have to concede this point. Even Nvidia has agreed GF100 is late. "Its fucking hard"

DegustatoR said:
I think it's pretty obvious that they'll need a dual chip AFR card to beat Hemlock. Whether or not it'll be GF100-based is another question.

Well from what Fuad/Apoppin were told seems like they will have a dual chip based on GF100.

Ailuros said:
I must have missed Arty's post...

Arty,

It is an ever repeating rumor up to now in the rumor mill that the die is supposed to be roughly on GT200b level.

That's not a small die or something that's even remotely close to what I'd call a performance chip. The message so far in the pipe was rather not as huge as GT200@65nm.

http://www.techreport.com/articles.x/17670

I think it's fairly obvious that the author is speculating too.

Yeah that is what has me dis-appointed. Its still larger than 500mm2. I would have been more than happy if Nvidia were to get 600 or more SPs on a chip that size. I think it was Jawed who put it aptly, this could have been Nvidia's RV770.

And no I was in no way or form was insinuating that you were affiliated to Nvidia. We (you, Deg & me) had this discussion last week so I included you.

DemoCoder · Oct 3, 2009

It could have been their RV770, but they are being more ambitious and simultaneously conservative, putting off a shrinkage/optimization pass until the next revision, opting instead of rearchitecting pieces.

It looks like they want to absorb the pain upfront now of switching the architecture yet again, rather than do a generation that is merely a refinement/optimization, followed later by bigger changes.

If they had simultaneously made all of the architectural changes, and also concentrated on tweaking all of the units for size, the card might have been delayed further.

Of course, there are limits to how dense they can go, but I wouldn't say they can't do better. Really, this is not much different in software development where whenever you introduce major architectural changes, you end up going for correctness first, and then later, you go back and optimize everything.

Their hand may have been forced. Looking at Larabee, and other product roadmaps, they probably felt they needed to get a much more general purpose GPU this generation else be caught with their pants down by Intel next year.

Ailuros · Oct 3, 2009

Arty said:
Yeah that is what has me dis-appointed. Its still larger than 500mm2. I would have been more than happy if Nvidia were to get 600 or more SPs on a chip that size. I think it was Jawed who put it aptly, this could have been Nvidia's RV770.

Their cores consume since G80 more die area because of the large frequency differences, probably in order to keep power consumption under control with a relatively low transistor density. They probably could have gone theoretically for say 10+10 clusters (ie 640SPs) but then they would have also needed 64 ROPs and a 512bit bus. In order to go that route though I'm not so sure their hw budget for 40nm would had also fit all the added computing functonalities and I honestly doubt that the whole added package is only worth a couple of hundred M transistors.

Despite the recent crysis they still capture the biggest share in professional markets and there's always Intel lurking in the next corner which at some time will release LRB and as Democoder said no one should underestimate Intel as a huge resource bucket and marketing machine. NV's bet here IMO is to win ground before the actual battle begins and a 2 in 1 solution isn't such a bad idea from their perspective.

We mainstream consumers couldn't care less about all that. When we see from independent 3rd party testing in the future what the final boards consume, how much they will cost and how they perform in 3D we can then judge what it's worth compared to Cypress.

Take a look at Cypress it doubled about every aspect compared to RV770 and so far looks to have an outstanding price/performance ratio but not really up to the point some would have expected when seeing raw specs. This of course could change as drivers mature more and/or more demanding games kick in in the future. However under your reasoning I could have wished for more than 20 clusters here too (not really), but the immediate next question would had been where the additional needed bandwidth would have come from. And yes the obvious answer would be a wider bus, but then you end up with higher chip complexity more die area, higher costs and you don't have a performance part anymore either with would make their entire strategy (including Hemlock) completely redundant.

What NV could have done in my humble layman's opinion is to focus from the get go on developing/releasing a performance part first and the high end part later on. I don't think it would change their so far monolithic large single chip strategy but could speed up a bit time to market. One has to be blind to see that their execution lacks since GT200 and no G80 isn't something that would speak against it as an example as it had a high end chip in the form of R600 to counter.

DegustatoR · Oct 3, 2009

Arty said:
But until when?

Until they have a less costly chip for mainstream segmet. Or forever. Who knows.

Arty said:
Taping out just iwhen your competitor is almost ready with their launch is late. Come on, you have to concede this point. Even Nvidia has agreed GF100 is late. "Its fucking hard"

It depends on what you've planned more than on what your competition did. AFAIK they never planned to launch earlier than 4Q. Thus it will be "late" only if it misses that target. And last time I checked it's still 2009 here.

Arty said:
Well from what Fuad/Apoppin were told seems like they will have a dual chip based on GF100.

Was it GF100 though or just Fermi in general?

digitalwanderer · Oct 3, 2009

I'm confused, what is the difference between Fermi and GF100? I thought they were the same thing. :???:

Reputator · Oct 3, 2009

digitalwanderer said:
I'm confused, what is the difference between Fermi and GF100? I thought they were the same thing.

No more than AMD's Evergreen and Cypress are the same thing. I suppose they could implement dual GPU versions of their Tesla cards, for whatever purpose.

jimmyjames123 · Oct 3, 2009

DemoCoder said:
It could have been their RV770, but they are being more ambitious and simultaneously conservative, putting off a shrinkage/optimization pass until the next revision, opting instead of rearchitecting pieces.

It looks like they want to absorb the pain upfront now of switching the architecture yet again, rather than do a generation that is merely a refinement/optimization, followed later by bigger changes.

If they had simultaneously made all of the architectural changes, and also concentrated on tweaking all of the units for size, the card might have been delayed further.

Of course, there are limits to how dense they can go, but I wouldn't say they can't do better. Really, this is not much different in software development where whenever you introduce major architectural changes, you end up going for correctness first, and then later, you go back and optimize everything.

Their hand may have been forced. Looking at Larabee, and other product roadmaps, they probably felt they needed to get a much more general purpose GPU this generation else be caught with their pants down by Intel next year.

NVIDIA needed to prepare to take on AMD in the graphics space, and Intel in the GPU computing space. That's a hell of a challenge to take on. Given what NVIDIA has shown us so far regarding Fermi, and given their statements that graphics performance will be dramatically improved over GT200, I'd say that so far it appears they did a damn good job in balancing tradeoffs with this architecture. We'll find out when the graphics performance reviews come in a couple months!

Note that, even though NVIDIA did not strike first in the DX11 graphics space vs AMD (with NVIDIA two or three months behind at a minimum), they will strike first in the GPU computing space vs Intel (with NVIDIA many many months ahead at a minimum), which is absolutely huge for them. I'd say this was a good tradeoff, don't you think?

Regarding gaming performance on GF100, one thing that many people are overlooking is improvement to real world gameplay. Games utilizing PhysX should be far superior on GF100 vs GT200 derivatives, with much greater chance of having playable framerates at high resolutions. So now we will have a technology that will become much more useful in enhancing the overall gameplay experience, rather than just eye candy at super slow framerates. And who knows what other performance and/or image quality enhancements this new chip will have vs prior generations. There has to be something new in there, we just don't know the details yet.

trinibwoy · Oct 3, 2009

3dcgi said:
Dave answered this question, but there are probably multiple reasons for dual rasterizers without increasing the setup rate. One is by working on 2 triangles or tiles in parallel you speed up rendering of triangles that don't cover multiple tiles.

Previous designs rasterized one triangle at a time so all pixels had to come from one triangle to achieve full rate. As Dave said each now works on a different group of tiles.

Ah, that makes complete sense. Mystery solved, thanks! :smile:

Rys · Oct 3, 2009

With regards to it being a 'software' tesselator, what I mean by that is I think there's some hardware support, mainly in terms of memory management, but they didn't manifest the entirety of what they could do in silicon.

Regardless, the support will be fully DX11-compliant.

Novum · Oct 3, 2009

dkanter said:
ATI's wave fronts are 64 instructions, so 2x larger than NV.

They have 16 SIMDs per Cluster on GPUs with 64 thread wavefronts, so they still need 4 cycles

NVIDIA Fermi: Architecture discussion

Novum

thop

Great Member

Dave Baumann

Gamerscore Wh...

silent_guy

dkanter

Novum

3dcgi

dkanter

Novum

dkanter

Arty

KEPLER

DemoCoder

Ailuros

Epsilon plus three

DegustatoR

digitalwanderer

wandering

Reputator

jimmyjames123

trinibwoy

Meh

Rys

Graphics @ AMD

Novum

Similar threads