NVIDIA Fermi: Architecture discussion

Does anybody here know for certain how Fermi executes warp instructions? I'm not sure if I understand the white paper exactly, but is a warp of 32 threads really finished in two clock cycles by a group of 16 cores?

There is no "sorta" about it. There is 2x the raster rate there.
There is a difference between "raster rate" and "dual rasterizers" and you know that ;)
 
Nvidia GF100: 128 TMUs and 48 ROPs

http://www.hardware-infos.com/news.php?news=3228

If I summarize, we have following facts:

- 40 nm
- 3.0 billion transistors
- 512 SPs
- 128 TMUs
- 48 ROPs
- 384 Bit GDDR5

And, as you were able to confirm many times in the past, all SP's are MIMD, right?

BTW, according to your article, the clock speeds have been changed compared to the previous stepping (as 'reported' here on May 15th). Now the chip that was shown this week was also an A1 stepping, yet produced in week 35? Isn't that quite a bit later than May 15th?

What about the reported 2547 GigaFlops that were confirmed in the same earlier article? And the gigantic bandwidth due to the 512bit bus reported also in May? You reported then about 2.4B transistors. Did they add a whole lot of logic in between the first stepping and the next first stepping?

Do you have any real source at all, other than the writings of the brilliant Theo Valich who penned down pretty much identical nonsense? Did you get anything right at all?
 
Last edited by a moderator:
Does anybody here know for certain how Fermi executes warp instructions? I'm not sure if I understand the white paper exactly, but is a warp of 32 threads really finished in two clock cycles by a group of 16 cores?


There is a difference between "raster rate" and "dual rasterizers" and you know that ;)

Why bother reading white papers when you could read something more informative?

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

Many warps finish in two cycles, but not all.

David
 
One serves one half of the tiles the other serves the other half of the tiles.
Hm okay. Somebody told me that only the raster units that generate quads where doubled and it still was only one rasterizer. But I believe you in this point.

Why bother reading white papers when you could read something more informative?
Yeah, but this point really is strange for me. Is it sure that they didn't misunderstand something?

It's very hard to believe for me that every instruction can be retired in only two clocks at that high speed, but otherwise the double amount of SFUs points in that direction too.
 
Last edited by a moderator:
Can rasterizers double but still only expect to have one triangle sent to them by the setup pipeline?
Dave answered this question, but there are probably multiple reasons for dual rasterizers without increasing the setup rate. One is by working on 2 triangles or tiles in parallel you speed up rendering of triangles that don't cover multiple tiles.

I don't recall the number of ROPs ever influencing the number of (marketed) rasterizers. If so, why is it that 8 and 4 ROP derivatives weren't marketed as having fewer rasterizers or vice versa? I find it hard to believe that AMD just up and decided to change that trend.
Previous designs rasterized one triangle at a time so all pixels had to come from one triangle to achieve full rate. As Dave said each now works on a different group of tiles.
 
Hm okay. Somebody told me that only the raster units that generate quads where doubled and it still was only one rasterizer. But I believe you in this point.


Yeah, but this point really is strange for me. Is it sure that they didn't misunderstand something?

It's very hard to believe for me that every instruction can be retired in only two clocks at that high speed, but otherwise the double amount of SFUs points in that direction too.

The pipeline for a warp is much longer than 2 cycles. A warp can be executed in two cycles though...

Remember, like all real pipelines:

Fetch-->Decode-->Issue-->Execute-->Write Back

Execute may only take two cycles, the rest takes a VERY long time, especially since the first three stages occur at the speed of the scheduler (slow clock).

I think someone measured GT200's pipeline as 28 stages...if that's true, then going from 4 execute cycles to 2 execute cycles doesn't make a big difference.

David
 
Yes I understand that, but still the execution stage for all instructions including DP including full denorm support (!) would need to be done in two clock cycles at >1.5Ghz. Even ATI with much lower clocked ALUs uses 4 cycles. And what exactly is the benefit for Fermi?

I find this astonishing, but perhaps I'm missing something.
 
Yes I understand that, but still the execution stage for all instructions including DP including full denorm support (!) would need to be done in two clock cycles at >1.5Ghz. Even ATI with much lower clocked ALUs uses 4 cycles. And what exactly is the benefit for Fermi?

I find this astonishing, but perhaps I'm missing something.

ATI's wave fronts are 64 instructions, so 2x larger than NV.

Also, note that while DP is done in two cycles, they have to use ALL the vector lanes to achieve that.

Anyway, the speed of your FMA isn't really relevant. If you want to be impressed look at the POWER6 or POWRE7. POWER6 and can do an FMA in 7 cycles @ 5GHz and it's fully pipelined.

Ultimately FPU latency/throughput is all a trade off in terms of power, area, etc.

DK
 
And, as you were able to confirm many times in the past, all SP's are MIMD, right?

BTW, according to your article, the clock speeds have been changed compared to the previous stepping (as 'reported' here on May 15th). Now the chip that was shown this week was also an A1 stepping, yet produced in week 35? Isn't that quite a bit later than May 15th?

What about the reported 2547 GigaFlops that were confirmed in the same earlier article? And the gigantic bandwidth due to the 512bit bus reported also in May? You reported then about 2.4B transistors. Did they add a whole lot of logic in between the first stepping and the next first stepping?

Do you have any real source at all, other than the writings of the brilliant Theo Valich who penned down pretty much identical nonsense? Did you get anything right at all?
You made my sig. Thanks silent_guy.

It would make much more sense for them to do a middle range part first to compete against Cypress directly, yes. But it looks like they're not planning on competing with AMD again, they're planning to hit the high margins market which in turn should allow them to sell consumer graphics cards at a subsidised price until they have a middle class GF10x chips. Maybe they're right, who knows.
But until when?
Is it late? A1 made in August kinda destroys that theory, no? As for the launch date, you have to show products for that and no products were shown.
Taping out just iwhen your competitor is almost ready with their launch is late. Come on, you have to concede this point. Even Nvidia has agreed GF100 is late. "Its fucking hard"

I think it's pretty obvious that they'll need a dual chip AFR card to beat Hemlock. Whether or not it'll be GF100-based is another question.
Well from what Fuad/Apoppin were told seems like they will have a dual chip based on GF100.


I must have missed Arty's post...

Arty,

It is an ever repeating rumor up to now in the rumor mill that the die is supposed to be roughly on GT200b level.

That's not a small die or something that's even remotely close to what I'd call a performance chip. The message so far in the pipe was rather not as huge as GT200@65nm.

http://www.techreport.com/articles.x/17670

I think it's fairly obvious that the author is speculating too.
Yeah that is what has me dis-appointed. Its still larger than 500mm2. I would have been more than happy if Nvidia were to get 600 or more SPs on a chip that size. I think it was Jawed who put it aptly, this could have been Nvidia's RV770.

And no I was in no way or form was insinuating that you were affiliated to Nvidia. We (you, Deg & me) had this discussion last week so I included you.
 
It could have been their RV770, but they are being more ambitious and simultaneously conservative, putting off a shrinkage/optimization pass until the next revision, opting instead of rearchitecting pieces.

It looks like they want to absorb the pain upfront now of switching the architecture yet again, rather than do a generation that is merely a refinement/optimization, followed later by bigger changes.

If they had simultaneously made all of the architectural changes, and also concentrated on tweaking all of the units for size, the card might have been delayed further.

Of course, there are limits to how dense they can go, but I wouldn't say they can't do better. Really, this is not much different in software development where whenever you introduce major architectural changes, you end up going for correctness first, and then later, you go back and optimize everything.

Their hand may have been forced. Looking at Larabee, and other product roadmaps, they probably felt they needed to get a much more general purpose GPU this generation else be caught with their pants down by Intel next year.
 
Yeah that is what has me dis-appointed. Its still larger than 500mm2. I would have been more than happy if Nvidia were to get 600 or more SPs on a chip that size. I think it was Jawed who put it aptly, this could have been Nvidia's RV770.

Their cores consume since G80 more die area because of the large frequency differences, probably in order to keep power consumption under control with a relatively low transistor density. They probably could have gone theoretically for say 10+10 clusters (ie 640SPs) but then they would have also needed 64 ROPs and a 512bit bus. In order to go that route though I'm not so sure their hw budget for 40nm would had also fit all the added computing functonalities and I honestly doubt that the whole added package is only worth a couple of hundred M transistors.

Despite the recent crysis they still capture the biggest share in professional markets and there's always Intel lurking in the next corner which at some time will release LRB and as Democoder said no one should underestimate Intel as a huge resource bucket and marketing machine. NV's bet here IMO is to win ground before the actual battle begins and a 2 in 1 solution isn't such a bad idea from their perspective.

We mainstream consumers couldn't care less about all that. When we see from independent 3rd party testing in the future what the final boards consume, how much they will cost and how they perform in 3D we can then judge what it's worth compared to Cypress.

Take a look at Cypress it doubled about every aspect compared to RV770 and so far looks to have an outstanding price/performance ratio but not really up to the point some would have expected when seeing raw specs. This of course could change as drivers mature more and/or more demanding games kick in in the future. However under your reasoning I could have wished for more than 20 clusters here too (not really), but the immediate next question would had been where the additional needed bandwidth would have come from. And yes the obvious answer would be a wider bus, but then you end up with higher chip complexity more die area, higher costs and you don't have a performance part anymore either with would make their entire strategy (including Hemlock) completely redundant.

What NV could have done in my humble layman's opinion is to focus from the get go on developing/releasing a performance part first and the high end part later on. I don't think it would change their so far monolithic large single chip strategy but could speed up a bit time to market. One has to be blind to see that their execution lacks since GT200 and no G80 isn't something that would speak against it as an example as it had a high end chip in the form of R600 to counter.
 
But until when?
Until they have a less costly chip for mainstream segmet. Or forever. Who knows.

Taping out just iwhen your competitor is almost ready with their launch is late. Come on, you have to concede this point. Even Nvidia has agreed GF100 is late. "Its fucking hard"
It depends on what you've planned more than on what your competition did. AFAIK they never planned to launch earlier than 4Q. Thus it will be "late" only if it misses that target. And last time I checked it's still 2009 here.

Well from what Fuad/Apoppin were told seems like they will have a dual chip based on GF100.
Was it GF100 though or just Fermi in general?
 
It could have been their RV770, but they are being more ambitious and simultaneously conservative, putting off a shrinkage/optimization pass until the next revision, opting instead of rearchitecting pieces.

It looks like they want to absorb the pain upfront now of switching the architecture yet again, rather than do a generation that is merely a refinement/optimization, followed later by bigger changes.

If they had simultaneously made all of the architectural changes, and also concentrated on tweaking all of the units for size, the card might have been delayed further.

Of course, there are limits to how dense they can go, but I wouldn't say they can't do better. Really, this is not much different in software development where whenever you introduce major architectural changes, you end up going for correctness first, and then later, you go back and optimize everything.

Their hand may have been forced. Looking at Larabee, and other product roadmaps, they probably felt they needed to get a much more general purpose GPU this generation else be caught with their pants down by Intel next year.

NVIDIA needed to prepare to take on AMD in the graphics space, and Intel in the GPU computing space. That's a hell of a challenge to take on. Given what NVIDIA has shown us so far regarding Fermi, and given their statements that graphics performance will be dramatically improved over GT200, I'd say that so far it appears they did a damn good job in balancing tradeoffs with this architecture. We'll find out when the graphics performance reviews come in a couple months!

Note that, even though NVIDIA did not strike first in the DX11 graphics space vs AMD (with NVIDIA two or three months behind at a minimum), they will strike first in the GPU computing space vs Intel (with NVIDIA many many months ahead at a minimum), which is absolutely huge for them. I'd say this was a good tradeoff, don't you think?

Regarding gaming performance on GF100, one thing that many people are overlooking is improvement to real world gameplay. Games utilizing PhysX should be far superior on GF100 vs GT200 derivatives, with much greater chance of having playable framerates at high resolutions. So now we will have a technology that will become much more useful in enhancing the overall gameplay experience, rather than just eye candy at super slow framerates. And who knows what other performance and/or image quality enhancements this new chip will have vs prior generations. There has to be something new in there, we just don't know the details yet.
 
Last edited by a moderator:
Dave answered this question, but there are probably multiple reasons for dual rasterizers without increasing the setup rate. One is by working on 2 triangles or tiles in parallel you speed up rendering of triangles that don't cover multiple tiles.

Previous designs rasterized one triangle at a time so all pixels had to come from one triangle to achieve full rate. As Dave said each now works on a different group of tiles.

Ah, that makes complete sense. Mystery solved, thanks! :smile:
 
With regards to it being a 'software' tesselator, what I mean by that is I think there's some hardware support, mainly in terms of memory management, but they didn't manifest the entirety of what they could do in silicon.

Regardless, the support will be fully DX11-compliant.
 
Back
Top