NVIDIA Fermi: Architecture discussion

DegustatoR · Oct 2, 2009

Arty said:
I know I'm quoting myself but any one else surprised that the GF100 is still huge. I mean talking to Degustar & Ailuros for the past few weeks I was inclining towards a more leaner (G92ish) GPU from Nvidia but again we got a huge chip.

It would make much more sense for them to do a middle range part first to compete against Cypress directly, yes. But it looks like they're not planning on competing with AMD again, they're planning to hit the high margins market which in turn should allow them to sell consumer graphics cards at a subsidised price until they have a middle class GF10x chips. Maybe they're right, who knows.

Arty said:
Again the disappointment also comes from a lack of launch date i.e. it is still vaporware with almost all sites claiming 2010. Huge and late is what I have a problem with.

Is it late? A1 made in August kinda destroys that theory, no? As for the launch date, you have to show products for that and no products were shown.

Arty said:
Also Fuad is certain that there is a dual Fermi in the product stack. I dont see any one else mentioning this so I guess he's getting this again straight from NV PR.

I think it's pretty obvious that they'll need a dual chip AFR card to beat Hemlock. Whether or not it'll be GF100-based is another question.

Jawed said:
My interpretation is that Fermi is TMU- and ROP-less.

AFAIK, TMUs (there are some discussions on the number of them though -- 256, 128, 64?) and ROPs are both there.

nAo said:
Someone needs to get a life.

LOL

Ailuros · Oct 2, 2009

I must have missed Arty's post...

Arty,

It is an ever repeating rumor up to now in the rumor mill that the die is supposed to be roughly on GT200b level.

That's not a small die or something that's even remotely close to what I'd call a performance chip. The message so far in the pipe was rather not as huge as GT200@65nm.

That's considerably larger than Cypress—nearly 50%—but is in keeping with its advantages in DP compute performance and memory bandwidth. That also seems like a sensible estimate in light of Fermi's two additional memory interfaces, which will help dictate the size of the chip. Somewhat surprisingly, that also means Fermi may turn out to be a little bit smaller than the 55-nm GT200b, since the best estimates place the GT200b at just under 500 mm². Nvidia would appear to have continued down the path of building relatively large high-end chips compared to the competition's slimmed-down approach, but Fermi seems unlikely to push the envelope on size quite like the original 65-nm GT200 did.

http://www.techreport.com/articles.x/17670

I think it's fairly obvious that the author is speculating too. And just to clarify one more thing: I don't have any insider connection to NVIDIA or AMD. However some stuff floats around and it reaches me one way or the other from all 3 IHVs (if you'd include PowerVR for embedded stuff). One step further I never was the NV shill some wanted me to be and those that pointed fingers back then should hold up a mirror to their face and go for an integrity check nowadays. I don't have a shred of guilt for anything; and no you didn't mention or imply any of it, it's just that some folks seem to forget what a stinking pile of shit any antagonism between large firms actually is. The more you get involved the worse the smell.

Degustator,

I'm pretty confident that it has 128 TMUs, not because of anything else but less than 8/SM is suicide and more than that overkill.

DegustatoR · Oct 2, 2009

Ailuros said:
I'm pretty confident that it has 128 TMUs, not because of anything else but less than 8/SM is suicide and more than that overkill.

64 is clearly not enough but i'm not so sure that 256 would be an overkill with all the memory system changes...

AnarchX · Oct 2, 2009

Are all tasks of the former MUL in G80-GT21x SPs moved in to the "interpolation 4-way SFU pipe" or have some still done by the FMAs?

DegustatoR · Oct 2, 2009

AnarchX said:
Are all tasks of the former MUL in G80-GT21x SPs moved in to the "interpolation 4-way SFU pipe" or have some still done by the FMAs?

The former MUL was done on the former SFU, no?

trinibwoy · Oct 2, 2009

Fuad says it's only slightly smaller than GT200 and guesstimates ~550mm^2.

Ailuros · Oct 2, 2009

trinibwoy said:
Fuad says it's only slightly smaller than GT200 and guesstimates ~550mm^2.

Uhmm he also says: http://www.fudzilla.com/content/view/15784/1/

Tuesday could be; in which year though....

mboeller · Oct 2, 2009

DegustatoR said:
64 is clearly not enough but i'm not so sure that 256 would be an overkill with all the memory system changes...

I find the different TMU numbers floating around rather strange.

So here is my take on it (but I'm very often wrong so this time should be no difference)

The GF100 has 256 TMUs but this TMUs can only be used for point sampling and freely programmed for different quality settings like bilinear, tri, AF etc..). Therefore marketing can say 256 TMUs but well...

so only 64 "normal" TMUs. Why only 64? Because Fermi is clearly intended for the cGPU / GPGPU market.

Nvidia will have to rely on a high Frequency and high efficiency here.

Well, all imho and with my trackrecord of being wrong quite often I hope I'm worng here too.

CouldntResist · Oct 2, 2009

trinibwoy said:
Yes, but many are making the mistake of thinking that equates to a de-emphasizing of gaming. (...) How that can be interpreted as abandoning 3D is beyond me.

I see it more as de-emphasizing of DirectX n+1.

DirectX n -> DirectX n+1 -> DirectX n+2 -> ... saga is officially over.

In past years, product launches were orchestrated mainly by DX "level up". Today, the opportunities for real progress in graphics are laying elsewhere.

DegustatoR · Oct 2, 2009

mboeller said:
Why only 64? Because Fermi is clearly intended for the cGPU / GPGPU market.

Why do you think that LSUs are not important for cGPU / GPGPU market?

Dave Baumann · Oct 2, 2009

CarstenS said:
Really? Now that's cool - it would have saved Dave Baumann some nanoseconds to just state "yes, it is" rather than pull up a link to TR and also would have avoided unnecessary discussion, wouldn't it?

Actually, I was trying to point to the architecture slide that showed it, however I couldn't find a review that posted that particular one (it was in the Stream arch slide as opposed to the shader arch slides).

Bouncing Zabaglione Bros. · Oct 2, 2009

CouldntResist said:
I see it more as de-emphasizing of DirectX n+1.

DirectX n -> DirectX n+1 -> DirectX n+2 -> ... saga is officially over.

In past years, product launches were orchestrated mainly by DX "level up". Today, the opportunities for real progress in graphics are laying elsewhere.

That might not be a good thing given the stability and progress we've had in PC gaming due to adherence to a common API. Unless by "graphics" you mean something other than graphics for games/PC space, eg simulation/animation.

I don't think Nvidia is intending to break away from DX compatibility, and I can't see devs doing it either and cutting out ATI PC or MS console products from the potential target audience of their games.

trinibwoy · Oct 2, 2009

I don't think you can make that argument in the wake of DX10. It's more like stagnation right now. The recent improvements in fidelity have been driven by hardware, not DirectX (the best looking games are still DX9) That trend looks to continue for some time until the console refresh. That's where DirectX will shine - setting targets for the next console generation.

Ailuros · Oct 2, 2009

Hmmmm I can see Rys modified his diagram to 8 pixels/SM. I wonder what made him change his mind.

DegustatoR · Oct 2, 2009

trinibwoy said:
That's where DirectX will shine - setting targets for the next console generation.

What targets will that be considering that Xenos already has tesselator even?
I think that NV is right with their computational push because it's more about what you can do with hardware now and less about what hardware allows you to do.
Here's a good example of why CUDA+OpenCL>DX: Mirics and NVIDIA bring software-based TV to more PC platforms.
I'm thinking that the next console generation will be more about GPU computing and less about h/w features too.

trinibwoy · Oct 2, 2009

I agree. In the past DirectX has set targets for what PC hardware should be capable of. Now that the IHVs are pushing the envelope DirectX will become more of a hindrance than a help. But it will still be very important as a common lowest denominator for all hardware. That's why Nvidia has to go it alone because they can't sit by and wait for Microsoft. It's no different to ATi and tessellation. The only difference is that Nvidia has the will and capability to drive things beyond DirectX.

DegustatoR · Oct 2, 2009

trinibwoy said:
But it will still be very important as a common lowest denominator for all hardware.

The question is: how much time is left before the point where every new feature of DXn+1 can be implemented on the "old DXn" h/w via software "emulation"?

leoneazzurro · Oct 2, 2009

It's not only a matter of what emulation could do or not do, it's a matter of "how fast" a certain feature can be implemented in SW vs HW and what is thre transistor budget at disposal.

DegustatoR · Oct 2, 2009

Sure. But if your GPU can "emulate" a new feature then it's technically compatible.
Wasn't Intel planning on doing something like this for LRB's DX compatibility?

mczak · Oct 2, 2009

MfA said:
"One 64-bit FP MULL and ADD or two 64-bit FP ADDs per clock" was probably too wide for the picture

That makes no sense. The next line in that diagram says "1 64bit FP MAD per clock, so if the MUL would be same rate they'd have put it on this line instead.
I have seen zero source or evidence which indicates that those marketing slides are wrong about it, thus I have to assume the chip can indeed do 2 64bit FP MUL per clock.
Damien's number at hardware.fr are a lot better than when I first visited it, though actually there are still some bugs. Aside from the FP64 MUL on rv870, core i7 in fact has twice the mad fp32 rate (different execution units for add and mul), and it is arguable if the rv870 int32 mul is indeed 272 gflops - if you need both the 32 low and high order bits you need 2 instructions hence 136 gflops only (at least seems highly unlikely this changed from rv770).
Those peak rates are pretty misleading anyway, as they don't really reflect differences of the hardware. For instance the core i7 would have twice the mul and add throughput with a series of independent muls and adds, which isn't the case for rv870 / fermi.

edit: actually looks like damien is right about 64bit FP MUL rate and the marketing material is wrong. Ahh well, so no change from rv770 there.

NVIDIA Fermi: Architecture discussion

DegustatoR

Ailuros

Epsilon plus three

DegustatoR

AnarchX

DegustatoR

trinibwoy

Meh

Ailuros

Epsilon plus three

mboeller

CouldntResist

DegustatoR

Dave Baumann

Gamerscore Wh...

Bouncing Zabaglione Bros.

trinibwoy

Meh

Ailuros

Epsilon plus three

DegustatoR

trinibwoy

Meh

DegustatoR

leoneazzurro

DegustatoR

mczak

Similar threads