NVIDIA Fermi: Architecture discussion

MfA · Oct 2, 2009

ninelven said:
or more likely just a mistake.

In the white paper there is a single mention of 8x ... and 2 mentions of 16x (once explicitly and once implicitly in the g80/gt200/fermi comparison). On the one hand you might say the odds are in favour of 16x, on the other I would not. Personally I think it's far more likely that Double Precision instructions get dispatched at "half rate" and "do not support dual dispatch with any other operation". It just makes more sense, running it at half peak rate overall would mean that half the multiplier hardware would lie fallow most of the time.

If I'm right this won't be cleared up until the launch I think

mczak · Oct 2, 2009

Jawed said:
We have no details on the ATI 24-bit INT-MUL - is that just the 24 lowest bits? I suspect its for addressing type calculations.

Hmm yes that's actually a good question. Maybe it could be 24bit source, 32bit destination? That would be nice. Where's the R800 instruction guide

What did the 24bit int mul on GT200 actually do? Seems a bit odd nvidia opted for (half-rate) 32bit int mul only with GF100.

mczak · Oct 2, 2009

ninelven said:
Cypress is 1/5 for FP64 MUL (272GFlops), 2/5 for ADD (544GFlops).

Marketing material clearly indicates 2 FP64 MUL or ADD per clock: http://techreport.com/articles.x/17618/5 - so either it is bogus or you're dead wrong

.

babcat · Oct 2, 2009

Kaotik said:
So, what's the verdict on the gaming part, is there enough magic tricks and cookie monsters to really beat a chip with over 1 TFLOPS higher theoretical peak performance, and with notable margin?
For reference, last gen the peak difference between 4890 and GTX285 was ~0.2TFLOPS

Very good question.

trinibwoy · Oct 2, 2009

apoppin said:
Nvidia is emphasizing *other* than gaming with Fermi.

Yes, but many are making the mistake of thinking that equates to a de-emphasizing of gaming. I agree with Arun, all these editorials are off base. For starters, even at moderate clocks Fermi is 2x GTX285. How that can be interpreted as abandoning 3D is beyond me. Secondly, during the keynote JHH went to great pains to point out that the only reason they could afford all this R&D is due to the volumes that they ship. So are people really trying to say that a man who knows that there is no Tesla and no Quadro without Geforce, is going to blow up the foundation of his house? Come on....

MfA said:
Why do they say that a SM has "8x the peak double precision floating point performance over GT200"? (Shouldn't that be 16.)

Probably a typo, it's 8x for the chip and 16x per SM.

ShaidarHaran · Oct 2, 2009

Just for the sake of technical accuracy:
HD 4890 (RV790) had a maximum FLOP rate of 1.36TFLOPs and GTX 285 was 1.12TFLOPs.

trinibwoy · Oct 2, 2009

Kaotik said:
So, what's the verdict on the gaming part, is there enough magic tricks and cookie monsters to really beat a chip with over 1 TFLOPS higher theoretical peak performance, and with notable margin?
For reference, last gen the peak difference between 4890 and GTX285 was ~0.2TFLOPS

Fermi has >2x the MADDs of GTX285. Does that answer your question? What's the point of comparing it to HD5870 when that only manages to be 40% faster with 4x the MADDs?

Kaotik · Oct 2, 2009

ninelven said:
Cypress is 1/5 for FP64 MUL (272GFlops), 2/5 for ADD (544GFlops).

mczak said:
Marketing material clearly indicates 2 FP64 MUL or ADD per clock: http://techreport.com/articles.x/17618/5 - so either it is bogus or you're dead wrong .

ninelven, 1/5 of Cypress is 544, not 272GFLOPS

If my memory serves from the way it used to be in RV670 which first introduced DP on Radeons, basic add & substraction functions can be carried at 2/5 rate, while multiply and divide functions would go at 1/5 rate. (I doubt it has gone any worse from what it was back then)

babcat · Oct 2, 2009

seahawk said:
I do find a new GPU architecture always very interesting and I care only a little about its performance. So it was a good techday, not more not less.

ATI has released a splendid gaming card. NV has shown interesting technology and ideas, but it might suck for gaming and it won´t be on the shelves for 4-6 months. So who cares about its gaming performance right now?

I care about it's gaming performance.

MfA · Oct 2, 2009

trinibwoy said:
Probably a typo, it's 8x for the chip and 16x per SM.

We will see ... in the face of conflicting official statements and the fact that half peak rate double precision execution runs against my bias, I'll have to go with my bias.

Insider information won't really convince me otherwise, you have your scepticism about PR written technological information ... I have my scepticism about the ability of people to really see the difference between professional ego stroking "insider information" and true friendly acquaintances who talk too much

trinibwoy · Oct 2, 2009

MfA said:
We will see ... in the face of conflicting official statements and the fact that half peak rate double precision execution runs against my bias, I'll have to go with my bias.

Insider information won't really convince me otherwise, you have your scepticism about PR written technological information ... I have my scepticism about the ability of people to really see the difference between professional ego stroking "insider information" and true friendly acquaintances who talk too much

You lost me there. I thought we were talking about the whitepaper here? That qualifies as "PR written technological information" too doesn't it?

MfA · Oct 2, 2009

You spoke with absolute certainty, so I assumed you had some reason to do so

If like me you are only going on what the white paper said then I find it strange you can be so certain.

trinibwoy · Oct 2, 2009

MfA said:
You spoke with absolute certainty, so I assumed you had some reason to do so If not then to assume it's possible to make the mistake one way it would be strange not to consider the other direction.

Oh, lol no man I'm as far from an "insider" as you could probably get

My assertion was based on the two occurrences that you mentioned. Besides, I said it was probably a typo. That's not absolute

1: 256 FMA ops/clock
2: up to 16 double precision fused multiply-add operations can be performed per SM, per clock

So the odds that both of those are typos are far lower than the 8x being a typo, that's all. The measured performance they mentioned was at 4.2x so maybe you're right and it's really 4x higher peak with a little extra due to efficiency. /shrug.

ninelven · Oct 2, 2009

Kaotic said:
ninelven, 1/5 of Cypress is 544, not 272GFLOPS

1/5 is 544GFlops when counting FMAs, when counting MULs it is 272.

Anyway, I was just going off the chart on this page:

http://www.hardware.fr/articles/772-7/nvidia-fermi-revolution-gpu-computing.html

I'm guessing the author assumed RV870 was like RV770 in this regard, but what mczak posted seems to suggest otherwise. I have no idea.

MfA · Oct 2, 2009

"One 64-bit FP MULL and ADD or two 64-bit FP ADDs per clock" was probably too wide for the picture

trinibwoy · Oct 2, 2009

ninelven said:
http://www.hardware.fr/articles/772-7/nvidia-fermi-revolution-gpu-computing.html

Why does Damien think MADs are half speed? According to the whitepaper all MADs should simply be promoted to FMAs. Though Rys raised concerns about that he never went into detail as to why that's a problem.

dkanter · Oct 2, 2009

There seems to be a bit of confusion here about a number of things in terms of Fermi's execution units. I recommend this page of my article:

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

1. Fermi cores have 32 vector lanes (CUDA cores in NV's bastardized and retarded terminology). Each pipeline is 16 vector lanes.
2. Each vector lane has a 32-bit ALU and a 32-bit FPU
3. Vector lanes cannot simultaneously use the ALU and FPU, there simply isn't register file bandwidth.
4. 64-bit FPU ops use both pipelines, 64 bit integer operations simply take longer, sometimes a lot longer (e.g. 64b IMUL = 4x slower than 32b IADD/ISUB)

Fermi definitely has TMUs and ROPs. They didn't talk about them, but they are certainly there.

David

dkanter · Oct 2, 2009

trinibwoy said:
Why does Damien think MADs are half speed? According to the whitepaper all MADs should simply be promoted to FMAs. Though Rys raised concerns about that he never went into detail as to why that's a problem.

Some people may want to retain the same results for running their code on different GPUs...it's called 'compatibility'.

If you wrote your code using MADs, it had better produce the same results on all GPUs. Extra precision from FMA can be nice, but it can also be an unpleasant surprise.

David

trinibwoy · Oct 2, 2009

dkanter said:
Some people may want to retain the same results for running their code on different GPUs...it's called 'compatibility'.

If you wrote your code using MADs, it had better produce the same results on all GPUs. Extra precision from FMA can be nice, but it can also be an unpleasant surprise.

David

Makes sense, so presumably CUDA 3.0 / PTX2.0 will offer a mechanism for the developer to ensure that MADs remain as such? Also, how does that work on RV870 which provides both MAD and FMA?

MfA · Oct 2, 2009

dkanter said:
There seems to be a bit of confusion

If you're talking about me, I'm not confused ... I'm just not convinced. NVIDIA's whitepaper simply contradicts itself once. I like your article and it shows you had a decent bit of correspondonce with them, but still a lot of information comes from the same source and misunderstandings do come about so easily. Unless the contradiction is clarified officially I'm not going to be convinced about the GF100 double precision instruction throughput until we see benchmarks.

BTW, you make a point to note the fact that command/address lines are not error checked for DRAM ... is that statistically really an issue? (It obviously is for data, but that moves at 4 times the rate.)

NVIDIA Fermi: Architecture discussion

MfA

mczak

mczak

babcat

trinibwoy

Meh

ShaidarHaran

hardware monkey

trinibwoy

Meh

Kaotik

Drunk Member

babcat

MfA

trinibwoy

Meh

MfA

trinibwoy

Meh

ninelven

PM

MfA

trinibwoy

Meh

dkanter

dkanter

trinibwoy

Meh

MfA

Similar threads