NVIDIA Fermi: Architecture discussion

or more likely just a mistake.
In the white paper there is a single mention of 8x ... and 2 mentions of 16x (once explicitly and once implicitly in the g80/gt200/fermi comparison). On the one hand you might say the odds are in favour of 16x, on the other I would not. Personally I think it's far more likely that Double Precision instructions get dispatched at "half rate" and "do not support dual dispatch with any other operation". It just makes more sense, running it at half peak rate overall would mean that half the multiplier hardware would lie fallow most of the time.

If I'm right this won't be cleared up until the launch I think :p
 
We have no details on the ATI 24-bit INT-MUL - is that just the 24 lowest bits? I suspect its for addressing type calculations.
Hmm yes that's actually a good question. Maybe it could be 24bit source, 32bit destination? That would be nice. Where's the R800 instruction guide :)
What did the 24bit int mul on GT200 actually do? Seems a bit odd nvidia opted for (half-rate) 32bit int mul only with GF100.
 
So, what's the verdict on the gaming part, is there enough magic tricks and cookie monsters to really beat a chip with over 1 TFLOPS higher theoretical peak performance, and with notable margin?
For reference, last gen the peak difference between 4890 and GTX285 was ~0.2TFLOPS

Very good question.
 
Nvidia is emphasizing *other* than gaming with Fermi.

Yes, but many are making the mistake of thinking that equates to a de-emphasizing of gaming. I agree with Arun, all these editorials are off base. For starters, even at moderate clocks Fermi is 2x GTX285. How that can be interpreted as abandoning 3D is beyond me. Secondly, during the keynote JHH went to great pains to point out that the only reason they could afford all this R&D is due to the volumes that they ship. So are people really trying to say that a man who knows that there is no Tesla and no Quadro without Geforce, is going to blow up the foundation of his house? Come on....

Why do they say that a SM has "8x the peak double precision floating point performance over GT200"? (Shouldn't that be 16.)

Probably a typo, it's 8x for the chip and 16x per SM.
 
Just for the sake of technical accuracy:
HD 4890 (RV790) had a maximum FLOP rate of 1.36TFLOPs and GTX 285 was 1.12TFLOPs.
 
So, what's the verdict on the gaming part, is there enough magic tricks and cookie monsters to really beat a chip with over 1 TFLOPS higher theoretical peak performance, and with notable margin?
For reference, last gen the peak difference between 4890 and GTX285 was ~0.2TFLOPS

Fermi has >2x the MADDs of GTX285. Does that answer your question? What's the point of comparing it to HD5870 when that only manages to be 40% faster with 4x the MADDs?
 
Cypress is 1/5 for FP64 MUL (272GFlops), 2/5 for ADD (544GFlops).

Marketing material clearly indicates 2 FP64 MUL or ADD per clock: http://techreport.com/articles.x/17618/5 - so either it is bogus or you're dead wrong :).

ninelven, 1/5 of Cypress is 544, not 272GFLOPS

If my memory serves from the way it used to be in RV670 which first introduced DP on Radeons, basic add & substraction functions can be carried at 2/5 rate, while multiply and divide functions would go at 1/5 rate. (I doubt it has gone any worse from what it was back then)
 
I do find a new GPU architecture always very interesting and I care only a little about its performance. So it was a good techday, not more not less.


ATI has released a splendid gaming card. NV has shown interesting technology and ideas, but it might suck for gaming and it won´t be on the shelves for 4-6 months. So who cares about its gaming performance right now?

I care about it's gaming performance.
 
Probably a typo, it's 8x for the chip and 16x per SM.
We will see ... in the face of conflicting official statements and the fact that half peak rate double precision execution runs against my bias, I'll have to go with my bias.

Insider information won't really convince me otherwise, you have your scepticism about PR written technological information ... I have my scepticism about the ability of people to really see the difference between professional ego stroking "insider information" and true friendly acquaintances who talk too much :)
 
We will see ... in the face of conflicting official statements and the fact that half peak rate double precision execution runs against my bias, I'll have to go with my bias.

Insider information won't really convince me otherwise, you have your scepticism about PR written technological information ... I have my scepticism about the ability of people to really see the difference between professional ego stroking "insider information" and true friendly acquaintances who talk too much :)

You lost me there. I thought we were talking about the whitepaper here? That qualifies as "PR written technological information" too doesn't it? :LOL:
 
You spoke with absolute certainty, so I assumed you had some reason to do so :) If like me you are only going on what the white paper said then I find it strange you can be so certain.
 
You spoke with absolute certainty, so I assumed you had some reason to do so :) If not then to assume it's possible to make the mistake one way it would be strange not to consider the other direction.

Oh, lol no man I'm as far from an "insider" as you could probably get :LOL: My assertion was based on the two occurrences that you mentioned. Besides, I said it was probably a typo. That's not absolute :p

1: 256 FMA ops/clock
2: up to 16 double precision fused multiply-add operations can be performed per SM, per clock

So the odds that both of those are typos are far lower than the 8x being a typo, that's all. The measured performance they mentioned was at 4.2x so maybe you're right and it's really 4x higher peak with a little extra due to efficiency. /shrug.
 
"One 64-bit FP MULL and ADD or two 64-bit FP ADDs per clock" was probably too wide for the picture :)
 
There seems to be a bit of confusion here about a number of things in terms of Fermi's execution units. I recommend this page of my article:

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

1. Fermi cores have 32 vector lanes (CUDA cores in NV's bastardized and retarded terminology). Each pipeline is 16 vector lanes.
2. Each vector lane has a 32-bit ALU and a 32-bit FPU
3. Vector lanes cannot simultaneously use the ALU and FPU, there simply isn't register file bandwidth.
4. 64-bit FPU ops use both pipelines, 64 bit integer operations simply take longer, sometimes a lot longer (e.g. 64b IMUL = 4x slower than 32b IADD/ISUB)

Fermi definitely has TMUs and ROPs. They didn't talk about them, but they are certainly there.

David
 
Why does Damien think MADs are half speed? According to the whitepaper all MADs should simply be promoted to FMAs. Though Rys raised concerns about that he never went into detail as to why that's a problem.

Some people may want to retain the same results for running their code on different GPUs...it's called 'compatibility'.

If you wrote your code using MADs, it had better produce the same results on all GPUs. Extra precision from FMA can be nice, but it can also be an unpleasant surprise.

David
 
Some people may want to retain the same results for running their code on different GPUs...it's called 'compatibility'.

If you wrote your code using MADs, it had better produce the same results on all GPUs. Extra precision from FMA can be nice, but it can also be an unpleasant surprise.

David

Makes sense, so presumably CUDA 3.0 / PTX2.0 will offer a mechanism for the developer to ensure that MADs remain as such? Also, how does that work on RV870 which provides both MAD and FMA?
 
There seems to be a bit of confusion
If you're talking about me, I'm not confused ... I'm just not convinced. NVIDIA's whitepaper simply contradicts itself once. I like your article and it shows you had a decent bit of correspondonce with them, but still a lot of information comes from the same source and misunderstandings do come about so easily. Unless the contradiction is clarified officially I'm not going to be convinced about the GF100 double precision instruction throughput until we see benchmarks.

BTW, you make a point to note the fact that command/address lines are not error checked for DRAM ... is that statistically really an issue? (It obviously is for data, but that moves at 4 times the rate.)
 
Back
Top