NVIDIA Fermi: Architecture discussion

But that's not a good enough excuse for neglecting known bottlenecks. Given the dramatic changes on the compute side I don't think it's unreasonable to ask for a little love for graphics.

Maybe they only improved the bottlenecks by 1.6x. Apparently they didn't want to double everything. Given the size of the chip, something had to give.
 
IEEE precision and DP

A few notes here:

1. There are different standards for IEEE floating point arithmetic. There's 2008, 1985, etc. Some chips are compliant with different versions. i.e. something designed in 1999 is probably not IEEE-2008 complaint : )

NV claimed that they had IEEE compliant SP in GT200, now they say they had to make improvements to be IEEE for SP in Fermi. So perhaps the spec they were following changed, they are supporting more features, or they just were being deceptive about GT200.

2. There are many facets of floating point arithmetic, more than just the data types. There's rounding, denorms, under/overflow, exceptions, etc.

3. Some of the above can be supported with software or microcode traps, which provides compliance, but impacts performance.

4. Fermi can issue 256 DP FMAs/cycle across the chip, but this blocks issuing any other instructions. GT200 can issue 30 DP FMA/s across the whole chip, and it also blocks issuing other instructions. That's actually a 8.5 increase, although frequency obviously matters - I appreciate that NV marketing took the high road and didn't rounding up.

5. RV770 can do a DP FMA/VLIW, there are 16 VLIWs/SIMD and 10 SIMDs/chip. For those who can multiply...that's 160 DP FMAs/chip. Cypress doubles that to 320 DP FMAs/chip.

6. Fermi's latency on integer instruction varies by the instruction type and operand width.

7. I will write about RV770 and Cypress later, once I have more information and a complete grasp on what they do. The manuals will probably tell you exactly what features are supported, but I expect that they are probably IEEE754-2008 compliant.
 
G80(+successors) situation is unprecedenced. Entire HW generation has emerged, lived on marked, and is going to be superceded soon, all while it's important capablities were unexposed by DirectX (applies to OGL as well, but it lagged even before that). The last 2 years is a testimony to what bottleneck a 3D API has become.

Interesting statement considering G92/G200 have both lagged behind DX, themselves becoming the bottleneck to 3D rendering. But at least Fermi will catch up once again, and quite impressively go far beyond in non-gaming related areas.

Regards,
SB
 
Maybe they only improved the bottlenecks by 1.6x. Apparently they didn't want to double everything. Given the size of the chip, something had to give.

Well, TUs grow from 80 to 128, that gives what number already? :) It *looks* like NV did "real" DP (half, not just quarter, speed) and "skimped" on the TUs (if they had only put 10 instead of 8 in there....). I can see why there might be some pouting from gamers if that's true. It isn't a surprise for NV to do that, though. Recall FP24 vs. FP32.... [and, no, I don't think it's that bad this time around] NV seems to like precision challenges. I look forward to seeing >DP support at some point (there are definitely physics problems that require it).

I am also fascinated by what design philosophies might bleed over into nVidia from Imagine. The latter had a stream processor with lots of asymmetric ALUs behind a VLIW dispatch. Sound familiar? Any change won't be visible until the next major release, which, y'know, might happen after the world ends and all (2012!) but if the world somehow survives, I will be interested to see what direction NV takes.

-Dave
 
Are you referring to DX10.1? That's hardly a revolution.

Perhaps not, but increasing efficiency such that you get a free 20% extra increase in performance is rather substantial.

Leading to either 20% extra speed or 20% extra resources available for 3D effects.

Now if you consider that devs generally develope for the lowest common denominator if Nvidia had also provided mainstream cards with Dx10.1, there's a chance there'd be more titles with more 3D improvements due to having more resources available.

A bottleneck is a bottleneck whether it's revolutionary or not. In this case Nvidia pretty much held back the rest of the industry. In a similar way to how you could claim R420 held back the industry by not supporting SM 3.0 (everything could still be done in SM 2.x, however it was less efficient). But ATI at least only held things back for one refresh.

Regards,
SB
 
Why does everyone assume there is no tesselator? Based on the information available about the board I am curious why everyone has come to this conclusion.
 
In broad terms, in order for GTX285 to be just about faster than HD4890 (10-20%), it required 2x HD4890's TUs (80 v 40) and 2x HD4890's RBEs (32 v 16).

Now that HD5870 has 80 TUs and 32 RBEs ...

Of course that takes no account of the per-unit efficiency of these things. There's no reason why NVidia hasn't re-vamped that - if there are fixed function TMUs and ROPs.

Jawed

Come on - normally you don't think in those simple terms.
 
7. I will write about RV770 and Cypress later, once I have more information and a complete grasp on what they do. The manuals will probably tell you exactly what features are supported, but I expect that they are probably IEEE754-2008 compliant.
I'm kind of curious about the blurb that tech-report wrote:
The lone potential snag for full IEEE compliance, Demers told us, is the case of "a few numerical exceptions." The chip will report that such exceptions have occurred, but won't execute user code to handle them.
If they're reported, then why can't you write code to handle them?
 
I'm kind of curious about the blurb that tech-report wrote:If they're reported, then why can't you write code to handle them?

Maybe they're reported in a global way, not accessible by shader code, or only after a kernel has finished running? That is, perhaps you can inquire and find out that "an exception occured at some point", or even get a count of the number of exceptions?

Otherwise, it seems obvious that if after an FP operation, a readable flag exists, one could simply write your own code to check for exceptional status after each operation where you care about traps. I mean, it's not the same as SIGFPE, or real exceptions, but would be invaluable in a 'debug build' to have generated traps able to tell you which line of code generated the exception, and/or set breakpoints/debug statements on the exception clause.

Seems like it should be a cut-and-dried case for them to explain clearly what they support, unless they're having internal meetings on the proper 'messaging' to go forward with to blunt any criticism that whatever they have doesn't match the standard, or Fermi. Of course, they could also really be busy and not really care what's going on hear, but Dave should be able to go get the required information. :)
 
4. Fermi can issue 256 DP FMAs/cycle across the chip, but this blocks issuing any other instructions. GT200 can issue 30 DP FMA/s across the whole chip, and it also blocks issuing other instructions. That's actually a 8.5 increase, although frequency obviously matters - I appreciate that NV marketing took the high road and didn't rounding up.
Pity they made it a bullet point in a list of things which changed about SMs and not the chip ... 8 from 16 is rounded way way way down. Don't tell me it's unrealistic to expect them to be talking about SM changing because of the TPC->dual-warp transition ... right above the line they talked about a 4x change in which did concern a direct comparison of old to new SMs. With that kind of context to assume they are talking about the chip instead of SMs is beyond ridiculous.

Once again this is what they said :
• Third Generation Streaming Multiprocessor (SM)
o 32 CUDA cores per SM, 4x over GT200
o 8x the peak double precision floating point performance over GT200
o Dual Warp Scheduler simultaneously schedules and dispatches instructions
from two independent warps
o 64 KB of RAM with a configurable partitioning of shared memory and L1 cache

If a SM has 16x the peak double precision performance over GT200 then that is what it should say.
 
Last edited by a moderator:
I'm still trying to understand why we should be impressed by 2x the raster rate. Hasn't that alway been increasing.

Not really. From X800 to HD 4890 there has been 70% increase in ROP. And that's only because it's clocked 70% higher. The number of ROPs have been stuck at 16 for all this time.
 
Not really. From X800 to HD 4890 there has been 70% increase in ROP. And that's only because it's clocked 70% higher. The number of ROPs have been stuck at 16 for all this time.
It's not as simple as that. X800 was limited to 32 Zs per clock with MSAA, whereas 4890 can do 64.
 
Nvidia GF100: 128 TMUs and 48 ROPs

http://www.hardware-infos.com/news.php?news=3228

If I summarize, we have following facts:

- 40 nm
- 3.0 billion transistors
- 512 SPs
- 128 TMUs
- 48 ROPs
- 384 Bit GDDR5


Can the mods please regulate the self promotion by the usual suspects like these?

I can understand this guy spamming his links on vr-zone, thats how it works over there but here? :no:


I know, WTF?. The next person to do self-promotion should be spanked with the wooden Fermi board as punishment. :devilish:
 
Perhaps not, but increasing efficiency such that you get a free 20% extra increase in performance is rather substantial.

20%? Is that based on anecdotal evidence?

Not really. From X800 to HD 4890 there has been 70% increase in ROP. And that's only because it's clocked 70% higher. The number of ROPs have been stuck at 16 for all this time.

I don't recall the number of ROPs ever influencing the number of (marketed) rasterizers. If so, why is it that 8 and 4 ROP derivatives weren't marketed as having fewer rasterizers or vice versa? I find it hard to believe that AMD just up and decided to change that trend.
 
Back
Top