NVIDIA Fermi: Architecture discussion

That'd be an obvious application. Another one is that it can work as a reorder buffer, to make memory access more bandwidth efficient. For example, in G8X/G92, memory access has to be strictly sequential and aligned to be most efficient. On GT200 the rules are less strict but there are still some situations which can lead to poor memory bandwidth utilization. Now with a large L2 cache, the restrictions will be much less, because the cache can act as a large buffer to collect all non-regular read/writes into more regular access pattern. For example, I suspect that now with Fermi's L2 cache it's possible to write a fast histogram program with histograms stored in the global memory (and hopefully cached). Similar algorithms such as radix sort can also benefit.

That will certainly help with reducing the overhead of having to coalesce access patterns.

My concern is that, since L1 is not guaranteed to be coherent, there is a risk of cache pollution if a cacheline sis evicted from L1 to L2, gets changed there, and is then brought back into L1 again.
Do you mean by putting the trees/linked lists in shared memory? That'd be possible but could lead to a lot of bank conflicts. Also 16KB shared memory is not really very large, especially when you want to have different trees for each thread...
Neither is 48K per SM. (now that and SM has 32 alu's, fermi actually has lesser L1/alu than GT200's shared mem/alu).

There is L2, but that is meant for all the SM's, so it works out to only 32K/SM. Not to mention that it will be slower.
 
Yeah it's 1/5 (really 1/4 as the 5th alu simply isn't used), but anyway ~544GFlops worth.
This is true for peak gflops rate using mad/fma (needs all 4 simple alus for a 64bit fma - on rv870, the rv770 couldn't do 64bit fma). It is noteworthy though that for other instructions, e.g. mul or add, the rate is 2/5 of the single precision rate, so still 544GFlops when using only adds or muls as long as the compiler can extract pairwise independent muls or adds (GF100 will drop to half the gflops with muls or adds).
 
That will certainly help with reducing the overhead of having to coalesce access patterns.

My concern is that, since L1 is not guaranteed to be coherent, there is a risk of cache pollution if a cacheline sis evicted from L1 to L2, gets changed there, and is then brought back into L1 again.

Technically, a SM shouldn't read or write an address where another SM is reading or writing. That is, between a "sync event." So, if the L1 cache has something evicted, it must either be read (then it shouldn't be written by another SM) or written (then it shouldn't be read by another SM), so there shouldn't be a risk of cache pollution. Atomic operations on global memory is likely not cached in L1.

The write back of L1 cache most likely only happens when a SM's kernel is completed, or when it issued a memory barrier request. These events can be broadcasted and all SM can be forced to perform a writeback on their L1 cache, into the unified L2 cache or directly to the memory. Therefore, the coherence at the sync points is maintained.
 
So, what's the verdict on the gaming part, is there enough magic tricks and cookie monsters to really beat a chip with over 1 TFLOPS higher theoretical peak performance, and with notable margin?
For reference, last gen the peak difference between 4890 and GTX285 was ~0.2TFLOPS
 
Has somebody already looking to the pack of Fermi?
According to Anandtech Fermi should be 467 sqmm. If you compare this value with AMD's Cypress (339 sqmm / 2,15 billion transistors * 3 billion transistors = 473 sqmm), Nvidia's pack is as good as AMD's.

If Fermi will be >= 40% faster than Cypress, you can say that it has got a better performance/die size than Cypress. How times can change... ;)
 
But the key thing is that GF100 has way too much INT capability for graphics. In my view the only way to diffray this is to make it an INT unit for compute and part of DP as well as to do texturing. It's pretty smart I'd say. A lot of compute is address computation, yet another use for INT (the ALU).
It doesn't look to me like nvidia invested a lot of transistors into INT capability. 32bit muls are half rate according to rwt (64bit 1/4 rate, I guess mostly needed because chip can address more than 4GB of ram?), and you can't issue INT and FP at the same time (on the same pipeline), so in terms of implementation sounds pretty cheap to me. Still INT capability is certainly more than decent. Note that rv870 beefed up INT capability too - it can do 4 muls/adds per clock though they are 24bit only (of course that should be very cheap in implementation), plus one 32bit mul/add on the T unit (though the 32bit mul will need 2 instructions on the t unit if you need both high-order and low-order result bits, at least if rv870 is like rv770 in that area).
 
If true that's a disaster in terms of bandwidth IMHLO (L doesn't stand for lame but layman's heh). Unless of course we'll see some >16xAF sample mode "for free" (due to the bandwidth restriction). The last one is more of a joke of course; I personally could make better use of something like better filter kernels for example than higher than 16x samples.

Else if it truly has something like 256 TMUs or equivalents I doubt real time fillrate could even peak to such hights.

One aspect that hasn't anyone asked this far and obviously there's probably no data available on that, is 8xMSAA performance on GF100. I'd dare to say it has 48 ROPs (wtf it would need 48 pixels/clock is beyond my uneducated imagination), but the 8Z/8C note in Rys' diagram isn't particularly telling IMO in that department yet either.

Finally if it ends up with something like 1200MHz GDDR5 the 384bit "grants" it 230.4GB/sec which might be sufficient and marks a 50% increase over GTX285, closer to what Dally allowed me to speculate from his PCGH interview and way less to anything the so far 512bit scenarios granted.

Let's suppose the design is 700MHz 48ROPs/256TMUs/512SPs and the memory is the same with the 5870's mem (1,2GHz GDDR5). (so 230.4GB/sec with a 384bit memory controller)

With G92b, Nvidia had a design with 738MHz 16ROPs/64TMUs/128SPs and with 1,1GHz GDDR3. (70,4GB/sec with a 256bit memory controller)

GT300 has 2,85X pixel fillrate / 3,8X texel fillrate / 3,3X memory bandwidth.

This logic is kinda simplistic, but things doesn't look so bad if the GDDR5 memory controller has good efficiency.

Nvidia must do something to reduce the hit with 8xMSAA .

Anyway, i guess (wild guess) GT300 will be around 1,5X faster per MHz in relation with a 5870 (4X AA).

The problem with 5870 is that the performance improvement in relation with a 4890 is not consistent. (it has much higher variations in perf. than what 2X specs would normally have, i know about the bandwidth...)

I don't know why is that, but i guess it is either the geometry setup engine (Geometry/Vertex assempler has same performance with 4890's) or something about the Geometry shader performance?

I could only find 3DMark Vantage tests, if you check:

http://www.pcper.com/article.php?aid=783&type=expert&pid=12

GPU Cloth: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test)
GPU Particles: 5870 is only 1,2X faster than 4890. (vertex/geometry shading test)

Perlin Noise: 5870 is 2,5X faster than 4890. (Math-heavy Pixel Shader test)
Parallax Occlusion Mapping: 5870 is 2,1X faster than 4890. (Complex Pixel Shader test)


It shouldn't be a problem of the dual rasterizer/dual SIMDs engine efficiency since synthetic Pixel Shader tests is fine (more than 2X) while the synthetic geometry shading tests is only 1,2X.

Is these synthetic vertex/geometry shading tests so bandwidth limited in order to deliver 1,2X instead of 2X?
Or the Pixel Shader tests like the Parallax Occlusion Mapping test is not bandwidth limited at all? (why they deliver more than 2X?)

And anyway, it is not logical for 5870 to be extremely bandwidth limited.
Why ATI to waste transistor resources like that, if the design is not going deliver (in such a degree) because of the bandwidth?
Certainly, they would have used the transistor space in a more efficient way.
 
Last edited by a moderator:
So, what's the verdict on the gaming part, is there enough magic tricks and cookie monsters to really beat a chip with over 1 TFLOPS higher theoretical peak performance, and with notable margin?
For reference, last gen the peak difference between 4890 and GTX285 was ~0.2TFLOPS

Yea, why not?
The GTX295 beats the 5870 aswell, in most games... Doesn't the GTX295 have a considerably lower theoretical TFLOP rating than the 5870 (~1.8 TFLOPS vs ~2.7 TFLOPS)? Not even including the inefficiencies introduced by SLI, that Fermi won't have to suffer from.
 
Has somebody already looking to the pack of Fermi?
According to Anandtech Fermi should be 467 sqmm. If you compare this value with AMD's Cypress (339 sqmm / 2,15 billion transistors * 3 billion transistors = 473 sqmm), Nvidia's pack is as good as AMD's.
Well, anand got his 467mm² by multiplying rv870 die size with the 1.4 factor due to the transistor difference between these chips so it's not surprising you end up with the same density :).
I think if GF100 is indeed only this big I'd be pleasantly surprised. Both g92b and gt200b required about 25% more area per transistor than rv770.
 
The industry is littered with failures trying to get into the CPU space. There was once a thriving industry in the workstation/server business prior to the rise of commodity linux. You had SPARC, you had PA-RISC, you have MIPS, you had PPC, you have Alpha, you had 68k. These were once protected islands until free BSD variants and Linux arrived. Now most are in the dead pool, including Sun. Only IBM has survived, just barely.

There are only two areas where recent success has occurred -- consoles and mobile/low power markets, but even there you see Intel is making an assault.

Competitors could have commodified x86 if not for the simple fact that the fabs are incredibly expensive, and therefore the barrier to entry is now so high, you're unlikely to see new challengers. We're lucky AMD is still hanging on. Intel's biggest threat probably comes from a mainland Chinese competitor in the next decade.

If Nvidia plans to go up against Intel, not having x86 compatibility is a non-starter. And even if they did plan to go that route, they'd still face the fact that ultimately, they can't bet their entire future on TSMC while Intel and AMD control their own process.

So IMHO, a long term strategy of trying to beat Intel at their own game is a failure strategy. IMHO, the real strategy is to de-emphasize traditional processing, which is already hitting limits. Your web browser or Microsoft Office won't run much faster on superduper x86 chips. Rather, the kinds of workloads that will stress desktop systems are inherently media/parallel tasks anyway.

We're looking at hitting the limits of process downsizing in the coming decade anyway, the only way to scale is parallelism, so long term, NVidia's strategy should be to look at transitioning developers to a new model, rather than adapt their existing hardware to the old. And I think you're seeing them do that, especially with the new developer tools they have coming out.

They just can't do it too quickly, but graphics still has to fund this transition.

Honestly, I have a hard time seeing anyone challenging Intel in the future, except perhaps state-subsidized players in Asia or maybe large Japanese oligarchies. As we get closer to fundamental limits, costs are going up so high, that very few entities can raise the kind of capital, and wait a long time for ROI, to meet future needs.

As great as the R8xx/GF1xx are, the reality is, Intel is a very large company with lots of smart people, and lots of money, and other market position advantages that makes it very hard to unseat them, and if need be, they can put enough resources on a threatening GPGPU killer. Even with a large clusterfuck like Prescott, AMD was only able to bite off a small niche.

Great post and honest comment. ;)
 
on rv870, the rv770 couldn't do 64bit fma)

[ignore]Neither can RV870 AFAIK, only MAD for DP - FMA is SP only (not 100% certain though).[/ignore]

Seems I made something of a booboo, they actually do FMA for DP as well, so disregard the above.
 
elsence,

Yes the speculative math is more than simplistic and besides that you are comparing a performance chip (G92) with a high end chip (GF100). If anything G92 wasn't designed in my mind with it's bandwidth and memory size constraints for very high resolutions.

Considering NV's 8xMSAA performance from G80 to GT200 today I've never really understood what the real problem is that there are such differences. 4xMSAA is single cycle and it takes only two cycles for 8xMSAA. Granted through various driver updates things have improved by quite a bit with 8xMSAA for all those GPUs, but they're always a notch behind ATI's equivalents.

Something might say 8xMSAA performance is bandwidth related and I don't see anything that proves that. I don't see any excessive bandwidth on Cypress and yet the performance drop from 4x to 8xMSAA is rather small. Something else must be here vastly different between the two architectures so far.

Finally when it comes to any hypothetical bandwidth limitations for Cypress it doesn't sound to be that much different so far for GF100. If there's such a limit then it rather applies for all performance/high end DX11 GPUs then only for one. Besides it's far more important how each architecture handles its bandwidth then the raw maximum bandwidth on paper. W/o any extensive testing results from a GF100 there's nothing we can say about that one yet, nor compare the two families.
 
The industry is littered with failures trying to get into the CPU space. There was once a thriving industry in the workstation/server business prior to the rise of commodity linux.
All of these were (x86) cpu substitutes. Nvidia (for the time being at least) is driving complementarity in the x86/cpu space. Sure, they're looking to supplant large traditional clusters while leaving a few breadcrumbs for the cpu crowd. Heh.

Has somebody already looking to the pack of Fermi?
According to Anandtech Fermi should be 467 sqmm. If you compare this value with AMD's Cypress (339 sqmm / 2,15 billion transistors * 3 billion transistors = 473 sqmm), Nvidia's pack is as good as AMD's.
I'd say they just used the same assumptions you did & extrapolated the number... It may be bigger or smaller. ;)
 
Well, anand got his 467mm² by multiplying rv870 die size with the 1.4 factor due to the transistor difference between these chips so it's not surprising you end up with the same density :).
You are right, but in the rumours a similar die size was expected. I think we can say that Fermi will be around 500 sqmm and this looks like a much better performance/die size than GT200.

PS: I am now sure that Radeon HD 5870 X2 will be faster than Geforce 380 in gaming performance.
 
I just rechecked hardware.fr.
It reports 16 interpolations.
So probably we have 256TMUs.
That is 16 _scalar_ interpolations. So 256 in total. But you need to interpolate 2 scalar attributes for 2d texturing, so that's only really good for 128 TMUs.

edit: oh hmm, forgot about the higher clock of the alus vs. tmus. Still, imho 256 TMUs would make no sense at all, it would increase TEX:ALU ratio from GT200 considerably (back to the G92 level).
 
Last edited by a moderator:
Wow. I had just written a fairly long post, then BAM! Blue Screen of Death. That doesn't happen too often nowadays to say the least - just my luck... A couple of thoughts based on some of the strange assumptions I've seen in this thread:

1) Remember Fermi is the architecture, GF100 is only a chip. The derivatives should be less HPC-centric and therefore have less GPGPU 'overhead'. How much is anyone's guess; ECC support should be gone in some/all of them at the very least and it's not clear to me what level of FP64 support they will have. Also, low-end chips might have a lower ALU:TEX ratio (or even ALU:ROP) and therefore fewer transistors should be HPC or GPGPU-centric.

2) The FP64 implementation is likely based on the two FPUs being very different; one with 24x24 multipliers and incapable of either INT32 MULs or FP64, and the other capable of either at full speed and benefiting from the first one's buses & RF when doing 64-bit. This seems quite efficient to me, but at the same time it might make it difficult to support slower FP64 in derivatives as you'd then be left without INT32 support. Unless they do it in four cycles in either ALU on those chips and forget about INT64, in which case I'm not sure why they apparently gave up on INT24 support completely... (and this would also imply the FPU-ALU differentiation is once again mostly marketing either way)

3) TMUs will probably be fairly traditional. Just look at the die size: if we very naively assume that the TMU is the differently colored block in each shader area, then we get to about ~13% of the total die size. Given the greater performance and functionality of the SMs and the fact it's likely more of the 'cluster logic' moved in there, that's not an impossible evolution from the 25% of GT200 and it's also conservative estimate. It's interesting to ponder how NVIDIA could change the ALU:TEX ratio over time in this architecture; unlike in G8x where the multiprocessor count per cluster was the easiest approach, it seems to me that changing the performance of the TMUs is an easier one this time around.

4) The MC-linked group (which includes ROPs) is fairly large, so there is little reason to assume they've tried to offload as much as possible to the shader core there (although they could have for flexibility reasons; I still yearn for programmable blending). More importantly, it is also therefore possible that these blocks handle most of the triangle setup/rasterization functionality whose performance then scales gracefully between derivatives based on the number of MCs. Since every tile should be dedicated to a MC, it would a sensible location from where to handle conflicts/guarantee correct rendering order. Obviously, that's not a magic fix and I still can't imagine it being easy to implement.
 
You as an end-user should notice it as it's die size is an indication of the cost of the processor.

Not really.
The sales price of videocards or processors in general is not directly related to the manufacturing cost.
A huge factor in product pricing is also performance.
There are plenty of examples of 'small' chips which commanded high prices to end-users due to the better performance they had (eg Athlon 64/X2/FX vs Pentium 4/D), or 'large' chips which were cheaper than smaller chips, because the performance didn't allow higher prices (eg Athlon X2/X3/X4/Phenom vs Core2).
 
You as an end-user should notice it as it's die size is an indication of the cost of the processor.

The newegg price is a lot better indication of whether I should give a crap about die size.

What I wan to to know actually of all the weird things is how tesselation performance will be between the two since the nvidia one is supposedly more software oriented. It doesn't necessarily mean it will be worse, but it certainly hints that at that particular task it may suffer. I was hoping that since it was finally in DX it would become more commonly used.
 
Back
Top