NVIDIA Fermi: Architecture discussion

Just architecture info and maybe a few Nvidia provided benchmarks. Probably won't get any independent numbers until launch in March/April/May/whenever.
 
  • Sampler runs at scheduler clock (half the hot clock)
  • 4 samplers per cluster (64 total)
  • Sampler will do jittered-offset for Gather4 (no idea how, the texture-space offset is constant per call)
  • 4 tris/clock setup and raster
  • Raster area per unit is now 2x4 rather than 2x16
  • PolyMorph Engine (heh), effectively pre-PS FF, one per cluster
  • ROPs now each take 24 coverage samples (up from 8)
  • Compression is improved, 4x->8x delta drop is less than GT200 clock-for-clock
  • Display engine improvements

That's the list of the stuff I either got wrong or missed in my article at TR, concerning the graphics. Biggest thing is probably the > 1tri/clk for small triangles, and the change in the per-clock rasterisation area for each of the four units. Aggregate setup and rasterisation performance is no faster per clock than G80+ for triangles that are > 32 pixels.

Sampler count was out by 2x, so NV will need a > 1.6 GHz hot clock to beat a GTX 285 in peak possible texture performance, and there's a distinct lack of information about the sampler hardware in the latest whitepaper. Doing more digging there, but it looks like no change to texturing IQ other the ability to jitter the texcoords per sample during an unfiltered fetch.

NV claim that everything they list in the PolyMorph block exists as a physical block in the silicon. Obviously interesting thing there that didn't exist before is the tessellator, and it seems the fixed block there is responsible for generating the new primitives (or killing geometry too), and the units run in parallel (where possible), with most other stuff running on the SM.

As for my clock estimates, I doubt 1700 MHz hot clock at launch :)sad:), but the base clock should be usefully higher, up past 700 MHz. They still haven't talked about GeForce productisation or clocks, but at this point it looks unlikely the fastest launch GeForce will texture faster than a GTX 285.

That's about it, will have an article up ASAP.
 
As always Rys, thanks for your immediate thoughts; looking forward to the article.
 
Rys,

I assume that with base clock you mean the lowest (ROP frequency). How likely is it that that one is lower than 700MHz and half the hot clock (which will amount for the majority of former core frequencies) is at >700MHz? Something like 500/725/1450 for example?
 
I can see the base/ROP clock being higher than half the hot clock now, because there's not much going on in that domain any more. ROP clock shouldn't be ~500 MHz if the hot clock is 1450. Unless you know something I don't :smile:
 
I can see the base/ROP clock being higher than half the hot clock now, because there's not much going on in that domain any more. ROP clock shouldn't be ~500 MHz if the hot clock is 1450. Unless you know something I don't :smile:

No I don't know something you don't.

Wait a sec: if there's not much going in the ROP domain then why "torture" it with as high frequencies in the first place? I'm already a bit confused with Anand's article, which states that each raster uses up to 8 pixels/clock. Times 4x gives a total of 32 pixels/clock. However given 48 ROPs what are the remaining 16 pixels/clock for exactly? I would have thought that given 4 rasterizers each would work with 12 pixels/clock; what am I missing?
 
Those 48 ROPs can do a bit more than just finish a single 32-bit RGBA pixel in a clock. They could (f.e.) finish a lower number of MSAA'd pixels in a clock without compression. Z-only raster rate could be different. Need a board!
 
So, does this GF100 power hog monster features RGSS anti-aliasing or would still rely on the unofficial OGSS modes?

How hard is it really to turn the available TSAA (which is already sparsed SSAA) into full screen SSAA in the driver? It's my understanding that that is the way AMD implemented it. I'd consider it mighty stupid if NV didn't go for that rather simple "hack" even more so considering their delay.
 
Well, since GF100 is DX11 part (and DX10.1 for that matter), I guess it's probably a trivial task to "hack" the driver and make the shader pipe to process the fragments at sample frequency... oh, well. ;)
 
I can see the base/ROP clock being higher than half the hot clock now, because there's not much going on in that domain any more. ROP clock shouldn't be ~500 MHz if the hot clock is 1450. Unless you know something I don't :smile:

According to the GF100 whitepaper, this is not (or at least should not be) the case. The whitepaper claims that the texture unit on previous architectures operated at the core clock of the GPU. On the GF100, the texture units run at a higher clock, leading to improved texturing performance for the same number of units.

The whitepaper also claims: While GT200 has more texture units than GF100, GF100 delivers higher real world performance thanks to improved efficiency, and NVIDIA shows some graphs where GF100 has 40-70% higher texturing performance compared to GT200 with some benchmarks (L4D, Crysis, Vantage, Crysis WH).

All very impressive stuff.
 
What is not or shouldn't be the case? I don't think I follow (I've got the whitepaper).

Didn't you speculate that the core clock frequency will actually be higher than 1/2 the hot clock frequency, which means that the texture units will actually operate below the core clock frequency? GF100's whitepaper seems to indicate that previous architectures had texture units operating at core clock frequency, while GF100 texture units will operate at a higher frequency, which indicates that they would operate at something above the core clock frequency, correct?
 
Ah yeah, I follow now, and yeah, that's what the whitepaper alludes to (core clock < scheduler/sampler clock) :smile:
 
Reading Anand's bit about GF100's distributed setup, I suspect that there is still much more to the matter in terms of coherent workload among the GPC partitions. The mysterious square structure in the very center of the chip die could actually serve a kind of "directory cache" (similar to Tukwila and Nehalem -EX), where all the geometry attributes are tagged and kept in sync across the setup units!?
 
The setup blocks can definitely communicate with each other (they have to for overlapping triangles and for rendering to be correct). There's no detail about what happens there though.
 
Ah yeah, I follow now, and yeah, that's what the whitepaper alludes to (core clock < scheduler/sampler clock) :smile:

Grrrr... :LOL:

I just read through an online translator (in perfect frengrisch) Damien's article at hardware.fr. Frequencies aside I wonder how he concluded that GF100 has only 64 texture filtering units. Granted Anand's claim about 256TFs doesn't make that much sense either (texture fetch rates don't necessarily mean texture filtering abilities) and since there's no shred of viable information anywhere about texture filtering units (which is anything but encouraging) I am slowly starting to worry about filtering quality.
 
Fermi has 16 L/S per SM, four of which are needed for a bilinear filtered texel. That makes 4 tex/sm, 64 per chip. Nothing else is said in the available information and also not been hinted at.
 
Back
Top