NVIDIA Fermi: Architecture discussion

[*]Sampler will do jittered-offset for Gather4 (no idea how, the texture-space offset is constant per call)
Can obviously not be done with DX gather ... at a guess they will simply advocate using Sample multiple times with varying offsets, and point sample filtering, they have 16 LSUs after all.
 
Last edited by a moderator:
They specifically call it "DX11 four-offset Gather4" - haven't heard that before:
"The texture units also support jittered sampling through DirectX 11’s four-offset Gather4 feature, allowing four texels to be fetched from a 128×128 pixel grid with a single texture instruction. GF100 implements DirectX 11 four-offset Gather4 in hardware, greatly accelerating shadow mapping, ambient occlusion, and post processing algorithms."
 
Would have to be SM5.1 then ... unless Microsoft introduces a new shader instruction without bumping the version.

PS. of course the DirectX documentation is a fucking mess to begin with, maybe it just fell through the cracks (a rather convenient one though in that case).
 
Last edited by a moderator:
Would have to be SM5.1 then ... unless Microsoft introduces a new shader instruction without bumping the version.
Maybe it's just a common case that they can detect in the shader code and apply the hardware solution (possibly rewriting the shader code on the fly)?
 
You really want it to be deterministic though and not have the driver guess you wanted jittered samples when you really didn't.
 
You could go from 4 sample instructions to a "4 offset gather" but when developers use the currently documented gather instruction they will expect the quad neighbouring samples.

Given Microsoft's glacial process in directx development and the complete POS documentation I wouldn't be surprised if they say "whoops, we said this was DirectX 11 ... but really THIS is DirectX11". Wish they would just release the reference rasterizer source code.
 
So I just read the summaries and what-not of Fermi. I'm not the most technical person on the GPU side, more a CPU person, but from looking at the slides Nvidia have build a quad-core GPU no, with each core having 128CUDA cores and a rasterizing engine.

This diagram looks eerily similar to early dual/quad core CPU designs from AMD and later Intel. So this chip is going to be massively parallel right. Is this the best path for Nvidia to take, from experience with parallelism in the CPU space it has taken 6-7 years for some of the more basic elements to take advantage of having >1 core. Then again I suppose the same questions were raised when AMD went multi-core with Athlon.

Anyway, that is my limited understanding of what Nvidia are trying to achieve, it's probably misguided and wrong, so please correct me if this is the case. If I'm right the design paves the way for very easy scalability, I mean they can basically cut the chip in half and get 50% performance with a 50% die size or increase the number of 'cores' in the next iteration (32nm) to 6 along with other general architectural improvements and efficiency gains and finally stop increasing the die size but get massive performance increases similar to what Intel/AMD do in the CPU space.

What is the real thinking, am I completely wrong?

Damn, posted into the wrong thread. I put this in the pre-release speculation instead. :(

Anyway what do you guys think?
 
You really want it to be deterministic though and not have the driver guess you wanted jittered samples when you really didn't.
Sure, but maybe there is some sort of "magic code" (i.e. a code sequence that the driver recognizes) and Nvidia is instructing developers how to exploit that or maybe it's more a game-by-game optimization thing where Nvidia enables support for "important" games by changing shaders (think 3DMark optimization).
 
Nathan, all GPU architectures have been relatively modular for a while, this isn't really anything new.

It's just that I compared the block diagram I linked to the latest ATi one and this looks a lot less scalable and less CPU style multicore where adding and removing 'cores' is relatively much easier than adding SP blocks to a single 'core' and redesigning the architecture to accommodate.

Basically what I am trying to find out is whether Nvidia could ship this with just a single GPC in a tiny package for notebooks or embedded devices or increase the GPC count to 6/8 in the next gen. I mean Intel came back from the dead in 2005 in part because Conroe could easily be scaled to 4 cores for HPC and down to 1 core for notebooks, and we are still seeing that type of easy scalability now in the CPU space while it hasn't been available on the GPU side, well until now if Nvidia have done it.
 
Basically what I am trying to find out is whether Nvidia could ship this with just a single GPC in a tiny package for notebooks or embedded devices or increase the GPC count to 6/8 in the next gen.
I guess that was the plan, though they can scale also by # of SMs within each GPC.
 
Cypress chops in half quite nicely (Juniper). After that you can remove SIMD engines and memory controllers without making major changes (Redwood). Only after that scaling becomes a little harder ... and by that point they already chopped it in 4 too. So as I said, it isn't really anything new.

GF100 might have an easier time scaling up though ... but that's a little less useful at the moment.
 
Well if that was the plan and Nvidia delayed to execute properly then surely it has been worth it as it sets up the next 2-3 generations for them with easy linear increases in power by just increasing the number of GPCs and increasing the clock speeds as the process node matures. Add in general architectural improvements gen-to-gen and we will see > 50% increases in performance whilst keeping the same die size and TDP, something which hasn't been achieved in the GPU space yet but has been happening for 3/4 generations in the CPU space.

The potential pay off from this single move is huge, I mean it transformed Intel's CPU business from flagging to market leader.

Not only that, it also paves the way for a tick/tock timetable which is important to profitability and lower power-consumption.
 
If I were NVIDIA I would not be satisfied with the computational density at the moment ... other parts of their architectures have been able compensate for this pretty well (they have always had better access to memory pools for instance, and Fermi is again superior there). It would be dangerous to rely on that for 2-3 generations though.
 
Well if that was the plan and Nvidia delayed to execute properly then surely it has been worth it as it sets up the next 2-3 generations for them with easy linear increases in power by just increasing the number of GPCs and increasing the clock speeds as the process node matures.

Same here, I expect at least the next two years of Nvidia hardware to be slight tweaks to Fermi. They've made a lot of changes so they're probably not going to do too much in the near future. If anything they may bump up the texture power per SM if that proves to be an issue. We'll probably only know ifor sure f they tried to do too much in one shot when we can compare it to AMD's next chip.

In terms of scalability I don't really get why Fermi should be easier than GT200. Having all the fixed function stuff outside of the SIMDs certainly didn't hamper AMD's ability to quickly scale Cypress downward. I always figured that GT200's biggest problem was that its derivatives would be uncompetitive with G92 and its derivatives and that has now been shown to be true (on 40nm, 55nm would've been a disaster). One potential bright spot is that a half-Fermi should be smaller than Cypress and based on early numbers looks like it would be competitive with the GTX 285 / HD 5850 especially in newer titles.
 
In terms of scalability I don't really get why Fermi should be easier than GT200. Having all the fixed function stuff outside of the SIMDs certainly didn't hamper AMD's ability to quickly scale Cypress downward. I always figured that GT200's biggest problem was that its derivatives would be uncompetitive with G92 and its derivatives and that has now been shown to be true (on 40nm, 55nm would've been a disaster). One potential bright spot is that a half-Fermi should be smaller than Cypress and based on early numbers looks like it would be competitive with the GTX 285 / HD 5850 especially in newer titles.
And how is that? If we take the lower of the die-size estimates 550mm2 and estimate that a half GF100 would be 0.6 times (scaling down is not linear) then we end up with exactly the size of Cypress, 330mm2. Pairing that up with 128bit GDDR5, I dont see how it be competitive with 285 or 5850? It should be faster than the 5770 though.

Note: I'm assuming that a GF100 is on average faster than the 5870 by atleast 40-50% for the above.
 
A 128-bit bus on a half-Fermi when G94 has a 256-bit one? Not likely. It should look something like:

256 shaders
32 TMUs
192/256-bit GDDR5
24/32 ROPs

The texturing performance may be an issue but otherwise why wouldnt that match up with the GTX 285? And I agree, die size may be very close to Cypress or even a little higher depending on how everything scales.
 
According to NV's presentations even half-fermi could not achieve GT200 speeds as it "says up to 2x over GT200 @8xAA high resolution" for the full GF100.

Unless of-course a GTS350 could have higher clocks that would give it benefits in a direct comparison. I think the 256 bit bus would be maintained, giving it a healthy 140GB/s bandwidth.

edit: I also wonder how they benchmarked the hair demo in the whitepaper, according to ChrisRay there's physx involved. So how does one run that on a RV8x0?
 
Based on already achieved transistor densities for GT215 in 40nm, which was 5,22M/sqmm, 575 sqmm should be the upper end for a 3000M GPU, assuming Nvidia doesn't cut it and doesn't profit one bit from a larger die, which always gave better densities in the past.

AMD did achieve a 5% increase in density going from Juniper to Cypress already and based on that data, I could imagine GF100 to be in the range of 540 sqmm.

Plus, half-a-fermi would be 192 Bits. The final performance should then be quite depending on the clock rates, shouldn't it?
 
Back
Top