NVIDIA Fermi: Architecture discussion

It's just that I compared the block diagram I linked to the latest ATi one and this looks a lot less scalable and less CPU style multicore where adding and removing 'cores' is relatively much easier than adding SP blocks to a single 'core' and redesigning the architecture to accommodate.

Basically what I am trying to find out is whether Nvidia could ship this with just a single GPC in a tiny package for notebooks or embedded devices or increase the GPC count to 6/8 in the next gen. I mean Intel came back from the dead in 2005 in part because Conroe could easily be scaled to 4 cores for HPC and down to 1 core for notebooks, and we are still seeing that type of easy scalability now in the CPU space while it hasn't been available on the GPU side, well until now if Nvidia have done it.

That's not entirely accurate.

The term "core" is a bit fuzzy at the moment, and PR people (from all sides) are doing their utmost to twist it out of shape. If you take the historically accepted definition of core, then 1 SM in Fermi is like one module of Bulldozer (upcoming AMD cpu's). IOW, 1 SM in fermi is almost like 2 tightly-coupled "classic" cores. And one module of bulldozer is sorta like 2 cores. So, I'd say GF100 has 16 modules or 32 cores.
 
Based on already achieved transistor densities for GT215 in 40nm, which was 5,22M/sqmm, 575 sqmm should be the upper end for a 3000M GPU, assuming Nvidia doesn't cut it and doesn't profit one bit from a larger die, which always gave better densities in the past.

AMD did achieve a 5% increase in density going from Juniper to Cypress already and based on that data, I could imagine GF100 to be in the range of 540 sqmm.

Plus, half-a-fermi would be 192 Bits. The final performance should then be quite depending on the clock rates, shouldn't it?

Half a fermi will have a lot of room in power and should have much less intra-die variations, allowing for higher clocks. At GF100 clocks though, it'll get totally smoked by Cypress.
 
A 128-bit bus on a half-Fermi when G94 has a 256-bit one? Not likely. It should look something like:

256 shaders
32 TMUs
192/256-bit GDDR5
24/32 ROPs

The texturing performance may be an issue but otherwise why wouldnt that match up with the GTX 285? And I agree, die size may be very close to Cypress or even a little higher depending on how everything scales.
One of the preview pieces mentioned half Fermi with 128 bit bus width memory, I borrowed that. However CarstenS's 192 bit wide figure looks more appropriate.

Even then the comparison would not be something what I'd like to call competitive, remember 5850 is the salvage part. Sure its better than GTX260 and 4870 but still not ideal.
 
They specifically call it "DX11 four-offset Gather4" - haven't heard that before:
"The texture units also support jittered sampling through DirectX 11’s four-offset Gather4 feature, allowing four texels to be fetched from a 128×128 pixel grid with a single texture instruction. GF100 implements DirectX 11 four-offset Gather4 in hardware, greatly accelerating shadow mapping, ambient occlusion, and post processing algorithms."

I haven't either. DX11 SDK doesn't go into much details there. Nvidia told me there should be an instruction for the DX11 four-offset Gather4 at the user level but it's passed to the driver as 4 simple instructions with single offset or I guess directly written as these 4 simple instructions. Then their compiler reassembles it into a native four-offset gather 4 running at half speed. They considered at first to implement it at full speed but the 4 independant x/y calculators required were too expensive so they cut these to 2 later in the design. This still means a nice 2x boost.
 
If I were NVIDIA I would not be satisfied with the computational density at the moment ... other parts of their architectures have been able compensate for this pretty well (they have always had better access to memory pools for instance, and Fermi is again superior there). It would be dangerous to rely on that for 2-3 generations though.

No, I wouldn't recommend that to Nvidia either, resting on one's laurels never works out, see Intel NetBurst, AMD Thunderbird etc... I said that they have the general structure of future chips sorted now, increasing and decreasing the number of GPCs is how they should vary their performance levels in each generation.

What I was trying to get across was that having a tick/tock method allows Nvidia to remain competitive as they can have small but significant architectural updates in the tick, and design the tock as a complete overhaul. If the performance increase from GF100 -> GF150(?) is ~ 20-30% (given how poor yields are on GF100, it will be clocked lower and have a number of disabled SPs) Nvidia could also add in architecture improvements in other areas where they are weak in the small jump and get the fabrication issues worked out in GF150 on the 32nm node it would make an easier time to release GF200 with a 100% power increase over GF100 with high yields and high clocks.

Intel have shown how well tick/tock works in the CPU space, AMD/Nvidia would be crazy not to follow it in the GPU market. Obviously there are a number of other issues at hand as Intel have their own foundries whereas AMD/Nvidia are relying on third parties like TSMC and GlobalFoundries, but the idea makes sense.
 
No, I wouldn't recommend that to Nvidia either, resting on one's laurels never works out, see Intel NetBurst, AMD Thunderbird etc... I said that they have the general structure of future chips sorted now, increasing and decreasing the number of GPCs is how they should vary their performance levels in each generation.

What I was trying to get across was that having a tick/tock method allows Nvidia to remain competitive as they can have small but significant architectural updates in the tick, and design the tock as a complete overhaul. If the performance increase from GF100 -> GF150(?) is ~ 20-30% (given how poor yields are on GF100, it will be clocked lower and have a number of disabled SPs) Nvidia could also add in architecture improvements in other areas where they are weak in the small jump and get the fabrication issues worked out in GF150 on the 32nm node it would make an easier time to release GF200 with a 100% power increase over GF100 with high yields and high clocks.

Intel have shown how well tick/tock works in the CPU space, AMD/Nvidia would be crazy not to follow it in the GPU market. Obviously there are a number of other issues at hand as Intel have their own foundries whereas AMD/Nvidia are relying on third parties like TSMC and GlobalFoundries, but the idea makes sense.

In my opinion the tick tock thing intel is applying now to their roadmaps works better for CPUs than for GPUs. The reason being that the formers general architecture changes much less frequently.
 
I haven't either. DX11 SDK doesn't go into much details there. Nvidia told me there should be an instruction for the DX11 four-offset Gather4 at the user level but it's passed to the driver as 4 simple instructions with single offset or I guess directly written as these 4 simple instructions. Then their compiler reassembles it into a native four-offset gather 4 running at half speed. They considered at first to implement it at full speed but the 4 independant x/y calculators required were too expensive so they cut these to 2 later in the design. This still means a nice 2x boost.

Thanks Damien. That sounds pretty interesting and - more importantly - in line with what they claim the expected performance to be. :)
 
Half a fermi will have a lot of room in power and should have much less intra-die variations, allowing for higher clocks.
If it ends up working like that it would be truly ironic if NVIDIA has succeeded in using high bandwidth interconnect for non AFR parallel rendering ... they would end up doing what ATI never quite could (due to needing AFR at the highest end). Disproving the validity of their own strategy.
 
I haven't either. DX11 SDK doesn't go into much details there.
Hmm?

Code:
TemplateType Gather(
  sampler s,
  float2 location,
  int2 offset
);

float4 GatherCmp(
  sampler s,
  float2 location,
  float compare_value,
  int2 offset
);

These are the two new gather HLSL instructions in the DirectX 11 documentation (ignoring the variations which work on color components). Plenty of detail.

The instruction NVIDIA alludes to is simply not there, which is not quite the same as not going into detail.
 
  • Sampler runs at scheduler clock (half the hot clock)
  • 4 samplers per cluster (64 total)
  • Sampler will do jittered-offset for Gather4 (no idea how, the texture-space offset is constant per call)
  • 4 tris/clock setup and raster
  • Raster area per unit is now 2x4 rather than 2x16
  • PolyMorph Engine (heh), effectively pre-PS FF, one per cluster
  • ROPs now each take 24 coverage samples (up from 8)
  • Compression is improved, 4x->8x delta drop is less than GT200 clock-for-clock
  • Display engine improvements

That's the list of the stuff I either got wrong or missed in my article at TR, concerning the graphics. Biggest thing is probably the > 1tri/clk for small triangles, and the change in the per-clock rasterisation area for each of the four units. Aggregate setup and rasterisation performance is no faster per clock than G80+ for triangles that are > 32 pixels.

Sampler count was out by 2x, so NV will need a > 1.6 GHz hot clock to beat a GTX 285 in peak possible texture performance, and there's a distinct lack of information about the sampler hardware in the latest whitepaper. Doing more digging there, but it looks like no change to texturing IQ other the ability to jitter the texcoords per sample during an unfiltered fetch.

NV claim that everything they list in the PolyMorph block exists as a physical block in the silicon. Obviously interesting thing there that didn't exist before is the tessellator, and it seems the fixed block there is responsible for generating the new primitives (or killing geometry too), and the units run in parallel (where possible), with most other stuff running on the SM.

As for my clock estimates, I doubt 1700 MHz hot clock at launch :)sad:), but the base clock should be usefully higher, up past 700 MHz. They still haven't talked about GeForce productisation or clocks, but at this point it looks unlikely the fastest launch GeForce will texture faster than a GTX 285.

That's about it, will have an article up ASAP.
I think that the half-clock thing is about the TAs (AF, border-mode, wrap-around, adress calculation) and the load store units AND the four bilinear filters run both on the full hot clock.

That would give a TA:TF ratio of 1:2 like in G80.
 
Hmm?

Code:
TemplateType Gather(
  sampler s,
  float2 location,
  int2 offset
);

float4 GatherCmp(
  sampler s,
  float2 location,
  float compare_value,
  int2 offset
);

These are the two new gather HLSL instructions in the DirectX 11 documentation (ignoring the variations which work on color components). Plenty of detail.

The instruction NVIDIA alludes to is simply not there, which is not quite the same as not going into detail.

I agree, I've read that part of the doc of course. Nvidia told me they will check with MS why the four offset gather4 wasn't described.
 
If it's not described then how are developers supposed to know it exists? This sounds more like an optimization for point sampling than it is for any multiple offset Gather instruction.
 
I agree, I've read that part of the doc of course. Nvidia told me they will check with MS why the four offset gather4 wasn't described.
If Microsoft comes back with the answer "oops, this should be in there too" it would be due diligence to ask AMD if they were ever made aware of that ahead of time. If not then I personally will stake my bet on NVIDIA having received the XBOX720 contract already :)
 
Well, the compiler does indeed swallow 4 offset versions of the gather instructions ...

Ugh, it does awfully reek of dirty pool ... it would be nice to get a statement from AMD on this.
 
Well, the compiler does indeed swallow 4 offset versions of the gather instructions ...

Ugh, it does awfully reek of dirty pool ... it would be nice to get a statement from AMD on this.
Can you post the HLSL code? I'd like to see the instruction stream that gets generated. Thanks!
 
I didn't really run anything practical, I just ran the BasicHLSL11 sample with these changes :
Code:
//SamplerState g_samLinear : register( s0 );
SamplerState g_samPoint : register( s0 ) {filter = MIN_MAG_MIP_POINT;};

...

//	float4 vDiffuse = g_txDiffuse.Sample( g_samLinear, Input.vTexcoord );
	int2 offset = int2(1, 1);
	float4 vDiffuse = g_txDiffuse.GatherRed( g_samPoint, Input.vTexcoord, offset, offset, offset, offset );

It won't run because the framebuffer formats are all wrong for gather, but the compiler doesn't know that ... if you use a number of offset different from 1 or 4 you get an error like this from the compiler :

none(46,32): error X3013: 'GatherRed': intrinsic method does not take 7 parameters
none(46,32): error X3013: Possible intrinsic methods are:
none(46,32): error X3013: Texture2D.GatherRed(SamplerState, float2)
none(46,32): error X3013: Texture2D.GatherRed(SamplerState, float2, int2)
none(46,32): error X3013: Texture2D.GatherRed(SamplerState, float2, int2, int2, int2, int2)

PS. less lazy people than myself with more DirectX coding experience might want to run it with an appropriate texture format (and pixel shader format, which is set to 4.1 now although the compiler seems content with SM 5.0 gather instructions anyway) to see how the reference rasterizer supports the 4 offset gather.
 
Last edited by a moderator:
I didn't really run anything practical, I just ran the BasicHLSL11 sample with these changes :
Thanks! Just so you know, your example compiles to:
Code:
gather4_indexable(1,1,0)(texture2d)(float,float,float,float) r1.xyzw, v1.xyxx, t0.xyzw, s1.x
Since you used the same offset for all samples. However, if you do something like:
Code:
	int2 offset = int2(1, 1);
	int2 offset2 = int2(-1,0);
	float4 vDiffuse = g_txDiffuse.GatherRed( g_samPoint, Input.vTexcoord, offset, offset2, offset, offset );
Then you get something different:
Code:
gather4_indexable(1,1,0)(texture2d)(float,float,float,float) r1.xzw, v1.xyxx, t0.xyzw, s1.x
gather4_indexable(-1,0,0)(texture2d)(float,float,float,float) r0.y, v1.xyxx, t0.xyzw, s1.x
 
Interesting. So it decomposes into multiple instructions, one per distinct offset. Makes sense I guess.

I don't see why the presence of this function would have influenced AMD's design either way though. It seems that having 4 different offsets is a practical use case so if they wanted to optimize for that case they would have done so regardless.
 
Half a fermi will have a lot of room in power and should have much less intra-die variations, allowing for higher clocks. At GF100 clocks though, it'll get totally smoked by Cypress.

I dont know about that. GTX is less than half a full fermi will be and its 30-35% from Cypress(5870) and 5-10% with in 5850. Half a fermi would have 256SPs, DDR5 with propably 256bit bus and 24 ROPs.
 
One silicon level architecture question for you guys. Given the 'polymorph engines' are all interconnected across large expanses of the die, and need to be kept in sync, anyone want to hazard a guess on how that affects clock scaling? :)

-Charlie
 
Back
Top