NVIDIA Fermi: Architecture discussion

MfA · Jan 19, 2010

Jawed said:
Traditional Gather4 style "clumped" jittered sampling isn't as nice as a truly sparse sampling. They know that. The architectural cost to optimise for multiple, distinctly sampled, 32-bit fetches per clock is high.

32 bit is high, two independent 64 bit fetches? Probably not so much.

You say "Gather4 is just ATI's optimisation for when the data aligns within 128-bit buckets.". That is not how I would implement a texture cache ... if you store texels quad ordered in cache then you are going to be able to get 4 bilinear samples in a single go 20% of the time, the same amount of time your needed texels will be completely non contiguous. I don't like them odds.

Jawed · Jan 19, 2010

MfA said:
You say "Gather4 is just ATI's optimisation for when the data aligns within 128-bit buckets.". That is not how I would implement a texture cache ... if you store texels quad ordered in cache then you are going to be able to get 4 bilinear samples in a single go 20% of the time, the same amount of time your needed texels will be completely non contiguous. I don't like them odds.

These are 32-bit shadow buffer texels, typically. Not yer regular "8bpc" albedo format.

Now there is multi-channel support in gather4, too.

Jeremy Shopf was (still is?) at AMD and seemed aware of the concept in March last year:

http://www.jshopf.com/blog/?tag=dx11

Gather4 improvements: Specify which channel of a multi-channel texture to fetch from. Can also use programmable offsets

EDIT: Whoops, forgot that his Mixed Resolution Rendering technique:

http://www.jshopf.com/files/GDC09_Shopf_Mixed_Resolution_Rendering.ppt

does some fun stuff to avoid inefficiences in sparse sampling

Jawed

Mintmaster · Jan 19, 2010

I don't think this has to do with data alignment as Jawed is suggesting. It has to do with the way texels are fetched. The hardware is already programmed to figure out the four nearest texels to a (u,v) FP texture address, and those four texels are communicated as a single location i (the other three locations are i+1, i+p, i+p+1, p being pitch) throughout the pipeline. Fetch4 is a way to keep most of this pipeline intact but add a little flexibility and throughput to the way data is fed to the shader processors.

Fermi goes the extra mile with point sampling because it's useful for GPGPU. There's no doubt that it adds gobs of complexity to addressing, caches, and the memory controller due to the extra granularity. It's only marginally useful for gaming, but for GPGPU it can quadruple (or does Fermi only double it?) fetch speed which can be a huge boon for navigating data structures.

Gipsel · Jan 19, 2010

Mintmaster said:
Fermi goes the extra mile with point sampling because it's useful for GPGPU.

Does that makes much sense considering that for GPGPU loads you are probably using a different L1 cache? The L/S units can directly access the L1/shared memory without going through the texture units when you need unfiltered data for that purpose.

MfA · Jan 19, 2010

Jawed said:
These are 32-bit shadow buffer texels, typically. Not yer regular "8bpc" albedo format.

32 bit per sample texels are not yer regular 32 bit per sample texels? (From the texture cache point of view.)

Now there is multi-channel support in gather4, too.

Which are in the documents.

Jeremy Shopf was (still is?) at AMD and seemed aware of the concept in March last year

What you mean the plural? When he was talking about multiple new functions in the first place? Man, I thought my arguments were thin

Jawed · Jan 19, 2010

Mintmaster said:
I don't think this has to do with data alignment as Jawed is suggesting. It has to do with the way texels are fetched. The hardware is already programmed to figure out the four nearest texels to a (u,v) FP texture address, and those four texels are communicated as a single location i (the other three locations are i+1, i+p, i+p+1, p being pitch) throughout the pipeline. Fetch4 is a way to keep most of this pipeline intact but add a little flexibility and throughput to the way data is fed to the shader processors.

We have to distinguish between typical albedo textures in some compressed form and "linear resources" such as 32-bit single channel shadow buffers that were generated on the GPU.

Also need to distinguish between logical texel/texel-quad address and video memory address (varies with bits per texel) and L1 cache physical address (L1 consisting of uncompressed texels in all cases).

The cache rate is optimised for 128-bit bucket fetches.

Fermi goes the extra mile with point sampling because it's useful for GPGPU. There's no doubt that it adds gobs of complexity to addressing, caches, and the memory controller due to the extra granularity. It's only marginally useful for gaming, but for GPGPU it can quadruple (or does Fermi only double it?) fetch speed which can be a huge boon for navigating data structures.

This is point sampling in Fermi's texturing pipeline, though, I believe.

Jawed

Jawed · Jan 19, 2010

MfA said:
32 bit per sample texels are not yer regular 32 bit per sample texels? (From the texture cache point of view.)

you're talking about quad-alignments/packing of 32-bit texels and "how to design a texture cahce" and I'm saying the optimisations you're thinking of are centred on DXT formats, not 32-bit shadow buffer texels.

And it's quite clear that fetch4, as originally implemented, was making use specifically of the hardware's native 128-bit bucketing.

What you mean the plural? When he was talking about multiple new functions in the first place? Man, I thought my arguments were thin

Your quoting has apparently gone all jittery, I presume you're referring to "Can also use programmable offsets ". Well, you could always email him.

The presentation I linked is further proof that sparse fetching is something to avoid on ATI (though again hardly earth-shattering). I guess you'd still use his technique on Fermi, because it should be faster even though GF100 is twice faster than ATI at disparate gather4.

Jawed

KimB · Jan 19, 2010

Mintmaster said:
However, they dumped this strategy from G80 onwards. If R600 wasn't so late and poorly optimized (just think about how much more powerful ATI could have made a 700M transistor GPU with 100 GB/s using today's IP), NVidia may not have felt like they had the freedom to keep going down this road of innovation.

Well, part of this is just because with DX11, they have no way to expose alternative features. So they had two choices before them: they could just barely support DX11 and produce a part that was as high-performing as possible in current games. Or they could produce a card with differences other than simply raw performance to differentiate from their competitor.

When put in those terms, it doesn't sound to me like too much of a difference in strategy. nVidia has always been very active in producing GPU's that have features their competitors don't have, instead of just going for raw performance. But now, with DX11 not allowing nVidia to have hardly any checkbox features that their competitors won't have, they had to find another way to add value besides going for raw performance. I mean, they obviously had PhysX and CUDA, but those are old news, and it wouldn't make sense to add yet another proprietary API.

fellix · Jan 19, 2010

Some alternative GF100 architectural diagrams:

http://www.overclockers.ru/images/news/2010/01/19/jp1.jpg
http://www.overclockers.ru/images/news/2010/01/19/jp2.jpg
http://www.overclockers.ru/images/news/2010/01/19/jp3.jpg

MfA · Jan 20, 2010

Jawed said:
you're talking about quad-alignments/packing of 32-bit texels and "how to design a texture cahce" and I'm saying the optimisations you're thinking of are centred on DXT formats, not 32-bit shadow buffer texels.

My point was that you don't store quads of texels in those 128 bit chunks when you aren't using compressed formats because the odds of getting hits which exactly fit the necessary alignment are so poor. You are better off simply storing it in good old simple lines ... in which case two 64 bits accesses should be no more expensive than a 128 bit quad access.

Jawed · Jan 20, 2010

MfA said:
My point was that you don't store quads of texels in those 128 bit chunks when you aren't using compressed formats because the odds of getting hits which exactly fit the necessary alignment are so poor. You are better off simply storing it in good old simple lines ... in which case two 64 bits accesses should be no more expensive than a 128 bit quad access.

Well, that's how the hardware currently works :shrug:

Fetch4 originally was a way to get quads from a single-channel texture into the ALUs:

http://www.beyond3d.com/content/reviews/2/4
http://forum.beyond3d.com/showthread.php?t=27581
http://developer.amd.com/gpu_assets/Advanced DX9 Capabilities for ATI Radeon Cards.pdf
http://developer.amd.com/media/gpu_assets/Isidoro-ShadowMapping.pdf

Fundamentally I think this is a cache-line mapping problem. Quads are aligned to cache lines - even if the hardware was addressing at twice the rate it currently does, it'd be fetching simultaneously from distinct cache lines. I presume DXT formats are all neatly aligned with cache lines, hence the highest texel rates. EDIT: hmm, that isn't a complete answer...

Jawed

MfA · Jan 20, 2010

Regardless how you store things you are always going to be throwing away heaps of data for sparse accesses even when accessing quads, at least if your ports are wider than 32 bits ... hell, almost half (7/16th) of quad accesses in 16 pel DXTC formats straddle borders so even there a single access only gets you so far.

The way how hardware currently works is not at issue. It was my contention that 64 bit granularity accesses could be supported at little greater costs ... and given that NVIDIA is pushing 4 offset gathers it would make sense for ATI to do that in future hardware (making the knowledge that they are valuable).

PS. I think just how the texture caches work exactly in modern architectures is one of the most oversimplified in the diagrams/etc ... almost as bad the extent to which parameter passing between shaders is glossed over. Is the cache banked? Are there coalescing stages?

DemoCoder · Jan 20, 2010

Jaaanosik said:
Wouldn't it make sense to parametrize the level of tessellation based on the card? If yes then were is the problem?

Yep, you could do that, but it's yet another knock against Charlie's spin. Supporting different levels of tessellation performance is the least of the developer's problems these days.

Razor1 · Jan 20, 2010

Jaaanosik said:
Wouldn't it make sense to parametrize the level of tessellation based on the card? If yes then were is the problem?

Yes that is what I've been saying, but to assume that Fermi lower end derivitives aren't going to be able to handle the same level of tesselation as equivelent AMD products based on the fact they are going to have less tesselation units is a conclusion that can't be determined as of now since as Democoder and I have been saying there is still shader performance that affects the amount of tesselation that takes place.

DemoCoder · Jan 20, 2010

The real difference will come down to whether or not the fixed function units on Cypress/Fermi can output more than one (u,v) per clock.

Jaaanosik · Jan 20, 2010

Razor1 said:
Yes that is what I've been saying, but to assume that Fermi lower end derivitives aren't going to be able to handle the same level of tesselation as equivelent AMD products based on the fact they are going to have less tesselation units is a conclusion that can't be determined as of now since as Democoder and I have been saying there is still shader performance that affects the amount of tesselation that takes place.

That's understandable. If lower Fermi parts provide 70%-80% of equivalent AMD parts in tessellation then it's still much better than none. Only time will tell how they compare.

Razor1 · Jan 20, 2010

DemoCoder said:
The real difference will come down to whether or not the fixed function units on Cypress/Fermi can output more than one (u,v) per clock.

True

3dcgi · Jan 20, 2010

Razor1 said:
I think its too early to say there is a scaling probem, in my experience with tesselation, mid range and lower end cards really won't have the horsepower overall anyways to do major amounts of tesselation. The problem will probably be solved with adaptive tesslation with some type of LOD so when using less power graphics cards tesselation amounts are going to be dropped, so Fermi derivitives might be well balanced. Again, thats just speculation, but its too earlier to really give a definitive answer.

I agree with the basic premise of your post as developers can always have a knob to limit the highest tessellation level, but I question what experience can be telling you anything about how mid range and low end cards perform. There are probably only a handful of developers that have played with ATI's entire range of cards and none that have Nvidia's range so I'd say no one can claim to be experienced.

3dcgi · Jan 20, 2010

DemoCoder said:
The real difference will come down to whether or not the fixed function units on Cypress/Fermi can output more than one (u,v) per clock.

Why do you think that will matter much? With reuse the ratio of (u,v) to prims should be pretty close to one.

Razor1 · Jan 20, 2010

3dcgi said:
I agree with the basic premise of your post as developers can always have a knob to limit the highest tessellation level, but I question what experience can be telling you anything about how mid range and low end cards perform. There are probably only a handful of developers that have played with ATI's entire range of cards and none that have Nvidia's range so I'd say no one can claim to be experienced.

Hmm that is true, but before a game comes out with advanced tessellation features, its going to go through QA, this is where most of the profiles are set, at least this is the way I used to setup projects when I was in the game industry.

NVIDIA Fermi: Architecture discussion

Similar threads