NVIDIA Fermi: Architecture discussion

If Microsoft comes back with the answer "oops, this should be in there too" it would be due diligence to ask AMD if they were ever made aware of that ahead of time. If not then I personally will stake my bet on NVIDIA having received the XBOX720 contract already :)

What a bad tease :(

But they have always been naughty like that, when Nvidia didn't play ball, they grabbed them by the balls and partially made R300 the runaway success it was.

Whats the chances that the actual feature doesn't exist on the DX11 specification and the feature they implemented here is an intercept which nets the same result anyway?
 
Just read the whitepaper and it sure looks innovative with great improvements in different areas that are IMO key for greatly pushing the graphic further in games and professionally. Glad I waited as a GTX380 will be the deal for me. Pretty much sure about this.
 
Whats the chances that the actual feature doesn't exist on the DX11 specification and the feature they implemented here is an intercept which nets the same result anyway?
Read the bit about the code, it exists in the HLSL (but not in the docs or assembler/refrast, which is relevant since assembler/refrast are more strictly documented since drivers are written with them).
 
It's a good instruction, with an underlying texture cache better suited to point sampling than the one in Evergreen.
Since it's apparently only using L/S, I'd assume, it wouldn't go through Tex-Cache at all, but rather use the (larger) L1-/Shared-Memory-Pool.
 
Well you've completely lost me now :) Isn't the single offset gather 4 instruction going to return you 4 point samples from a texel quad which defeats the whole reason for jittering the samples in the first place?
I'm not sure, but I think MfA is saying that on ATI hardware you would fetch, say, all 16 samples from a 4x4 footprint using four gather4 instructions, but without gather4 you would use four jittered point sample instructions.

I think NVidia is saying that they can do the latter in one or two instructions, or equivalently they can gather 8-16 jittered point samples with the same four instructions. That wouldn't make a difference in a 4x4 sampling footprint, but it would if your samples were farther apart.
 
More samples is better even if they aren't ideally distributed and on ATI you get the extra samples from a quad virtually for free with gather4, so you should simply try to make use of them ... the optimal algorithms for both architectures are neither here nor there though.

What NVIDIA is saying is that there is an instruction in HLSL which up to this point has remained hidden, which if you know it exists you can decompile from assembler level and design hardware for to make it run efficiently. Knowing it exists is a rather important step though, without that knowledge you simply wouldn't expect those type of assembly instructions. They make absolutely no sense on the original hardware from which gather4 came (HD3/4/5, where you will just take all the samples).
 
Last edited by a moderator:
I still don't get it, why there is a dedicated tessellator for each SM?

Good question, been wondering the same myself. It could be that each tessellator is assigned a different screen tile to minimize inter-SM communication. So some of them would see less work than others in a given frame but the peak throughput for tessellation heavy scenes would be high.

I still don't know why that matters if you can only cull / setup / rasterize 4 triangles per clock though. Maybe there's something they're doing in the GS to discard primitives before they even get to the setup/rasterization stages.

I think NVidia is saying that they can do the latter in one or two instructions, or equivalently they can gather 8-16 jittered point samples with the same four instructions. That wouldn't make a difference in a 4x4 sampling footprint, but it would if your samples were farther apart.

Right, but my question is whether or not developers are using point sampling. If they are then AMD could certainly have implemented this optimization in hardware and simply grouped individual point sampling requests together. MfA is arguing that the reason they didnt consider it is that they may not have known about the four-offset gather HLSL instruction but I'm not convinced.
 
What NVIDIA is saying is that there is an instruction in HLSL which up to this point has remained hidden, which if you know it exists you can decompile from assembler level and design hardware for to make it run efficiently. Knowing it exists is a rather important step though, without that knowledge you simply wouldn't expect those type of assembly instructions. They make absolutely no sense on the original hardware from which gather4 came (HD3/4/5, where you will just take all the samples).
As you've been shown, ATI handles the individually-offset gather fine. Gather4 is just ATI's optimisation for when the data aligns within 128-bit buckets.

ATI always samples 128-bits at a time. If you choose to throw away 96 bits, then so be it. The hardware will not bother loading the 96 bits into registers if you choose not to use them. But the memory transaction is 128-bits.

The compiler should be able to coalesce distinct fetches when it sees they are using a common sample address, resulting in a 128-bit fetch, rather than several 32-bit fetches. That's very much dependent on how the code's written though, and wouldn't apply when the developer chooses to use nicely jittered samples, whose average is no greater than 32-bits of data per 128-bit sampling address.

Of course looking at all the pixels the average fetch per 128-bit address is likely to be higher. So global memory traffic won't show such a severe disparity in effort versus results. But there's no doubt that slowing the samplers down to 1 32-bit result per pixel per clock is going to make ATI slow here (actually, 1/4 of that, once ALU:fetch is taken into account).

Jawed
 
Hypothetical ...

DirectX 11 is being hammered out, 4 offset gather gets put in and properly documented. Does AMD think :

A) NVIDIA has a texture cache optimized for individual 32 bit accesses and will advocate developers to make accesses in a pattern to make optimal use of that flexibility, we should take this into account for our future hardware architectures.

B) Meh, doesn't concern us.

A write in answer will be acceptable as well.
 
D3D11 supports byte-addressable resources. 128-bit addressing hardware is clearly uncomfortable faced with that prospect.

Turn that question around: what's the effective bandwidth of Evergreen with 128-bit aligned fetches versus GF100 (assuming there'll eventually be a full 16-SM part)?

ATI gains on the bursts and loses on the fragments.

Jawed
 
There's an interesting thing happening in forums with these revelations happening. Months ago, there was much optimism and props given to AMD for their focus on tessellation in DX11, and from that came the assumption that NVidia put no work into it, and if they supported it at all, it would be some late additional, half-assed, bolted-on, or emulated tessellation and would not perform as well as AMD's. I'll note for the record that much the same story was repeated with Geometry Shaders (speculation that NVidia would suck at it, and that the R600 would be the only 'true' DX10 chip) AMD has had some form of tessellation for several generations all the way back to N-patches, so there's some logic to these beliefs.
Well don't forget that NVidia basically tacked on pixel shaders to GF3 (there's a huge difference between PS1.0-1.3 and PS1.4), made a GF4MX which held back game development due to lack of features, did god knows what in their DX9 implementation on NV3x, and also tacked on useless vertex texturing and dynamic branching to NV4x/G7x. And with the exception of only the high end NV3x, these decisions gave them great margins and all the bullet points they needed for marketing.

However, they dumped this strategy from G80 onwards. If R600 wasn't so late and poorly optimized (just think about how much more powerful ATI could have made a 700M transistor GPU with 100 GB/s using today's IP), NVidia may not have felt like they had the freedom to keep going down this road of innovation.

Fermi is probably not going to have the perf/$ of RV8xx, but going forward they've now forced ATI to get off their butts and revamp their culling/setup/rasterization. Competition at its finest.
 
Seems a bit trivial to form the basis of a conspiracy theory, doesn't it? It wouldn't be the first time that an IHV decides that a certain feature wasn't worth the bother with until some time down the road.
 
Which is no reason for Microsoft to hide it from the docs and only expose it at the HLSL level (which effectively hides it from the other IHVs unless they bother to reverse engineer the shader compiler). It can all be an unhappy accident of circumstances of course ... which is why I would like to hear from AMD (or any of the other IHVs for that matter) if the HLSL instruction was made known to them (if not I personally would call anyone who still thinks an accident is the most likely to be slightly naive).
 
Jawed, that's not a write in answer ... that's avoiding the question :(
I think it's obvious: B.

Traditional Gather4 style "clumped" jittered sampling isn't as nice as a truly sparse sampling. They know that. The architectural cost to optimise for multiple, distinctly sampled, 32-bit fetches per clock is high.

Particularly if your starting point is an architecture that's deeply 128-bit.

Quite apart from that, AMD has chosen to use 2 GPUs to compete against NVidia's top single GPU card. So, in AMD's eyes, "meh, we've got double rate sampling in this scenario anyway".

Jawed
 
Which is no reason for Microsoft to hide it from the docs and only expose it at the HLSL level (which effectively hides it from the other IHVs unless they bother to reverse engineer the shader compiler). It can all be an unhappy accident of circumstances of course ... which is why I would like to hear from AMD (or any of the other IHVs for that matter) if the HLSL instruction was made known to them (if not I personally would call anyone who still thinks an accident is the most likely to be slightly naive).

Well I think it would be wise to get that confirmation before causing a ruckus. I for one would be very surprised if it's even possible for one IHV to know about a DirectX feature while the other one remains oblivious. I mean, it's not like they do this stuff for a living or anything :)

I still stand by the notion that if AMD wanted to implement support for multiple point samples per clock at different offsets they would have done so.
 
Which is no reason for Microsoft to hide it from the docs and only expose it at the HLSL level (which effectively hides it from the other IHVs unless they bother to reverse engineer the shader compiler). It can all be an unhappy accident of circumstances of course ... which is why I would like to hear from AMD (or any of the other IHVs for that matter) if the HLSL instruction was made known to them (if not I personally would call anyone who still thinks an accident is the most likely to be slightly naive).

Isn't it also possible that ATI asked Microsoft to 'accidently' omit it from the documentation so no developer would use a feature that benefitted nvidia? :devilish:
 
Any thoughts on what this new architecture could mean for SLI? If each GPC is almost like a fully containted gpu itself, then extending the scheduling to work over multiple GPUs seems like a natural progression.
Could we finally be able to say goodbye to AFR?
 
Back
Top