NVIDIA Fermi: Architecture discussion

The pixel shader doesn't really care what triangle a quad is from, you can pool them and run them in a single thread group.
That used to be true in the land of fixed function interpolators. Not so much anymore..
 
Nvidia does have a head start, but the momentum of the industry is behind OpenCL.

I do not think it makes much of a difference... OpenCL and CUDA are not like two completely different beasts and all the R&D poured into CUDA by nVIDIA will pay off for them when they move to OpenCL and DirectCompute IMHO.
It is not as nice for them as controlling the industry with their own proprietary standard, but I think they will be able to leverage their CUDA expertise (drivers and tools) when developing their OpenCL support (and they already have).

Fermi might have CUDA cores, but you could just as easily call them OpenCL cores :).
 
That used to be true in the land of fixed function interpolators. Not so much anymore..
I don't see why not ... for small triangles iterative interpolation does not provide a gain over purely parallel interpolation and if you do it purely parallel it suits SIMD just fine.

Lets say the vertex shader precomputes 1/z and X/z per vertex (with X being the parameters) and the rasterizer supplies vertex blending factors per pixel ... why would a pixel shader + parameter interpolator care what triangle a pixel belongs to?
 
Sorry, Charlie, but that roadmap, at least the second table, is fake at best.
Nvidia top is GTX285, not 280, and anyway both versions can have 1 or 2 Gigabytes of VRAM. And this is only one of the inconsistencies.

1792 is a weird memory amount even for a 384bit bus. I'd expect either 1536 or 3072 to be honest. That aside if something with 3B transistors and 1792MB ram theoretically consumes only 150W bring it on by all means. No wait he'll tell you now that it works at a 300MHz core frequency only :rolleyes:
 
Why would they want to do this? More efficiency would be traded for how compact and numerous they can make those 5-way scalar units.

You actually might have a point. I remember it was Mike Houston from AMD who said that the VLIW efficiency is at least 80% in Folding@Home code, and even better in complex game shaders. And that's apparently without any parallelizing effort from the developer's side.

Now if you compare the SP FLOP capability between the AMD Evergreen and the Nvidia Fermi, we have Nvidia at a around 50% - 60% of AMD (~1.5 TFLOPS vs 2.72 TFLOPS), not even taking chip sizes into account. The only problem for AMD is DP, but that's only used in HPC/scientific GPGPU applications currently.

Of course there are many other factors to consider, but clearly if compute density is what your're after, the current AMD design is way more efficent. What will be interesting is how efficent the Fermi architecture will be in SP-only applications like games and F&H. My bet is it will do surprisingly well.
 
I remember it was Mike Houston from AMD who said that the VLIW efficiency is at least 80% in Folding@Home code, and even better in complex game shaders.

I'm suspicious of this claim. If it were true, wouldn't a 4850, with it's 1 TFLOPS theoretical peak performance be running rings around a GTS 250, with it's paltry 0.5 TFLOPS theoretical peak? Instead, they're pretty similar in performance. Comparing actual game benchmarks, 1 nVidia FLOP seems to about equal 2 ATI FLOPs, suggesting that ATIs utilization efficiency in game shader code is closer to 50% than 80+ percent. I think Fermi will perform quite well, relative to Cypress, in games.
 
I'm suspicious of this claim. If it were true, wouldn't a 4850, with it's 1 TFLOPS theoretical peak performance be running rings around a GTS 250, with it's paltry 0.5 TFLOPS theoretical peak? Instead, they're pretty similar in performance. Comparing actual game benchmarks, 1 nVidia FLOP seems to about equal 2 ATI FLOPs, suggesting that ATIs utilization efficiency in game shader code is closer to 50% than 80+ percent. I think Fermi will perform quite well, relative to Cypress, in games.

There's more to graphics than FLOPs though. A lot more. The 250 has more pixel fillrate, higher texture addressing/filtering throughput and more bandwidth, and using game benchmarks to divine anything about ALU usage ratios is...inefficient.
 
So what happens when you try and run some code that uses DP on such a design?
My assumption that this would be for consumer-level gaming chips, which lead me to expect a few scenarios:
1) driver compiler code-replacement antics
2) To the horror and consternation of applications devs who really care about the math their code does (Nvidia could say they should have bought a Fermi), the chip demotes DP to SP. I'm not aware of any operation that is only DP.
3) Driver/device error for invalid operations (shoulda bought that Fermi, Nvidia could say)
4) Device is tagged as not supporting that compute level or optional standard and you should have bought a Fermi if you wanted the extras
5) The chip blows up (should've bought a Fermi, Nvidia could say)

This may be more of a reflection of my low expectations than what Nvidia would decide is worthwhile. It wouldn't be out of character for a GPU maker to play games with support and compliance, or unusual to have yet another version x.1,x.2,xa,xb for different levels of support.

Register files care as well, since they have to allocate and access 64b regs differently.
The physical register files themselves, I don't know.
I suspect the compute shader setup engine and the operand collectors might need some alteration, or they can just only operate in SP mode.

So do you think they will do DP in a library as part of the driver?
That would be awfuly nice of Nvidia.
It's also not unheard of for a vendor to promise such support in a later version of their platform, at some indeterminate future date.
For the gaming markets I'd figure the SP design would target, I wouldn't expect that to a be a problem, and Nvidia would really want people to take a hint about its desired market segmentation and buy a Fermi.

Nvidia could just refuse to allow it for those chips. I don't know what Intel will do about folks who want to run their MMX and SSE code on Larrabee. The options strike me as being similar.
 
I'm suspicious of this claim. If it were true, wouldn't a 4850, with it's 1 TFLOPS theoretical peak performance be running rings around a GTS 250, with it's paltry 0.5 TFLOPS theoretical peak? Instead, they're pretty similar in performance. Comparing actual game benchmarks, 1 nVidia FLOP seems to about equal 2 ATI FLOPs, suggesting that ATIs utilization efficiency in game shader code is closer to 50% than 80+ percent. I think Fermi will perform quite well, relative to Cypress, in games.

The claim here was only about VLIW efficiency vs scalar, not about the total ALU utilization, which in itself is (as AlexV mentioned) only part of the performance picture. I agree on the predicted Fermi game performance.
 
Lets say the vertex shader precomputes 1/z and X/z per vertex (with X being the parameters) and the rasterizer supplies vertex blending factors per pixel ... why would a pixel shader + parameter interpolator care what triangle a pixel belongs to?
A lot of IFs and way more data injected in the pixel shader. No rocket science required, but not as straightforward as it used to be.
 
I do not think it makes much of a difference... OpenCL and CUDA are not like two completely different beasts and all the R&D poured into CUDA by nVIDIA will pay off for them when they move to OpenCL and DirectCompute IMHO.
It is not as nice for them as controlling the industry with their own proprietary standard, but I think they will be able to leverage their CUDA expertise (drivers and tools) when developing their OpenCL support (and they already have).

Fermi might have CUDA cores, but you could just as easily call them OpenCL cores :).

I think the question was if CUDA would die/diminish, not if Fermi would be good for OpenCL, which I'm sure you it will just be.
 
what's your idea Mintmaster? Running 4 vertices/pixel per 5-vector?
Although working on 256 pixel batches doesn't sound terribly efficient when it comes down to dynamic branching. To not mention increased register pressure and less than optimal instruction bandwidth/instruction cache usage
No, just having an SIMD where 64 ALUs run the same instruction on the whole batch as opposed to 16 vec4 ALUs (I'll ignore the trancendental for a moment).

Currently ATI has 8-stage ALUs, pipelining the four 16 pixel groups of one active batch and then the four groups (with a different instruction pointer) from a second active batch. They can still use 8-stage ALUs with my proposal, but now they'll round robin eight active batches (again, 8 IPs) instead of flipping back and forth between two. Branching can be done at the same rate, i.e. once every four scalar instructions instead of once every vec4 instruction group. Net instruction throughput is the same. Register access has the same BW and granularity, but it's organized a bit differently. Texturing is done at the same rate (4 fetches per cycle per SIMD). Batch loading from the overall pool to the active pool (due to texture fetches or completion) still occurs the same rate, despite there being 8 active instead of 2.

The only cost is a few transistors to store the state for six more active batches, and possibly a little more complexity to make the ALUs change instruction every cycle instead of every four (but remember that the pipeline is still 8 stages). I don't think the latter is a big deal because that's how virtually all FPUs operate other than ATI/NVidia, and we're not really changing the instruction since MAD/MUL/ADD/SUB/DP are effectively the same. Also, in addition to the efficiency gain of being able to execute serially dependent code without loss of utilization, you can also compile shaders to use fewer registers. As for the transcendental unit, it goes relatively unchanged, taking four cycles per batch.
 
Last edited by a moderator:
Not in my opinion. See my post above to nAo. Cost is almost negligible.

It would need 4x as many branch units.
Breaking up the groups of 4 would bring up register file concerns.
Either this will quadruple the register file to maintain the same amount of registers, or it split up the wierd quad registers and potentially simplify the register access restrictions. (edit: maybe at the cost of increasing the complexity of the hardware?)

ATI's full rationale for the way the register file is divided currently would be fun to know.

The trans units would have a problem, though, since their register accesses piggyback on the datapaths of the slim ALUs, and breaking the clusters will leave them orphaned or picking between a lot of lanes.
 
Last edited by a moderator:
That used to be true in the land of fixed function interpolators. Not so much anymore..
I think it's still true. The only difference is that the vertex parameters are stored nearby now (in the local memory?), and instead of each quad storing interpolated values for each pixel, it will store indices plus vertex weights.

I don't see why not ... for small triangles iterative interpolation does not provide a gain over purely parallel interpolation and if you do it purely parallel it suits SIMD just fine.
Agreed.

Lets say the vertex shader precomputes 1/z and X/z per vertex (with X being the parameters) and the rasterizer supplies vertex blending factors per pixel ... why would a pixel shader + parameter interpolator care what triangle a pixel belongs to?
FYI, this method needs at least two sets of blending factors per pixel for DX11, as you can now choose centroid or regular interpolation positions on the fly, and maybe more.

Just to throw this out there, an alternative is to store perspective correct vertex blending factors, so that the vertex shader doesn't need to multiply each parameter by 1/z, and the pixel shader doesn't need to muliply each interpolated X/z by z.
 
Just to throw this out there, an alternative is to store perspective correct vertex blending factors, so that the vertex shader doesn't need to multiply each parameter by 1/z, and the pixel shader doesn't need to muliply each interpolated X/z by z.
The vertex shader can't multiply by 1/z anyway in case there's clipping involved.
 
My assumption that this would be for consumer-level gaming chips, which lead me to expect a few scenarios:
1) driver compiler code-replacement antics

It is surprising that 3 years after G80, people still wonder about what-happens-if-you-do-dp-without-hw-support. :rolleyes:

ptxas demotes it to float, but not silently.
 
Back
Top