NVIDIA Fermi: Architecture discussion

and don't remember them aknowledging that it's late. They would want to have it right now of course but that doesn't mean that it's late.

If it's not late, then whoever planned it to miss the Windows 7 launch and miss the holiday shopping season should be fired.

From Anandtech: "I asked two people at NVIDIA why Fermi is late; NVIDIA's VP of Product Marketing, Ujesh Desai and NVIDIA's VP of GPU Engineering, Jonah Alben. Ujesh responded: because designing GPUs this big is "fucking hard"."
 
All these co-issue schemes imply significantly higher register file read bandwidth and 2x register write bandwidth, which from what I understand is quite expensive.

It probably is somewhat expensive. I have no idea how expensive. To use the FMA hardware as a separate add and mul looks like it requires an extra read and an extra write port. I don't recall the size of the register file off-hand (dkanter's article has it somewhere, IIRC), but the cost is probably something like .1% [edit: looks likes 2% -- 16 * 128k * 8b/B * 4trannies -- ouch!] of the total GPU tranny budget. Your effective flop utilization almost certainly goes up by a lot more than that.

No, the real expense, I would think, is the extra operand buffers, the extra register space you might need because you're eating through data faster, and the logic necessary to issue and handle exception cases. But if your tranny cost goes up 5%, and if utilization goes up 25%, that's worth it, no?

Lots of "ifs" there :)

Similar arguments for separate int and fp issue, except you need to double your read bandwidth, unless you're clever and don't support FMA co-issue between int and fp units. The advantage of doing this, though, is likely more marketing than real. Not sure how much int and fp mixed math kernels there are out there, but marketing would be able to quote much higher op/s numbers (w/ fma coissue)

I ... would be pretty surprised if the fp mul HW weren't reused in some way for integer multiplies.

If you read back, I said the same thing earlier. But, the more technical articles I read, the more I'm starting to be convinced that what we have is single-cycle DP hardware, split across two sets of int and fp units. You're right, though, if I had to bet right now, I'd say int mul and fp mul shared resources of some kind [ed/clarify: and not just mul]. Int32 is two cycles.... It's just that no article has said that yet, and lots of articles have said the opposite. :shrug:

With regards to implementing stream producer/consumer communication via L2 - why is that a bad thing?

L2 is shared. If you have eight consumers and eight producers running on 16 SMs, and the kernels are relatively short, L2 is going to get a soaking. This is really what the shared "L1" should be being used for, no? Let the spill-over L2 handle the 20% case, not the 100% case....

FWIW, in my opinion, fine-grained threading (the traditional type) is a really good idea in a message-based system, and messaging is a fine way of organizing the environmental API of a system that not only scales internally, but is going to be scaled to thousands of processor externally. But then, that's a bias that's not informed by a lot of experience with those systems.

So we are talking about up to 4 FLOPs per CUDA core? :???:

My current assumption is that we have 2 flops and 2 unused ints, or 2 flops and 2 used ints (for DP), or 2 ints and 2 unused flops. Never 4. I am mostly musing here....

What's so inefficient about putting circular append buffers between producer/consumer kernels and branching to consumers when they have full warps?

Hmm, am I missing something? AFAIK, you can't have two kernels active on the same SM in GF100, so you can't switch from one to the other? I think Jawed was saying that branching was an inefficient workaround for lack of per-SM parallelism. That said, I suppose you could:

MegaKernel:
<lock-common resource>
increment shared variable and load
<unlock>
branch to kernel2 if variable%64 >=32
kernel1:
--blah--
exit
kernel2:
--blah--

at a minimum, I'd prefer that kind of ugliness hidden for me :)
 
Last edited by a moderator:
Well then you'd have to clarify what exactly you mean; in the case of having say >600SPs but still the same 128 TMUs / 48 ROPs they have today it would increase the final gaming performance theoretically by what percentage? Certainly not by the percentage peak arithmetic values would increase. I am thinking in clusters because it would bring a healthier performance increase and by the time you increase even just the texel fillrate bandwidth won't be enough etc.

You'd have to be a bit more specific don't you think?
While we're throwing up just random numbers, I'd say 640alus, 160tus & 64rops. Ofcourse then they would have to get rid of the other silicon dedicated to GPGPU and highly optimize their alu/tus. However like DemoCoder mentioned this was probably not the best time for them to do it so I'll still be pinning hopes for such optimization in the refresh.

Remember that Fudo article about architectures/roadmaps not being able to get changed on short notice? How's the boomerang effect for a change? ;)
Not sure what you are implying there.

Arty I'd say that beyond doubt NVIDIA would had wished for to be able to launch before win7 and preferably before AMD too. And before someone says yeah but it ended up only a couple of months later, the next best question is what kind of launch.
The usual kind? Hard launch.

I'm sorry but what's all this italics and underlines? I don't remember NV ever saying anything in the lines of "sooner than you'd expect" and don't remember them aknowledging that it's late. They would want to have it right now of course but that doesn't mean that it's late. A1, August. If that chip was A3 and was done in Spring then you would have a reason to say that it's late. Right now it's on track. Whether that track is late in itself is another issue.
Nvidia acknowledged its late, end of story. Now unless you know more about Nvidia's internal deadlines than Nvidia themselves, you cant say otherwise.
So you're now trusting Fuad again? -)
I think its quite clear who feeds Fuad so yes I'll trust that bit of news.

I don't understand that stance that Fermi is good for compute and bad for graphics. Most of what's needed for compute is needed for graphics too. If anything, Fermi should be better for graphics than previous generation architecture.
Fermi needs to be better for graphics than the competition not just Nvidia's previous generation.

Now tell me if that statement makes sense given these numbers below? Here I used relatively conservative clocks of 650/1300 for Fermi and the currently rumoured 128 TMUs. I counted only MAD flops, adjust as required if you consider the "missing MUL" useful.

comp.png


Does that look like they've abandoned graphics keeping in mind the lowball clock estimates?
Trini, you would have been better off using 9800GTX as base for G92 rather than the GTS250 which I believe has quite a bit higher clocks which was a refresh to the refresh (9800GTX -> 9800GTX+ -> GTS250)
 
Now tell me if that statement makes sense given these numbers below? Here I used relatively conservative clocks of 650/1300 for Fermi and the currently rumoured 128 TMUs. I counted only MAD flops, adjust as required if you consider the "missing MUL" useful.

comp.png


Does that look like they've abandoned graphics keeping in mind the lowball clock estimates?
Actually, I was asking you to justify: "For starters, even at moderate clocks Fermi is 2x GTX285." It clearly isn't. And 650 core clock is not particularly conservative - even if 40nm looks reasonably happy with clocks on ATI.

But there's room to improve the per unit efficiencies from where they are in GTX285, and there'll be lots of bonuses accruing from all the memory-system and threading tweaks.

Just having seen 2x RV870 "disappoint" with its 1.45x performance most of the time (though I've seen 1.8x+ scaling) it's premature to go declaring GF100 2x GTX285. You need to know about some serious clocks and/or efficiency gains to make that statement.

Jawed
 
Trini, you would have been better off using 9800GTX as base for G92 rather than the GTS250 which I believe has quite a bit higher clocks which was a refresh to the refresh (9800GTX -> 9800GTX+ -> GTS250)
GTS250 = 9800GTX+, not a refresh from it (clockwise)
 
That's far riskier and more expensive than their current approach. People hype the bigger die sizes but only Nvidia knows how much that hurts the bottom line in the end. Also much of this is financed by "cheap" dies like G92 and lower where they are very competitive. In the end the big dies on the high end may not be as big a deal as commonly thought and it's a far easier proposition for them to leverage that investment in multiple markets.
Can NVidia delete double-precision in this design? What else might be stripped out to make competitive cards for the plebs?

Jawed
 
Can NVidia delete double-precision in this design? What else might be stripped out to make competitive cards for the plebs?

Jawed

Yeah that's a biggie. It will be interesting to watch.

Arty, GTX285 is also a refresh/overclock ala G92+.
 
In that case. You really need to compare it with the original 9800GTX. I honestly dont think the 9800GTX is the best source case either. But rather the G80. G92 specially 512 meg variants have some problems. And also offer higher shader performance than the G80.
 
Dont see why it wouldn't be. 240 cores>512 cores is a slightly bigger jump than 800 SP's>1600 SP's after all.

But we'll probably see a repeat of last gen, where ATI is a little slower but a lot smaller, too.

I'm wondering if the consumer part will have 512 'cores' enabled, especially at launch. I think in order to keep their volumes up, they may have save the fully enabled cores for Fermi (dedicated GPGPU) parts, and have a smaller core count on even the top-end consumer part. If the performance lead is really as large as they were hinting, that is.

If they do lead with a 'GTX 380' type part with less than 512 cores, they may eventually release a consumer part with all cores enabled, say a 385, once yeilds are up, or when they have more competition, such as a 5890.
 
Actually, I was asking you to justify: "For starters, even at moderate clocks Fermi is 2x GTX285." It clearly isn't. And 650 core clock is not particularly conservative - even if 40nm looks reasonably happy with clocks on ATI.

You're right. When I said that I was thinking solely of flops. I think you'll agree it should be easy to double up on GT200 in that area. But yeah lots of other stuff isn't doubled.

Yes, hence I asked you to compare Fermi with GT200 and not GT200b.

G92 & GT200
GT200 & Fermi

Doing so makes Fermi look like an even better upgrade :) Using same 650/1300/4800 clocks for Fermi:

comp.png
 
Jawed said:
Just having seen 2x RV870 "disappoint" with its 1.45x performance most of the time (though I've seen 1.8x+ scaling) it's premature to go declaring GF100 2x GTX285. You need to know about some serious clocks and/or efficiency gains to make that statement.

Something odd I've noticed about RV870's performance scaling is that the crossfire results seem to scale as good or even better (% wise) than the jump from a single card RV770 to RV870. This seems particularly odd as not only is the generational jump larger % wise, but one would also expect the crossfire results to be more "system" limited, not to mention inefficiencies in crossfire in general.

Driver issues perhaps? Or maybe bandwidth, on just the theoretical specs, the HD5870 has less bandwidth per flop, tex fill, and pixel fill than the HD4850.
 
I'm wondering if the consumer part will have 512 'cores' enabled, especially at launch. I think in order to keep their volumes up, they may have save the fully enabled cores for Fermi (dedicated GPGPU) parts, and have a smaller core count on even the top-end consumer part. If the performance lead is really as large as they were hinting, that is.

If they do lead with a 'GTX 380' type part with less than 512 cores, they may eventually release a consumer part with all cores enabled, say a 385, once yeilds are up, or when they have more competition, such as a 5890.

To be honest I'd be surprised if thats the case. In every example of Yields I can think of regarding Nvidia. its traditionally been the ROPS that came up defective/. As the GT200 architectured evolved. We still have parts with disabled ROPS/Memory but full fledged GT200 from TMU/Shader PoV.

The fact that Nvidia has fully functioning shader cores and hasn't displayed its graphic portion of its tech makes me believe we'll see a similar trend.
 
You're right. When I said that I was thinking solely of flops. I think you'll agree it should be easy to double up on GT200 in that area. But yeah lots of other stuff isn't doubled.



Doing so makes Fermi look like an even better upgrade :) Using same 650/1300/4800 clocks for Fermi:

comp.png

How high exactly do you think nVidia will get the clocks?
I mean; according to nVidia, it's supposed to be 8x (rounded down, so I'll use 8 & 9x numbers to be sure) GT200 in double precision GFLOPS, I'm assuming this is clock-to-clock, so Fermi would be sitting somewhere around 622-699 GFLOPS DP with shaders running at 1296MHz. I haven't double checked this, but didn't nVidia also mention that this time DP performance is as high as half of SP performance, which would put SP FLOPS at 1244-1398 GFLOPS with shaders at 1296MHz.

That's quite far from being 213% higher than GT200
 
That's quite far from being 213% higher than GT200

Sure, if you count GT200's extra MUL which has been widely criticized as being useless in 3D. Without that MUL GT200 is rated at only 622 Gflops, which is less than half a theoretical 1300Mhz Fermi.
 
Sure, if you count GT200's extra MUL which has been widely criticized as being useless in 3D. Without that MUL GT200 is rated at only 622 Gflops, which is less than half a theoretical 1300Mhz Fermi.

Oh indeed, didn't think about that.
 
Back
Top