NVIDIA GF100 & Friends speculation

A war on two fronts, a bit like Germany in WWII, it was bound to end in tears.

Well, NVIDIA isn't focused on handheld markets with Tegra and Fermi's compute specific features, because they have no competition at all and can live with profits from the graphics market only...
 
hmm not really two fronts, its more of one front with two focal points, since both major changes in Fermi actually have similiar goals at the end which is more performance for all applications that run on the GPU.
 
Quarter rate DP seems fine to me. Overall, the VLIW approach makes a *lot* of sense. It makes dpfp in evergreen about as cheap as it can be, while still having higher perf/mm than nv. Rounding modes/denormals might change this though.

[QUOTE="Evergreen ISA" ]SET_MODE
Overrides the rounding and denorm modes until the end of clause.

Round modes are:
• Round to nearest
• Round toward 0 (truncate)
• Round toward +infinity
• Round toward -infinity

Denormal handling:
• single_denorm_flush_input (on/off)
• single_denorm_force_underflow_to_zero (on/off)
• double_denorm_flush_input (on/off)
• double_denorm_force_underflow_to_zero (on/off)[/QUOTE]
If nv had no int pipeline, their dp throughput would be 1/8 sp from just the fp pipes. :oops:. No wonder they had to add the int path just to reach respectable dp performance. May be Fermi3 will make all the 4 pipes identical (both spfp and int32) and quad-issue warps every clock. Otherwise, the INT32 pipe seems rather unnecessary. And
NVidia could have 1/4 rate DP using the same trick ATI has. And put MUL32 in the special function ALU.

But it seems NVidia wants 1/2 rate DP + full speed MUL32 (low or hi, half speed full result).

Jawed
 
I said may:

Since you are bent on buying the fastest single GPU, it may be the case that a 2GB 5870 eyefinity/oc may be faster overall and lower power to boot.

So about your statement:

There is a question mark at the end since it surely looks like you may provide us with an answer. So I'm waiting.

So what exactly are you waiting for? Perhaps you should improve a little in reading comprehension.
 
Isn't 8192x8129x4bytes ? I think the current limitation of 256mb per buffer of ATI OpenCL comes from there.
I suspect 16384² is the source of this limitation because of byte addressing, since the hardware can't work with larger textures. But everything's so vague...

Jawed
 
NVidia could have 1/4 rate DP using the same trick ATI has.
Oh yeah, brainfart :oops:
And put MUL32 in the special function ALU.
Seems like a good idea for mainstream parts.

But it seems NVidia wants 1/2 rate DP + full speed MUL32 (low or hi, half speed full result).
Which *is* overkill for graphics. AMD's approach seems more area efficient in the sense of $/mm aka balance b/w frequency-of-use vs. area. Is there a use case for int32 which I have missed?

With this in mind, I have to say, GF100 looks dp-heavy (by an appreciable amount) for a gpu.
 
Last edited by a moderator:
Wow , that is such a clever use of the VLIW architecture !
Is this the case in Milyway@Home too ? I know it has amazing performance on ATi hardware too ..
MW@Home is double-precision, which is 1/4 performance for MUL and 1/2 performance for ADD. Also the programmer had to hand-code around the stupid compiler's inability to properly co-issue a pair of DP ADDs per clock. Not sure if the compiler has improved in this respect recently.

Jawed
 
Which *is* overkill for graphics.
Ever since G80 NVidia's been doing overkill: monstrous counts and/or sizes of units - it just took a while to realise it.

I'm intrigued at the thought that if GF100Ax has half its TMUs turned off, then B1 might have them deleted, leading to a smaller, less power-hungry, die.

It also raises the question of what GTX480 performance would have been like with 128 TMUs... Sure, it's too early to really tell what GTX480's performance is, but eventually we can have some fun with that idea.

Jawed
 
Can I set the TDP for the 470 at 220W and take it for granted? Now I'd love to read a reasonable point that suggests what the 480 exactly has that justifies a 75W difference in TDP.

Voltage and slightly higher clocks? Couple that with additional 64 CC and there you go :). Still - I'm not saying it's true, just not that hard to explain.
 
Oh yeah, brainfart :oops:
Seems like a good idea for mainstream parts.

Which *is* overkill for graphics. AMD's approach seems more area efficient in the sense of $/mm aka balance b/w frequency-of-use vs. area. Is there a use case for int32 which I have missed?

With this in mind, I have to say, GF100 looks dp-heavy (by an appreciable amount) for a gpu.

If DP is very important for the markets NVIDIA is trying to reach (i.e. even more than with GT200), how is Fermi DP-heavy ?

Also, look at this:

http://techreport.com/articles.x/18332/5

Scott said:
I should pause to explain the asterisk next to the unexpectedly low estimate for the GF100's double-precision performance. By all rights, in this architecture, double-precision math should happen at half the speed of single-precision, clean and simple. However, Nvidia has made the decision to limit DP performance in the GeForce versions of the GF100 to 64 FMA ops per clock—one fourth of what the chip can do. This is presumably a product positioning decision intended to encourage serious compute customers to purchase a Tesla version of the GPU instead. Double-precision support doesn't appear to be of any use for real-time graphics, and I doubt many serious GPU-computing customers will want the peak DP rates without the ECC memory that the Tesla cards will provide. But a few poor hackers in Eastern Europe are going to be seriously bummed, and this does mean the Radeon HD 5870 will be substantially faster than any GeForce card at double-precision math, at least in terms of peak rates.
 
Well, I would only consider that important (what the link says) if DP was physically taken from the chip. If the limitation is only on software/BIOS, it still is there occupying die space.

What the link says is that DP in Fermi based GeForces will be crippled, so that the Tesla products based on Fermi are more appealing. I'm not sure at this point, that it's just something that can be done through software. We will see.

Also, my question was related with the DP-Heavy remark, which doesn't add up for me, since it's in NVIDIA's interest to have as much DP as possible given the markets they want to get into even more.
 
What the link says is that DP in Fermi based GeForces will be crippled, so that the Tesla products based on Fermi are more appealing. I'm not sure at this point, that it's just something that can be done through software. We will see.

Also, my question was related with the DP-Heavy remark, which doesn't add up for me, since it's in NVIDIA's interest to have as much DP as possible given the markets they want to get into even more.

Well they used to cripple GeForce's on Quadro's habilities through BIOS and software. I remember Soft-Modding my GeForce 4 Ti 4200 :p
 
If DP is very important for the markets NVIDIA is trying to reach (i.e. even more than with GT200), how is Fermi DP-heavy ?

Based upon the scenario so far,

It makes 0 financial sense to focus on tesla to the detriment of graphics.

Fermi needs to win in graphics. Period.

Based upon my guesses regarding future evolution.

I think fermi is nv's bet that GF100 is priced too high to be of much use for gamers. It'll be able to recoup money from HPC market even if graphics perf is relatively less than what it could be. The mainstream gpu market has all the profits, so as long as the "pure" graphics side is efficient enough, high-end bloat doesn't matter. So fermi's possible loss of halo isn't such a big deal. Now that feature-parity with cpu's has been (well, almost..) achieved, they can just keep pumping up the core count in future generations.

It is risky if the top dog is delayed, since that is where all the graphics-side innovation happens.
 
Ever since G80 NVidia's been doing overkill: monstrous counts and/or sizes of units - it just took a while to realise it.
Yes, but they haven't spent transistors useless to graphics on this scale before. G80/GT200 were monsters, but had very little area devoted to stuff which graphics doesn't need.
I'm intrigued at the thought that if GF100Ax has half its TMUs turned off, then B1 might have them deleted, leading to a smaller, less power-hungry, die.
I can't see why you'd disable half your TMU's for the top dog. This rumour doesn't make any kind of sense from any angle.
 
Back
Top