NVIDIA GF100 & Friends speculation

Silus · Mar 18, 2010

Tahir2 said:
A war on two fronts, a bit like Germany in WWII, it was bound to end in tears.

Well, NVIDIA isn't focused on handheld markets with Tegra and Fermi's compute specific features, because they have no competition at all and can live with profits from the graphics market only...

Razor1 · Mar 18, 2010

hmm not really two fronts, its more of one front with two focal points, since both major changes in Fermi actually have similiar goals at the end which is more performance for all applications that run on the GPU.

Jawed · Mar 18, 2010

rpg.314 said:
Quarter rate DP seems fine to me. Overall, the VLIW approach makes a *lot* of sense. It makes dpfp in evergreen about as cheap as it can be, while still having higher perf/mm than nv. Rounding modes/denormals might change this though.

[QUOTE="Evergreen ISA" ]SET_MODE
Overrides the rounding and denorm modes until the end of clause.

Round modes are:
• Round to nearest
• Round toward 0 (truncate)
• Round toward +infinity
• Round toward -infinity

Denormal handling:
• single_denorm_flush_input (on/off)
• single_denorm_force_underflow_to_zero (on/off)
• double_denorm_flush_input (on/off)
• double_denorm_force_underflow_to_zero (on/off)[/QUOTE]

If nv had no int pipeline, their dp throughput would be 1/8 sp from just the fp pipes. . No wonder they had to add the int path just to reach respectable dp performance. May be Fermi3 will make all the 4 pipes identical (both spfp and int32) and quad-issue warps every clock. Otherwise, the INT32 pipe seems rather unnecessary. And

NVidia could have 1/4 rate DP using the same trick ATI has. And put MUL32 in the special function ALU.

But it seems NVidia wants 1/2 rate DP + full speed MUL32 (low or hi, half speed full result).

Jawed

rpg.314 · Mar 18, 2010

mczak said:
Those should all be the same on rv7xx and rv8xx though IIRC, so wouldn't explain why rv8xx is faster.

cypress has twice the alu's. Each alu is just as fast on both rv770 and cypress.

Sontin · Mar 18, 2010

Tahir2 said:
A war on two fronts, a bit like Germany in WWII, it was bound to end in tears.

Who is at the second front? First was intel with the GPGPU and second is the "tessellation" front. But who is/was the competitor?

compres · Mar 18, 2010

I said may:

compres said:
Since you are bent on buying the fastest single GPU, it may be the case that a 2GB 5870 eyefinity/oc may be faster overall and lower power to boot.

So about your statement:

DegustatoR said:
There is a question mark at the end since it surely looks like you may provide us with an answer. So I'm waiting.

So what exactly are you waiting for? Perhaps you should improve a little in reading comprehension.

Jawed · Mar 18, 2010

Dade said:
Isn't 8192x8129x4bytes ? I think the current limitation of 256mb per buffer of ATI OpenCL comes from there.

I suspect 16384² is the source of this limitation because of byte addressing, since the hardware can't work with larger textures. But everything's so vague...

Jawed

rpg.314 · Mar 18, 2010

Jawed said:
NVidia could have 1/4 rate DP using the same trick ATI has.

Oh yeah, brainfart

And put MUL32 in the special function ALU.

Seems like a good idea for mainstream parts.

But it seems NVidia wants 1/2 rate DP + full speed MUL32 (low or hi, half speed full result).

Which *is* overkill for graphics. AMD's approach seems more area efficient in the sense of $/mm aka balance b/w frequency-of-use vs. area. Is there a use case for int32 which I have missed?

With this in mind, I have to say, GF100 looks dp-heavy (by an appreciable amount) for a gpu.

Jawed · Mar 18, 2010

DavidGraham said:
Wow , that is such a clever use of the VLIW architecture !
Is this the case in Milyway@Home too ? I know it has amazing performance on ATi hardware too ..

MW@Home is double-precision, which is 1/4 performance for MUL and 1/2 performance for ADD. Also the programmer had to hand-code around the stupid compiler's inability to properly co-issue a pair of DP ADDs per clock. Not sure if the compiler has improved in this respect recently.

Jawed

rpg.314 · Mar 18, 2010

Dade said:
Isn't 8192x8129x4bytes ? I think the current limitation of 256mb per buffer of ATI OpenCL comes from there.

IMHO, AMD should get rid of this in Evergreen+1.

Jawed · Mar 18, 2010

mczak said:
Those should all be the same on rv7xx and rv8xx though IIRC, so wouldn't explain why rv8xx is faster.

Yeah, I can't work that out.

Jawed

Jawed · Mar 18, 2010

rpg.314 said:
Which *is* overkill for graphics.

Ever since G80 NVidia's been doing overkill: monstrous counts and/or sizes of units - it just took a while to realise it.

I'm intrigued at the thought that if GF100Ax has half its TMUs turned off, then B1 might have them deleted, leading to a smaller, less power-hungry, die.

It also raises the question of what GTX480 performance would have been like with 128 TMUs... Sure, it's too early to really tell what GTX480's performance is, but eventually we can have some fun with that idea.

Jawed

air_ii · Mar 18, 2010

Ailuros said:
Can I set the TDP for the 470 at 220W and take it for granted? Now I'd love to read a reasonable point that suggests what the 480 exactly has that justifies a 75W difference in TDP.

Voltage and slightly higher clocks? Couple that with additional 64 CC and there you go

. Still - I'm not saying it's true, just not that hard to explain.

Silus · Mar 18, 2010

rpg.314 said:
Oh yeah, brainfart
Seems like a good idea for mainstream parts.

Which *is* overkill for graphics. AMD's approach seems more area efficient in the sense of $/mm aka balance b/w frequency-of-use vs. area. Is there a use case for int32 which I have missed?

With this in mind, I have to say, GF100 looks dp-heavy (by an appreciable amount) for a gpu.

If DP is very important for the markets NVIDIA is trying to reach (i.e. even more than with GT200), how is Fermi DP-heavy ?

Also, look at this:

http://techreport.com/articles.x/18332/5

Scott said:
I should pause to explain the asterisk next to the unexpectedly low estimate for the GF100's double-precision performance. By all rights, in this architecture, double-precision math should happen at half the speed of single-precision, clean and simple. However, Nvidia has made the decision to limit DP performance in the GeForce versions of the GF100 to 64 FMA ops per clock—one fourth of what the chip can do. This is presumably a product positioning decision intended to encourage serious compute customers to purchase a Tesla version of the GPU instead. Double-precision support doesn't appear to be of any use for real-time graphics, and I doubt many serious GPU-computing customers will want the peak DP rates without the ECC memory that the Tesla cards will provide. But a few poor hackers in Eastern Europe are going to be seriously bummed, and this does mean the Radeon HD 5870 will be substantially faster than any GeForce card at double-precision math, at least in terms of peak rates.

Picao84 · Mar 18, 2010

Silus said:
If DP is very important for the markets NVIDIA is trying to reach (i.e. even more than with GT200), how is Fermi DP-heavy ?

Also, look at this:

http://techreport.com/articles.x/18332/5

Well, I would only consider that important (what the link says) if DP was physically taken from the chip. If the limitation is only on software/BIOS, it still is there occupying die space.

Silus · Mar 18, 2010

Picao84 said:
Well, I would only consider that important (what the link says) if DP was physically taken from the chip. If the limitation is only on software/BIOS, it still is there occupying die space.

What the link says is that DP in Fermi based GeForces will be crippled, so that the Tesla products based on Fermi are more appealing. I'm not sure at this point, that it's just something that can be done through software. We will see.

Also, my question was related with the DP-Heavy remark, which doesn't add up for me, since it's in NVIDIA's interest to have as much DP as possible given the markets they want to get into even more.

Picao84 · Mar 18, 2010

Silus said:
What the link says is that DP in Fermi based GeForces will be crippled, so that the Tesla products based on Fermi are more appealing. I'm not sure at this point, that it's just something that can be done through software. We will see.

Also, my question was related with the DP-Heavy remark, which doesn't add up for me, since it's in NVIDIA's interest to have as much DP as possible given the markets they want to get into even more.

Well they used to cripple GeForce's on Quadro's habilities through BIOS and software. I remember Soft-Modding my GeForce 4 Ti 4200

Lonbjerg · Mar 18, 2010

Picao84 said:
Well they used to cripple GeForce's on Quadro's habilities through BIOS and software. I remember Soft-Modding my GeForce 4 Ti 4200

It wasn't long after that that they stopped doing it in software....due to driverhacks

rpg.314 · Mar 18, 2010

Silus said:
If DP is very important for the markets NVIDIA is trying to reach (i.e. even more than with GT200), how is Fermi DP-heavy ?

Based upon the scenario so far,

It makes 0 financial sense to focus on tesla to the detriment of graphics.

Fermi needs to win in graphics. Period.

Based upon my guesses regarding future evolution.

I think fermi is nv's bet that GF100 is priced too high to be of much use for gamers. It'll be able to recoup money from HPC market even if graphics perf is relatively less than what it could be. The mainstream gpu market has all the profits, so as long as the "pure" graphics side is efficient enough, high-end bloat doesn't matter. So fermi's possible loss of halo isn't such a big deal. Now that feature-parity with cpu's has been (well, almost..) achieved, they can just keep pumping up the core count in future generations.

It is risky if the top dog is delayed, since that is where all the graphics-side innovation happens.

rpg.314 · Mar 18, 2010

Jawed said:
Ever since G80 NVidia's been doing overkill: monstrous counts and/or sizes of units - it just took a while to realise it.

Yes, but they haven't spent transistors useless to graphics on this scale before. G80/GT200 were monsters, but had very little area devoted to stuff which graphics doesn't need.

I'm intrigued at the thought that if GF100Ax has half its TMUs turned off, then B1 might have them deleted, leading to a smaller, less power-hungry, die.

I can't see why you'd disable half your TMU's for the top dog. This rumour doesn't make any kind of sense from any angle.

NVIDIA GF100 & Friends speculation

Similar threads