NVIDIA GF100 & Friends speculation

Sxotty · Feb 22, 2010

digitalwanderer said:
Silus said:

On another note, they'll be having a custom built PC raffle in honor of ChrisRay. All winnings go to his mother. Nicely done!

Click to expand...

Way cool! Anyone going? I gotta mail ya some cash to buy me a ticket...

Likewise. Is it pax east?

Sontin · Feb 22, 2010

Sxotty said:
Likewise. Is it pax east?

yes: http://www.geforcelan.com/

And Rollo said it will be a hard launch of the cards.

Groo The Wanderer · Feb 22, 2010

seahawk said:
Thx for the reply. Helps me a lot to understand hiw such things are actually done.

Thx fpr the info. You are always a first rate source for such info. If NV has no mainstream and entry level DX11 chip till the end of 2010, they are broke.

As an update, I am hearing, totally unconfirmed so far, that there may be one other tapeout either done or pending. I am far from 100% on this one though.

-Charlie

Groo The Wanderer · Feb 22, 2010

neliz said:
And yet they never seem to be, just because of the "Geforce" brand.

I think that may be needing a name change soon.

http://www.facebook.com/posted.php?id=8409118252&share_id=319744854245&comments=1#s319744854245

-Charlie

Ethatron · Feb 22, 2010

Jawed said:
I'm not sure that that will work as the multiplier will always try to create a normalised result - there's "magic" for the 24th bit which is implicit in the final result. To do uint24 arithmetic it needs to be tweaked a bit, I think, which is why I say "free".

I always thought FP-logic has suffix-bits to accomodate for rounding errors, so I thought in this case 24 bit is not 23+1 implicit but a real 24bit mantissa + 1 implicit bit.
Otherwise none of the rounding modes in IEEE would make any sense as they would all be identicall if underflow bits can't surface.

I think a reference is x87 FP32 treatment in single precision mode. It uses more bits while calculating and chopping products. AFAIK

Groo The Wanderer · Feb 22, 2010

rpg.314 said:
The TSMC reticle size is about ~580 mm2, so you can't make bigger chips even if you wanted to. BTW, is there a relation between reticle size and device sizes?

Where did you get that number from? Got a link?

-Charlie

Groo The Wanderer · Feb 22, 2010

hatter said:
Funny piece from Hilbert.

On occasions, Hilbert too can't stop himself from repeating what Charlie is saying, only in different words. Yet, he sings the paeans of Fermi.

I never knew Guru3D is competing to outsmart Semi-Accurate. It's true that charlie has more misses than hits but what is the point of coming out with this piece on a website like Guru3D, hitting at Mr C, taking an argument that could very well be (poorly) drafted by Nvidia's PR team.

You kept your mouth shut for so long and you are not going to share any concrete info, so why not keep mum for a few more weeks. Unless, you were nudged by Nvidia.

and this one is gem...

It probably was done by NV's PR team. They tend to shop stories like that around to various sites, starting out at the high end, and moving down. They use it to point to as an 'independent source' to 'collaborate' their view.

Normally, the sites run by people with a brain no better than to touch those pieces, and so things get shopped to progressively more sketchy sites until someone bites. ATI used to do this back in the day, I haven't seen it in a while. AMD and Intel never did that I am aware of, but that isn't definitive.

-Charlie

Groo The Wanderer · Feb 22, 2010

MfA said:
That bit about "tier 1 website" reminds me of the tier X publication SNAFU with AMD a couple year back (ie. where AMD said they never required vetting of articles before release from tier 1 publications, suggesting they did of others). I always wonder how explicit these deals are in the wonderful world of the web, is it just like with Charlie (simple threats not to bite the hand that feeds you or you don't get to be at the trough) or do they actually require editorial rights by contract for the smaller sites if they participate in media events?

I have never seen anyone ask for editing rights to an article. Some have asked nicely, and a few I have offered it to, but those were deep architecture articles where some of the bits were nuanced and complex. It was more of a fact check than editorializing, and the articles were like this:

http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/

For simple pieces or reviews, never seen it, never been asked either. I am pretty sure that any PR person knows better than to ask that. If a site does do let PR run roughshod over their articles, it is open season on that site, and they become coopted very fast, and die off quick.

I have heard rumors of some Taiwanese vendors asking about such, but nothing concrete.

Also, this is very different from getting a letter after something goes up saying, "You got that wrong, and here is why. Can you correct it?".

-Charlie

Jawed · Feb 23, 2010

Ethatron said:
I always thought FP-logic has suffix-bits to accomodate for rounding errors, so I thought in this case 24 bit is not 23+1 implicit but a real 24bit mantissa + 1 implicit bit.

GPU fp32 has historically not been particularly accurate but gradually improving. This has been tightened up as FFMA in CUDA 2.0 devices (Fermi onwards), which holds on to the full result from the MUL and there is only one rounding after the addition. It's all IEEE-754 compliant precision in 2.0.

The point I was raising was that an fp32 MUL will normalise its output. e.g. if you multiply two integers that are encoded as subnormals, the ALU will return the most significant digits and normalise the result (if possible).

I've just checked it (sigh, should have done that earlier), and CUDA 1.x's mul24 returns the low 32 bits of the result. Emulation in 2.0 devices should be nothing more than a bit of masking before doing the multiplication.

So, regardless of my normalisation point, this technique can't work on Fermi to perform the old function.

---

Oh and I've just noticed that floating point exceptions are quiet in CUDA 2.0.

Jawed

Jawed · Feb 23, 2010

aaronspink said:
If it is indeed half rate, then they are likely bridging two SP mantissa multipliers to get the required functionality.

But that's not enough bits for DP. Whereas fp32 and int32 bridged does the job. The latency is the same as an fp32 MUL, but the ALUs effectively become 8-lane instead of 16 (G.4.1 in CUDA Programming Guide 3.0).

Jawed

Mintmaster · Feb 23, 2010

Wouldn't Fermi just use its DP arithmetic units for int32 mul?

John021 · Feb 23, 2010

Thou shall not worry, because CUDA will fix it!

ap_ · Feb 23, 2010

Mintmaster said:
Wouldn't Fermi just use its DP arithmetic units for int32 mul?

I think so too. Also explains why DP and int32 rates are similar in Jawed's reference.

Mize · Feb 23, 2010

John021 said:
Thou shall not worry, because CUDA will fix it!

Genius.

CouldntResist · Feb 23, 2010

John021 said:
Thou shall not worry, because CUDA will fix it!

Meh. Hitler rants are only good when he sounds as disappointed supporter. The message of the meme is supposed to be "I was loyal fan, and they shafted me", not "LOL you failed, losers".

eastmen · Feb 23, 2010

John021 said:
Thou shall not worry, because CUDA will fix it!

lol farmville is all i need !

That was good.

aaronspink · Feb 23, 2010

Mintmaster said:
Wouldn't Fermi just use its DP arithmetic units for int32 mul?

Nothing I've seen suggest that Fermi has DP units per say. Everything available so far point to a bridge and extend functionality integrated into the SP units themselves. Basically, each SP unit contains a ~24x53b multiplier and two of these are bridges together to generate the DP mantissa. In contrast it appears that RV870 bridges 4 ~24x24 multipliers from 4 SP unit for its DP math.

At 1/4 or 1/2 rate FP there is little point in having separate units from an area perspective.

aaronspink · Feb 23, 2010

Jawed said:
But that's not enough bits for DP. Whereas fp32 and int32 bridged does the job. The latency is the same as an fp32 MUL, but the ALUs effectively become 8-lane instead of 16 (G.4.1 in CUDA Programming Guide 3.0).

Jawed

My contention was they've already extended the SP multiplier such that 2 SP multipliers can be bridged to handle a 54b mantissa. The other option is to have a separate multiplier but that doesn't explain the int32 multiplier performance.

rpg.314 · Feb 23, 2010

aaronspink said:
In contrast it appears that RV870 bridges 4 ~24x24 multipliers from 4 SP unit for its DP math.

/nitpick on

Shouldn't it be 27bx27b multipliers?

Mindfury · Feb 23, 2010

Rumor from Chinese sites:
GTX470 3Dmark Vantage Performance=167xx, Extreme=73xx

http://www.enet.com.cn/article/2010/0221/A20100221612409.shtml

NVIDIA GF100 & Friends speculation

Sxotty

Sontin

Groo The Wanderer

Groo The Wanderer

Ethatron

Groo The Wanderer

Groo The Wanderer

Groo The Wanderer

Jawed

Jawed

Mintmaster

John021

ap_

Mize

3dfx Fan

CouldntResist

eastmen

aaronspink

aaronspink

rpg.314

Mindfury

Similar threads