NVIDIA GF100 & Friends speculation

Thx for the reply. Helps me a lot to understand hiw such things are actually done.



Thx fpr the info. You are always a first rate source for such info. If NV has no mainstream and entry level DX11 chip till the end of 2010, they are broke.

As an update, I am hearing, totally unconfirmed so far, that there may be one other tapeout either done or pending. I am far from 100% on this one though.

-Charlie
 
I'm not sure that that will work as the multiplier will always try to create a normalised result - there's "magic" for the 24th bit which is implicit in the final result. To do uint24 arithmetic it needs to be tweaked a bit, I think, which is why I say "free".

I always thought FP-logic has suffix-bits to accomodate for rounding errors, so I thought in this case 24 bit is not 23+1 implicit but a real 24bit mantissa + 1 implicit bit.
Otherwise none of the rounding modes in IEEE would make any sense as they would all be identicall if underflow bits can't surface.

I think a reference is x87 FP32 treatment in single precision mode. It uses more bits while calculating and chopping products. AFAIK
 
Funny piece from Hilbert.

On occasions, Hilbert too can't stop himself from repeating what Charlie is saying, only in different words. Yet, he sings the paeans of Fermi.

I never knew Guru3D is competing to outsmart Semi-Accurate. It's true that charlie has more misses than hits but what is the point of coming out with this piece on a website like Guru3D, hitting at Mr C, taking an argument that could very well be (poorly) drafted by Nvidia's PR team.

You kept your mouth shut for so long and you are not going to share any concrete info, so why not keep mum for a few more weeks. Unless, you were nudged by Nvidia.

and this one is gem...

:oops:

It probably was done by NV's PR team. They tend to shop stories like that around to various sites, starting out at the high end, and moving down. They use it to point to as an 'independent source' to 'collaborate' their view.

Normally, the sites run by people with a brain no better than to touch those pieces, and so things get shopped to progressively more sketchy sites until someone bites. ATI used to do this back in the day, I haven't seen it in a while. AMD and Intel never did that I am aware of, but that isn't definitive.

-Charlie
 
Last edited by a moderator:
That bit about "tier 1 website" reminds me of the tier X publication SNAFU with AMD a couple year back (ie. where AMD said they never required vetting of articles before release from tier 1 publications, suggesting they did of others). I always wonder how explicit these deals are in the wonderful world of the web, is it just like with Charlie (simple threats not to bite the hand that feeds you or you don't get to be at the trough) or do they actually require editorial rights by contract for the smaller sites if they participate in media events?

I have never seen anyone ask for editing rights to an article. Some have asked nicely, and a few I have offered it to, but those were deep architecture articles where some of the bits were nuanced and complex. It was more of a fact check than editorializing, and the articles were like this:

http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/

For simple pieces or reviews, never seen it, never been asked either. I am pretty sure that any PR person knows better than to ask that. If a site does do let PR run roughshod over their articles, it is open season on that site, and they become coopted very fast, and die off quick.

I have heard rumors of some Taiwanese vendors asking about such, but nothing concrete.

Also, this is very different from getting a letter after something goes up saying, "You got that wrong, and here is why. Can you correct it?".

-Charlie
 
I always thought FP-logic has suffix-bits to accomodate for rounding errors, so I thought in this case 24 bit is not 23+1 implicit but a real 24bit mantissa + 1 implicit bit.
GPU fp32 has historically not been particularly accurate but gradually improving. This has been tightened up as FFMA in CUDA 2.0 devices (Fermi onwards), which holds on to the full result from the MUL and there is only one rounding after the addition. It's all IEEE-754 compliant precision in 2.0.

The point I was raising was that an fp32 MUL will normalise its output. e.g. if you multiply two integers that are encoded as subnormals, the ALU will return the most significant digits and normalise the result (if possible).

I've just checked it (sigh, should have done that earlier), and CUDA 1.x's mul24 returns the low 32 bits of the result. Emulation in 2.0 devices should be nothing more than a bit of masking before doing the multiplication.

So, regardless of my normalisation point, this technique can't work on Fermi to perform the old function.

---

Oh and I've just noticed that floating point exceptions are quiet in CUDA 2.0.

Jawed
 
If it is indeed half rate, then they are likely bridging two SP mantissa multipliers to get the required functionality.
But that's not enough bits for DP. Whereas fp32 and int32 bridged does the job. The latency is the same as an fp32 MUL, but the ALUs effectively become 8-lane instead of 16 (G.4.1 in CUDA Programming Guide 3.0).

Jawed
 
Wouldn't Fermi just use its DP arithmetic units for int32 mul?

Nothing I've seen suggest that Fermi has DP units per say. Everything available so far point to a bridge and extend functionality integrated into the SP units themselves. Basically, each SP unit contains a ~24x53b multiplier and two of these are bridges together to generate the DP mantissa. In contrast it appears that RV870 bridges 4 ~24x24 multipliers from 4 SP unit for its DP math.

At 1/4 or 1/2 rate FP there is little point in having separate units from an area perspective.
 
But that's not enough bits for DP. Whereas fp32 and int32 bridged does the job. The latency is the same as an fp32 MUL, but the ALUs effectively become 8-lane instead of 16 (G.4.1 in CUDA Programming Guide 3.0).

Jawed

My contention was they've already extended the SP multiplier such that 2 SP multipliers can be bridged to handle a 54b mantissa. The other option is to have a separate multiplier but that doesn't explain the int32 multiplier performance.
 
Back
Top