NVIDIA GF100 & Friends speculation

My understanding is that without denormal support, two different ways of getting to the number zero don't guarantee they will have the exact same representation, which means they will sometimes turn out "false" on a comparison.

yes but its effectively a difference without a distinction unless you are a theoretical mathematician.
 
Let SuperHuang ends this discussion:


http://seekingalpha.com/article/189...all-transcript?source=trans_sb_popular&page=9



Quadro and Tesla - it's all software. You pay for the software and the support.

IIRC, didn't JHH say NV wasn't a hardware company a but a software one. So I guess we look forward to nV promoting it's software through free hardware /wink.. maybe instead we should be looking at Quadros as simply hardware dongles that allow one to use said software.
 
And I'm betting that on Monday, they'll announce the launch date for Fermi. So this might well be an announcement of an announcement of an announcement :p

be more meta, I think it will be an announcement of an announcement of a reveal with contains another announcement.
 
not to nitpick but ;) .. no Quadro cards are not the only ones with Genlock, maybe if you are talking about Nvidia offerings only then yes but the way it's worded "features like Genlock (...) that aren't available outside of quadro cards?" is factually incorrect, even "lowly" S3 has Genlock and if memory serves me well, their (VIA's) Apollo 133 chipset had Genlock capability (as to whether or not it was enabled .. well ..) as well as FirePro line from ATI and Matrox (since at least Parhelia.. DigiSuite before that) has had Genlock/Framelock for years extending throughout their product lines often through the use of a "daughter" card or ASM (Advanced Synchronization Module). Their QID Pro can drive upto 8 displays with Genlock using ASM.

So if you are going to make sweeping generalizations I might suggest that you be a bit more specific so someone doesn't read something the way you intended vs the way it was written.

With regard to the 2nd bold,, well that "a niche where they excel in" (quadro) is ironically also what has kept nV's numbers up so well over the last couple of quarters too.. so don't go knocking it unless you want to apply it to all parties involved.


I should have stated outside of professional cards ;) but since we were talking about quadros......
 
If you look at Fermi, there are a number of improvements that are more focused on programmability than graphics performance - these do increase die area and power.

The coherency and weakly consistent caches with a unified memory address space are one example. That definitely adds quite a bit of overhead in terms of TLBs, etc.

Double precision support.

The fast denorm support is another - and FWIW, the reason to keep it around is not PR, but users. You don't want code to suddenly get 2X slower.

ECC for all on-die SRAM is also an overhead which doesn't help graphics.

Some of the synchronization tweaks may not really be all that helpful for graphics either.


Given where NV makes most of it's profit (professional), it's quite reasonable for them to add all these things. But we should be clear, they will pay a price in the high-end. That may not be a big issue, since the truly high-end is hardly high volume. A more interesting question is how much extra are they paying for low-end and mid-range cards...

David
 
Now if we are at quadros againt then the GF100 quadro cards could be realy good with the 4tri/clock performance when working with large multi milion polygon assemblies.
 
Is this the same Igor that said we should expect Fermi to be sampling in November with December product availability (launch?) shortly followed by a "dual fermi" (X2) product ?? That one ?

Yep! The one that told Fudo that "AMD is afraid of Fermi" "Don't trust FUD about Fermi being delayed, it will be here very soon"etc... the Igor even Fuad doesn't believe anymore.

Edit: I like this tweet
Man, don't trust stupid FUD. Fermi will conquer market very fast. There will be no competition at all. 12:38 PM Oct 11th, 2009
 
Last edited by a moderator:
Yep! The one that told Fudo that "AMD is afraid of Fermi" "Don't trust FUD about Fermi being delayed, it will be here very soon"etc... the Igor even Fuad doesn't believe anymore.

Edit: I like this tweet

Ahh ok, wanted to make sure I wasn't going senile or confuse him with that other Igor of hunchback variety...

"J.H.H: IEEEGORrrrrrr !!!!"

"Igor: yes Master ??!"

"J.H.H: Stop taunting the monster .."

"Igor: yes Master."
 
Yep! The one that told Fudo that "AMD is afraid of Fermi" "Don't trust FUD about Fermi being delayed, it will be here very soon"etc... the Igor even Fuad doesn't believe anymore.

Interesting. A GPU soap opera....whodathunkit. Were those proclamations also on twitter?
 
Interesting. A GPU soap opera....whodathunkit. Were those proclamations also on twitter?

I've edited one in, some are in Czech so I have no way (even google doesn't help) to check those..

There are multiple about how AMD is afraid of Fermi.
 
Last edited by a moderator:
Well as I described before, it's logically possible to reserve registers/L1 memory specifically for the handler. Assuming that only one instance of the handler can be live at any time, whatever is reserved won't break the bank.
Hardware-wise it is possible to make allowances. My point is that having registers allocated to the handler while also directly accessing the context's registers would need be made explicit in code. That is, there will be some kind of handlerRegisterAddress versus the context's regular registers.
Otherwise, a thread that has allocated the maximum number of addressable registers (128, 256 I can't remember for Nvidia) could not be handled by a handler coded to use general-purpose registers, or at least couldn't without first messing with the state before it has the chance to look at it.

Technically, it couldn't even use the L1, as the ISA is now load/store and there's no register left to use for an address.
On the other hand, it is programmatically straightforward to dump the lane context on an exception to L1 and have the handler check the data in memory with a fresh set of registers and whatever else data is saved by convention.

e.g. the exception flagger uses a queue of elements, queue length limited to 32 (max count of hardware threads per SIMD). 3 bits for the flag seems reasonable, so a 384 byte block of memory to track exceptions. Presuming exception handlers aren't allowed to generate exceptions, otherwise...
My interpretation from Realworldtech was that Fermi has 48 warp queue entries so any selection of those could try something, not that it's particularly significant at that range of capacity.
 
If you look at Fermi, there are a number of improvements that are more focused on programmability than graphics performance - these do increase die area and power.
Hmm, maybe I need a sig saying "compute is now part of graphics, it's not an overhead".

The coherency and weakly consistent caches with a unified memory address space are one example. That definitely adds quite a bit of overhead in terms of TLBs, etc.
They should make compute go faster, and compute is part of graphics.

Double precision support.
I estimate (very broadly) it's a few percent. What cost do you think it has?

The fast denorm support is another - and FWIW, the reason to keep it around is not PR, but users. You don't want code to suddenly get 2X slower.
Is this even 1% cost?

ECC for all on-die SRAM is also an overhead which doesn't help graphics.
Very broad estimate: <5% overhead. What do you think? The interesting thing about Fermi is there are theoretically many less distinct pools of memory, so the ECC stuff should be concentrated.

Some of the synchronization tweaks may not really be all that helpful for graphics either.
Can't tell what you're referring to here.

Anyone want to guesstimate these overheads?

Jawed
 
We'll have to wait and see. Since exception handling is likely just an in-context subroutine call, it's near-instantaneous - so it's easy to have a switch statement in the handler.

I thought more in terms of CPU. When you for example configure the FPU to raise for a division by zero it produces an interrupt-event calling the handler who may do something and return a solution or abort or pass handling to highlevel handlers.
So what you get is the ability to go with 0 penalty and bad results, or quite some penalty and an crash-requester, or big penalty and custom handling in your application. But it's up to you, if the interrupt is produced or not.

Another example are the trigonometric ops on the 68882, which where removed in the 68040 and trigonomettric functions where called via interrupt-handler. This was slow, much slower than linking with an appropriate libm doing the trigonometric functions.

I'm just wondering how much flexibility compared to the traditional CPUs exception handling can be really implemented in GPU hardware.

GPUs have monster register files so that they explicitly have a context that usually lives entirely in registers for the lifetime of the kernel. Shared memory/L1/L2/global memory can be used as adjuncts, but the programming model heretofore has been based on registers for performance reasons.

I can't decode your terminology.

Oh, sorry. I want to know if the kernel-scheduler freezes the occupied shaders when an exception occurs, untill the problem passes, or if the scheduler can create a let's say "sufficient" context, saves that context and reuses the shaders which caused the exception for the next schedules work? The "sufficient" context will be passed to the exception handling mechanism. In that case I doubt source-level debugging can work.
It's simply very hard for me to imagine that you do not need to freeze the kernel in-place for being usefull. On CPUs it's easy to restore a specific program-state, on GPUs not so, right? Saving hundreds of KBs of state information?

SIMDs can only support a limited count of hardware threads at any time. The type of kernel (compute versus pixel shader, say) may determine this limit.

Yes, a 1 core CPU can run only 1 hardware thread. Still the time-slicing or round-robin or whatever OS scheduler maps hundreds of threads on the core. I wonder how the GPU scheduler (software) handles all that. I suppose it significant for the future to think about how M:N (thread:core) scheduler with high N can work optimal. The nice thread-hopping problem with Phenom I and Vista is an indicator that some people who are suppose to think about it, didn't think about it.

Maybe it's just to address the "credibility gap" in the scientific community? x86 is flexible and also supports extended precision (i.e. is significantly more precise than double-precision), so these measures are a way to claw back some apparent credibility. ECC, in my view, is in the same credibility gap category.

Hm. Doesn't the sheer ALU-power of GPUs easily allow to surpass x86 EP performance, emulating say 128bit floats via DP products? Making it fast as well as more precise?

I have the impression the slow migration from graphics chip to compute chip is really painfull. Not Intel, not nVidia, not AMD ever learned from successfull DSP chip histories?
 
Hardware-wise it is possible to make allowances. My point is that having registers allocated to the handler while also directly accessing the context's registers would need be made explicit in code. That is, there will be some kind of handlerRegisterAddress versus the context's regular registers.
That's fine, nothing I've said so far negates that.

Otherwise, a thread that has allocated the maximum number of addressable registers (128, 256 I can't remember for Nvidia)
That's a hardware limit that's irrelevant since registers are virtualised. They have to be because D3D10 requires 4096 vec4s.

could not be handled by a handler coded to use general-purpose registers, or at least couldn't without first messing with the state before it has the chance to look at it.
By definition, the state that is relevant to the exception is on-die in general-purpose registers. Sure, the exception handler could decide to access a register that was computed 30,000 cycles ago in a register that's been swapped to memory for aeons. I'm not negating that. Aside from extreme corner cases the exception handling should be fast, because "context switching" is something GPUs are happy to do every few physical cycles anyway (it appears to be 2 in Fermi). They're built for precisely this.

Technically, it couldn't even use the L1, as the ISA is now load/store and there's no register left to use for an address.
If you can get that past the compiler!

On the other hand, it is programmatically straightforward to dump the lane context on an exception to L1 and have the handler check the data in memory with a fresh set of registers and whatever else data is saved by convention.
Unnecessary - as I said earlier, in the worst case, any working can be done in L1 without mucking about copying dozens of registers around for no reason. Ultimately it should submit to register virtualisation mechanics if the exception handler is more complicated than the kernel being handled.

My interpretation from Realworldtech was that Fermi has 48 warp queue entries so any selection of those could try something, not that it's particularly significant at that range of capacity.
Sorry, that's right, that's what the whitepaper says too.

Jawed
 
I'm just wondering how much flexibility compared to the traditional CPUs exception handling can be really implemented in GPU hardware.
I can't think of any reason to bar the options you listed. Interrupts are not rocket science.

Oh, sorry. I want to know if the kernel-scheduler freezes the occupied shaders when an exception occurs, untill the problem passes, or if the scheduler can create a let's say "sufficient" context, saves that context and reuses the shaders which caused the exception for the next schedules work? The "sufficient" context will be passed to the exception handling mechanism. In that case I doubt source-level debugging can work.
It's simply very hard for me to imagine that you do not need to freeze the kernel in-place for being usefull. On CPUs it's easy to restore a specific program-state, on GPUs not so, right? Saving hundreds of KBs of state information?
The other hardware threads in a the GPU cannot (should not) do anything that will affect the excepted hardware thread's context. The registers are obviously private. Writes to shared memory and global memory that could affect the thread are meant to be synchronised so that all work-items have a consistent view of their shared context.

So, the excepted hardware thread's context is safe, without having to "save" it anywhere. The instant the exception arises the context can be locked, so that it can only be modified by the handler. It's basic stuff, no different from a hardware thread being unable to execute ALU instructions because it's waiting for a texture result.

Hm. Doesn't the sheer ALU-power of GPUs easily allow to surpass x86 EP performance, emulating say 128bit floats via DP products? Making it fast as well as more precise?
Yeah if you really want. There are people doing (or trying to do) kilobit precision arithmetic on GPUs.

You can do some fancy stuff with FFTs to do monster multiplies with stupid precision and GPUs are pretty good at FFTs.

http://numbers.computation.free.fr/Constants/Algorithms/fft.html

Over my head...

I have the impression the slow migration from graphics chip to compute chip is really painfull. Not Intel, not nVidia, not AMD ever learned from successfull DSP chip histories?
Actually GPUs are following the path that x86 took: consumers, by buying billions of x86 chips, subsidised x86 in killing off most of that gear. That's only a medium term thing though, as generality will kill of the GPU, per se. AMD's strategy is to put the GPU in the CPU - that way it lasts a bit longer... NVidia's strategy is two-fingers to x86 with ARM and SoCs. Intel's solution is just to chomp everything with varieties of x86.

Jawed
 
So, the excepted hardware thread's context is safe, without having to "save" it anywhere. The instant the exception arises the context can be locked, so that it can only be modified by the handler. It's basic stuff, no different from a hardware thread being unable to execute ALU instructions because it's waiting for a texture result.

Thank for the answers. Just to get a yes or no: An exception will lock all involved shaders?
If yes, exception handling is in "in-order" (don't know how to call it better) on GPUs, while "out-of-order" on CPUs? The implication is that you can bind the entire GPUs resources with exceptions, while CPUs continue steam ahead?
 
Thank for the answers. Just to get a yes or no: An exception will lock all involved shaders?
The ALUs can continue to run instructions for other hardware threads. I'm not aware of any reason for the entire GPU to defer to a single instance of an exception handler.

Of course we still need to wait and see what happens for real. I couldn't find any NVidia patent documents on this subject.

Jawed
 
Back
Top