NVIDIA GF100 & Friends speculation

Registers are allocated at invocation for all the potential operands used by whatever exception handler is coded?
GPUs only have registers for "fast" context. I don't understand how D3D11 works, but part of the intention is for "optimal" register allocation for a suite of virtual functions.

In HD5870 it seems that the number of clause temporary registers available is 8, compared with 4 in earlier GPUs. This increase might have been targetted for usage by an exception handler.

Additionally, ATI has the concept of shared registers - registers that are shared by all current hardware threads, where the registers are shared by work items of the same work item ID. So if 3 shared registers are allocated, then 64 work items * 3 shared registers = 192 registers need to be allocated.

NVidia could implement the same thing - it's simply a matter of removing the hardware thread ID from the addressing (assuming that the register file is designed to handle more than one kernel at a time, e.g. VS+PS). Since the hardware will only run a single exception handler at any one time, there's no need to allocate a monster wodge of these shared registers.

So the context at the time the exception occurs is "locked" simply by control flow passing to the handler. The handler then has dedicated working registers (the shared registers allocated earlier + clause temporaries if this is ATI) and it gets to work with all registers in the context.

An alternative is to use L1 for working memory during the exception handler. The context (registers) remains untouched as the handler starts, work is done in L1 and then registers updated according to the handler. L1 is "full speed" in GF100 (i.e. not really, due to limited bandwidth and increased latency compared with registers) but the latencies encountered by LDSTs in the handler will be covered by the other hardware threads which will be running as normal.

All that's speculation, of course. I haven't spent any time on ATI's implementation.

Is this 12 cycles from the point of view of the affected thread, or from the hardware? With the warp schedulers, we have multiple warps in progress, and we would need to track them separately.
I'm referring to flushing the pipeline of succeeding instructions from the same hardware thread that might be in flight. Exceptions should only arise in limited places in the pipeline.

Worst-case, the hardware would have to track exceptions (possibly different ones?) from every lane in every warp that is currently in progress at the point of the first exception, wherever that first appears in the pipeline.
Per lane exception handling is hardly a big deal. A queue of exceptions (i.e. a per-work-item FIFO) would allow the hardware to track all hardware threads if they all hit a shitstorm of exceptions one after the other, before the first invocation of the handler has completed.

Nvidia claims to have changed the internal ISA to a load/store one, whereas the earlier variants had memory operands that would have been nightmarish to track as part of an ALU instruction.
Since GT200 has exception handling perhaps this means it's cheaper in GF100?

My interpretation is that within each core, there are rectangular bands of straight silicon on the upper and lower edges, with one end marked by regions that look like the SRAMs for the register file.
Sandwiched between them would be stuff I attribute to special function and scheduling.
I zoomed in and I think I can see what you're referring to. There's hints of structure along those two edges, with a "paired" structure on each side.

Also, looking at the die it seems you can tell how fast the logic's clocked by whether it's light or dark!

Shame the whole picture is so blurry.

Jawed
 
If you multiply a spfp number with zero, and then want to compare it with 0, why would you need denormal support? Multiplication with 0 is an exact operation even on ALU's which flush input denormals to zero, ain't it?
My understanding is that without denormal support, two different ways of getting to the number zero don't guarantee they will have the exact same representation, which means they will sometimes turn out "false" on a comparison.
 
I wasn't sure which would be appropriate thread, but seeing this has gone a bit off the course every now and then, I suppose this is as good as any, as I don't think this is worth it's own thread

Finally, when nVidia naming started to make sense again after all the renames, leaving 2xx as DX10 adapters, 3xx as DX10.1 adapters and 4xx as DX11 adapters, nVidia pulls another great one and re-re-re-(re-re?)releases G92 as GT330
http://www.nvidia.com/object/product_geforce_gt_330_us.html
 
My understanding is that without denormal support, two different ways of getting to the number zero don't guarantee they will have the exact same representation, which means they will sometimes turn out "false" on a comparison.

Zero (modulo the +0,-0 quirks) has the same representation, whether denormals are supported or not.

It's not the denormal support per se, but the it's implementation in hw that befuddles me. Really, why bother? :rolleyes: The rest of the world is doing fine without it. In GT200, it made some sense as denormal handling in sw is expensive and it would cause warp divergence. In fermi where you have per lane exception handling, why not just reuse it to provide an area efficient IEEE compliant implementation?
 
At what number of rebrands/relabeling does it cross the line and become deceptive marketing? :rolleyes:

To be fair, this thing has 3 display out's so may be it's not a total scam.
 
At what number of rebrands/relabeling does it cross the line and become deceptive marketing? :rolleyes:

To be fair, this thing has 3 display out's so may be it's not a total scam.

It's not the first nV card with DVI+VGA+HDMI, and only 2 at once are usable.
By default, they haven't had those before I think, but different companies have released their bit customed models with those outputs before, and IIRC from G92-based cards too
 
What's this?
Seems NVIDIA is experimenting in cutting out the enthusiast websites altogether by pushing hype entirely themselves, dressing it up in the amateurism of social networking sites (with the added benefit of being able to datamine their fans).
 
GPUs only have registers for "fast" context. I don't understand how D3D11 works, but part of the intention is for "optimal" register allocation for a suite of virtual functions.

In HD5870 it seems that the number of clause temporary registers available is 8, compared with 4 in earlier GPUs. This increase might have been targetted for usage by an exception handler.
Not that I have documentation on what Fermi does exactly, but couldn't the handler could be some arbitrary program? A fixed number of registers wouldn't be appropriate.

NVidia could implement the same thing - it's simply a matter of removing the hardware thread ID from the addressing (assuming that the register file is designed to handle more than one kernel at a time, e.g. VS+PS). Since the hardware will only run a single exception handler at any one time, there's no need to allocate a monster wodge of these shared registers.

So the context at the time the exception occurs is "locked" simply by control flow passing to the handler. The handler then has dedicated working registers (the shared registers allocated earlier + clause temporaries if this is ATI) and it gets to work with all registers in the context.
A programming guide to Fermi would confirm or deny this. This would be something that could not be hidden from the program coded as the exception handler. Otherwise, nothing outside of the original thread's addressible register range would be accessible, and that would include the handler's registers.

(edit: removed some text from quoted post)

Per lane exception handling is hardly a big deal. A queue of exceptions (i.e. a per-work-item FIFO) would allow the hardware to track all hardware threads if they all hit a shitstorm of exceptions one after the other, before the first invocation of the handler has completed.
How many bits of storage would we estimate it would need to have?
A 32-bit mask would be needed per warp-instruction. How many bits to track the exception type per lane? Then how many instruction pointers would need to be kept around?

Since GT200 has exception handling perhaps this means it's cheaper in GF100?
Is this certain? I'm pretty sure that exceptions were one thing not in the presentations for GT200.
 
Last edited by a moderator:
I'm referring to flushing the pipeline of succeeding instructions from the same hardware thread that might be in flight. Exceptions should only arise in limited places in the pipeline.

The nice thing with CPUs is, that in a per-case way you can define which states produce exceptions, I asume there is nothing configurable on GPUs? (eg. exception on denormal, division, address, etc.)

Per lane exception handling is hardly a big deal. A queue of exceptions (i.e. a per-work-item FIFO) would allow the hardware to track all hardware threads if they all hit a shitstorm of exceptions one after the other, before the first invocation of the handler has completed.

How does this play together with multitasking?

A "normal" program carries around the near context-data, some of the data can be trashed by concurrent programs/threads, memory is most likely trashed if shared. Which means on CPUs at least the internal CPU context is fully available even though we could have concurrent use of MMX and XMMX resources for example. It's seen as a (non-breakable) unit. Anyway the system continues to run and reoccupy the CPUs resources.

On GPUs we have X programs aquiring Y "shaders"/blocks. Task switching (I presume) will assign our Y shaders to Z other programs. If one of the running programs raises, does it mean that because it's not possible even to save the near context-data, that all shaders running for that program will be locked, until treated, locking out all successive programs? I assume the situation of the chip with all the big and shared register-blocks requires another form of exeption-handling than CPUs?

Any hint how good load-balancing and resource-switching with multiple concurrent OpenCL programs is on these architectures anyway? Do we end up with one shader-block per program, which does not migrate (as a poor way to prevent cache invalidation), having an effective limit of 32 concurrent programs on 5k?

I think this is a real interesting topic, and just from the look at it, I think traditional exception-handling is not very suited. I wonder why, if done, nVidia tries that. To much feature-suck from Larabee?
 
Not that I have documentation on what Fermi does exactly, but couldn't the handler could be some arbitrary program? A fixed number of registers wouldn't be appropriate.
The handler would be in GPU state for the current kernel as a sub-routine waiting to be called. Remember, traditionally, all kernels that are ran on GPUs have a fixed number of registers for their lifetime.

Further, worst case, the L1 solution I provided is a way to access unlimited state for the handler. In truth the state for a kernel can be "unlimited" because the registers can be virtualised. D3D requires this (4096 vec4s in D3D10, not sure what D3D11 says). Registers are spilled to memory in the general case (going via L1 with Fermi - L1 is described as an explicit destination for spilling).

Either way, I don't see any particular issue. As I say I don't understand the full intentions/scope behind the virtual function support in D3D11 nor how the GPUs implement it. I was kinda hoping some documentation/presentations on this subject would appear.

A programming guide to Fermi would confirm or deny this. This would be something that could not be hidden from the program coded as the exception handler. Otherwise, nothing outside of the original thread's addressible register range would be accessible, and that would include the handler's registers.
Well as I described before, it's logically possible to reserve registers/L1 memory specifically for the handler. Assuming that only one instance of the handler can be live at any time, whatever is reserved won't break the bank.

How many bits of storage would we estimate it would need to have?
A 32-bit mask would be needed per warp-instruction. How many bits to track the exception type per lane? Then how many instruction pointers would need to be kept around?
e.g. the exception flagger uses a queue of elements, queue length limited to 32 (max count of hardware threads per SIMD). 3 bits for the flag seems reasonable, so a 384 byte block of memory to track exceptions. Presuming exception handlers aren't allowed to generate exceptions, otherwise...

There should be a stack for instruction pointers anyway, so that's no additional cost.

Is this certain? I'm pretty sure that exceptions were one thing not in the presentations for GT200.
Sod it, I've been borged. Yeah, it's new in Fermi.

Jawed
 
The nice thing with CPUs is, that in a per-case way you can define which states produce exceptions, I asume there is nothing configurable on GPUs? (eg. exception on denormal, division, address, etc.)
We'll have to wait and see. Since exception handling is likely just an in-context subroutine call, it's near-instantaneous - so it's easy to have a switch statement in the handler.

How does this play together with multitasking?
GPUs have monster register files so that they explicitly have a context that usually lives entirely in registers for the lifetime of the kernel. Shared memory/L1/L2/global memory can be used as adjuncts, but the programming model heretofore has been based on registers for performance reasons.

On GPUs we have X programs aquiring Y "shaders"/blocks. Task switching (I presume) will assign our Y shaders to Z other programs.
I can't decode your terminology.

A kernel has a fixed size state defined, which is instantiated for every work-item in the execution domain - e.g. 12 vec4 registers. Hardware threads in the GPU are used to group work-items for parallel execution in the SIMDs (however many there are). The 10s or 100s of KB of register file are divided amongst hardware threads simply according to the amount of state per instance. If the GPU is doing graphics then a SIMD may see more than one kernel running concurrently (e.g. a VS and a PS) and the GPU's higher-level scheduler needs to decide how to balance the allocation of the kernels, i.e. how many hardware threads for each and conseqently how much register file space. Hardware threads share execution time in the SIMD according to a variety of parameters (some may be defined by the compilation, others by the scheduling hardware). That load balancing may be parameterised in the driver and could be adaptive over the lifetime of the kernels' hardware threads.

SIMDs can only support a limited count of hardware threads at any time. The type of kernel (compute versus pixel shader, say) may determine this limit.

Any hint how good load-balancing and resource-switching with multiple concurrent OpenCL programs is on these architectures anyway? Do we end up with one shader-block per program, which does not migrate (as a poor way to prevent cache invalidation), having an effective limit of 32 concurrent programs on 5k?
As far as I can tell ATI currently executes kernels strictly in their order - but as one kernel's work-groups start disappearing off SIMDs as the kernel comes to an end, those SIMDs can start execution of the succeeding kernel from the queue of kernels submitted by the host. There's no real detail on the simultaneity of kernels at the SIMD level for OpenCL kernels

I think this is a real interesting topic, and just from the look at it, I think traditional exception-handling is not very suited. I wonder why, if done, nVidia tries that. To much feature-suck from Larabee?
Maybe it's just to address the "credibility gap" in the scientific community? x86 is flexible and also supports extended precision (i.e. is significantly more precise than double-precision), so these measures are a way to claw back some apparent credibility. ECC, in my view, is in the same credibility gap category.

Anyway, in all these cases I think there's not enough overhead in what NVidia's done with Fermi to say that it's been damaged by CUDA-specific features.

Jawed
 
Maybe it's just to address the "credibility gap" in the scientific community? x86 is flexible and also supports extended precision (i.e. is significantly more precise than double-precision), so these measures are a way to claw back some apparent credibility. ECC, in my view, is in the same credibility gap category.
Native hardware support for exceptions makes debugging and programming significantly easier. It certainly beats the dark ages of GPU programming where something would go wrong and you wouldn't know until the end (maybe) and even then you wouldn't know where, or you had to run your program on a simulator that wasn't accurate anyway.

It does solve some of a credibility gap, as in nobody who didn't need to find make-work for free grad students is going to develop serious software in a crap platform like that.
 
are you guys not aware there are features like Genlock (which is a must in real time broadcasting) that aren't available outside of quadro cards? I don't see why you guys are talking about this at all, its not about performance in gaming, its about features and support for professional software and needs, its pretty simple to understand that. The market for such cards are needed, when working in millions of dollars budget, you might have the same core, but software that is customized and supported to the needs of business is much more important. This goes with Matrox too, why are they still alive, because they made a niche where they excel in, many imaging software and companies rely on martox's stability in imaging.

Doesn't take a page and half of posts to see that.


not to nitpick but ;) .. no Quadro cards are not the only ones with Genlock, maybe if you are talking about Nvidia offerings only then yes but the way it's worded "features like Genlock (...) that aren't available outside of quadro cards?" is factually incorrect, even "lowly" S3 has Genlock and if memory serves me well, their (VIA's) Apollo 133 chipset had Genlock capability (as to whether or not it was enabled .. well ..) as well as FirePro line from ATI and Matrox (since at least Parhelia.. DigiSuite before that) has had Genlock/Framelock for years extending throughout their product lines often through the use of a "daughter" card or ASM (Advanced Synchronization Module). Their QID Pro can drive upto 8 displays with Genlock using ASM.

So if you are going to make sweeping generalizations I might suggest that you be a bit more specific so someone doesn't read something the way you intended vs the way it was written.

With regard to the 2nd bold,, well that "a niche where they excel in" (quadro) is ironically also what has kept nV's numbers up so well over the last couple of quarters too.. so don't go knocking it unless you want to apply it to all parties involved.
 
Last edited by a moderator:
Well, string theorists don't typically handle numbers at all, so...

Anyway, the nice thing about denormals is with comparisons to zero. For example, if I have some sort of mask I want to apply to an image, which multiplies parts of it by zero, and I want to go back later and ask which pixels were masked and which weren't, it's a lot easier to directly compare against zero. If you don't have denormals, you can't do that and expect it to work.

Chal, anything * 0 = 0. The whole point of exceptions and flush to zero is that denormal math isn't something you really want to get into. Outside of a small handful of mathematicians denormals are considered fairly dangerous.
 
It's a matter of convenience, mostly. Basically, if you don't do it in hardware, you can't do a simple comparison to check for zeroes. And if you don't do a simple comparison to check for zeroes, then you need to examine your system for a reasonable minimum cutoff, a cutoff that may potentially change with different data sources, and would be difficult to automate. In other words, it's quite a bit of extra work just to avoid doing "if (x == 0)"

comparisons to zero work with flush to zero which is how its done in the real world.
 
Let SuperHuang ends this discussion:

You know, our business is increasingly moving from a great shift business to much more of a software rich business. If you look at our Quadro business, it is nearly all software. You know, the enormous R&D that we invest in Quadro and in the technology we create for Quadro is all software, because for anyone else, it is still built on NVIDIA GPUs. So you see the same thing with GeForce now. The work that we did in 3-D Vision, tons of software. The work that we did with 3-D Blu-Ray, tons of software. You know, so the work that we do with CUDA, the work we do with Physx, tons of software. So I think increasingly, that is going to become the nature of our business.Tesla is just all software, right, software tools and software compilers, libraries and I mean these are the profilers and debuggers. I mean, it has become increasingly a software-oriented type business and that is where our differentiation really is and that is where NVIDIA has historically been really, really excellent. And so in order for us to differentiate the basic commodity platform and turn it into an extraordinary experience for gamers or scientists or digital creators or for clouds or for netbooks or for tablets, it is increasingly a software business. And so that is where a lot of our differentiation becomes and I think if you think about our business from that perspective, our gross margin of the 44 points or the 44.7% that we had this quarter should be far from our expectations.
http://seekingalpha.com/article/189...all-transcript?source=trans_sb_popular&page=9



Quadro and Tesla - it's all software. You pay for the software and the support.
 
Back
Top