We'll have to wait and see. Since exception handling is likely just an in-context subroutine call, it's near-instantaneous - so it's easy to have a switch statement in the handler.
I thought more in terms of CPU. When you for example configure the FPU to raise for a division by zero it produces an interrupt-event calling the handler who may do something and return a solution or abort or pass handling to highlevel handlers.
So what you get is the ability to go with 0 penalty and bad results, or quite some penalty and an crash-requester, or big penalty and custom handling in your application. But it's up to you, if the interrupt is produced or not.
Another example are the trigonometric ops on the 68882, which where removed in the 68040 and trigonomettric functions where called via interrupt-handler. This was slow, much slower than linking with an appropriate libm doing the trigonometric functions.
I'm just wondering how much flexibility compared to the traditional CPUs exception handling can be really implemented in GPU hardware.
GPUs have monster register files so that they explicitly have a context that usually lives entirely in registers for the lifetime of the kernel. Shared memory/L1/L2/global memory can be used as adjuncts, but the programming model heretofore has been based on registers for performance reasons.
I can't decode your terminology.
Oh, sorry. I want to know if the kernel-scheduler freezes the occupied shaders when an exception occurs, untill the problem passes, or if the scheduler can create a let's say "sufficient" context, saves that context and reuses the shaders which caused the exception for the next schedules work? The "sufficient" context will be passed to the exception handling mechanism. In that case I doubt source-level debugging can work.
It's simply very hard for me to imagine that you do not need to freeze the kernel in-place for being usefull. On CPUs it's easy to restore a specific program-state, on GPUs not so, right? Saving hundreds of KBs of state information?
SIMDs can only support a limited count of hardware threads at any time. The type of kernel (compute versus pixel shader, say) may determine this limit.
Yes, a 1 core CPU can run only 1 hardware thread. Still the time-slicing or round-robin or whatever OS scheduler maps hundreds of threads on the core. I wonder how the GPU scheduler (software) handles all that. I suppose it significant for the future to think about how M:N (thread:core) scheduler with high N can work optimal. The nice thread-hopping problem with Phenom I and Vista is an indicator that some people who are suppose to think about it, didn't think about it.
Maybe it's just to address the "credibility gap" in the scientific community? x86 is flexible and also supports extended precision (i.e. is significantly more precise than double-precision), so these measures are a way to claw back some apparent credibility. ECC, in my view, is in the same credibility gap category.
Hm. Doesn't the sheer ALU-power of GPUs easily allow to surpass x86 EP performance, emulating say 128bit floats via DP products? Making it fast as well as more precise?
I have the impression the slow migration from graphics chip to compute chip is really painfull. Not Intel, not nVidia, not AMD ever learned from successfull DSP chip histories?