Nvidia Pascal Announcement

That was not in conjunction with concurrent execution. That was taking about when having different kernel execution in the same queue when the program called for turning off one kernel to execute another.

PS fine grain preemption is already enabled in pascal, they have talked about it anyways.

Maxwell 2 can do CUDA kernel execution in its graphics queue with no problem at all, it does have issues with doing direct compute in its graphics queue (this is the problem that we saw here, after doing one operation the second operation just falls into serial execution, we did see other issues where the graphics instructions started to go into the compute queue with DX12 and direct compute, this should not be happening at all, something was or is messed up in the drivers). Open CL behaves similarly to CUDA in this respect too.

They should not be using preemption for this kind kernel execution, preemption is only for when you need to force something to be done at a certain time, where latency matters for certain operation.
 
Last edited:
This is from GDC'15
jYb5JoK.jpg


I can dig the newer slide where they specify Pascal as "finer grained preemption" (as pointed by registers article linked above)

Stating that fine grained pre-emption is coming doesn't necessarily mean lack of support for concurrent graphics and compute kernels.

A brief glance at HyperQ makes it very clear that nVidia is well aware of the benefits of "async" and the pitfalls of poor utilization. The real question is whether they choose to support it for graphics contexts (and if not, why not?).
 
Stating that fine grained pre-emption is coming doesn't necessarily mean lack of support for concurrent graphics and compute kernels.

A brief glance at HyperQ makes it very clear that nVidia is well aware of the benefits of "async" and the pitfalls of poor utilization. The real question is whether they choose to support it for graphics contexts (and if not, why not?).
And that was 2015, so coming could mean Pascal albeit as an evolution with it being more of a revolution with the next architecture Volta.
More information is needed to know just how better it is with Pascal over Maxwell 2.
TBH I think that slide was talking about Pascal.
Notice they mention multiple contexts by time-slicing with draw-call preemption for generations before Pascal .
So to me this suggests a bit more of a fundamental change.
Cheers
 
Stating that fine grained pre-emption is coming doesn't necessarily mean lack of support for concurrent graphics and compute kernels.

A brief glance at HyperQ makes it very clear that nVidia is well aware of the benefits of "async" and the pitfalls of poor utilization. The real question is whether they choose to support it for graphics contexts (and if not, why not?).
I think this Draw-Preemption is something else than asynchronous compute, because:
AMD Radeon (TM) R9 Fury Series
Description AMD Radeon (TM) R9 Fury Series
VendorId 0x00001002
DeviceId 0x00007300
SubSysId 0x0b361002
Graphics Preemption Granularity DMA Buffer
Compute Preemption Granularity DMA Buffer

(same for Geforce)

Whereas:
Intel(R) HD Graphics 4600
Description Intel(R) HD Graphics 4600
VendorId 0x00008086
DeviceId 0x00000412
SubSysId 0x85341043
Revision 6
Graphics Preemption Granularity Primitive
Compute Preemption Granularity Triangle


and
Intel(R) HD Graphics 530
Description Intel(R) HD Graphics 530
VendorId 0x00008086
DeviceId 0x00001912
SubSysId 0x86941043
Revision 6
Graphics Preemption Granularity Triangle
Compute Preemption Granularity Pixel

(Pixel is as fine grained as it gets in current DX)

Or is this the draw preemption for desktop rendering and thus "responsive user experience"? Might be true as well.
 
I don't know the answer. But ... what would stop them from doing so when the FF hardware isn't involved (compute shaders)? The only thing I can think of would be if each warp (wave? I forget the NV specific terminology) is required to allocate the same amount of space in the register file, so that a simple warp ID is enough to locate all registers for a warp. But even in the graphics case, it seems odd not to be able to run both a pixel and vertex shader in parallel.
 
I don't know the answer. But ... what would stop them from doing so when the FF hardware isn't involved (compute shaders)? The only thing I can think of would be if each warp (wave? I forget the NV specific terminology) is required to allocate the same amount of space in the register file, so that a simple warp ID is enough to locate all registers for a warp. But even in the graphics case, it seems odd not to be able to run both a pixel and vertex shader in parallel.
Hmm, I hadn't thought of the register file as being the restriction.

Does NVidia still do funky register file allocations? Where registers can be assigned in a wide (across banks) or a deep pattern (within a bank), with reliance upon the operand collector to gather operands for the SIMDs?

If a pixel shader requires a certain kind of register allocation pattern then that would presumably make it much harder to also come up with a vertex shader register allocation pattern that meshed efficiently.

One might argue that vertex and pixel shaders would never normally have any need for funky register allocation patterns, so they can co-exist within a single SIMD. The register allocation shape only becomes applicable for compute kernels. Which would then imply that compute kernels can't co-exist with non-compute kernels. Nor can mixed compute kernels co-exist on a single SIMD.
 
Is it really worth making speculations with assumptions (due to lack of information available) at this time regarding differences between Maxwell and Pascal when the information even about changes to the scheduler are not even released yet.
Part of the speculation I remember Mahigan mentioning in the past pertained to resource barrier on Kepler-Maxwell and that this was part of the limitations due to different scope between CUDA and consumer gaming; even this has probably evolved in some way with Pascal.
Good enough to compensate for game designs around AMD's "async compute"? who knows yet how successful it will be.
And isn't part of the issue resolving occupancy-stall related issues (which we know NVIDIA with Pascal has focused on this time even if it is not a revolution)?

Cheers
 
Pretty simple really, if NVidia is going to do threading properly at the SIMD level, it's going to cost area and power.
 
I think this Draw-Preemption is something else than asynchronous compute, because:
AMD Radeon (TM) R9 Fury Series
Description AMD Radeon (TM) R9 Fury Series
VendorId 0x00001002
DeviceId 0x00007300
SubSysId 0x0b361002
Graphics Preemption Granularity DMA Buffer
Compute Preemption Granularity DMA Buffer

(same for Geforce)

Whereas:
Intel(R) HD Graphics 4600
Description Intel(R) HD Graphics 4600
VendorId 0x00008086
DeviceId 0x00000412
SubSysId 0x85341043
Revision 6
Graphics Preemption Granularity Primitive
Compute Preemption Granularity Triangle


and
Intel(R) HD Graphics 530
Description Intel(R) HD Graphics 530
VendorId 0x00008086
DeviceId 0x00001912
SubSysId 0x86941043
Revision 6
Graphics Preemption Granularity Triangle
Compute Preemption Granularity Pixel

(Pixel is as fine grained as it gets in current DX)

Or is this the draw preemption for desktop rendering and thus "responsive user experience"? Might be true as well.
That's just a DXGI feature that it seems only Intel drivers currently report properly. Which tool are you using to read this? Because this are the defined graphics preemption values and this are the defined compute preemption values.
And yes, preemption is something different then "async compute". I think "asynchronous compute" is something that's best described as marketing speak for a bunch of things going on here. De facto "async compute" has now become a synonym for running graphics and compute tasks concurrently. This has been used even in VR case where in fact you run two graphics tasks (one normal graphics task rendering frame and one high priority graphics task for asynchronous time warp) which you have to preempt (that is stop current graphics task and start a higher priority graphics task).
 
From there, since it was discussed here earlier:
Compute Preemption
The new Pascal GP100 Compute Preemption feature allows compute tasks running on the GPU to be interrupted at instruction-level granularity, and their context swapped to GPU DRAM. This permits other applications to be swapped in and run, followed by the original task’s context being swapped back in to continue execution where it left off. Compute Preemption solves the important problem of long-running or ill-behaved applications that can monopolize a system, causing the system to become nresponsive while it waits for the task to complete, possibly resulting in the task timing out and/or being killed by the OS or CUDA driver. Before Pascal, on systems where compute and display tasks were run on the same GPU, long-running compute ernels could cause the OS and other visual applications to become unresponsive and non-interactive until the kernel timed out. Because of this, programmers had to either install a dedicated compute-only GPU or carefully code their applications around the imitations of prior GPUs, breaking up their workloads into smaller execution timeslices so they would not time out or be killed by the OS. Indeed, many applications do require long-running processes, and with Compute Preemption in GP100, those applications can now run as long as they need when processing large datasets or waiting for specific conditions to occur, while visual applications remain smooth and interactive—but not at the expense of the programmer struggling to get code to run in mall timeslices.

Compute Preemption also permits interactive debugging of compute kernels on single-GPU systems. This is an important capability for developer productivity. In contrast, the Kepler GPU architecture onlyprovided coarser-grained preemption at the level of a block of threads in a compute kernel. This block-level preemption required that all threads of a thread block complete before the hardware can context switch to a different context. However when using a debugger and a GPU breakpoint was hit on an instruction within the thread block, the thread block was not complete, preventing block-level preemption. While Kepler and Maxwell were still able to provide the core functionality of a debugger by adding instrumentation during the compilation process, P100 is able to support a more robust and lightweight debugger implementation.
 
Any idea how long it takes to flush/restore state for a GPU as wide as GP100?
Back of envelope guess. Worst case, you have to dump (and later restore) every register and every shared memory word to main memory. Fully enabled P100 has 30 SMs, each with 256KB of registers and 64KB of shared memory, so the whole GPU state is 9.6 MB. (Remaining state like warp instruction pointers, carry flags, predicate flags, etc are ignorable in size). HBM2 memory bandwith is ~700GB/sec on ECC P100. Conclusion: flushing or restoring the whole GPU execution state would take about 14 microseconds.

Putting that in context, let's figure out how much performance the overhead of a full worst case realtime preemptive kernel would lose. Say you're interrupting the GPU at 120Hz. The state switch overhead would be 120*(2*14us)/1 sec, about 1/3 of one percent of the GPU's time.
 
Back
Top