In a cycle...

> Excuse me if this is a silly question, but how long is a typical cycle for a high end chip that is outlined in the original post?

At 600 MHz, which is typical for high-end GPUs, it's 1 / 600 MHz == ~1.67 ns (which includes circuit setup/hold times, so your real time to do anything is less than that).

3 GHz CPUs have cycle times of ~0.33 ns.

So the CPU can do 10-20 instructions (assuming dual-core with 2x the execution units) by the time the GPU can do ~60 in the programmable shader alone.
 
OpenGL guy said:
That would violate the DX10 spec. Say you're running a complex pixel shader that takes a million cycles per pixel... Do you really want your context switch to wait until all pixels are shaded?

I am sure it’s only because my spec is old and you have a newer one. But as far as I know Direct3D 10 requires a WDDM 1.0 driver. WDDM 1.0 requires context switches based on DMA buffers.

Finer switch levels are part of WDDM 2.0 (command/triangle) and WDDM 2.1 (immediately)
 
Bob said:
> Excuse me if this is a silly question, but how long is a typical cycle for a high end chip that is outlined in the original post?

At 600 MHz, which is typical for high-end GPUs, it's 1 / 600 MHz == ~1.67 ns (which includes circuit setup/hold times, so your real time to do anything is less than that).

3 GHz CPUs have cycle times of ~0.33 ns.

So the CPU can do 10-20 instructions (assuming dual-core with 2x the execution units) by the time the GPU can do ~60 in the programmable shader alone.
Duh, of course. My brain is worthless. Thank you. :)
 
Demirug said:
I am sure it’s only because my spec is old and you have a newer one. But as far as I know Direct3D 10 requires a WDDM 1.0 driver. WDDM 1.0 requires context switches based on DMA buffers.

Finer switch levels are part of WDDM 2.0 (command/triangle) and WDDM 2.1 (immediately)
They're probably thinking about the future and thus 2.0 or 2.1. I believe 2.1 requires switching within a certain number of milliseconds so you might not be able to finish the current pixel if the shader is really long.
 
Surely there's a difference between "switching-in a context" within x milliseconds and "completing the batch of work that's been switched-in" within x milliseconds.

If a GPU is able to execute out of order and supports multiple concurrent render states (I'm thinking along the lines of how Xenos works) then when a new context is submitted to the GPU, it can start it up practically straight away. Surely this new context only needs to be given high-priority to be scheduled in preference to the currently executing "1 million cycle shader". The long shader (or rather the vertices/primitives/pixels that are running it) can hang around to help hide any latency that arises within the GPU from texture/memory fetches.

Are there any "guarantees" for the time to completion of work in WDDM 1.0 (2.0, 2.1, whatever)? Anything in the sub second range (or sub-1/60th of a second)?

I'm thinking the issue here is responsiveness of the Aero Glass interface. What fractions of time are required to make the interface reasonably responsive?

I don't see how the OS (Vista) could require anything incredibly tight from the GPU - there has to be range of performance in suitable graphics cards, they're not all going to be 1TFLOP-programmable monsters.

Jawed
 
Jawed said:
I'm thinking the issue here is responsiveness of the Aero Glass interface. What fractions of time are required to make the interface reasonably responsive?
Yep, this is the reason for hardware context switching in WDDM. I'm thinking in the range of 100ms but don't know for sure. It should say in the spec if someone is ambitous enough to look.
 
3dcgi said:
They're probably thinking about the future and thus 2.0 or 2.1. I believe 2.1 requires switching within a certain number of milliseconds so you might not be able to finish the current pixel if the shader is really long.

With WDDM 2.1 context switches should be possible every time. Even in the middle of a pixel shader program.
 
3dcgi said:
Yep, this is the reason for hardware context switching in WDDM. I'm thinking in the range of 100ms but don't know for sure. It should say in the spec if someone is ambitous enough to look.

a millisecond is a long period of time. an execution quantum should be more on the order of milliseconds, so switch times need to be more like a hundred or two hundred microseconds.

think about time budgets for video decode at x frames/second combined with desktop composition.
 
db said:
a millisecond is a long period of time. an execution quantum should be more on the order of milliseconds, so switch times need to be more like a hundred or two hundred microseconds.

think about time budgets for video decode at x frames/second combined with desktop composition.
I'd argue that you should think about the length of a graphics pipeline. 100 milliseconds is a reasonable response rate for a person to not consider it slow.
 
3dcgi, but a hundred milliseconds is 50 million cycles on a 500MHz GPU... honestly you'd have to try really hard to make a context switch implementation that was that slow. Consider some kind of monster high-end GPU that actually needed to write/read 10 MB of state on every context switch: you'd have to manage to only read and write a single byte every 5 cycles...
 
Npl said:
Nope, you are speaking about the GPU-Context switches, so what requirement am I missing?
This is what I objected to:
Not necessary, you could allow the GPU to get into "switch-state" which then needs alot less information to backup, like for example let it finish all triangles/vertexes it started
I should imagine that you can't wait for even a single triangle to complete.:cry:
 
Jawed said:
Surely there's a difference between "switching-in a context" within x milliseconds and "completing the batch of work that's been switched-in" within x milliseconds.

If a GPU is able to execute out of order and supports multiple concurrent render states (I'm thinking along the lines of how Xenos works) then when a new context is submitted to the GPU, it can start it up practically straight away. Surely this new context only needs to be given high-priority to be scheduled in preference to the currently executing "1 million cycle shader".
That's probably not possible if the new process needs more pages of (GPU) memory which aren't available, and has to force some other process to swap out some of its pages....
 
Ultimately, a new context and the existing contexts all have to share the GPU, don't they? Then it depends on the workload and available local RAM whether the GPU is forced to swap memory (out, to system RAM) or whether the contexts can co-exist.

That seems to be what's happening within Xenos (where I guess memory management is simpler). And what's demanded of WDDM 2.0 onwards.

Also, presumably there's always the option to "throw away" some state and put the relevant items back in the queue. With a fenced buffer, the queue isn't cleared between fences until the entire section of work has been completed. If a context is junked, then the GPU can move back to the prior fence for that context. Memory consumed by textures and render targets specifically allocated since the prior fence can be junked too. Partial results, e.g. in render targets that existed before the prior fence, need to be swapped-out if the memory is needed.

I suppose this is the difference between the WDDM 1.0 "presentation layer" being a single-threaded GPU app (everything runs through one context and the OS has to perform the equivalent of "AGP texturing" and cooperative multi-tasking of GPU work) and later versions where each OS app gets to request one or more contexts for itself.

Jawed
 
Simon F said:
That's probably not possible if the new process needs more pages of (GPU) memory which aren't available, and has to force some other process to swap out some of its pages....

while that scenario can happen, it can be separated from the mechanics of switching contexts in the same way it is separate on a CPU, i.e., the processor can fault in pages causing additional context switches so there isn't necessarily a need to ensure the previous working set is present. A context can be narrowly defined to be just the hardware pipeline state and not include the working set of the process.
 
3dcgi said:
I'd argue that you should think about the length of a graphics pipeline. 100 milliseconds is a reasonable response rate for a person to not consider it slow.

in a desktop composition scenario (e.g., Windows Vista) would it be fine to compose 30Hz video into a window at 10Hz?

if you want to time slice between multiple interactive applications, you need decide how many applications and at what rate. I think what you are proposing is not enough. 2ms execution quantum with 100us switch time is an interestingly aggressive target. This allows ~7 apps to run at 60Hz without introducing extra latency. For comparison, OSes on modern CPUs are more like 10ms quantum with a couple of usec switch time, but audio and other applications put pressure on the quantum to be much lower. A typical 1980s OS had a 16ms quantum. Will timeslicing the GPU be successful with larger numbers?
 
db said:
This allows ~7 apps to run at 60Hz without introducing extra latency.
Well, there's not a whole lot of reason to worry about more than ~4 apps to use more than a miniscule amount of 3D graphics. Typically you'll just have one, perhaps two in rare cases.

But yeah, ~100us switch time is more than enough for the switch penalty to be inconsequential. The switch time does, however, need to be much longer than that for CPU's, because the pipeline depth for a GPU is on the order of a few hundred clock cycles. That places the amount of processing time wasted for a pipeline flush at somewhere around a few microseconds (this is just considering the pipeline flush time: there may be other penalties that lengthen this amount of time), so you need to time between switches to be much longer than that to avoid significant penalties due to switching.
 
psurge said:
3dcgi, but a hundred milliseconds is 50 million cycles on a 500MHz GPU... honestly you'd have to try really hard to make a context switch implementation that was that slow. Consider some kind of monster high-end GPU that actually needed to write/read 10 MB of state on every context switch: you'd have to manage to only read and write a single byte every 5 cycles...
Even GPUs with slower clocks will need to support context switching, but you're correct that 100ms is more than enough time to dump state after halting execution. I wasn't thinking correctly before and db is correct about the switch time being closer to 100us. Although after a quick search I couldn't find the exact requirement.
 
Back
Top