In a cycle...

OpenGL guy

Veteran
...a graphics chip fetches a number of vertices while n vertex shaders execute m instructions while the setup engine sets up l triangles while the k scan converters prepares j quads while the rasterizer interpolates i interpolants while h pixel shaders execute g instructions while the f texture units texture e pixels while the depth buffer tests d pixels while the blend unit blends c pixels. And while all this is happening, the memory controller is fetching and writing data for multiple clients.

Typical values for a high end chip:
n = 8
m = 1
l = 1
k = 4
j = 1
i = 8
h = 16
g = 3
f = 4
e = 4
d = 16
c = 16

Of course, this is not an exhaustive list as it doesn't account for early Z check, any sort of compression, alpha test, or anti-aliasing. Also, I could go into a lot more detail are certain steps, notably texturing which includes filtering, anisotropy, and more. Setup also has to handle things like clipping and culling, so more steps are involved there as well. Fog may also be handled by the HW unless it's being done in the pixel shader.

Now consider how many of these steps happen at full FP32 precision and you can see why graphics chips are floating point monsters.

Hope this helps someone. :D
 
sorry but someone had to ask:
How about some values for an unannounced upcoming high end chip :?:
 
4 texture units?

There are only 4 texture units? I seem to remember R300 had 1 per pipe for 8 total. Has the number actually gone down since then?
 
Me said:
There are only 4 texture units? I seem to remember R300 had 1 per pipe for 8 total. Has the number actually gone down since then?
To avoid confusion, add an "each" for all the figures that come in pairs (nm, kj, hg, fe). It's four quad texture units.

Is i meant to be "quad interpolants", i.e. two per quad per clock?
 
Xmas said:
To avoid confusion, add an "each" for all the figures that come in pairs (nm, kj, hg, fe). It's four quad texture units.

Is i meant to be "quad interpolants", i.e. two per quad per clock?
It might be related to rough average rates based on anisotropic filtering, if it's not a typo :)
 
I wonder if the implication here is "and because they are such floating-point monsters, they'd be able to do physics just fine." But I'm goofy.
 
No, I think the implication is related to the "bilinear filtering on CPU" thread, that is, GPUs blow away CPUs at this workload and always will. I think Mint's "number of instrs and bytes fetched per pixel" calculations were probably a little more clearer to the average person. But Mint was just accounting PS/TEX, OGL Guy is trying to show how much FP power is being gobbled up even by fixed function HW in the pipeline.
 
Me said:
There are only 4 texture units? I seem to remember R300 had 1 per pipe for 8 total. Has the number actually gone down since then?
I think it'd better (and simpler) to think of things in terms of "quads". R300 had 2 quad pipes, each with it's own texture unit. Obviously, it's a beefy texture unit as it has to service 4 pixels at a time.
 
Xmas said:
To avoid confusion, add an "each" for all the figures that come in pairs (nm, kj, hg, fe). It's four quad texture units.
Yeah, that's how I meant it. I could have been a bit more clear, but it was late :)
Is i meant to be "quad interpolants", i.e. two per quad per clock?
I meant 8 interpolants per quad.
 
DemoCoder said:
No, I think the implication is related to the "bilinear filtering on CPU" thread, that is, GPUs blow away CPUs at this workload and always will. I think Mint's "number of instrs and bytes fetched per pixel" calculations were probably a little more clearer to the average person. But Mint was just accounting PS/TEX, OGL Guy is trying to show how much FP power is being gobbled up even by fixed function HW in the pipeline.
Precisely.
 
OpenGL guy said:
...a graphics chip fetches a number of vertices while n vertex shaders execute m instructions ....

And in the near future, all this state has to be dumped out to memory... frequently. It is scary :oops:
 
Simon F said:
And in the near future, all this state has to be dumped out to memory... frequently. It is scary :oops:
Not necessary, you could allow the GPU to get into "switch-state" which then needs alot less information to backup, like for example let it finish all triangles/vertexes it started - in that case only renderstates and displaylist would need to be saved and no internals.

What I wanted to know though - do GPUs (or any other Device on the Bus for that matter) actually know about the CPU`s page-table or do they just see the physical memory ?
In the latter case it should be easier switching tasks as even when the CPU is already running the new task the GPU could finish workloads of the old one...
 
Npl said:
Not necessary, you could allow the GPU to get into "switch-state" which then needs alot less information to backup, like for example let it finish all triangles/vertexes it started - in that case only renderstates and displaylist would need to be saved and no internals..
Have you read the DX10 spec?
 
Simon F said:
Have you read the DX10 spec?
Nope, you are speaking about the GPU-Context switches, so what requirement am I missing?
I dont think DX10 requires 0-cycle switches, I was just pointing out an example which would allow switching tasks at a more coarse granularity. Its similar to task switches with CPUs where you DONT store full caches and the pipeline-state, instead simply writeback dirty cache lines and wait till the pipeline is empty. Similar you could finish tasks in the GPU (be it the current pixel, the current vertex, finish all fetches.... whatever) to avoid having to dump your whole state.

If Im totally wrong, then Im sorry, but I dont know where you going and Im not gonna readup DX10 specs, atleast aslong you point me to the stuff in question ;)
 
Npl said:
Not necessary, you could allow the GPU to get into "switch-state" which then needs alot less information to backup, like for example let it finish all triangles/vertexes it started - in that case only renderstates and displaylist would need to be saved and no internals.
That would violate the DX10 spec. Say you're running a complex pixel shader that takes a million cycles per pixel... Do you really want your context switch to wait until all pixels are shaded?
 
OpenGL guy said:
That would violate the DX10 spec. Say you're running a complex pixel shader that takes a million cycles per pixel... Do you really want your context switch to wait until all pixels are shaded?
Maybe I would want the pixel to be finished, considering it could in turn save time&memory for reading/writing the context :smile: . But I see this would mean indeterministic delays.
Just out of curiosity, what do the DX10 specs require?
 
Excuse me if this is a silly question, but how long is a typical cycle for a high end chip that is outlined in the original post? :)

Edit: Oh, and how does such a typical cycle time compare with a cycle in a modern x86(7?) CPU?
 
Last edited by a moderator:
Back
Top