In a cycle...

OpenGL guy · Jul 19, 2006

...a graphics chip fetches a number of vertices while n vertex shaders execute m instructions while the setup engine sets up l triangles while the k scan converters prepares j quads while the rasterizer interpolates i interpolants while h pixel shaders execute g instructions while the f texture units texture e pixels while the depth buffer tests d pixels while the blend unit blends c pixels. And while all this is happening, the memory controller is fetching and writing data for multiple clients.

Typical values for a high end chip:
n = 8
m = 1
l = 1
k = 4
j = 1
i = 8
h = 16
g = 3
f = 4
e = 4
d = 16
c = 16

Of course, this is not an exhaustive list as it doesn't account for early Z check, any sort of compression, alpha test, or anti-aliasing. Also, I could go into a lot more detail are certain steps, notably texturing which includes filtering, anisotropy, and more. Setup also has to handle things like clipping and culling, so more steps are involved there as well. Fog may also be handled by the HW unless it's being done in the pixel shader.

Now consider how many of these steps happen at full FP32 precision and you can see why graphics chips are floating point monsters.

Hope this helps someone.

rwolf · Jul 20, 2006

How about AA and AF?

hoom · Jul 20, 2006

sorry but someone had to ask:
How about some values for an unannounced upcoming high end chip :?:

Me · Jul 20, 2006

4 texture units?

There are only 4 texture units? I seem to remember R300 had 1 per pipe for 8 total. Has the number actually gone down since then?

Xmas · Jul 20, 2006

Me said:
There are only 4 texture units? I seem to remember R300 had 1 per pipe for 8 total. Has the number actually gone down since then?

To avoid confusion, add an "each" for all the figures that come in pairs (nm, kj, hg, fe). It's four quad texture units.

Is i meant to be "quad interpolants", i.e. two per quad per clock?

KimB · Jul 20, 2006

Xmas said:
To avoid confusion, add an "each" for all the figures that come in pairs (nm, kj, hg, fe). It's four quad texture units.

Is i meant to be "quad interpolants", i.e. two per quad per clock?

It might be related to rough average rates based on anisotropic filtering, if it's not a typo

Tim Murray · Jul 20, 2006

I wonder if the implication here is "and because they are such floating-point monsters, they'd be able to do physics just fine." But I'm goofy.

DemoCoder · Jul 20, 2006

No, I think the implication is related to the "bilinear filtering on CPU" thread, that is, GPUs blow away CPUs at this workload and always will. I think Mint's "number of instrs and bytes fetched per pixel" calculations were probably a little more clearer to the average person. But Mint was just accounting PS/TEX, OGL Guy is trying to show how much FP power is being gobbled up even by fixed function HW in the pipeline.

OpenGL guy · Jul 20, 2006

Me said:
There are only 4 texture units? I seem to remember R300 had 1 per pipe for 8 total. Has the number actually gone down since then?

I think it'd better (and simpler) to think of things in terms of "quads". R300 had 2 quad pipes, each with it's own texture unit. Obviously, it's a beefy texture unit as it has to service 4 pixels at a time.

OpenGL guy · Jul 20, 2006

Xmas said:
To avoid confusion, add an "each" for all the figures that come in pairs (nm, kj, hg, fe). It's four quad texture units.

Yeah, that's how I meant it. I could have been a bit more clear, but it was late

Is i meant to be "quad interpolants", i.e. two per quad per clock?

I meant 8 interpolants per quad.

OpenGL guy · Jul 20, 2006

DemoCoder said:
No, I think the implication is related to the "bilinear filtering on CPU" thread, that is, GPUs blow away CPUs at this workload and always will. I think Mint's "number of instrs and bytes fetched per pixel" calculations were probably a little more clearer to the average person. But Mint was just accounting PS/TEX, OGL Guy is trying to show how much FP power is being gobbled up even by fixed function HW in the pipeline.

Precisely.

Simon F · Jul 21, 2006

OpenGL guy said:
...a graphics chip fetches a number of vertices while n vertex shaders execute m instructions ....

And in the near future, all this state has to be dumped out to memory... frequently. It is scary

Npl · Jul 21, 2006

Simon F said:
And in the near future, all this state has to be dumped out to memory... frequently. It is scary

Not necessary, you could allow the GPU to get into "switch-state" which then needs alot less information to backup, like for example let it finish all triangles/vertexes it started - in that case only renderstates and displaylist would need to be saved and no internals.

What I wanted to know though - do GPUs (or any other Device on the Bus for that matter) actually know about the CPU`s page-table or do they just see the physical memory ?
In the latter case it should be easier switching tasks as even when the CPU is already running the new task the GPU could finish workloads of the old one...

Simon F · Jul 21, 2006

Npl said:
Not necessary, you could allow the GPU to get into "switch-state" which then needs alot less information to backup, like for example let it finish all triangles/vertexes it started - in that case only renderstates and displaylist would need to be saved and no internals..

Have you read the DX10 spec?

Demirug · Jul 21, 2006

Simon F said:
Have you read the DX10 spec?

Maybe I am misunderstanding you. Do you talk about render context switching?

Npl · Jul 21, 2006

Simon F said:
Have you read the DX10 spec?

Nope, you are speaking about the GPU-Context switches, so what requirement am I missing?
I dont think DX10 requires 0-cycle switches, I was just pointing out an example which would allow switching tasks at a more coarse granularity. Its similar to task switches with CPUs where you DONT store full caches and the pipeline-state, instead simply writeback dirty cache lines and wait till the pipeline is empty. Similar you could finish tasks in the GPU (be it the current pixel, the current vertex, finish all fetches.... whatever) to avoid having to dump your whole state.

If Im totally wrong, then Im sorry, but I dont know where you going and Im not gonna readup DX10 specs, atleast aslong you point me to the stuff in question

OpenGL guy · Jul 21, 2006

Npl said:
Not necessary, you could allow the GPU to get into "switch-state" which then needs alot less information to backup, like for example let it finish all triangles/vertexes it started - in that case only renderstates and displaylist would need to be saved and no internals.

That would violate the DX10 spec. Say you're running a complex pixel shader that takes a million cycles per pixel... Do you really want your context switch to wait until all pixels are shaded?

Npl · Jul 21, 2006

OpenGL guy said:
That would violate the DX10 spec. Say you're running a complex pixel shader that takes a million cycles per pixel... Do you really want your context switch to wait until all pixels are shaded?

Maybe I would want the pixel to be finished, considering it could in turn save time&memory for reading/writing the context :smile: . But I see this would mean indeterministic delays.
Just out of curiosity, what do the DX10 specs require?

Bludd · Jul 21, 2006

Excuse me if this is a silly question, but how long is a typical cycle for a high end chip that is outlined in the original post?

Edit: Oh, and how does such a typical cycle time compare with a cycle in a modern x86(7?) CPU?

KimB · Jul 21, 2006

There's, er, one clock cycle.

In a cycle...

OpenGL guy

rwolf

Rock Star

hoom

Me

Xmas

Porous

KimB

Tim Murray

the Windom Earle of mobile SOCs

DemoCoder

OpenGL guy

OpenGL guy

OpenGL guy

Simon F

Tea maker

Npl

Simon F

Tea maker

Demirug

Npl

OpenGL guy

Npl

Bludd

Experiencing A Significant Gravitas Shortfall

KimB

Similar threads