Balancing DX10 shader load between CPU end low-end GPU

Per B · May 16, 2007

Looking at DX9 IGP:s from SiS (and if I'm right, Intel as well), they only implement PS in the graphics hardware and run VS on the CPU instead. Will this work in DX10 as well (in a unified shader architecture)?. If it would work, in a setup with a GPU with low shader performance and a dual or quad core CPU, one of the CPU cores could handle the vertex shading. Doable?

Simon F · May 16, 2007

Per B said:
Looking at DX9 IGP:s from SiS (and if I'm right, Intel as well), they only implement PS in the graphics hardware and run VS on the CPU instead. Will this work in DX10 as well (in a unified shader architecture)?. If it would work, in a setup with a GPU with low shader performance and a dual or quad core CPU, one of the CPU cores could handle the vertex shading. Doable?

A unified shader architecture is just one that that runs pixels and vertices through the same processing unit in order to try to do more with less silicon. A non-unified approach should still work.

JHoxley · May 16, 2007

I don't see why they couldn't pull off a similar hardware implementation and meet the D3D10 requirements.

Be careful not to confuse the Universal Shader Cores with Unified Hardware - whilst the latter might make sense, it's not a requirement of the API. You could obtain D3D10 compatibility by having physically seperate VS/GS/PS hardware.

hth
Jack

Per B · May 16, 2007

So, would it be possible to answer the question how many R6xx stream processors a single CPU core would equal? Do you see what I'm after: can some extra CPU cores balance a weaker GPU by taking some of its tasks?

Techno+ · May 16, 2007

Per B said:
So, would it be possible to answer the question how many R6xx stream processors a single CPU core would equal? Do you see what I'm after: can some extra CPU cores balance a weaker GPU by taking some of its tasks?

stream processors are specialized for gfx, while CPUs are general purpose.

I believe that when doing vertex shading, a current single core CPU is as fast as 3-4 stream processors, but for pixel shading (which is very unlikely since all IGPs have PS hardware), it would be something like a 0.35 stream processor.

Per B · May 16, 2007

Techno+ said:
I believe that when doing vertex shading, a current single core CPU is as fast as 3-4 stream processors, but for pixel shading (which is very unlikely since all IGPs have PS hardware), it would be something like a 0.35 stream processor.

So theoretically, even a future Nvidia or AMD IGP with unified shader architectures could benefit from moving the vertex shader load to a dual or quad core CPU in cases where they are PS limited?! Do you think this will be supported in the drivers?

Are there anything in SSE4 that will help improving VS performance? If no, why haven't Intel (or AMD) done that?

Nick · May 16, 2007

Techno+ said:
I believe that when doing vertex shading, a current single core CPU is as fast as 3-4 stream processors...

Actually more like 12 stream processors if you do things right. Say we have a 2.0 GHz Core 2 Duo and we wish to use only one core to assist an Intel X3000 IGP with 8 stream processors at 667 MHz. That's 16 GFLOPS versus 10.7 GFLOPS.

This isn't a rare combination for laptops...

Nick · May 16, 2007

Per B said:
Are there anything in SSE4 that will help improving VS performance? If no, why haven't Intel (or AMD) done that?

They've added a dot product instruction. However, if you're using SSE for scalar processing then it's of no use. They've also added instructions to compute the absolute value and fractions a little faster, but that's not all that significant (though still welcome).

What they should have added to make shaders on the CPU fly is a special function unit (approximations for log, exp, sin, cos, rsq, rcp). They've already got instructions for the latter two, but their precision isn't high enough.

Ideally, and now I'm dreaming away, they should make that SFU completely flexible by providing access to the lookup tables it uses. Another addition that would be worth gold is true scatter/gather: reading and writing the elements of an SSE vector in parallel, using the integer values in another register as offsets.

Simon F · May 16, 2007

Nick said:
What they should have added to make shaders on the CPU fly is a special function unit (approximations for log, exp, sin, cos, rsq, rcp). They've already got instructions for the latter two, but their precision isn't high enough.

The last time I analysed VS code, those instructions actually weren't terribly common. Besides, a Newton-Rhapson step will double the number of precise bits, so it doesn't take too many additional instructions to make a "rough guess" accurate.

Techno+ · May 16, 2007

Nick said:
Actually more like 12 stream processors if you do things right. Say we have a 2.0 GHz Core 2 Duo and we wish to use only one core to assist an Intel X3000 IGP with 8 stream processors at 667 MHz. That's 16 GFLOPS versus 10.7 GFLOPS.

But these 16 GFLOPS are achieved through emulation, since the vertex instructions aren't hardwired into the execution units, like in the stream processors

Nick said:
What they should have added to make shaders on the CPU fly is a special function unit (approximations for log, exp, sin, cos, rsq, rcp). They've already got instructions for the latter two, but their precision isn't high enough.

Something similar to this is coming in the second level of CPU-GPU Fusion. Check the fusion thread at the front page of this section. Anway just a small question, how much will the performance increase by using a SFU?, just estimate, thank you.

Nick · May 16, 2007

Simon F said:
The last time I analysed VS code, those instructions actually weren't terribly common.

Pow is pretty common to implement HDR. It requires both a log and exp. These require around 20 instructions each to achieve with SSE. Same for sine and cosine.

Besides, a Newton-Rhapson step will double the number of precise bits, so it doesn't take too many additional instructions to make a "rough guess" accurate.

Yes, but it's still one instruction on the GPU versus several on the CPU. GFLOP performance of CPU's is quite high really, but for anything other than multiplies and additions it only reaches a fraction of performance. So some kind of SFU on the CPU could help a lot.

Nick · May 16, 2007

Techno+ said:
But these 16 GFLOPS are achieved through emulation, since the vertex instructions aren't hardwired into the execution units, like in the stream processors

Without special function instructions you can get very close to that 16 GFLOPS in practice.

Something similar to this is coming in the second level of CPU-GPU Fusion. Check the fusion thread at the front page of this section. Anway just a small question, how much will the performance increase by using a SFU?, just estimate, thank you.

Like I said, log and exp take about 20 instructions to approximate with SSE. So if an SFU can do it one (fully pipelined)... It all depends on the ratio of add/mul operations versus special function. But if you want a very rough estimate, I'd say it could be between 50% to 150% speedup.

Techno+ · May 16, 2007

Nick said:
Like I said, log and exp take about 20 instructions to approximate with SSE. So if an SFU can do it one (fully pipelined)... It all depends on the ratio of add/mul operations versus special function. But if you want a very rough estimate, I'd say it could be between 50% to 150% speedup.

Now the most important question is: is there any use of this SFU outside the gfx arena? It would only make sense to integrate it as a part of a CPU core if most people are going to use it.

Secondly, does this also help in PS? (suppose some PS work is offloaded)

Thirdly, is your estimate on gain in performance built upon only one core? or more than one.

Killer-Kris · May 16, 2007

Techno+ said:
Now the most important question is: is there any use of this SFU outside the gfx arena? It would only make sense to integrate it as a part of a CPU core if most people are going to use it

I would imagine that there would be a whole lot of very happy programmers out there if we got fast trig and other transcendental functions in the instruction set.

Secondly, does this also help in PS? (suppose some PS work is offloaded)

Sure, especially if there was a weighted averaging function added as well (or is there already?). Though dealing with texture fetch latency is still going to be a major problem.

Thirdly, is your estimate on gain in performance built upon only one core? or more than one.

I would guess that would be for one core. Per-core scaling would still be a function of how well the code scales across multiple cores to begin with.

Nick · May 16, 2007

Techno+ said:
Now the most important question is: is there any use of this SFU outside the gfx arena?

Sure. But then the most important thing of all is to standardize it properly. A while ago some SSE code I had written was working perfectly on AMD processors, but not on Intel processors. The problem was that AMD offers 14-bit of mantissa precision for the rcpps instruction, while Intel offers only 12-bit. The reason AMD offers 14 is because the 3DNow! equivalent instruction was specified that way and obviously they reuse that logic.

Based on NVIDIA's slides for the G80 SFU it looks like a high number of accurate mantissa bits is really possible. And given that a CPU can probably spend a little more transistors on those lookup tables, they might be able to go for full accuracy for single-precision. High latency is ok, as long as it's pipelined (not like current division and square root units).

If they can't reach that precision, they should specify that all non-accurate bits are forced to zero. Total invariance between implementations is very important for CPU's.

Secondly, does this also help in PS?

Pixel shaders also contain transcendentals, so yes. Furthermore, perspective correction requires a fast and accurate reciprocal. The 12-bit does not suffice and a Newton-Rhapson iteration costs clock cycles during which you can't issue other instructions.

Thirdly, is your estimate on gain in performance built upon only one core? or more than one.

One core, one SFU. Why?

Nick · May 16, 2007

Killer-Kris said:
Sure, especially if there was a weighted averaging function added as well (or is there already?).

All x86 instructions take only 2 arguments, so there's no possibility to have something like a lerp. Unless they overhaul the whole architecture. Anyway, I don't see that as the most lacking instruction for pixel shaders. A gather instruction to read four texels at once would be much more useful. The rdi or rsi register could hold the base address, and an SSE register could store the offsets.

Though dealing with texture fetch latency is still going to be a major problem.

CPU's have been doing great with caches and prefetch so far. Just compute the addresses (this is where gather could be of great help), prefetch, do something else, then filter.

Jawed · May 16, 2007

Will G80's SF work the same way when a double-precision result is required? I dare say there's a distinct possibility that the look-up tables would need to explode in size to cater for DP, and so the SF units might revert to an iterative method for DP results.

Interestingly I was just reading this evening how Barcelona will only iterate as long as it needs to, effectively checking that it's achieved the required accuracy and then exiting early.

Jawed

Simon F · May 17, 2007

Nick said:
Pow is pretty common to implement HDR. It requires both a log and exp.

In the vertex shader or the pixel shader? I would have thought the latter.

Nick · May 17, 2007

Simon F said:
In the vertex shader or the pixel shader? I would have thought the latter.

Well, recently, yes. Vertex shaders are getting shorter, while pixel shaders get longer. But now and then there's still a kind of gamma correction or exposure control done in the vertex shader. Exp/log are also used for various fog effects.

Vertex shaders also often normalize vectors or compute their length. This requires another Newton-Rhapson iteration to achieve the required precision. Sine and cosine are also not that rare in vertex shaders.

It only takes one transcendental instruction to significantly decrease vertex processing performance on the CPU. But that's just my experience...

Balancing DX10 shader load between CPU end low-end GPU

Per B

Simon F

Tea maker

JHoxley

Per B

Techno+

Per B

Nick

Nick

Simon F

Tea maker

Techno+

Nick

Nick

Techno+

Killer-Kris

Nick

Nick

Jawed

Simon F

Tea maker

Nick

Similar threads