Mapping shaders to threads - good or bad idea?

Shader programs are going to get longer and ever more complicated. They will require loops, branches and lots of FLOPS to compute. However, each pixel can be computed independently of any other. This allows the parallelism that allows hardware acceleration.

Currently GPUs have multiple pipelines to exploit this parallelism. CPUs have been poor at this due to their single threaded nature resulting in lower throughput. However, new designs may be changing this.

Sun's Niagara and Sony's Cell are probably the best examples. Both are designed for maximum throughput.

Niagara is a chip with 8 cores where each runs four independent threads. This makes it a 32 threaded CPU. Cell is also tailored for massive concurrent execution.

Intel and AMD are both migrating to multi-cored, multi-threaded design.

So the main question -- what differs a 32 pipeline GPU to a 32-threaded CPU?

Suppose the CPU did not have to take care of the OS and other apps, but took all its time to do shader math, it can be viewed essentially as 32 fully programmable pipelines. Perhaps a few fold slower, but nothing like the difference between a Pentium and a Voodoo for texture mapped graphics.

Does it make sense to map shaders to multi-cored CPU threads?

Does it make sense to map general programs to GPU pipelines?

Like to hear your thoughts.
 
You'll find much more insight in www.gpgpu.org :D

I think the main disvantage of mapping shaders to general purposed processor is due to the graphics pipeline is not FULLY programmable yet. Take rasterization, texture lookup, blending, z-buffering, etc, for example. All of these tasks don't quite fit the characteristic of general purposed processor. If you leave these tasks to GPU and let CPU do the math calculation, you'll need to transfer TONS of data between the two processing units, and I assume that'll kill the performance. And if you choose to implement these fixed functional task in CPU, that'll waste a lot of CPU's "intelligent" processing power.
 
991060 said:
You'll find much more insight in www.gpgpu.org :D

I think the main disvantage of mapping shaders to general purposed processor is due to the graphics pipeline is not FULLY programmable yet. Take rasterization, texture lookup, blending, z-buffering, etc, for example. All of these tasks don't quite fit the characteristic of general purposed processor. If you leave these tasks to GPU and let CPU do the math calculation, you'll need to transfer TONS of data between the two processing units, and I assume that'll kill the performance. And if you choose to implement these fixed functional task in CPU, that'll waste a lot of CPU's "intelligent" processing power.

Yes, but the trend is such that the amount of fixed function graphics computation is reducing toward nil. Even processes which we take for granted to be fixed function (eg. texture sampling) are now programmable.

David Kirk noted that around 5% of a GPU's die is dedicated to fixed function hardware. In the long run, graphics is bound by programmble FP units, not fixed function blocks.

However, graphics is very bandwidth dependent. The memory system for the CPU is certainly not as strong as the 128MB DRAM found on the grpahics board.
 
JF_Aidan_Pryde said:
Yes, but the trend is such that the amount of fixed function graphics computation is reducing toward nil. Even processes which we take for granted to be fixed function (eg. texture sampling) are now programmable.
Texture sampling is not programmable currently. And the amount of fixed function computation is not reducing. It is increasing, though still reducing in proportion to programmable units. Put simply, there are many very common operations in GPU's that are just done faster when done in dedicated hardware. Just because the die area for these operations is now small doesn't mean that these operations will move to any sort of general processing scheme anytime soon.

Now, what we may see is these dedicated units may expose some programmability in the future, but it won't be anything close to general. And before hardware designers actually do it, they'll need to see a compelling reason to do so. I'm not sure I know of one that would be efficient to implement in hardware.
 
Back
Top