Optimal Automatic Multi-pass Shader Partitioning

nAo · Jun 24, 2005

It seems Sony guys are working on improving the Cg compiler, this is Alan Heirich's home page at Caltech
http://alumnus.caltech.edu/~heirich/activities.htm

See the first paper (to appear at Graphics Hardware 2005 this summer)
Optimal Automatic Multi-pass Shader Partitioning by Dynamic Programming
From the abstract:

Complex shaders must be partitioned into multiple passes to execute on GPUs with limited hardware resources.Automatic partitioning gives rise to an NP-hard scheduling problem that can be solved by any number of established
techniques.

Experimental results on a set of test cases with a commercial prerelease compiler for a popular high level shading language showed a DP algorithm had an average runtime cost of O(n1:14966) which is less than O(nlogn) on the region of interest in n. This demonstrates that efcient and optimal automatic shader partitioning can be an emergent byproduct of a DP-based code generator for a very high performance GPU

IMHO the popular high level shading language is Cg (Nvidia just released Cg 1.4) and the very high performance GPU is RSX.
RSX should support very long shaders but he wrote:

The DP solution is motivated by a study of a very high performance GPU that supports large and complex shaders. The size of these shaders implies multi-pass execution and motivates the search for a scalable partitioning algorithm.

Operations are scheduled into passes in order to respect resource limitations of the GPU. At pass boundaries intermediate results are spilled into a form of intermediate storage from which they can be retrieved in a subsequent pass.

This study is concerned with partitioning shaders for a very high performance GPU that has a very efcient intermediate storage mechanism. GPU resources that must be scheduled include live operands (register allocation), outstanding texture requests, instruction storage and rasterized interpolants. Every shader pass must observe the physical resource limits of the GPU with any excess storage requirements satised from intermediate storage between passes. Each of these resource types has architecture-specic considerations that influence the cost function and the location of pass boundaries.

Maybe he's using the 'multipass' term in a new way (new to me at least

), he sees (also) intermediate internal results storage as the main problem and he's trying to allievate it.

Jawed · Jun 24, 2005

Could the term "long shader" refer to cases where the number of registers exceeds the capability of the GPU?:

This situation may be resolved by virtualizing the GPU resources in time by partitioning the shader execution into passes that use the same resources for different purposes at different times.

Don't GPU compilers currently dynamically re-allocate register resources as execution proceeds?

Also, what results storage is required, other than the shader state itself: registers and program counter?

I don't understand ATI's F-buffer, but I'm curious if there's any connection. The F-buffer arose to solve the "long shaders" problem, didn't it?

I find this topic hard to reconcile with the notional capability of 512 instruction slots in SM3 hardware. But maybe that capability is just another marketing number.

Wouldn't a 512-instruction shader bring the GPU to a halt, anyway?

Jawed

bloodbob · Jun 24, 2005

Jawed said:
Could the term "long shader" refer to cases where the number of registers exceeds the capability of the GPU?

I was thinking more dependant texture reads atleast we need this on damn ATI chips.

I don't understand ATI's F-buffer, but I'm curious if there's any connection. The F-buffer arose to solve the "long shaders" problem, didn't it?

I could be wrong but I think F-buffer reduces overheads when doing multipassing on only parts of the screen or something.

Wouldn't a 512-instruction shader bring the GPU to a halt, anyway?

Real time full screen shaders probably but you could use it for either non-realtime stuff or at small resolutions.

Jawed · Jun 24, 2005

Ah, long shaders:

It was difficult to locate publicly available large shaders and this was the reason for generating the fragment shader test cases. In general purpose GPGPU applications shaders have been reported with over ten thousand operations. Unfortunately no such shaders in the language under study were available on the relevant public web sites.

It's also interesting that the paper refers to storing data in textures to get around resource limitations:

These algorithms also explore the option of recomputing multiply referenced nodes of the DAG rather than saving them to intermediate storage. This consideration is motivated by the high cost on desktop GPUs of intermediate storage in the form of textures. By recomputing rather than saving intermediate results the shader may execute in fewer passes at the cost of executing more instructions. On current desktop GPUs this usually leads to a faster runtime.

but saying that re-computation is often preferable. Makes me think of MEMEXPORT.

Jawed

Simon F · Jun 24, 2005

Jawed said:
but saying that re-computation is often preferable. Makes me think of MEMEXPORT.

If another part of the system is bottlenecked on memory bandwidth, then extra writes and read might actually degrade performance. <shrug>

Jawed · Jun 24, 2005

How would MEMEXPORT ever be useful, though? I suppose the implication from:

http://www.beyond3d.com/articles/xenos/index.php?p=10

is that MEMEXPORT would only be used in vertex shading, where competition with textures for memory bandwidth would be low. Except that vertex shaders would be running concurrently with pixel shaders which should be hammering memory bandwidth with texturing.

So, we're back to the basic question, why is MEMEXPORT a feature of Xenos if it could be implemented with texture writes/reads?

Jawed

Xmas · Jun 24, 2005

I guess the main "resource limits" to get around on RSX is the number of texture samplers and interpolators. Not the number of instructions or registers.

Farid · Jun 24, 2005

Jawed said:
is that MEMEXPORT would only be used in vertex shading

It seems that the answer, as far as MS is concerned is no.
Link

Todd Holmdahl said:
On Xbox 360, we do not plan for the CPU to do any vertex processing at all

RejZoR · Jun 24, 2005

Relief bumpmapping uses over 400 instructions. Not enough for long shader test?

nAo · Jun 24, 2005

It would help to schedule instructions even according to the amount of data register files can provide to the shading core (distribute registers file workload over more clock cycles avoiding 'spikes')

Jawed · Jun 24, 2005

Vysez said:
Jawed said:

is that MEMEXPORT would only be used in vertex shading

Click to expand...

It seems that the answer, as far as MS is concerned is no.
Link

Todd Holmdahl said:

On Xbox 360, we do not plan for the CPU to do any vertex processing at all

Click to expand...

MEMEXPORT can be used, for example, to create a new vertex list which is processed in a second pass by the GPU, not the CPU. There's no reason why the CPU would be interested, as far as I can tell.

Jawed

nAo · Jun 24, 2005

What if one can output data from a shader directly into CPU's L2 cache?

(on Xenos of course..)

ERP · Jun 24, 2005

Jawed said:
How would MEMEXPORT ever be useful, though? I suppose the implication from:

http://www.beyond3d.com/articles/xenos/index.php?p=10

is that MEMEXPORT would only be used in vertex shading, where competition with textures for memory bandwidth would be low. Except that vertex shaders would be running concurrently with pixel shaders which should be hammering memory bandwidth with texturing.

So, we're back to the basic question, why is MEMEXPORT a feature of Xenos if it could be implemented with texture writes/reads?

Jawed

Because you can do memexport at the same time as other writes.
For example you could write sideband data out for a second pass while doing the first.

It also allows writes to arbitrary addresses, allowing a "scatter" operation, which you can't do with render to texture.

If bus contention really is an issue You could memexport to the shared cache and use none of the "texture bandwidth", but I doubt texture bandwidth is going to be an issue if you have a lot of ALU ops in the shader anyway.

Jawed · Jun 25, 2005

ERP said:
It also allows writes to arbitrary addresses, allowing a "scatter" operation, which you can't do with render to texture.

Actually existing GPGPU techniques such as Brook GPU already implement "scatter" with texture writes.

http://graphics.stanford.edu/projects/brookgpu/lang.html

Jawed

ERP · Jun 25, 2005

Jawed said:
ERP said:

It also allows writes to arbitrary addresses, allowing a "scatter" operation, which you can't do with render to texture.

Click to expand...

Actually existing GPGPU techniques such as Brook GPU already implement "scatter" with texture writes.

http://graphics.stanford.edu/projects/brookgpu/lang.html

Jawed

Does that actually work, I can't see how you can scatter on existing GPU's. There is no concept of an indirect write, you write pixels in rasterization order.

I could see how you could permute or collapse them, afterwards, using indirect reads to reorder the stream, but that isn't scatter, although you could do it with N passes over the data where N is the number of destination locations.

EDIT: - I guess you could also render a single pixel quad for every output value. In fact that's probably what they are doing. Not exactly efficient.

They might implement the scatter on the CPU I guess.

arjan de lumens · Jun 25, 2005

ERP said:
EDIT: - I guess you could also render a single pixel quad for every output value. In fact that's probably what they are doing. Not exactly efficient.

They might implement the scatter on the CPU I guess.

You can use the GPU's vertex shader to compute the screen-space XY coordinate for the address you wish to write/scatter to, then fire off one 1x1-pixel point primitive for each write. No need to involve the CPU; you might even use render-to-vertex-array to compute the scatter addresses from within a pixel shader.

ERP · Jun 25, 2005

arjan de lumens said:
ERP said:

EDIT: - I guess you could also render a single pixel quad for every output value. In fact that's probably what they are doing. Not exactly efficient.

They might implement the scatter on the CPU I guess.

Click to expand...

You can use the GPU's vertex shader to compute the screen-space XY coordinate for the address you wish to write/scatter to, then fire off one 1x1-pixel point primitive for each write. No need to involve the CPU; you might even use render-to-vertex-array to compute the scatter addresses from within a pixel shader.

Not exactly efficient though.

And memexport will also allow you to write multiple output values for every source element.

arjan de lumens · Jun 25, 2005

ERP said:
arjan de lumens said:

ERP said:

EDIT: - I guess you could also render a single pixel quad for every output value. In fact that's probably what they are doing. Not exactly efficient.

They might implement the scatter on the CPU I guess.

Click to expand...

You can use the GPU's vertex shader to compute the screen-space XY coordinate for the address you wish to write/scatter to, then fire off one 1x1-pixel point primitive for each write. No need to involve the CPU; you might even use render-to-vertex-array to compute the scatter addresses from within a pixel shader.

Click to expand...

Not exactly efficient though.

And memexport will also allow you to write multiple output values for every source element.

I would suspect that both the vertex shader method and the MEMXPORT method of scattering would be severely address bandwidth limited - with GDDR3 memory and 2x crossbar as in Xenos, you run into a theoretical maximum of about 1 write per clock cycle, assuming that the memory is not used for anything else. For the vertex shader method, you are additionally triangle setup limited, with the tri-setup limited to ~1 element per clock cycle.

DemoCoder · Jun 26, 2005

MEMEXPORT-like mechanisms can be used to update datastructures shared by CPU in a UMA architecture IF you have integer datatypes. For example, you could bind a tree to a texture, use shaders to recurse subtrees of the tree in the texture, turn off framebuffer writes, and then use MEMEXPORT to write out the location of a node matching a criteria. (if you want to write more than one value, you have to pass the location to pixel or vertex shaders separately, so they don't overwrite each others values)

What you cannot do is write out a value, and read it back immediately in another neighboring pixel pipeline. You still can't share concurrent execution state except by multi-pass. There's no global register or scratchpad memory or IPC, except gradient.

Optimal Automatic Multi-pass Shader Partitioning

nAo

Nutella Nutellae

Jawed

bloodbob

Trollipop

Jawed

Simon F

Tea maker

Jawed

Xmas

Porous

Farid

Artist formely known as Vysez

RejZoR

nAo

Nutella Nutellae

Jawed

nAo

Nutella Nutellae

ERP

Jawed

ERP

arjan de lumens

ERP

arjan de lumens

DemoCoder

Similar threads