How does the NV30 really store PS programs?

Arun

Unknown.
Moderator
Legend
Hello everyone,

If there's one thing many have been decieved by, it's the NV30 way too low speed for long PS programs.
But what are the differences compared to the NV20 & R300, which could cause such problems?
Compared to the R300, there's an architecture which can execute multiple instructions at the same time. But at the same time, the GFFX seem to be able to work on more pixels at once ( according to David Kirk at Extremetch, the NV30 actually works on 32 pixels at the same time! ) - so it might balance out.

What other difference is there? Well, according to some documents, the NV30 stores its 1024 PS Instructions in local memory.

But what does "storing the PS instructions in local memory" *really* mean?
If you read the instruction for each pixel, you'd need more bandwidth for that operation than for every other operation in the GPU. Probably even more than all of them united. So that's obviously ridiculous!

My question to you, thus, is how do you think the NV30 truly works on that.

I've got an explanation, but it's just an idea and it's likely to be wrong. I'd welcome any feedback.

Here's my explanation: In past architectures, the PS programs were sent via AGP each time. In the NV30, all PS programs the hardware will use are stored in local memory, in order to reduce stall time in the case you switch Pixel Shaders frequently. This can also be very useful in cases the game is rendered front-to-back, without caring about PS switching.

Now, that's a very conservative assumption. Such a system would take little memory bandwidth, and very little memory.

BTW, I've been wondering two things lately:
1. In current architectures, is AGP used to transmit the Pixel Shader each time it changes?
2. Do each pixel processing unit have its own list of all instructions used in the program? Or is it one global instruction pool for all pixel processors?


Thanks for reading,


Uttar
 
1. Sure, Pixel Shaders are TINY compared to things like textures and geometry.

2. There's probably multiple copies of the program. As I said, they're tiny, it'd be no big deal.

"Storing in local memory" means just that. The programs are stored on the card somewhere, just like textures and geometry would be. It's probably in some very fast bit of local memory.
 
Most likely its some kind of cache design. They can store a limit number of instruction on chip so that "general" programs run very fast. Very large programs will be slow to execute so it doesn't matter that much that you need to go and read instruction blocks from local memory.

Issue is because of latency (e.g. texture fetches) you already need to work on quite a few pixels in parallell (a texture fetch will take 10s, 100s of clocks to return from the cache and filter logic) and since you apply the same instruction to all this pixels you have quite a bit of time to grab that instruction from external memory. Problem is if a pixel shader is applied to a low number of pixels (small triangles) in that case your latency hiding can get problems and your program loading can get problems.

Now pixel shader instructions are not small, the instruction size is a lot bigger than you think due to all the flexibility and the huge number of constants and registers.

Also IIRC NV30 stores constants directly in the instruction, so 4 component float numbers are stored as part of the instruction. This means very large instructions (lots of bytes) and also causes problems if a program change constant between sets of polygons (need to change the whole program using the cpu, instead of just sending a single updated constant value).

All IMHO...
 
GeForce FX stores pixel shaders in video memory and on the other hand vertex shaders are still stored on chip as a "bunch of states". This means that FX reads some instructions from video memory and stores them in cache and then runs them. You need to know however that all pixel pipes always execute same instruction (since there is no branching yet).
I think that 1024 instructions on GeForce FX are more artificial limit then anything else (Quadro FX can do 2048 instructions)...
 
Hmm, interesting.
So you people think that the GFFX probably got a significantly smaller cache ( how much? 10 or 25 instructions maybe? ) and it would load the Pixel Shader program in multiple blocks?

Interesting theory. But could that mean that the GFFX might be *bandwidth* limited with high number of instructions, particularly when running integer/FP16?


Uttar
 
Here are some info's from the ATi radeon sdk:

Code:
6. Optimizing Shaders
Modern graphics chips offer enormous vertex and pixel processing power; nevertheless
there are times when even that power is not enough. When running long and complex
shaders it is possible to exhaust all that power and make shader processing a bottleneck.
Another “opportunity†to limit performance hides in inefficient shader management. This
section will deal with both of these obstacles.
Behind the scenes shader processing
The vertex and pixel shaders in DirectX® 9 are defined as streams of tokens, each token
representing an op-code of assembly instruction or macro. This is how they are passed by
the application to the shader creation functions of the API. This is also how the driver
receives them. None of the macros are expanded by the runtime, and it is up to the driver
how to deal with them. If hardware natively supports a macro, it will be executed as is,
otherwise it will be expanded into a series of simpler instructions.
A common misconception is that hardware shader implementation exactly matches the
shader assembly or op-codes as defined by DirectX®. The direct mapping of the shader
code to the hardware might not result in the most efficient shader implementation, and
hardware uses many tricks to provide the best performance possible. You should think of
the DirectX® shaders as p-code (pseudo code) programs that are passed to the back-end
compiler implemented in the driver. The driver compiles the shaders to the hardware
native instructions and runs compiled shader through the optimizer. The optimizer knows
many intricate details about hardware implementation and is able to allocate registers and
schedule instructions in the most efficient way. The following sections will explain some
of the hardware implementation details and what can be done to help driver optimizing
your shaders.
6.1. Shader Management
Shader switching is one of the most expensive state changes. Batching rendering by
vertex shader is always a good idea. When switches between shaders are inevitable, try
limiting frequent switches only to recently used smaller shaders as driver and hardware
can more effectively cache them. Switching between fixed function and programmable
pipeline is in most cases more expensive that switching between equivalent shaders
because of the extra driver overhead.
The shader compilation and optimization in the driver is quite a complex and expensive
process and it will become even more expensive as shaders grow in size and shader
models become more complex. Because of that it is a bad idea to compile too many
shaders on the fly. Try pre-creating as many shaders upfront as possible.
14
6.2. Shader Constant Management
Updating high volumes of shader constants can add considerable amount of overhead to
the drivers. Following strategies can help reducing driver overhead associated with
constant updates. When there are a lot of scalar constant updates, pack these scalar values
into vectors. This should reduce number of scalar constant updates by the factor of four.
When picking locations for the frequently updated constants do not scatter them across
the whole constant store. This will allow constant updates to happen in continuous
ranges, which should reduce runtime and driver overhead. Consider fragmenting the
constant store into 4 or 8 constant chunks and updating these chunks atomically. That is if
you have to update every other constant in some constant range it is better to update the
whole range at once than updating each changed constant individually.
6.3. Optimizing Vertex Shaders
When it comes to optimizing vertex shaders only a few optimizations apply. The reason
for that is the driver shader optimizer that does a pretty good job of optimizing shaders.
One subtle vertex shader optimization is to output from the shader only what you need.
For instance the shader can export duplicated texture coordinates only to fetch two
different textures with the same coordinates. Many developers still do that; however 1.4
and especially 2.0 pixel shaders allow decoupling of texture coordinate sets from texture
samplers. Just export unique texture coordinate values from the vertex shaders and use
pixel shaders to do proper texture coordinate mapping. Also, when outputting texture
coordinates use write masks to indicate how many texture coordinate components have to
be interpolated and passed down to pixel shaders.
Fixed function vs. programmable pipelines
RADEON 8500/9000 chips have implemented both fixed function
and programmable vertex processing in the silicon. Using fixed
function with these chips can be slightly more efficient than using
vertex shaders because of the optimized hardware implementation of
the TnL pipeline. Using fixed function TnL also simplifies shader
management and reduces the associated application and driver
overhead. Having said that, shaders can be a better solution if used to
pack vertex data or take some “shortcuts†in the vertex computations. There is no golden
rule as the ultimate solution depends on shader usage and can only be found through
extensive experimentation.
RADEON 9500/9700 on the other hand has only a programmable
pipeline implemented the hardware, and fixed function TnL is
emulated with the vertex shaders. This means that for DirectX® 9
class hardware there is no advantage in using fixed function
functionality. Using flow control available in 2.0 vertex shaders
15
solves the problem with shader management and allows application to toggle lights,
texture transform and other parameters as easily as with a fixed function pipeline.
Use of flow control in VS 2.0
As flexible and as powerful the 1.0-1.1 vertex shaders are, they can
also be a great nuisance. Rarely only a single shader is used – some
objects require per-vertex lighting with one spot and one directional
light, while others need tangent space setup and texture coordinate
generation, and so on. By the time you consider all possible
permutations that might be required, the number of shaders becomes
astronomical. This is where 2.0 vertex shaders become handy. With
addition of static flow control shader model has gained a robust mechanism for shader
management. Instead of swapping a huge number of very specific shaders it is much
better to write just a couple of universal shaders with flow control and replace expensive
shader switches with lightweight boolean constant updates. On RADEON 9500/9700
flow control instructions are essentially free, however some performance degradation
might still occur due to somewhat limited scope of performance optimizations.
Co-issue in vertex shaders
Radeon 9500/9700 has very interesting vertex processor unit design.
Each of the vertex processors has two math engines, one vector and
one scalar, that can process vector and scalar instruction on the same
clock. The idea is somewhat similar to pixel shader co-issue, however
there are implementation differences. Vector vertex processing e
operates on full 4D vectors, as opposed to 3D vectors in pixel
shaders, and scalar vertex processing engine is more independent
from the vector engine.
ngine
When the vertex shader optimizer schedules instructions it will try to pair vector and
scalar operations for optimal execution. There are a few limitations that might prevent
optimizer from co-issuing instructions. To increase the chances of instruction pairing, do
not output to the destination registers from scalar instructions and always use write masks
to write out only a single channel from scalar instructions such as POW, EXP, LOG,
RCP and RSQ. Also be aware the read port limits apply for vector/scalar instruction pair
the same way it is described in DirectX® 9 vertex shader specification for a single
instruction.
6.4. Optimizing Pixel Shaders
As pixel shaders progressively become more and more complex, they become more and
more an important target for optimizations. In the older 1.0-1.4 pixel shader models there
is not that much room for optimization because of low shader complexity. The 2.0 shader
model however is a different story. 2.0 pixel shaders are complex enough to implement
different optimization strategies, so the following sections will mostly focus on
16
RADEON: 9500/9700 pixel shader engine architecture and various pixel shader
optimization tricks.
Texture instructions
Texture instructions are the pixel shader instructions that fetch texture such as TEXLD,
kill pixel processing – TEXKILL, and TEXDEPTH for 1.4 pixel shaders outputs depth
values. When it comes to texture instructions there are few things to be aware of. First,
TEXKILL instruction does not interrupt pixel shader processing and provides pixel
culling only after shader was completely executed. Thus positioning of the TEXKILL
instruction in the shader does not make any difference and it is wrong to rely on early
abortion of pixel shader execution.
In general TEXKILL and TEXDEPTH (or equivalent depth output in 2.0 pixel shaders
with oDepth) should be used very carefully because they interfere with 
operation, and if possible should be avoided.
TEXKILL and clip planes
The TEXKILL instruction cancels the rendering of pixel based on the texture coordinate
values provided. This functionality can be used to implement user clip planes at the
rasterizer level. While this is an interesting hack, it does not provide the most efficient
way of implementing clip planes. All RADEON family chips have support for 6
geometry based clip planes in the TnL engine. Considering that TEXKILL instruction
has some detrimental impact on performance, as previously described, it is much better to
use real clip planes. Use TEXKILL only when clipping cannot be properly handled with
conventional user clip planes.
Legacy pixel shaders on DirectX® 9 hardware
When designing the RADEON 9500/9700 family of chips, one
important objective was to create architecture backwards compatible
with legacy shader models that would provide the highest
performance possible. This resulted in pixel shader engine
architecture that natively supports shader instruction co-issue, and
most of the source argument and instruction modifiers. Since 2.0
pixel shader model has very limited support for modifiers, they have
to be emulated with extra instructions. This means that some of the legacy pixel shaders
featuring many modifiers will execute faster than their 2.0 pixel shader equivalents.
Co-issue in pixel shaders
Earlier pixel shader models, namely 1.0-1.4, had a feature called instruction co-issue. It
allowed pairing two instructions operating on color and alpha values in one, and
executing them on the same cycle. While instruction co-issue provided a great
opportunity for optimization and increase of maximum number of instructions, it did
17
complicate shader development and broke instruction and operand orthogonality in the
shader model. The co-issue was removed from 2.0 pixel shader model.
RADEON 9500/9700 chips have dual-pipe pixel shader units,
which operate as two relatively independent engines performing
calculations on the different entities. One engine operates on 3D
vectors or RGB-colors and the other on scalar or alpha values. This
means that in most cases two instructions, one operating on the color
and another operating on alpha can be performed at the same time.
Such architecture provides a perfect opportunity for optimizing
shaders by splitting the computational workload between pipes and thus resulting in up to
twofold speedup. Careful examination of a shader for splitting the workload between the
pipes should focus on a couple of things – identifying computations that can be executed
only in one pipe (vector or scalar) and balancing number of instructions in each pipe.
Sometimes scalar or alpha computations can be executed in the color pipe and the other
way around, the color computations can be executed in the alpha pipe.
Explicit instruction co-issue in pixel shaders is available only in the older shader models.
However, this does not mean that the benefits of instruction pairing can be enjoyed only
in the older pixel shader models. On the contrary, the full benefit of instruction co-issue
can be achieved in 2.0 pixel shaders with some clever shader programming. In 2.0 pixel
shader model, write masks can be used to implicitly indicate opportunity for instruction
pairing. The shader optimizer in RADEON 9500/9700 drivers will look for write
masks to determine which pipe should execute instruction and will try reordering and coissuing
instructions.
There are some nuances the shader developers have to be aware of when optimizing
shaders for instruction co-issue. The color and alpha parts of the instruction pair can
reference different registers, however attempting to access alpha values in color
instruction or to access color values in alpha instructions might break co-issue. This also
applies to .ABGR or .WZYX swizzles available in 2.0 shaders as they force data to cross
vector and scalar pipes.
Another important fact is that RCP, RSQ, EXP and LOG instructions are always
executed in the scalar pipe. For that reason it is better to always use scalar arguments and
destinations (.W or .A) when using these instructions. This will ensure the vector pipe is
available for co-issue with these instructions.
Following are fragments of pixel shaders that compute diffuse and specular lighting. This
demonstrates how splitting instructions between pipes for co-issue can be used to
optimize shaders.
18
ps.2.0      ps.2.0
… …
dp3 r0.r, r1, r0 // N.H               dp3 r0.a, r1, r0 // N.H
dp3 r2, r1, r2 // N.L                  dp3 r2, r1, r2 // N.L
mul r2, r2, r3 // *color              mul r0.a, r0.a, r0.a // spec^2
mul r2, r2, r4 // *texture           mul r2.rgb, r2, r3 // * color
mul r0.r, r0.r, r0.r // spec^2      mul r0.a, r0.a, r0.a // spec^4
mul r0.r, r0.r, r0.r // spec^4      mul r2.rgb, r2, r4 // * texture
mul r0.r, r0.r, r0.r // spec^8      mul r0.a, r0.a, r0.a // spec^8
mad r0.rgb, r0.r, r5, r2              mad r0.rgb, r0.a, r5, r2
… …
Total – 8 instructions                 Total – 5 instructions
The instructions shown in purple color in the first shader are the instructions that could be
co-issued if they were executed in scalar pipe. The second shader illustrates the result of
such co-issue with blue and red instruction pairs. It is not required to place instructions
that can be paired next to each other, since the shader optimizer can intelligently reorder
instructions. In this example the instructions were reordered only to illustrate a concept.
Instruction balancing
On RADEON 9500/9700 the highest possible performance of pixel
shaders can be achieved by carefully balancing number of texture and
arithmetic instructions. Each of the pixel shader engines of
RADEON 9500/9700 is capable of executing a texture fetch and
color/alpha ALU instruction pair on each clock cycle. Because of this
high degree of parallelism between texture units and math engines, it
is a good idea to keep ratio of texture to ALU instructions close to
1:1. This of course makes sense if application is not texture fetch bound. When using
more expensive texture filtering modes the ratio of instructions will be skewed more
towards higher number of ALU instructions. For each particular pixel shader the cost of
arithmetic vs. texture instructions should be carefully evaluated to find areas that can be
implemented more optimally. For instance, if shader is too long because of some
complex calculations and there is some memory bandwidth to spare, some function lookup
tables can be used to reduce a number of arithmetic instructions. The perfect example
is SINCOS macro that can be much more efficiently implemented as a texture fetch of
one or two channel texture. This instruction balancing should be performed separately at
each dependency level in the shader.
19
Dependent texture reads
Dependent texture reads are quite expensive. On RADEON 8500/9000 a two-phase shader is much more expensive than a singlephase
shader, however it should be less expensive than running
multiple render passes with single-pass shaders. If you are developing
application that targets DirectX® 8.1 hardware and uses multiple
render passes with 1.0-1.3 pixel shaders, consider implementing a
single-pass solution with 1.4 pixel shaders.
RADEON 9500/9700 has significantly optimized dependent
texture read implementation for performance and efficiency. As well
the number of levels of dependency has been increased to four in 2.0
pixel shader model. The best performance on RADEON 9500/9700
can be achieved when not exceeding two dependent texture reads.
While three or four levels of dependency will provide sufficient
performance, it will not be as good as with only one or two levels.
Keep in mind that if arithmetic instructions are used to compute texture coordinates
before the first texture fetch, they will also be counted as a level of dependency. Also,
bear in mind that TEXKILL instruction forces a dependency level change in the pixel
shaders. To optimize shaders with dependent texture reads try to keep the number of both
texture and arithmetic instructions roughly the same at each level of dependency.
6.5. When Multiple Render Passes Are Better than One
One obvious optimization technique is to reduce number of rendering passes. This allows
cutting down on the amount of transformed and rendered geometry and decreasing fillrate
requirements. However, it turns out that this is not always true. There are a growing
number of cases where multi-pass rendering can result in better performance. Consider a
situation when overdraw is very high, and complex pixel shaders are used. When using
long and complex pixel shaders, the chances are the performance might be hampered by
shader execution. If overdraw is high, these complex shaders are run many times even for
the pixels that are occluded by other geometry. This is a huge waste of VPU shader
processing power. Ordering all geometry by distance and rendering it front to back might
not be such a good idea since it might affect sorting by effect, shader or render state. The
solution is to use multi-pass rendering since vertex processing is rarely a bottleneck. On
the first render pass just initialize the depth buffer with proper depth values for your
scene by rendering all geometry without any pixel shaders and outputting only depth and
no color information. Since no shaders are used, it is possible to render everything in the
front to back order without causing any major render state changing overhead. Then
render everything once again with proper shaders. Because depth buffer is already
initialized with proper depth values, the early pixel rejection can happen due to HYPER
Z optimizations, thus creating effective overdraw of one on the shader pass. Of course
if overdraw is already low, there is no sense in using this technique. As scene and shader
complexity increases this rendering method becomes more and more important.
20
6.6. Using High Level Shader Languages
It might be a lot of fun to develop vertex and pixel shaders in assembly, while chasing
every single opportunity to squeeze out an extra execution cycle here and there with
highly optimized handcrafted assembly code. In the real world of big and demanding
projects and tight schedules this just might not be the most practical way of developing
shaders. It is a well-known fact that higher-level languages provide much better
productivity.
With the introduction of Direct® 9, High Level Shader Language (HLSL) was
introduced. This C-like language with extra constructs to deal with vectors, matrices and
other graphics related features could be a great help in shader development. Besides
greater readability of the code and overall increased productivity, using high-level
language instead of assembly allows us to focus on code reuse and high-level algorithmic
optimizations, which quite often can be more valuable than low-level optimizations.
It does not mean that low-level optimizations are unimportant. The HLSL compiler is
aware of many low-level optimization tricks and can produce code that rivals some of the
best handcrafted assembly. To help compiler recognize optimization opportunities, use
appropriate types of variables. If only a scalar needs to be computed, do not use a vector
to store it. Likewise, use float3 type to store 3-component vectors and so on. When
computing shader output values for everything other than texture coordinates use float4
variable for holding the final result and avoid using type casts.
Another good way to help compiler recognize areas for possible optimizations is to use
built-in intrinsic functions. For instance, use dot() or lerp() functions instead of
implementing your own functional equivalents.
In HLSL pixel shaders make sure to use tex1D() intrinsic function whenever it makes
sense. In DirectX® 9 there are no 1D textures, so they are emulated through 2D textures
Using tex1D() instead of tex2D() can sometimes save an extra instruction when sampling
a texture with 1xN dimensions, since there is no need to worry about second texture
coordinate component.
If for some reason shader performance is still bellow your expectations, have HLSL
compile high-level code into assembly and then go over it with fine-tooth comb.

Thomas
 
Feedback

This is the description of a shader implementation.

The Shader Language is extendable through parameterized shader templates. Templates define a sequence of rendering commands inside a specific shader definition, with certain parameters in the sequence (such as texture map name, color or bump map height) replaced by context specific arguments.

The Shader Language is flexible by using simple orthogonal rendering commands to define complex rendering algorithms. The basic shader starts with a set of default settings that controls the material, vertex color, blend function and lighting. The artist has full control of the rendering pipeline and can override any default setting. A simple single pass, lit, single texture map shader is one line, while a rotating, lit and alpha blended texture map pass is only three lines. Complex multi-pass algorithms are defined by concatenating multiple single pass effects together. By combining shader templates, DirectX vertex/pixel shaders, rendered textures and multi-pass definitions, a game developer can define a dynamically modified procedural texture with a couple of passes.
Take this scene:
A scene with a lake, with trees by the water edge a player character with a sword that glows, and loosly fit clothes that sways, and vegetation around the trees.

Would the use of so many shaders run into performance problems? If a shader template was used to display the tree bark, other shaders for the rippling water, reflections and stencil shadows for shadowing the player. While the sword glowed using at least 2 shaders, plus the clothing animation.

Would DirectX 9 increase performance here, and a longer instruction set help.

What would a developer do if there needs to be some limiting on the number of shaders?

Thanks for any feedback, hopefully I worded my question right.

Speng.
 
speng,

Each vertex shader change is a state change. Other state changes include texture changes, vertex buffer changes, pixel shader changes, and other things that change the way graphics cards render pixels on screen.
Developers have to make a trade-off here. Generally it seams that state changes can be arranged like this (from worst up): vertex shader change, texture change, pixel shader change, vertex buffer change,...
Using lot's of different vertex shaders is not really a good idea. You can use constant based branching in vs_2_0 to enable/disable some specific vertex shader effects that you need/don't need on the fly, which reduces the total amount of vertex shaders.
If you have for example a scene with 5 textures and 20 vertex shaders it would probably be better to arrange by vertex shaders first and have 20 vertex shaders changes and say 20 texture changes, since if you'll arrange by textures, you could have 5 texture changes and say 60 vertex shader changes.
However this is one incredibly complex aspect of 3D engines. You actually have to weight advantages of everything: state changes and draw order (preferring front to back).
 
A good engine should be designed such that some degree of sorting is easy.

A sort by 'material' (shader, texture) combined with sort by depth (front-to-back) usually provides the best combination.

The usual recommendation is to find the closest 'object', render it, then render all other objects using the same material. Repeat until done. If the material changes are 'small' then a more exacting depth sort may extract marginal improvements. But certainly the first thing you render should always be any close-up big polygons - walls, etc.

And ALWAYS render the sky last. Not first. Please.

Generally you expect the number of fine structure changes (texture etc.) to outnumber gross structure (shader programs).
 
Thanks TB. ATIs presentation was quite interesting. One thing I did find amusing was
The TEXKILL instruction cancels the rendering of pixel based on the texture coordinate values provided. This functionality can be used to implement user clip planes at the rasterizer level. While this is an interesting hack, it does not provide the most efficient way of implementing clip planes.
]
I'm guessing it wasn't ATI who introduced TEXKILL.
 
Dio said:
And ALWAYS render the sky last. Not first. Please.
Surely that depends on how expensive ZClears are. Doing the ground and sky first eliminates the need for ZClears.
 
Dio said:
And ALWAYS render the sky last. Not first. Please.

unsless you have any semy-transparent objects in the scene likely to be seen on the background of the sky, in which case you have no choice but to draw the skybox first.
 
I guess on ATI clearing the Z is free, as you just invalidate the on chip Z macroblocks. Then, if you draw the a sky box with a large texture you'll save as huge swathes will be occluded by scenary.
 
darkblu said:
unsless you have any semy-transparent objects in the scene likely to be seen on the background of the sky, in which case you have no choice but to draw the skybox first.
Sorry, I should have been clearer. I was only considering the opaque pass.

Of course, all alpha blended objects will have to be rendered after all opaque objects unless you have very clever occlusion tracking in your engine. But you may well have to strictly depth-sort these anyway unless all your blend operations are commutative.
 
Simon F said:
Surely that depends on how expensive ZClears are. Doing the ground and sky first eliminates the need for ZClears.
The optimisation guides nowadays say that you should always issue an explicit clear for Z, and use PRESENT-DISCARD for colour to get best performance. The reasoning is that understanding the semantics of what the code really wants to do - rather than having to infer that a clear is required - means the card can much better meet the requirements of the code.

I've seen several apps that issue a Z clear and then either don't draw the sky unless they have to, or skip the bits they don't have to - and draw it last.
 
Simon F said:
I'm guessing it wasn't ATI who introduced TEXKILL.
I'm guessing that someone couldn't be bothered implementing clip planes in their transform engine... :)
 
Kristof said:
Also IIRC NV30 stores constants directly in the instruction, so 4 component float numbers are stored as part of the instruction.
I don't think that's true. Constants and instructions certainly carry the same memory space, but I believe that if you define a constant, you get one less instruction. I don't believe it would be possible to encode a constant with an instruction with such a setup (besides, it would take a ton of memory).

As for how the NV30 handles changing constants, well, that's a good question. It all depends on how nVidia made the chip. No matter how the chip stores the instructions, there is no fundamental need to retrieve an entire program to update one constant.

In any case, I doubt that changing constants will ever be a performance bottleneck.
 
Chalnoth said:
As for how the NV30 handles changing constants, well, that's a good question. It all depends on how nVidia made the chip. No matter how the chip stores the instructions, there is no fundamental need to retrieve an entire program to update one constant.

Perhaps not retrieve the entire program, but re-write the entire program.
 
Back
Top