Shaders?

I was just wondering one thing about shaders..

Does a long shader program on a GFFX that requires more passes on 9700 execute much slower or faster?

Regarding Vshaders i was just wondering just how fast VS2.0 is compared to an fast cpu etc..

To both Q where lies the bottlenecks in the gfx or the cpu or?

Is it possible for a cpu too do VS2.0 reg to speed etc...
 
This is still some degree of speculation but here goes.

There is an extra bandwidth cost on the R300 in having to read and write the framebuffer when multi-passing and also a cost due to changing shader program.

There are also some shaders that are much harder (much much harder) to do without full read/write access, which also occurs when multi-passing (Dx and OpenGL specification (but not that actual 9700) only allow read access OR write access to a single texture at any one time, if your program wants to do both it can get very complicated).

But GFFX has less bandwidth to start with, and doesn't appear to be optimised for long shaders (this is all hearsay for me at the moment, I have no real knowledge).

Personally I suspect the main advantage in long shaders in a single pass, will actually be those complex shaders that need read/write access.
The obvious one that everybody has already hit is HDRI lighting, by the time you've done some cool shading model its easy to be able to do only 1 light per R300 pass, if you want to store your acculumating lighting values at decent precision (above 8 bit integer) you have to use a high precision intermediate target BUT you can't blend to these, so simple adding the next light proves to be a pain (ping ponging surfaces). Where as a long shader would simple do 4-5 lights in a single pass without needed a temporary high precision surface.


VS2_0 can be very fast on CPU, a optimiser would eliminate the static branches and unroll loops. As such is should be as fast as running VS_1_1 on the CPU which has proved to very fast in the pass. Given a lot of PC are sold with good CPU and crap video cards, I suspect this will be a win for the low/middle end video card market.
 
DeanoC said:
There is an extra bandwidth cost on the R300 in having to read and write the framebuffer when multi-passing and also a cost due to changing shader program.
When multipassing, bandwidth is generally irrelevant. The assumption is that if you're executing a 200-plus-instruction program then the cost for a few extra reads and writes, even floating point ones, isn't significant, given the R9700's bandwidth.
 
How long shaders are we talking about?
And as the cpu can do VS2.0 pretty fast where is the "edge" where gfx vs cpu vertes shaders that goes too one favor?

The MotherNature test in 3dmark03 uses as i understand the VS2.0 to do the leafs, grass etc.. Seems like much job for the vertexshaders being used in there.
If we could dissable the gfx vertex shaders and use the cpu to do all this then it must be BIG difference or?
Lets say the cpu is an Athlon XP 2000+.
 
Dio said:
DeanoC said:
There is an extra bandwidth cost on the R300 in having to read and write the framebuffer when multi-passing and also a cost due to changing shader program.
When multipassing, bandwidth is generally irrelevant. The assumption is that if you're executing a 200-plus-instruction program then the cost for a few extra reads and writes, even floating point ones, isn't significant, given the R9700's bandwidth.

That seems to be dismissing it too quicky. Its hard to guessimate this stuff but we could be talking about ~8 clocks (writing and then reading a 4 channel FP32 texture), if we go upto the extreme and require 10 passes (which is still a single pass on nv30) that could be ~80 clocks extra. (You may have to reload the pixel shader and constants between passes as well, that could be expensive as well).

Of course the greater bandwidth the R300 has may componsate when compared to the nv30 but conceptually the nv30 long shaders do have a natural advantage.

It won't be a problem this generation but it is in the long term. Shaders will get longer mainly to save bandwidth.
 
Remember that texture fetches occur in parallel with shader processing. If each pass is taking 64 clocks, then that's a lot of data you can read and write in the same time. If you have 10 passes, that's 600+ clocks to hide the bandwidth in...
 
True but what if texture fetches take longer than shader processing and that those texture fetches are purely due to multi-pass?

Extreme case I know but you never know what devs want to do :)
 
Well, it becomes quite complex.

Assume that until the 9th pass you can't possibly have more than 8 passes read in. This also relies on every later pass reusing every previous pass results - pretty unlikely, more typically you will probably see 3-4 previous pass reads max.

So you're talking about a particular pathological case that will be rare in practice, and will still be OK even if it's hit :).
 
Well, a few things to think about:

1. No matter which way you look at the NV30's performance, it is lower in current synthetic PS 2.0 benchmarks than it should be compared to the Radeon 9700 Pro. I think this can easily be considered due to lack of optimization, and I'm hoping that that will disappear by the time the cards are available. Said another way, I wouldn't draw any conclusions on the NV30's shader performance just yet. The results are definitely anomalous right now.

2. Multipass performance really does depend on the algorithm. One thing that you have to keep in mind is that sometimes data will need to be recalculated each pass. So, given the right algorithm, the simple fact that the NV30 won't need to do multipass could lead to tremendous speed improvements. Other algorithms may perform about the same.

3. Is there additional overhead in running a long shader on the NV30? With optimized drivers, this overhead probably won't amount to anything, but if, for example, the entire program doesn't fit in on-chip cache, then unoptimized scheduling may end up stalling the pipelines frequently in a long, complex program. While logic would dictate that running a long shader instead of going multipass would, at worst, perform the same, if such an overhead exists, then a long shader may be slower than some multipass algorithms.

I guess what it all comes down to is a long program on the NV30 should be as fast or faster than a comparable multipass algorithm on the R300. Whether this pans out in reality will remain to be seen. I think most of this depends on how much more shader performance nVidia can get out of driver improvements (which is, obviously, heavily dependent upon how much computational power there is in the NV30 core...).
 
Little OT but we are talking about Geforce FX NON-ULTRA or..
To me the ultra version don´t exist and i would really like too se numbers compared with the Geforce FX that will be in my store..
 
overclocked said:
Little OT but we are talking about Geforce FX NON-ULTRA or..
To me the ultra version don´t exist and i would really like too se numbers compared with the Geforce FX that will be in my store..

We are mainly talking about extreme pathological cases, I don't think any of us think it will make any major difference in real-world apps REGARDLESS of Ultra/Non-Ultra.

The only real difference is that it makes a few algorithms easier to do (with a single pass).

Actually there is one non-technical factor that could help the NV3x, its easier to develop single pass shaders. But this shouldn't show up in shipping games (ATI dev rel would help port long shaders into multi-pass if the developer doesn't have time).
 
Back
Top