8800-series, granularity and tex:alu ratios

freka586

Newcomer
When (briefly) scanning through rumours about the coming 8800-series from Nvidia (as well as R600) I have not yet read anything on granularity and tex:alu ratios.

For some applications (variaties of volume rendering in my case) we have noticed some interesting things lately:
* Granularity and internal cache strategies are important. So important that in my case a 16-pipe X1800 beats a 24-pipe 7900 GTX by a *large* margin.
* Multiple ALUs do not seem to be of help in my case. So X1800 is more or less the same as the new X1950 XTX, clock differences aside.

So, any news about the coming HW on these aspects?
I seem to remember ATI hinting that in the long run we should expect even more asymmetrical tex:alu ratios.
 
Out of curiosity, what kind of workload is that exactly? What are your ALU ops like (MUL/ADD, or something else?), and your TEX ops? (single channel? FP16? Trilinear? Anything?) - furthermore, is the problem the actual granularity, or is it the systematic costs of 2+ cycles (iirc) for any branching operation on G7x? Also, is it strictly impossible to offload the branching instructions to the VS on G7x-like architectures, since they're basically free there? (I'd assume not, but heh)

Anyway, this specific thing on G80 wasn't discussed much yet that I can see - so it seems like a good idea to do so here ;)


Uttar
P.S.: It's worth noting that unless you're using "exotic" branching methods, in an unified architecture, the granularity will be the same for vertex or pixel shading in an unified architecture. At least, that's the case on Xenos (granularity of 32) - G965 proves that you can do things a bit differently by using scalar for the PS and Vec4 for the PS, thus dividing the VS's granularity by 4!
 
The application in question is a form of volume rendering based on single-pass raycasting with on-the-fly gradient computations.
 
Uttar, apparently I was way to fuzzy when explaining my situation :)
What should I do to find the answers for your questions, short of posting the actual shader code?
Some of the questions I can answer on my own, but I would be most helpful for any hints on how to properly "diagnose" my case.
 
The application in question is a form of volume rendering based on single-pass raycasting with on-the-fly gradient computations.
Doesn't surprise me - we've had similar to results with raytracing algorithms. When traversing hierarchical data structures, thread granularity matters a lot!
 
What application causes that to happen?!? :oops:

Most shaders that use dynamic branching will perform significantly better on X1x00 cards, unless the branching is trivial and extremely coherent. I've seen the same thing in some of my demos, like for instance the Selective Supersampling demo, where the branching provided a significant performance boost on ATI and very little on Nvidia (if it didn't drop even, can't remember).
 
So, interesting observations on current HW aside...
Any hints on where things are going for mainly Nvidia but also ATI?

It seems natural that efficiency for branched shader code should only get better, especially in Nvidias case. But perhaps changes in architecture could have big costs in this area?

What about the ratio between tex and alu units?
I think I remember reading somewhere about 128 shader units and 32 tex for the highend 8800.
Since I am not 100% updated on the inner workings I might have over-simplified things greatly...
But if this is the case the texture units would only increase from 24 to 32?
 
Back
Top