Radeon 8500 vertex ALU and Geforce 4ti

I read very recently that the geforce 3 displayed increased latency when executing rsq and rcp functions due to the two passes required by its "special purpose unit" (a unit alongside the vec4 processor?). Furthermore, in an extremetech session with an Nvidia employee, the nv-25 pipeline was discussed and admittedly elongated to provide the necessary latency for these instructions (maybe for the purpose of single pass execution) but I'm assuming the execution unit remained the same. Now, in vertex program tests run on the radeon 8500 and the geforce 3, (http://216.239.51.100/search?q=cach...40/final.pdf+radeon+vertex+alu&hl=en&ie=UTF-8, http://www.reactorcritical.com/review-battletitans2/review-battletitans2_2.shtml) the latency of complex programs on the hardware (rsqu and rcp primarily) remained approximately the same as that of the simple programs while the geforce 3 significantly decreased performance. How does the radeon VS implementation differ from that of the geforce architecture, does it have a dedicated scalar unit running alongside the vec4 unit?

Also, the radeon 8500 VS is capable of storing 192 constants as opposed to 96 and contains more registers than needed. Shouldn't the vertex shader thus be more capable (standalone) than that of the geforce 4ti. And can anyone confirm the fact that it also contains a hardwired unit, unlike the geforce 3 and 4, which emulate T&L through a vertex program?

I know it is a little late to discuss this architecture but I hear so much about newer competing architectures being superior in implementation, but theoretically this should be a strong competitor. Finally, why does a 60 million transistor processor with support for an advanced VS, PS 1.4, a hardwired T&L unit, and truform have less of a transistor count than the 63 million transistor geforce 4ti. Is it the lightspeed memory architecture and the amount of cache? Maybe the 8500 has more logic and less cache, which explains the single pass inefficiency issues in Doom 3.

Thankyou
 
I can't actually answer the question, since I know nothing about the 8500 pipeline, but I did want to add something.

It's incredibly difficult to establish what the VS is doing from tests, simply because your as much testing the drivers ability to compile the code as you are the hardwares ability to run it.
There is no guarantee that the code you write will execute the instructions in the order you write them, and in extreme cases, the VS compiler will actually remove redundant code.

The NV2X pipeline is somewhat better understood simply because there is extensive documentation on how it works supplied in the Xbox SDK. There is also a siggraph paper with a pretty complete description.

And FWIW the GF3/4 vertex shader also allows 192 constants (not exposed).
 
In reading Carmack's comments in the light of the Geforce 4 and Radeon 8500, I ran into this:

"It is interesting to contrast
the Nvidia and ATI functionality:

The vertex program extensions provide almost the same functionality. The ATI
hardware is a little bit more capable, but not in any way that I care about.
The ATI extension interface is massively more painful to use than the text
parsing interface from nvidia. On the plus side, the ATI vertex programs are
invariant with the normal OpenGL vertex processing, which allowed me to reuse
a bunch of code. The Nvidia vertex programs can't be used in multipass
algorithms with standard OpenGL passes, because they generate tiny differences
in depth values, forcing you to implement EVERYTHING with vertex programs.
Nvidia is planning on making this optional in the future, at a slight speed
cost."

Could it be that the 8500 has greater functionality in its vertex shader than even the geforce 4, which has been greatly praised for its nfinite solutions? Maybe Opengl guy could shed some light on this.
 
The Nvidia vertex programs can't be used in multipass
algorithms with standard OpenGL passes, because they generate tiny differences in depth values

The reason for this is that user vertex programs aren't constructed with the same Math that is used in the fixed function pipeline.
Unfortunately I don't think it's possible to duplicate the fixed function pipeline Math in what's exposed of the vertex shaders on the PC.
 
Carmack's comments are very old.

In a more recent .plan, he mentioned that making the vertex programs invariant with the fixed function pipeline on NV2x hardware was faster than what he was doing previously.
 
Luminescent said:
Could it be that the 8500 has greater functionality in its vertex shader than even the geforce 4, which has been greatly praised for its nfinite solutions? Maybe Opengl guy could shed some light on this.
Sorry, I don't know much more about the 8500 HW capabilities other than what I know from playing games on it :)
 
Back
Top