"Not conventional pipes"

PeterAce said:
Whats the difference between a 'ALU' and a 'register combiner'?
ALU (arithmetic and logic unit) is a generic term, coming out of CPU land and quickly gaining significance in GPU land.

When people talk about "register combiners", most of the time, they refer to some very specific kind of ALU. It's the basic building block in NV10 (and later generations up to at least NV30) pixel processing. It's a three-way VLIW thingy, and NVIDIA proprietary, obviously.
 
Pete said:
The XBOX NV2a has one vertex shader, then 4 pipelines with a FP32 TMU + FX9 + FX9 (note two shaders) per pipeline.
I'm still under the impression that NV2A, like NV25, had two vertex shaders, but I wouldn't put it beyond the realm of possibility that NV2A had additional improvements over NV20.

NV2A has 2 vertex shaders, with 6 simulatnous threads. Which is usually calculated as a dotproduct taking 0.5 cycles and a thread stalling operation (i.e. writing to constant memory) causes a dotproduct to take 3 cycles (6 times slower).

At register combiner level its both simplier and more compicated at the same time. It operates on 4 pixel at the same time (a quad), but how often it can output an ALU operation is quite variable. This is because it has both short-cuts and stalls.
 
[maven said:
]
Maybe a quad shares more logic than one would think (e.g. maybe the 4 pixels in a quad share texture lookup-logic so that if a texture-lookup for all 4 pixels lies in the same 2x2 texel footprint (bilinear filtering) [or maybe texel-cache block] everything's fine, but if some texels lie in different regions additional latencies might me introduced).

That's my guess. Maybe. :?
While you don't have to render in a quad based fashion it makes sense when you consider textures for just the reason you mentioned. Bilinear filtering.
 
Thank you all for responding. Let me preface my post with this disclaimer: I'm probably in over my head WRT to the specifics of a GPU's architecture, so Deano's reply was somewhat over my head. Forgive me for the boneheaded questions that follow (e.g., Is the NV2A's "6 simultaneous threads" a feature of the GPU or of the Xbox's software?). If they're too time-consuming to answer, please say so. Can I find the answers in RTR 2/E?

My interest stems from this thread at Ars discussing Doom 3 on the Xbox. BassVolvo, whom I quoted in my previous posted, also quoted you, Deano, when speculating that NV2A was more powerful than NV25.

I'm still not sure if NV2A is more powerful per clock than NV25 from a hardware or software POV. Hardware-wise, I thought NV15, NV2A, and NV25 all had the same vertex shaders, but the latter two had two of those shaders. So I was surprised when BV claimed NV2A had only a single VS, at least based on my thinking that all three GPU's VS units are the same. Software-wise, I thought Deano's post praising the Xbox's VS performance (quoted in the Ars thread) was more an issue of the Xbox allowing him to code closer to the metal ("thinner" OS & API) than NV2A allowing for more vertex shader ops per clock than NV25.

Deano, your post above still leaves me slightly confused. :) Does NV25 also have 2 VSs with 6 simult. threads? Does the NV2A's ability to hold 6 simultaneous threads reduce or mask the thread stalling op penalty?

Apologies for my ignorance. I'll try to conceal it better next time. ;)
 
Pete said:
My interest stems from this thread at Ars discussing Doom 3 on the Xbox. BassVolvo, whom I quoted in my previous posted, also quoted you, Deano, when speculating that NV2A was more powerful than NV25.
I wouldn't say its more powerful, but getting closer to the hardware lets you do considerable more with what your given. The other thing to remember is that its not a simple linear progression NV20 to NV2A to NV25. In some cases the NV2A has a few things NV25 doesn't have and vice versa (at least I think so, it could be the NV25 does have everything NV2A does but its simple not exposed via any API).

Pete said:
Deano, your post above still leaves me slightly confused. :) Does NV25 also have 2 VSs with 6 simult. threads? Does the NV2A's ability to hold 6 simultaneous threads reduce or mask the thread stalling op penalty?

I suspect that NV2A and NV25 have identical vertex shaders, its just the XBOX SDK gives us more details and lower level control over them. The threading issue is simple the way to remove latency, its whats allows 0.5 cycles per dotproduct. Normally you wouldn't be doing any thread stalling operations (on XBOX its a special vertex shader mode that isn't supported on PC) and you have more than 6 vertices 'in flight'. Its why writing constants is banned on PC and you only get good performance with large batchs (if you send less than 6 vertices at a time, the hardware stalls).
 
Back
Top