With the bandwidth saving functionalities, do you know how come "Villagemark" is still comparatively slow on the GF series compared to both KYRO and Radeon8500?
Tony: Villagemark happens to have been designed to be the "best case" poster child for Kyro's style of rendering. It is a high depth complexity scene, with very little geometric complexity, sorted back to front. Back to font sorting is the worst case method of sorting for all traditional graphics processors, as it requires every pixel to have a z-read, a z-write, a color-read and a color-write, for every layer of depth complexity. Even randomly sorted (ie. not sorted) applications will be faster. Since the first layer of depth complexity will always be the same, but subsequent layers will require a z-write and color-write only 50% of the time (ie. 1/2 as much bandwidth for each layer of depth complexity beyond 1 compared to the back-to-front model). Real applications don't sort back to front for the entire scene because of this, it's just slow on everthing. Back-to-front also means that occlusion culling in GeForce4 provides no benefit, since each subsequent pixel recieved is always going to be visible, so no rendering can be saved.
NV engineer: Villagemark is a near worst-case for non-tile-renderers. It lies somewhere between random sorting and back-to-front sorting of geometry (and yes, I've stepped through it polygon by polygon). It's possible that ATI is doing something like reordering geometry in the driver to boost their scores here. Also, ATI's hardware can reject 64 pixels per clock. Their hardware for doing this is not very sophisticated from what I understand, and it may break down in certain complex situations, but it's sufficient for Villagemark.