Supposed futuremark score on 6600

vb said:
I don't understand. 1/4 the ROPs of NV40 would mean 8 ROPs; with 8 pixel pipes it should be able to output 8 pixels/clock (in theory). should be limited to 4 pixels/clock only in 2xAA mode.

http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=11
http://www.beyond3d.com/previews/nvidia/6800s/index.php?p=6

A ROP is the pixel engine - a single ROP in NV40 is capable of half a pixel blend, one colour write or 2 Z/Stencil writes. There a 16 ROP's in NV40, so 1/4 of the ROP's in 6600 would equate to 2 blends, 4 colour writes and 8 Z/Stencil per cycle.

(Presumably this will also knock the FP16 blending down to a single blend per cycle).
 
Well, i have just (oops!) noticed here a certain "Fragment Crossbar" that R420 (and also, I asume, R3xx) seems to lack.

So to bring back old (but highly enjoyed) topics:

Can a architecture that outputs 4 colour writes and 8 Z/Stencil per cycle be called a 8 pipe card?

sorry, had to... :LOL:
 
Well, the difference would be in the shader power, wouldn't it? That will be the primary limitations for most games anyway. There is a difference, for example, between an architecture that can do eight shader operations in parallel per clock, and one that can do four operations in parallel, two in serial in each pipeline (since there are often limitations on which operations can be done in serial).
 
More tricky nomenclature, but as long as the performance is there, it's not that big a deal to me. I'm not a little confused that sites still rely mainly on 3DM to compare video cards, though, given recent history.
 
In any case, would that configuration make it act like a 4x2 design? Or is there something I don't see? Of course it can be the memory bandwidth but other than that... It seems really weird... Why would they go the 4color/8z route again after all the critique? Was it really that good a design?
 
Well, not really. Shader throughput is what will be more important in coming games. This architecture should still be capable of having eight pixels in-flight at one time. It just has to spend more than one clock per pixel in order to keep them outputting. Most cases in games today will require more than one clock per pixel (this will be a drawback in situations such as rendering shadow hulls for stencil shadow volumes, or for rendering shadow maps).

Additionally, a 4x2 architecture gains performance from pairing texture instructions. This sort of architecture would not. Basically, having fewer ROPs means that it will be less efficient for cases such as shadow rendering above, but the efficiency should be identical to an 8x1 architecture for every pixel that takes more than one clock to execute (assuming all other things the same). A 4x2 architecture would be less efficient most of the time.
 
Chalnoth said:
Well, not really. Shader throughput is what will be more important in coming games. This architecture should still be capable of having eight pixels in-flight at one time. It just has to spend more than one clock per pixel in order to keep them outputting. Most cases in games today will require more than one clock per pixel (this will be a drawback in situations such as rendering shadow hulls for stencil shadow volumes, or for rendering shadow maps).

Additionally, a 4x2 architecture gains performance from pairing texture instructions. This sort of architecture would not. Basically, having fewer ROPs means that it will be less efficient for cases such as shadow rendering above, but the efficiency should be identical to an 8x1 architecture for every pixel that takes more than one clock to execute (assuming all other things the same). A 4x2 architecture would be less efficient most of the time.

It does bring new light into Nvidia's marketing slogan "the Doom3 GPU". based on this, if RV410 is a 2 quad R420, it should be equally fast during Z pass and faster during fragment processing. not to mention that, if future patches off-load Cpu to VS, Rv410 has 6 VS units.
 
vb said:
not to mention that, if future patches off-load Cpu to VS, Rv410 has 6 VS units.

That won't really matter unless there are places in the game that are (going to be) VS limited, which i highly doubt.
 
vb said:
if future patches off-load Cpu to VS

Previous ID games had quite a number of patches that didn't just fixed bugs, but changed a lot the renderer. ID seems to test whatever changes they make for licencing the engine on the actual game, and i can see a request to lower CPU load.
 
DaveBaumann said:
http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=11
A ROP is the pixel engine - a single ROP in NV40 is capable of half a pixel blend, one colour write or 2 Z/Stencil writes. There a 16 ROP's in NV40

Your counting method doesn't seem to match up with the diagram shown. NVidia shows "C ROP" and "Z ROP" as separate. It does not diagram them as one "uber ROP" that is both C+Z or 2Z, there are therefore 32 ROPs in the NV40, but 16 of them can only write Z values. The C ROP is special. It can be "borrowed" to write an extra Z. As for blending, you can choose to look at it as halving half the number of blend units (8) or simply that a blend has a latency of two cycles.

Ultimately, with only a 128-bit bus, and being shader limited, including more ROPs doesn't make much sense anyway.
 
DemoCoder said:
Your counting method doesn't seem to match up with the diagram shown. NVidia shows "C ROP" and "Z ROP" as separate. It does not diagram them as one "uber ROP" that is both C+Z or 2Z, there are therefore 32 ROPs in the NV40, but 16 of them can only write Z values.

Actually, NVIDIA's documentation classes a ROP as the the whole "Pixel Engine" as a single ROP, which contans a specific Z ROP and Combination ROP. NVIDIA's you can see the slide "Detail of a Single ROP Pipeline" at Toms, ergo accrding to NVIDIA's documentation there are 16 ROP's in total, with a total of 16 Z ROPs and 16 Combination ROPs.

The C ROP is special. It can be "borrowed" to write an extra Z.

Yes, thats explained in the article.

As for blending, you can choose to look at it as halving half the number of blend units (8) or simply that a blend has a latency of two cycles.

I was choosing an aggregate on a per clock basis.
 
Either way, it yields the same result (32/4 = 8 ROPs, but 4 are color, and 4 are Z). The lower ROPs really only limit the max single/no texturing fillrate, given that the bus is 128-bits.
 
DemoCoder said:
Either way, it yields the same result (32/4 = 8 ROPs, but 4 are color, and 4 are Z).

:?:

And my counting detailed this in the first place :!:

The lower ROPs really only limit the max single/no texturing fillrate, given that the bus is 128-bits.

Well, the same can be said of NV40 to a certain extent - the core and memory clock bias doesn't offset the relative bandwidth to pixel output.
 
DemoCoder said:
Either way, it yields the same result (32/4 = 8 ROPs, but 4 are color, and 4 are Z).

And is this still speculating or concluding a result?

I thought...I mean... I think people were lead to believe this was 32/2 solution... as in 8 color... not a 32/4. If they have "only one quad out of four" so to speak remaining in this product, how are they going to rape the low end 6200s? :oops:
 
I find this all relevant too... since I play a lot of games, and the most common cases where I was having a bad performance in game or benchmark with my last gpu, were cases where I ended up fillrate bound.

Now with a 16 color/clock gpu all those situations perform just dandy... 8 color/clock weren't enough... now I hate to imagine what might happen for example with a 6600 series card (4 color pixels per clock) in the aquamark 3 end explosion... or painkiller in the monastery level with all the torches... especially with AA...
 
vb said:
It does bring new light into Nvidia's marketing slogan "the Doom3 GPU". based on this, if RV410 is a 2 quad R420, it should be equally fast during Z pass and faster during fragment processing. not to mention that, if future patches off-load Cpu to VS, Rv410 has 6 VS units.
Oops, I forgot that the NV43 should have twice the z ROPs when making my above post. That eliminates pretty much any scenario where this architcture would be less efficient than a "normal" 8x1 architecture. No, the RV410, if it is simply a two quad R420, will be no faster during fragment processing, since the NV43 can still do just as much fragment processing, but it just can't output as much. This only limits color outputs with single texturing, which is a situation that typically just doesn't happen anymore.
 
Chalnoth said:
Oops, I forgot that the NV43 should have twice the z ROPs when making my above post. That eliminates pretty much any scenario where this architcture would be less efficient than a "normal" 8x1 architecture. No, the RV410, if it is simply a two quad R420, will be no faster during fragment processing, since the NV43 can still do just as much fragment processing, but it just can't output as much. This only limits color outputs with single texturing, which is a situation that typically just doesn't happen anymore.

So

1. NV43 can work on 8 pixels/clock just like RV410
2. NV43 can output 8 "zixels"/clock just like RV410
3. NV43 can write 4 color pixels/clock half of RV410
Of course you can't output 8 pixels/clock anyway due to the lenght of fragment processing but 8 pipes sharing 4 writes shouldn't be as fast.

On the plus side they might have similar clockspeed so...

My main gripe is calling NV43 "the doom3 GPU" while it is the first from Nvidia in 2 years that doesn't feature 2x Z-fill rate.
 
vb said:
My main gripe is calling NV43 "the doom3 GPU" while it is the first from Nvidia in 2 years that doesn't feature 2x Z-fill rate.

I thought that performance is the only thing that counts. Not nr of pipelines or Z fillrate.
 
So what exactly makes this card approx. 2x faster in doom3 than pcx 5900 (which has 256bit mem.)?
And even if 5900 can do 16 z operations and this card only 8 (as I've understood from vb's post) per clock? 5900 clock speed isn't so much lower to cause such big performance differences. And there is 3dmark03 too...

(Till now I thought NV43 will be same as 6800gt/U, except only 2 quads and only 3 ps, therefore able to do 16z ops/8normal ops per clock and output only 4 (as it's mem bus has half the bandwidth oposed to GT/U))

Can anyone make it clear to me, how it really works and why has it so high scores in doom3 and 3dmark 2003?
 
Back
Top