Hey! It's the 13th. I want my juicy info!!

I'd be very surprised if the co-issue have to be 3+1 in SU1 and 2+2 in SU2. That's a rather strange way. As you say 3+1 is more usefull than 2+2 so why even bother to have a 2+2 unit, especially to put that configuration on the unit that is available most often.

I don't see it as a big problem to have a selective split between 3+1 and 2+2. The step to a 3 way co-issue is larger (2+1+1).

The big question mark here though is the rules for the inut registers.
Swizzles over the split border?
Different registers over the split border?
 
demalion said:
The NV40's apparent stencil element output of 32 per clock is 4 times both the R3xx and NV30/35.
Does anyone else think this is a bit of a wastage?

Consider the stencil-only passes in Doom3. You need a Z-read, a stencil read, and a stencil write for each pixel. Assuming 4:1 average Z compression, that's 3 bytes per pixel. With a 400 Mhz core clock and 600 Mhz memory clock, you get 48 bytes total per clock, allowing 16 pixels to be drawn.

I doubt you can get that much better average Z compression, but even if you could it won't help much. Do you think they have stencil compression too? Maybe an algorithm allowing 2-bit variation about an average value?

If NV40 can actually output 32 pixels per clock doing the aforementioned stencil pass, I'll be amazed. They sure pinched every penny with their transistor budget, which was already extremely large to begin with.

If those 3DMark2003 numbers are real, I have a feeling NV40's throughput is nowhere near these numbers, especially since GT2, GT3, and GT4 should each have up to 4 times the performance of NV35.
 
Typical configuration is 24 bit Z and 8 bit stencil in one 32 bit word.
So make that 2 byte/pix with 4:1 compression. With some creative swizzling, you could make the stencil write fully efficient, and end up with 1.25 byte/pix with 4:1 compression. So it's at least theoretically possible to run at 32 z-pix per clock.

Since the stencil value often is constant over rather large areas, it should be possible to compress well.
 
I don't think current hardware compresses the stencil alongside the Z, and the 4:1 ratio only applies to the Z part. I know that's how the buffer is traditionally stored, but when doing Z compression the stencil bytes for a block are likely stored together.

I do agree with what you said about stencils being easily compressible, though. Do you think the scheme I suggested is likely? Say one 8 bit average value and 32 3-bit difference values, resulting in 4x8 block taking only 4 dwords if successful. It seems easy enough to implement, so I guess it's quite likely that NV40 has stencil compression.

If NVidia can keep the costs reasonable, especially for the value versions, they have a winner on their hands. Even if you take one quarter of the pipes, it should be significantly faster and much better featured than RV360 in all situations.
 
Mint,
48bytes per clock * 400Mhz = 19.2gb/s, but the memory is capable of 38.4gb/s. Can't DDR write 512-bits of info peak per clock (@ 600Mhz)?

I think one factor is that it is easier to increase memory quickly than to increase core. Even if the 32-zixels is bandwidth bound, it may be able to achieve closer to its peak on future SKUs with faster memory, scaling more easily. Both NV40 and R420 have cores so overpowered it's sick. X800XT looks to have 8 gigapixels of fillrate. No way it will get close to that unless you have texturing and z disabled.

But isn't it good to have spare reserve power left over in the GPU to use just in case you need it?
 
Bouncing Zabaglione Bros. said:
dan2097 said:
True, must of the article is already common knowledge but there is some interesting information

Yes, confirmation of the two slot cooling and higher power requirement - pretty much confirming the high heat output. Especially interesting given the recent story of the NV40 review machine that gets 10 degree hotter and shuts down even though it has a lot of fans. It will be interesting to know if Nvidia have a quiet cooling solution or another dustbuster.

16xAA could be good, although I hope it's better than Nvidia's previously weak AA.

I notice the chips are made by IBM too.


(I know this is way old post but) Nvidia already Supports 16X AA in OpenGL, 4x OGMS and 12x OGSS, Its not much better than 8xS tho. (2x RGMS 4x OGSS)
 
DemoCoder said:
Mint,
48bytes per clock * 400Mhz = 19.2gb/s, but the memory is capable of 38.4gb/s. Can't DDR write 512-bits of info peak per clock (@ 600Mhz)?

Oops. :oops:


Well, I guess you can just completely ignore my post above. I'm usually careful when doing these things.

Thanks for pointing that out! Your other points are good as well.
 
Mintmaster said:
If those 3DMark2003 numbers are real, I have a feeling NV40's throughput is nowhere near these numbers, especially since GT2, GT3, and GT4 should each have up to 4 times the performance of NV35.

According to some nvidia guy I talked to way back, lots of GT2 and GT3 are *triangle setup* limited due to the degenerate quads used for vertex shader shadow volume silhouette extraction. Of course, this might not be the case with more modern hardware, and they're probably bandwidth limited at higher resolutions.
 
The problem with that theory is that 3DMark2003 scales quite well with resolution. That's why NVidia's criticism of the rendering method was so full of crap.

At 4 times the stencil fillrate, however, there will probably be more of a bottleneck. I guess we'll soon see how much faster NV40 is in 3DM2K3 at higher resolutions.
 
Back
Top