ATi claims 174.6 GB/s for Radeon 9700 Pro

PrzemKo said:
Taking into account that 8 pipes @ 500 MHz of GFFX you have 4 Gpix/s = 16 GB/s at 32 bits/pixel. Add to that Z-Buffer and you'll see, that memory bandwidth could be entirely consumed by NV30's fillrate alone (sans compression and optimizing techniques). Aniso WILL eat bandwidth.

No, it won't, particularly not from that argument.

First of all, there are compression and optimizing techniques at work to significantly lower the bandwidth requirements per pixel.

Secondly, not all pixels take a single clock to calculate. If there is a second texture in use, it will take two clocks. Many of today's games will use up to 5-6 textures at once. If anisotropic filtering is in use, it can take up to eight clocks to produce a single pixel.

Whichever way you slice it, anisotropic filtering will never take more of a memory bandwidth hit than it takes for a fillrate hit. Given that the current GeForce4's are generally not very limited in memory bandwidth at 2x FSAA, you can expect that the GeForce FX will not be limited with no FSAA.
 
Chalnoth said:
Whichever way you slice it, anisotropic filtering will never take more of a memory bandwidth hit than it takes for a fillrate hit.
You've got it backwards. What's the difference between point-sampling and bilinear filtering? More texture samples. How about bilinear and trilinear? More texture samples. How about trilinear and anisotropic? Potentially more texture samples. If you had an architecture that could compute anisotropic texels in a single cycle, what would the difference between point-sampling and anisotropic be? More memory bandwidth.
 
oh yeah!? well the NV1....... oh hell, not even the most devout nVidiot could do anything to make THAT chip sound good!
 
Chalnoth said:
PrzemKo said:
Taking into account that 8 pipes @ 500 MHz of GFFX you have 4 Gpix/s = 16 GB/s at 32 bits/pixel. Add to that Z-Buffer and you'll see, that memory bandwidth could be entirely consumed by NV30's fillrate alone (sans compression and optimizing techniques). Aniso WILL eat bandwidth.

No, it won't, particularly not from that argument.

First of all, there are compression and optimizing techniques at work to significantly lower the bandwidth requirements per pixel.

Yeah, but even NVidia said that compression is not very significant without AA. With AA, compression only reduces the increase. When you add everything together, even with fairly ideal 4:1 Z compression you need 48 bits written per normal 3D pixel. Alpha adds another 32-bits, and textures can also be significant although compression helps a lot. Finally, getting 100% out of your memory controller is nearly impossible. There will be very isolated circumstances where NVidia will be able to achieve 8 pix per clock.

Secondly, not all pixels take a single clock to calculate. If there is a second texture in use, it will take two clocks. Many of today's games will use up to 5-6 textures at once. If anisotropic filtering is in use, it can take up to eight clocks to produce a single pixel.

True, but in this case 4x2 would have saved die space. Also, how many of today's games use 5-6 textures at once? Today's games rarely use more than two, and a lot even use just one for the majority of the pixels.

Your point for aniso is valid, though. NV30 should be quite good with aniso since it is likely going to be waiting for memory most of the time.

Whichever way you slice it, anisotropic filtering will never take more of a memory bandwidth hit than it takes for a fillrate hit. Given that the current GeForce4's are generally not very limited in memory bandwidth at 2x FSAA, you can expect that the GeForce FX will not be limited with no FSAA.

I think you are completely wrong about the Geforce4 not being bandwidth limited with 2xFSAA. If you were right, 2xFSAA would hardly have any performance hit. Also, the Geforce4's FSAA scores nearly halve going from 2xFSAA to 4xFSAA, unless you are CPU/T&L bound. 4xFSAA doubles the Z and colour buffer bandwidth compared to 2xFSAA, so the bandwidth requirements are nearly doubled. Coupling these two facts, both 2xFSAA and 4xFSAA equally saturate the memory bandwidth, as excess bandwidth at 2xFSAA would prevent such a proportional pattern.
 
alexsok said:
Wasn't a similar comment made by ATI to Tomshardware on Comdex?

Yes, but the comment (I don't read THG, but saw it elsewhere) was made by ATI in response to a question put to them from various sources, which question was in turn based on the ridiculous conclusion reported by the Inquirer night before launch--that nv30 would have up to 48-gigs/sec "effective bandwidth."

The only thing I've been able to find from nVidia is a claim of lossless compression--I've yet to see something from nVidia claiming "48 gigs/sec effective bandwidth."... I think that entire line of BS was engineered by the guy at the Inquirer who so obviously loathes ATI and loves nVidia (I think this was the same guy who ran the "ATI admits it is having AGP x8 problems with the 9700 Pro" story, in which he links to an ATI web page as his "source." Nowhere on the page the article links, however, does ATI ever admit that its technology in the 9700 is at fault--a direct contradiction to the body of the article.)

Apparently, he couldn't stand it when he found out that the physical bandwidth of nv30 would be less than that of 9700 Pro, so he cooked up the "effective bandwidth" numbers based on nVidia's descriptions of its compression ratios. It must be embarrassing for companies when the zealots who love them use public platforms to make all kinds of ridiculous statements the hardware companies can never back up...;) Whew!

I thought those "effective bandwidth" numbers were *so funny*... I mean, if this was what nVidia actually claims (which certainly doesn't seem to be the case) then why go to the trouble of slapping on 1GHz DDR II? *heh-heh*--If you have an "effective bandwidth" of 48-gigs/sec all you'd need would be the cheapest 20ns SDR SDRAM you could buy because the physical bandwidth wouldn't even matter--*chuckle*...They don't call it the "Inquirer" for nothing, I guess...
 
I think that entire line of BS was engineered by the guy at the Inquirer
Actually, the 48gb/s number came from an NVIDIA paper that few people were able to get their hands on.

Though the 174.6gb/s number is way too weird to say the least...
 
Yeah, but even NVidia said that compression is not very significant without AA
yes, if you take a closer look at that satement it means that the NV30 will only be substantially bandwidth limited when doing FSAA. If it were significantly bandwidth limited under non-FSAA modes then the colour compression would make a large impact on overall performance. Now, it's certainly true that under FSAA modes you have to transfer a lot more colour information but, there should still be a large impact on non-FSAA performance. Then you can also add Z-compression, a more efficient culling engine, and I assume that if they can compress colour, alpha values can be compressed using the same method. Then, with further memory efficiency algorithms that we aren't yet aware of it should actually be able to achieve a fairly high "effective" bandwidth.
 
Yeah, but even NVidia said that compression is not very significant without AA
yes, if you take a closer look at that satement it means that the NV30 will only be substantially bandwidth limited when doing FSAA. If it were significantly bandwidth limited under non-FSAA modes then the colour compression would make a large impact on overall performance. Now, it's certainly true that under FSAA modes you have to transfer a lot more colour information but, there should still be a large impact on non-FSAA performance. Then you can also add Z-compression, a more efficient culling engine, and I assume that if they can compress colour, alpha values can be compressed using the same method. Then, with further memory efficiency algorithms that we aren't yet aware of it should actually be able to achieve a fairly high "effective" bandwidth. Would nVidia really be so dumb as to spend $400,000,000 developing a chip that is supposed to be the "big one" and leave it so severely bandwidth limited when it would be simple enough to slap on a 256-bit bus? some on, at least the old 3Dfx engineers would be smart enough not to do that :p
 
Sage said:
Yeah, but even NVidia said that compression is not very significant without AA
yes, if you take a closer look at that satement it means that the NV30 will only be substantially bandwidth limited when doing FSAA. If it were significantly bandwidth limited under non-FSAA modes then the colour compression would make a large impact on overall performance.

Ummm, NO.
you dont understand how color compression works (loseless). The deal is, without FSAA, compression gets very very poor ratios. IE, it doesnt help much. Not because of some psuedo logical thought process you came up with, but because there are not duplicate color samples "near" each other to be compressed.
 
Sage said:
Would nVidia really be so dumb as to spend $400,000,000 developing a chip that is supposed to be the "big one" and leave it so severely bandwidth limited when it would be simple enough to slap on a 256-bit bus?

NVidia is claiming that longer shaders will remove bandwidth requirements, but longer shaders output pixels at a rate of 1 every few clocks rather than 8 per clock. Thus a 4x2 pipeline (with double the shading speed in each pipe) would be much smarter.

There's really only one reason I can see for NVidia providing so little bandwidth to their pipelines: DOOM 3.

Doom 3 has both Z/stencil-only passes, and with compression (although I don't know how well stencil compression works) you can easily output 8 pixels per clock, even with only 32-bits of mem access per pipe per cycle.

If Doom3 and its engine are a HUGE success, then NVidia made an okay decision, although I would have just expanded their already exisiting 4 Z-check units per pipe to include stencil ops and the ability to write Z-values, and leave it 4x2. If they aren't a success, then I don't like the idea of having a 125M transistor chip waiting so much for the memory, and yes, I would say NVidia is dumb.
 
Mintmaster said:
NVidia is claiming that longer shaders will remove bandwidth requirements, but longer shaders output pixels at a rate of 1 every few clocks rather than 8 per clock. Thus a 4x2 pipeline (with double the shading speed in each pipe) would be much smarter.

Yes, but I would like to add that longer shaders might or might not reduce bandwidth requirements. It really depend on the specific shader: Does it depend on getting data from cache/memory for [almost] each ops/clock or doesn't it?

Some advanced shaders might very well require few data sets (e.g. texels) and will mainly be a long number of math ops done on the initial data. In that case the GeForce FX's high clockrate is great, but otherwise I guess NV30 and R300 will perform close to each other. IHMO.
 
Why do you think 4x2 would be so much easier to implement than 8x1?

The biggest reason why GPUs have increased in performance so much faster than CPUs is that a GPU can do it by increasing parallelism, while CPUs are stuck with increasing clock and IPC for their sequential problems.

4x2 was likely better earlier, when the pipe were more fixed (old fixed function or register combiners), and the arithmetics were short fixed point. As the precision rises (and the FPUs becomes a larger percentage of the gates), and PS becomes more like primitive CPUs, the advantages of doing Nx1 grows, and disadvatages are reduced.


And do you think DOOM3 engine games are the only ones that's going to do stencil/z-only passes? The z-only passes will become more important as pixel shaders get more advanced. They're also good for geometry culling operations.

Stencil buffers should btw compress realy well, even without FSAA. At least there's a great compression potential in them since they usually contain large areas with identical values.
 
To me it would seem that the 8x1 setup would be substantially faster than 4x2 for long shaders, especially if the shaders consist of long sequences of dependent instructions. Also, for complexity, 8x1 may actually be simpler than 4x2 because there is less pressure to extract instruction level parallelism within a given pipe.
 
OpenGL guy said:
You've got it backwards. What's the difference between point-sampling and bilinear filtering? More texture samples. How about bilinear and trilinear? More texture samples. How about trilinear and anisotropic? Potentially more texture samples. If you had an architecture that could compute anisotropic texels in a single cycle, what would the difference between point-sampling and anisotropic be? More memory bandwidth.

There's no architecture that exists today that can produce its maximum supported degree of anisotropy in a single pixel pipeline and clock. If one arrives, then you can say that anisotropic filtering has primarily a memory bandwidth hit. Now, it has primarily a fillrate hit.
 
Mintmaster said:
Yeah, but even NVidia said that compression is not very significant without AA. With AA, compression only reduces the increase. When you add everything together, even with fairly ideal 4:1 Z compression you need 48 bits written per normal 3D pixel. Alpha adds another 32-bits, and textures can also be significant although compression helps a lot. Finally, getting 100% out of your memory controller is nearly impossible. There will be very isolated circumstances where NVidia will be able to achieve 8 pix per clock.

You forgot to include that it isn't going to be overly-common for a pixel to actually be written each and every clock on the GeForce FX. For DOOM3, for example, one pixel will be written each clock (assuming perfect efficiency in memory bandwidth, etc...) only when doing the initial z-only pass, which will take very little memory bandwidth. For this game, essentially every other pixel will have many textures applied, meaning it will take many clocks to calculate.

In other words, what you're describing is only a problem if what is being written is single-textured trilinear-filtered polygons (no anisotropic). This is just not the case today, and I see no reason for it to be the case often in the future.

True, but in this case 4x2 would have saved die space. Also, how many of today's games use 5-6 textures at once? Today's games rarely use more than two, and a lot even use just one for the majority of the pixels.

But 4x2 would have been less efficient, primarily for DOOM3 (or any game in the future that will do an initial z pass).

And most games today use at least two textures per pass, with many of the more recent ones using far more (Serious Sam, UT2K3, for example).

I think you are completely wrong about the Geforce4 not being bandwidth limited with 2xFSAA. If you were right, 2xFSAA would hardly have any performance hit.

It doesn't.

Also, the Geforce4's FSAA scores nearly halve going from 2xFSAA to 4xFSAA, unless you are CPU/T&L bound. 4xFSAA doubles the Z and colour buffer bandwidth compared to 2xFSAA, so the bandwidth requirements are nearly doubled. Coupling these two facts, both 2xFSAA and 4xFSAA equally saturate the memory bandwidth, as excess bandwidth at 2xFSAA would prevent such a proportional pattern.

Okay, so the GeForce4 begins to be bandwidth limited at 2x FSAA. The point still stands. For any game that uses more than a single texture per pixel, the GeForce FX will be no less efficient in using its memory bandwidth than the GeForce4. And with the improved compression with FSAA, it will be quite a bit more efficient.
 
Chalnoth said:
There's no architecture that exists today that can produce its maximum supported degree of anisotropy in a single pixel pipeline and clock. If one arrives, then you can say that anisotropic filtering has primarily a memory bandwidth hit. Now, it has primarily a fillrate hit.
What you said before was:
Whichever way you slice it, anisotropic filtering will never take more of a memory bandwidth hit than it takes for a fillrate hit.
And I showed that you were incorrect.
 
Not entirely true. It's incredibly unfeasible to ever produce an architecture that will be capable of producing its maximum level of anisotropy in a single clock in each pixel pipeline. There will just be too much wasted processing power on those clocks that don't need the power.

So, I still think it will never happen. But yes, it is fundamentally possible, just not feasible.
 
Mintmaster said:
If Doom3 and its engine are a HUGE success, then NVidia made an okay decision, although I would have just expanded their already exisiting 4 Z-check units per pipe to include stencil ops and the ability to write Z-values, and leave it 4x2. If they aren't a success, then I don't like the idea of having a 125M transistor chip waiting so much for the memory, and yes, I would say NVidia is dumb.

This seems like a really dumb move to me because at best id engines account for 50% of the FPS market, with the rest going to other engine developers such as Epic, Monolith and of course in house engines.

And that's just the FPS games market, never mind flight sims, racing games, etc. So if Nvidia is optimizing their cards for a single engine made by a single developer, even one with some influence, they are being really stupid.
 
Chalnoth said:
Not entirely true. It's incredibly unfeasible to ever produce an architecture that will be capable of producing its maximum level of anisotropy in a single clock in each pixel pipeline.
I'm sure they said the same thing about trilinear filtering a few years back.
 
Back
Top