Radeon 9700 NDA Lifted

just for the record :

In the case of 6x FSAA, the Z-compression (as well as color compression) can be up to 1:24, as for FSAA the frame and Z buffer blocks are utilized as well and the six corresponding sample blocks are compressed as one.

ATi is storing the multiple frame samples in the same way as I described for the Z-buffer (Hyper-Z III). The dividing of frame samples into 8x8 pixel blocks allows a lossless compression of the frame buffer as well as Z-buffer blocks across all the different samples (up to 1:24 compression in case all pixel samples carry the same color and Z-value). This technique is able to save a significant amount of memory bandwidth, which is the bottleneck in FSAA. It ensures that the performance impact of FSAA is significantly lower than what we see in other implementations.

Source : http://www17.tomshardware.com/graphic/02q3/020718/radeon9700-08.html
 
There is simply no point to scale to 256 VPUs on a single board. If I wanted to do a renderfarm, I would simply put a bunch of them on separate cards and use software to coordinate the rendering. There are already pre-built RISC boxes that work like postscript printers that you slap onto your network and push Renderman RIB files to for rendering. Your renderfarm management software slices and dices the scenes and sends the jobs off to the various boxes.

Trying to put 256 chips on a single card behind a single AGP bus is insanity. It won't work because of heat, size, and power density requirements (25W * 256 = 6400 watts per "board"!) and it still won't get you A Bugs Life in real time. But if you want to accelerate offline rendering, then 3 or 4 on different PCI cards in cheap Linux/Intel boxes would make sense.
 
Trying to put 256 chips on a single card behind a single AGP bus is insanity. It won't work because of heat, size, and power density requirements (25W * 256 = 6400 watts per "board"!) and it still won't get you A Bugs Life in real time.

I'm not sure they said that as a serious statement, merely an example of what could be done. If ever a unti was needed then they would likely be similar in arrangement to the Quantum AAlchemy units.
 
Mintmaster said:
Now, forgetting about Tom's mostly incorrect statement, ATI's decision to use one texture unit per pipe is definately worthy of discussion, especially considering Matrox's move to use four in Parhelia. There are several aspects to consider:

Pixel Shaders:
If you are doing more math operations than only one per texture (as is the case with simple blending before shaders came around), then only one texture unit per pipe would be needed. Also dependent texture reads need another cycle anyway, again supporting the idea of 1 texture unit per pipe (see RV250 vs. R200 in the 3DMark2001 pixel shader tests).

Stencil Shadows:
Carmack's Doom3 is going to be a very widely used game engine in the future I believe, and single texturing fill rate is very important in the stenciled shadows. RV250 shouldn't suffer much in comparison to R200, so R300 will also be okay.

Current games not based on Q3 engine:
Games often use single texturing for many effects. Look at Serious Sam, and again there is not a very big difference between RV250 and R200, so there is probably a lot of single-texturing.

Good points there. Two years back people were all over the place for more texture units per pipe (like: "Yeah 4 textures in a single cycle - wow man!") but multi-texturing is being replaced with the ability apply more textures per pass (not so much per cycle).

The demand has changed since Quake III. Doom III is a perfect example: The chip has to jongle on the fly between single texturing stenciled shadows to multi-pass texture quality rendering. And as I said before: With 8 pipes and loads of textures in future games we not have an unlimited amount of texture fetch access anyway.

(Of course of you have ekstra silicon to burn then sure; go ahead and add more texture units. But I'm not sure they will pay off big time right now).
 
I have been watching the ATi press launch and the ATi Radeon 9700 was developed by the 'ATi Nintendo team.'
 
NVIDIA must be downsampling the frame buffer before mapping alpha textures if they are having trouble with AAing them. I don't see what other problem they could have.

Frame buffer compression looks like a good match for MSAA, sort of a brute force approach to FAA that works for polygon intersections. Too bad it doesn't look to support higher AA than 6X.

I'm a little puzzled by the TMU/pipe configuration. They must be able to get 1 trilinear pixel/pipe/clock. Otherwise it strikes me as a useless decision since no one is running bilinear anymore. In fact, AFAIC 4 TMUs per pipe is fine, just use them all for AF.
 
Jerry Cornelius said:
NVIDIA must be downsampling the frame buffer before mapping alpha textures if they are having trouble with AAing them. I don't see what other problem they could have.
Incorrect.

Generally with multisample AA, you get the following problem with alpha textures: You take one texture sample, but you take N depth samples. If the texture sample passes the alpha test, all of the N samples will. Similarly, if the texture sample fails the alpha test, all will fail. Because of this behavior, you won't get antialiasing on the internal "edges" because there is no interpolation between passing and failing texels.
 
Are you saying that if the alpha test suceeds all sub samples are given the green flag regardless of their individual depth tests? Normal filtering should work fine for the "alpha edges" within the texture.
 
Jerry Cornelius said:
Are you saying that if the alpha test suceeds all sub samples are given the green flag regardless of their individual depth tests? Normal filtering should work fine for the "alpha edges" within the texture.

I don't think that's the case. However, the pixels would get rejected if the alpha test fails, regardless of the z test. Which is the opposite of what you said. Maybe normal filtering just isn't good enough to match the quality of the AA in the rest of the scene.
 
Jerry Cornelius said:
Are you saying that if the alpha test suceeds all sub samples are given the green flag regardless of their individual depth tests? Normal filtering should work fine for the "alpha edges" within the texture.
No, depth test still matters. But that only helps with intersecting triangles.

Filtering does work fine if you use alpha blending. But then you need to depth-sort those triangles to avoid artifacts, and it's slower.

If you only use alpha test, it's either "all samples pass" or "all samples fail" since they use the same color/alpha value.
Only the triangle edges (coverage mask) and intersection edges (depth test) get AAed with multisampling.
 
Posted: Thu Jul 18, 2002 4:04 pm Post subject:

--------------------------------------------------------------------------------

fanATIVdiot wrote:


Not that I would typically use THG as a definitive source for information hower they do seem to have a very well written reason for ATI's possible decision for 1 TU / pipeline:




It might look as if one texture unit per pipeline is very little, but if you calculate the memory bandwidth requirement of eight parallel pipes with one texture unit doing a trilinear 32- bit color texture lookup, you will understand why two texture units wouldn't have made an awful lot of sense: 32 bit * 8 (trilinear filtering requires 8 texels to be read) * 8 (eight pipelines) = 2048 bit. 2048 bit would have to be read per clock, but 'only' 512 bit per clock are provided by the 256 bit-wide DDR memory interface of Radeon 9700. Bilinear filtering mode would still require 1024 bit per clock. Two texture units per pipe could never be fed by the memory interface.


http://www.tomshardware.com/graphic/02q3/020718/radeon9700-07.html


First, Tom's calculation doesn't take into account texture caching, which reduces texture bandwidth IMMENSELY. Today's good GPU's hardly ever have to read a texture sample from memory twice when drawing a polygon, except when tiling a texture. Still, locally speaking, that hold about true.

Consider single-texturing. When minification is happening, bilinear filtering has texture bandwidth requirements of 32-bits per pixel maximum. Trilinear requires about 40 bits per pixel max, because one mip map is always 1/4 the resolution - however, since it requires 2 clocks to do the trilinear filtering (assumming 2 mipmaps are used instead of 1), that's only 20 bits per pixel per clock.

Remember, these are max figures, too. Increasing LOD bias lowers this, as well as looking at oblique angles. When textures are closer to the camera, magnification spreads the textures over more pixels, reducing this much more (the 3DMark2K1 has almost negligible texture bandwidth requirements for this reason - I'm talking only a few bits per pixel).

Most GPU's, including GF2, GF3, GF4, Radeon 8500, and R300, have about 64-bits of bandwidth per pixel per clock (give or take). You need 32-bits for the colour buffer write, and both Z reads and writes are necessary. With Z-compression, this is 16-64 bits per pixel, depending on compression (avg of 32 maybe?). This leaves only a bit for texture bandwidth, but again, texture bandwidth is not near as bad as Tom says it is. From here, the greater the texture bandwidth, the lower the efficiency. Alpha textures are a bit different, needing a Z-read an both a colour read and write (~80 bits/pix + texture bandwidth).

Generally, a second texture unit will help out a lot in multitexturing, because texture bandwidth is usually quite low. Some parts of the screen are bandwidth limited, so the performance gain isn't 100%, but it's still significant. Just look at RV250 vs. R200 in Quake 3 or Jedi Knight. The difference is quite noticeable.

- You are making a bunch of assumptions here
- Texture caching and bandwidth requirements are two separate things. Sure texture caching helps if you have a lot of reuse..But don't forget you need to fetch the data atleast once into the cache
- Your comments apply to games that are not texture bandwidth limited; ie they probably use a lot of low-res textures or compressed textures...most last generation games(including quake3). Something like UT2003 or Doom3 is a completely different story
- Again, the issue is not whether you read texture sample from memory more than once..just compute the bandwidth for reading a 32-bit texture sample..assume a multitextured background in a game where you a have 1024x1024 texture mapped onto to say a 16x12 rectangle. Do the math...
 
- Your comments apply to games that are not texture bandwidth limited; ie they probably use a lot of low-res textures or compressed textures...most last generation games(including quake3). Something like UT2003 or Doom3 is a completely different story

Don't knock compressed textures. DXTC is a stunning technology. The best thing about it is that you can go from 256x256 to 512x512, take up half the space in vram of a 32-bit original, and STILL get double the performance (because of improved fetch, tiling and cache efficiency). And in these cases it looks better 99% of the time (i.e. all those that aren't things like the sky in Q3).

I've never seen anyone complain about texture compression if, when it is enabled, the resolution of all textures is doubled in this manner, although nvidia's DXT1 problem is a bit of a downer sometimes.

Note that Epic were one of the first big users of compressed textures in Unreal Tournament's add-on pack, and the textures they improved in this way look stunning. Nowadays I find playing UT without the enhanced textures quite strange.
 
Trying to put 256 chips on a single card behind a single AGP bus is insanity. It won't work because of heat, size, and power density requirements (25W * 256 = 6400 watts per "board"!)

Sounds like an "upper management" type of solution, marketed by "upper management" thinking PR people. They needed to appeal to these people as well. Sshhhh. :)
 
croc_mak said:
- Again, the issue is not whether you read texture sample from memory more than once..just compute the bandwidth for reading a 32-bit texture sample..assume a multitextured background in a game where you a have 1024x1024 texture mapped onto to say a 16x12 rectangle. Do the math...
Huh? Ever heard of mip mapping?
 
Back
Top