Radeon 9700 NDA Lifted

CMKRNL said:
The one advantage that NV30 will have over R300 is the programmable triangle tessellation unit. It's a fantastic feature but unfortunately one that will not be taken advantage of by game developers in its product lifecycle. (Heck, games are just now starting to take advantage of shaders which have been around since NV20!) Still, it's nice to see this type of innovation because it paves the way for making this a standard feature in future designs.

But the R300 has conditional "JMP" in vertex shader programming, and I presume can perform tesselation as part of VS 2.0...I presume that is what you meant by "programmable triangle tessellation unit"? If it isn't, please enlighten...I thought NV30's advantage was perhaps something like conditional branching in pixel shaders.
 
DaveBaumann said:
Thus, the R300 is already very efficient Hyper-Z, anisotropic, and MSAA wise, but and it has a 256-bit CROSSBAR bus. This means NVidia's IMR efficiency won't help them because ATI now has everything the GF4 has in terms of efficiency tricks.

Well, HyperZ-III sounds remarkably unchanged so if NVIDIA has some more efficient occlusion culling routines, as has been suggested, there is also room to gain here.

I thought "Early Z" was a significant change...one preview implies extremely high efficiency of hidden pixel rejection due to this new feature. I'd be interested to hear your take on it.
 
To be able to do all of the tesselation in the VS (not just coordinate calculation after subdivision along the u/v axis) you would need to be able to generate multiple vertices inside a vertex shader program. Otherwise you are just stuck with the fixed function tesselation units (which AFAIK could very well be all that the NV30 will have too BTW, I have no idea).
 
WOW! :eek:

This card is truly more than I expected, it is undisputed champion in all classes and you don't even have to try and make up scenarios to make it look good (unlike Parhelia)! Unfortunatelly I won't buy a new card for my home PCs until next spring ... that is ... unless one of my current ones breaks! <takes a peek at the hammer lying next to his mouse>)
Ah hell, we'll probably have a number of juicy choices by then, hopefully even more than a R300 (maybe a .13 refresh, however unlikely considering ATI's product cycle record, maybe they'll change to 6 months once just to screw Nvidia, hehe?) or NV30! :D

Can't wait to see the more detailed reviews coming up the next weeks, hopefully some competent websites will recieve review units (please let B3D be among them)! I want FSAA and Aniso comparisons, and hopefully someone will manage to point out the added IQ coming from the Gamma correction and 128bit floating point color precision, should do wonders for AA and some of the banding apparent even in 32bit.

I have a hunch ATI will offer some *serious* workstation models of the R300 somewhere down the road! Multichip bad boys with hopefully excellent driver support for CAD and CGI applications (please beyond the usual MAX support, Maya and LightWave would be nice). That might just be the one thing needed to finally make me a fanATIc! I'm hopefully going to be building a new workstation rig here in the office sometime this winter, maybe if I do some begging ... ;)

The only question left is - how about Linux support? With Linux gaining more and more popularity in the CGI industry, are they finally working on making a fully functional Linux driver-set?

To the Board admins: Add more emoticons please, I need more choice to express my excitement! How about an Airguitar!? :D
 
LeStoffer said:
DaveBaumann said:
Heres a key part:

It is not yet clear if the omission of a second texture unit per pixel rendering pipeline was indeed a smart choice. Under multi texturing conditions, GeForce4Ti4600 is theoretically able to supply just the same amount of pixels per clock as Radeon 9700. The fill rate test of 3DMark2001 SE hinted in the same direction. The score of Radeon 9700 is very close to the result of GeForce4Ti4600.

That'll explain the 8 pipelines. If they are working on a .13um part I wonder if they will add that back in.

Good catch.

It's not clear to me, however, whether a single texture unit per pipeline might not turn out to be a reasonable standard on all vendors PS 2.0 pipelines. I'm not sure that you'll really need more texture units with 8 pipelines and the advanced rendering with PS 2.0 (like the number of texture inputs per pass go up to 16, 32 address instructions and new stuff like 4 render targets that might complicate things like dependency etc.)

I guess that the days of a Voodoo2's dual-texture ability just isn't that fancy anymore. ;)

Not that I would typically use THG as a definitive source for information hower they do seem to have a very well written reason for ATI's possible decision for 1 TU / pipeline:




It might look as if one texture unit per pipeline is very little, but if you calculate the memory bandwidth requirement of eight parallel pipes with one texture unit doing a trilinear 32- bit color texture lookup, you will understand why two texture units wouldn't have made an awful lot of sense: 32 bit * 8 (trilinear filtering requires 8 texels to be read) * 8 (eight pipelines) = 2048 bit. 2048 bit would have to be read per clock, but 'only' 512 bit per clock are provided by the 256 bit-wide DDR memory interface of Radeon 9700. Bilinear filtering mode would still require 1024 bit per clock. Two texture units per pipe could never be fed by the memory interface.


http://www.tomshardware.com/graphic/02q3/020718/radeon9700-07.html
 
demalion said:
But the R300 has conditional "JMP" in vertex shader programming, and I presume can perform tesselation as part of VS 2.0...I presume that is what you meant by "programmable triangle tessellation unit"? If it isn't, please enlighten...
No, VS2.0 is still "1 vertex in, 1 vertex out". Tessellation takes place before that. NV30 is likely to feature a "primitive processor", a programmable tessellation unit that should enable support for almost any kind of procedural geometry.
 
I haven't yet read of an explaination of why all the VS diagrams of R300 had a vec4 processor with a scalar processor in parallel.
 
MfA said:
Well they must have given up on MAXX then, because a 256 frame delay is not acceptable by any stretch of the imagination :) (Or they were just being facetious.)
There is no 256 frame delay with 256 chips...
 
DaveBaumann
The scalar unit probably take care of:
rcp r, s0.w
rsq r, s0.w
expp r, s0.w
logp r, so.w

Which is the scalar instructions. They are more complex than the other, but work only on one element. It could be worth having special hardware for it. With some luck it might be possible to run the units superscalar.

I think I've seen a similar split in a more detailed description of GF3/4, but I might be wrong.
 
fanATIVdiot said:
Not that I would typically use THG as a definitive source for information hower they do seem to have a very well written reason for ATI's possible decision for 1 TU / pipeline:




It might look as if one texture unit per pipeline is very little, but if you calculate the memory bandwidth requirement of eight parallel pipes with one texture unit doing a trilinear 32- bit color texture lookup, you will understand why two texture units wouldn't have made an awful lot of sense: 32 bit * 8 (trilinear filtering requires 8 texels to be read) * 8 (eight pipelines) = 2048 bit. 2048 bit would have to be read per clock, but 'only' 512 bit per clock are provided by the 256 bit-wide DDR memory interface of Radeon 9700. Bilinear filtering mode would still require 1024 bit per clock. Two texture units per pipe could never be fed by the memory interface.


http://www.tomshardware.com/graphic/02q3/020718/radeon9700-07.html

First, Tom's calculation doesn't take into account texture caching, which reduces texture bandwidth IMMENSELY. Today's good GPU's hardly ever have to read a texture sample from memory twice when drawing a polygon, except when tiling a texture. Still, locally speaking, that hold about true.

Consider single-texturing. When minification is happening, bilinear filtering has texture bandwidth requirements of 32-bits per pixel maximum. Trilinear requires about 40 bits per pixel max, because one mip map is always 1/4 the resolution - however, since it requires 2 clocks to do the trilinear filtering (assumming 2 mipmaps are used instead of 1), that's only 20 bits per pixel per clock.

Remember, these are max figures, too. Increasing LOD bias lowers this, as well as looking at oblique angles. When textures are closer to the camera, magnification spreads the textures over more pixels, reducing this much more (the 3DMark2K1 has almost negligible texture bandwidth requirements for this reason - I'm talking only a few bits per pixel).

Most GPU's, including GF2, GF3, GF4, Radeon 8500, and R300, have about 64-bits of bandwidth per pixel per clock (give or take). You need 32-bits for the colour buffer write, and both Z reads and writes are necessary. With Z-compression, this is 16-64 bits per pixel, depending on compression (avg of 32 maybe?). This leaves only a bit for texture bandwidth, but again, texture bandwidth is not near as bad as Tom says it is. From here, the greater the texture bandwidth, the lower the efficiency. Alpha textures are a bit different, needing a Z-read an both a colour read and write (~80 bits/pix + texture bandwidth).

Generally, a second texture unit will help out a lot in multitexturing, because texture bandwidth is usually quite low. Some parts of the screen are bandwidth limited, so the performance gain isn't 100%, but it's still significant. Just look at RV250 vs. R200 in Quake 3 or Jedi Knight. The difference is quite noticeable.
 
cybamerc said:
The R300 was clearly what Carmack was talking about when he mentioned upcoming multi-chip solutions. 256 units in parallel is pretty damn impressive.

Well, i would be pretty surprised if the NV30 didn't support something like this also.
 
OpenGL guy said:
MfA said:
Well they must have given up on MAXX then, because a 256 frame delay is not acceptable by any stretch of the imagination :) (Or they were just being facetious.)
There is no 256 frame delay with 256 chips...
I must admit that strictly speaking with frame-interleaving you would have a 255 frame delay ... or is that not what you meant? :)
 
DaveBaumann said:
I haven't yet read of an explaination of why all the VS diagrams of R300 had a vec4 processor with a scalar processor in parallel.

According to one of the previews I read today, the R300 processing units are designed to perform a scalar and vector operation at the same time in each pipeline.

OK, looked it up... I read it as part of HardOCPs Commented WhitePaper article.

HardOCP said:
Each vertex shader pipeline in the RADEON 9700 is designed to handle vector and scalar operations simultaneously. Vector operations work on values composed of multiple components, such as 3D co-ordinates (x,y & z components) and color (red, green, and blue components). Scalar operations work on values with just a single component. Since vertex shaders typically include a mixture of vector and scalar operations, this optimization can improve processing speed by up to 100%.
 
And the first part with a $400 MSRP that is, imo, actually worthy of that price tag.

This places a reasonable constraint on Nvidia's pricing policy, depending on what NV30 brings to the table. ATI can be congratulated on that alone! NV30 is rumoured to have various AA, tesselation, & functional unit tricks up it's sleeve, however...
 
LeStoffer said:
DaveBaumann said:
Heres a key part:

It is not yet clear if the omission of a second texture unit per pixel rendering pipeline was indeed a smart choice. Under multi texturing conditions, GeForce4Ti4600 is theoretically able to supply just the same amount of pixels per clock as Radeon 9700. The fill rate test of 3DMark2001 SE hinted in the same direction. The score of Radeon 9700 is very close to the result of GeForce4Ti4600.

That'll explain the 8 pipelines. If they are working on a .13um part I wonder if they will add that back in.

Good catch.

It's not clear to me, however, whether a single texture unit per pipeline might not turn out to be a reasonable standard on all vendors PS 2.0 pipelines. I'm not sure that you'll really need more texture units with 8 pipelines and the advanced rendering with PS 2.0 (like the number of texture inputs per pass go up to 16, 32 address instructions and new stuff like 4 render targets that might complicate things like dependency etc.)

I guess that the days of a Voodoo2's dual-texture ability just isn't that fancy anymore. ;)

Now, forgetting about Tom's mostly incorrect statement, ATI's decision to use one texture unit per pipe is definately worthy of discussion, especially considering Matrox's move to use four in Parhelia. There are several aspects to consider:

Pixel Shaders:
If you are doing more math operations than only one per texture (as is the case with simple blending before shaders came around), then only one texture unit per pipe would be needed. Also dependent texture reads need another cycle anyway, again supporting the idea of 1 texture unit per pipe (see RV250 vs. R200 in the 3DMark2001 pixel shader tests).

Stencil Shadows:
Carmack's Doom3 is going to be a very widely used game engine in the future I believe, and single texturing fill rate is very important in the stenciled shadows. RV250 shouldn't suffer much in comparison to R200, so R300 will also be okay.

Current games not based on Q3 engine:
Games often use single texturing for many effects. Look at Serious Sam, and again there is not a very big difference between RV250 and R200, so there is probably a lot of single-texturing.

Overall, it seems like ATI made a good decision, especially considering how over-the-top R300 considering the manufacturing process. This was probably the smartest way to keep the size in check with minimal performance cost.
 
MfA said:
I must admit that strictly speaking with frame-interleaving you would have a 255 frame delay ... or is that not what you meant? :)
Ok, I'll be more clear: There is no frame delay with 256 chips. Better? :)
 
Ailuros said:
I'm hoping for a comment on the SV II algorithm from the data so far. Anything?

From what Anand wrote:
ATI insists that it will be higher quality than NVIDIA’s implementation in situations where transparent textures are used. The best example of this is in the DM-Antalus level in UT2003; the gorgeous grass in the level is simply a high resolution texture that is alpha blended, so you can see what’s behind it. According to ATI, NVIDIA’s multisampling algorithm will merely ignore aliasing within these polygon edges due to their transparency whereas the R300 will not.
Part of this is wrong. Alpha blended masked textures already have smooth edges (as long as you don't do point sampling).
What he really means is plain alpha tested textures. Which don't occur in UT2003 AFAIK.
But how does R300 take care of this? The simplest explanation IMHO would be that it uses supersampling if alpha test is enabled, because alpha test AA requires an alpha value to be calculated per sample. And I don't think you can get the alpha value without going through the whole pipeline once per sample.
 
Anyone wonder if the decision to implement one texture unit per pixel pipe is based on aquired technology from ArtX, as this is also used in Gamecube's Flipper. Or is it just a problem of die space. Could there be a possible benefit to this setup?
 
Back
Top