G80 vs R600 Part X: The Blunt & The Rich Feature

Jawed · Jun 21, 2008

trinibwoy said:
Low ALU:TEX = brute force (G8X)
Low TEX:ALU = elegant (R6xx)

Is R5xx a good architecture? R520 was never a good advertisement for it. An X1950Pro is quite a bit faster than a 7900GTX in newer games. etc.

Jawed

Jawed · Jun 21, 2008

CarstenS said:
Would a test count in your books, where the pure pixel shaders have been grabbed from a real game, run through a fillrate-meter and then been normalized in relation to a given card?

Yeah, thanks, it'd be nice to have an example of a significantly ALU-limited game.

Got the shader code?

Jawed

Jawed · Jun 21, 2008

Xmas said:
How is that supposed to improve anything?

It depends on the amount of hardware that would need to be dedicated to decompression - decompression might just be a macro, a bit like fog is right now, I guess.

Jawed

Mintmaster · Jun 21, 2008

Jawed said:
You can't Load() from the current render target until the Colour+Z compression has been decoded by the RBEs. Only the RBEs have access to the tile tag tables in order to make sense of the data in memory.

So we know for sure that an MSAA rendertarget can be compressed? I seem to remember someone saying that render-to-texture is uncompressed, but I could be mistaken and/or thinking of a console.

If you're going to run a decompression pass (within the RBEs) and you're going to do bog-standard MSAA resolve, why ask the RBEs to write the decompressed data to a new target for the TUs to consume when it's possible to transmit the data directly to the register file?

Well, I wasn't sure that the RBE's were the things doing the MSAA resolve. If they were, it makes no sense to me that a "hardware" resolve is any faster than a software resolve. NVidia complained about this in their CoJ rant, and everyone said this was the reason R6xx was slow in AA. I figured that the only way this can really be true is if there was some parallel hardware to do the resolve, maybe near the display engine.

I guess the datapath between the RBEs and shader units is a reason.

Apparently RV770 (well 2 of them I guess) is doing real time wavelet decompression on the ALUs for the new Ruby demo.

Jawed

Well, there's a difference between realtime on a 1-2MPix screen and 60 GTex/s decompression (for two of them) without tying up any resources.

CarstenS · Jun 21, 2008

Jawed said:
Yeah, thanks, it'd be nice to have an example of a significantly ALU-limited game.

Got the shader code?

Jawed

At least that what was grabbed from some sample scenes - maybe some games do generate new shaders on the fly or in different levels.

I'm not sure though, whether I'm legally allowed to make this stuff public. Maybe just a diagram with the achieved fillrates normalized to a 7900 GT would suffice?

Jawed · Jun 21, 2008

Mintmaster said:
So we know for sure that an MSAA rendertarget can be compressed?

Yep, even with fp16 samples:

Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same

this appears to have been for R5xx.

In one embodiment, graphics processor comprises a Tile Format Table (TFT) to keep track of the compression format of the incoming files [sic]. In another embodiment, two TFTs are used, with one keeping track of the tiles stored in main memory and another keeping track of the tiles stored in cache memory. With the help of TFT, the graphics processor is able to process incoming tiles in their compressed formats, without the need of decompression. This speeds up overall system performance. In one embodiment, the TFT is also used to enable a "Fast Clear" operation where the graphics processor can quickly clear the cache memory. In another embodiment, the process performs resolve operations on sample colors. In these operations, the sample colors are combined to get the final color of the pixel. The compression schemes of the present invention also enable the processor to optimize the procedure of the resolve operation.

I seem to remember someone saying that render-to-texture is uncompressed, but I could be mistaken and/or thinking of a console.

"Copy to texture", which would decompress the compression, is how I think it works. It doesn't make sense to have the MSAA compression hardware idling during rendering to a target that will later be used as a texture.

Well, I wasn't sure that the RBE's were the things doing the MSAA resolve.

In an ATI GPU, what other hardware would do this (bearing in mind that the actual averaging is performed by the ALUs since R600).

If they were, it makes no sense to me that a "hardware" resolve is any faster than a software resolve.

We know this is true because R600 MSAA performance shows no performance loss against the pure-RBE resolve of R5xx - i.e. with enough ALUs software resolve is as fast as hardware resolve.

NVidia complained about this in their CoJ rant, and everyone said this was the reason R6xx was slow in AA. I figured that the only way this can really be true is if there was some parallel hardware to do the resolve, maybe near the display engine.

NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.

Well, there's a difference between realtime on a 1-2MPix screen and 60 GTex/s decompression (for two of them) without tying up any resources.

Which is Carmack's reason for doing MegaTexture, too, isn't it? But then he suggested doing voxel (or was it octree?) rendering to get "MegaGeometry", along the same lines. Well, there are plenty of variations of this...

Anyway, RTRT of this quality makes Intel's videos look kinda silly.

Jawed

Jawed · Jun 21, 2008

CarstenS said:
I'm not sure though, whether I'm legally allowed to make this stuff public. Maybe just a diagram with the achieved fillrates normalized to a 7900 GT would suffice?

Maybe you should wait until Monday to check what you intend to do...

Jawed

jimmyjames123 · Jun 22, 2008

Jawed said:
NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.

Using excessive memory/bandwidth with MSAA on NVIDIA cards (relative to ATI cards) surely sounds plausible, which is why CSAA used on NV cards makes a lot of sense when moving beyond 4xAA as a bandwidth-saving technique. Unfortunately there isn't much testing done by reviewers nowadays with 16x CSAA, probably because there is nothing directly comparable on the ATI cards.

Anyhow, the whole "clever" vs "brute-force" concept is a semantical argument, just a play on words. Depending on how one spins something, it is easy to make one or the other approach appear "clever" or "brutish". The most important thing is how the product performs for the intended market at a given price point, be it gaming/GPGPU/developer applications.

Davros · Jun 22, 2008

As i said before he who dies with the most frames wins

mczak · Jun 22, 2008

Jawed said:
NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.

Sure nvidia is still doing display-engine msaa resolve? IIRC nv30 could do this, but only with 2xAA whereas it could do 2xAA or 4xAA with a copy resolve.
I'd have thought they've abandoned display-engine resolve, as the drawbacks aren't worth the effort (higher memory usage, fullscreen only). It did in fact use less bandwidth though since with fullscreen you can do pageflip (so no copy necessary), and afair that was pretty much the reason it even existed.

Mintmaster · Jun 22, 2008

Jawed said:
Yep, even with fp16 samples:

I know that any format can be compressed. I'm talking solely about the case where the rendertarget is a texture. I guess these textures can only be accessed in the shader using the Load() function instead of Sample() or tex2d(), now that I look into it a bit more closely, so there aren't any orthogonality problems. It does make sense to use the compression hardware, but like I said, I heard otherwise for texture rendering.

We know this is true because R600 MSAA performance shows no performance loss against the pure-RBE resolve of R5xx - i.e. with enough ALUs software resolve is as fast as hardware resolve.

Well that's just it: R6xx MSAA performance has never been as fast as their RV5xx counterparts, at least in terms of performance drop.

NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.

That's basically what I was talking about. I figured that ATI would do something similar. Maybe not directly to the display device like NVidia, but a separate unit nonetheless.

Which is Carmack's reason for doing MegaTexture, too, isn't it? But then he suggested doing voxel (or was it octree?) rendering to get "MegaGeometry", along the same lines. Well, there are plenty of variations of this...

The situations are rather different. Megatexture is about only having the data needed consuming the RAM. Your engine is still going to require the same billions of texture accesses per second.

To have full speed in today's games, you need to be able to decompress 160 point samples per clock in RV770. There's no way to do that economically except with fixed function hardware away from the shader engines.

Jawed · Jun 22, 2008

Mintmaster said:
I know that any format can be compressed. I'm talking solely about the case where the rendertarget is a texture.

But a render target is never a texture while the RT is being written. When the GPU has finished writing the RT then the API tells the GPU to "re-cast" the RT as a texture. All I'm saying is that at the time of re-casting, the compressed RT needs to decompressed because the TU hardware doesn't know how to decompress a RT. Once recast, the RT simply becomes just a "flat" wodge in memory, now labelled as a texture for the TUs to read.

I guess these textures can only be accessed in the shader using the Load() function instead of Sample() or tex2d(), now that I look into it a bit more closely, so there aren't any orthogonality problems.

I don't understand why that might be the case, or if that's the case. Most of the time RTT is consumed by a screen-sized quad isn't it?

Well that's just it: R6xx MSAA performance has never been as fast as their RV5xx counterparts, at least in terms of performance drop.

There was a point when you understood this:

http://forum.beyond3d.com/showpost.php?p=1086411&postcount=7

I wish you'd remember...

That's basically what I was talking about. I figured that ATI would do something similar. Maybe not directly to the display device like NVidia, but a separate unit nonetheless.

I've never seen a detailed description of the physical realities of screen display in modern GPUs - I'm out of my depth here.

The situations are rather different. Megatexture is about only having the data needed consuming the RAM. Your engine is still going to require the same billions of texture accesses per second.

No, Carmack is quite explict in saying that with MT the GPU only accesses texels at around twice the frequency of pixels:

http://forum.beyond3d.com/showpost.php?p=674633&postcount=15

At 1024x768 resolution, well under two million texels will be referenced, no matter what the finest level of detail is.

To have full speed in today's games, you need to be able to decompress 160 point samples per clock in RV770. There's no way to do that economically except with fixed function hardware away from the shader engines.

I agree, fixed function is here for quite a while yet for conventional OGL/D3D-style graphics.

Jawed

Xmas · Jun 22, 2008

Jawed said:
It depends on the amount of hardware that would need to be dedicated to decompression - decompression might just be a macro, a bit like fog is right now, I guess.

There is a huge difference. Fixed function fog maps to a few general ALU instructions and is used at most once per fragment, while S3TC decompression can be required for several texture fetches and is a very specialized sequence of operations which would probably result in several dozen ALU instructions. If you want to do it efficiently there is hardly anything that could be reused for other operations, so there is no point in integrating it into the ALU.

Mintmaster · Jun 22, 2008

Jawed said:
But a render target is never a texture while the RT is being written. When the GPU has finished writing the RT then the API tells the GPU to "re-cast" the RT as a texture. All I'm saying is that at the time of re-casting, the compressed RT needs to decompressed because the TU hardware doesn't know how to decompress a RT.

I'm not disagreeing with you here. This process, however, can be more costly than just rendering uncompressed in the first place.

I don't understand why that might be the case, or if that's the case. Most of the time RTT is consumed by a screen-sized quad isn't it?

Not sure what you're referring to, but no, I don't think you can make that generalization. Shadow maps are different, environment maps are different, etc.

There was a point when you understood this:

http://forum.beyond3d.com/showpost.php?p=1086411&postcount=7

I wish you'd remember...

Oh I did. But just because I felt there's a "wrench in the comparison" doesn't mean I think the AA drop was as small as it should have been. RV770, for example, has the same 2x Z rate without AA, but it has a smaller drop than even R580, IIRC.

No, Carmack is quite explict in saying that with MT the GPU only accesses texels at around twice the frequency of pixels:

He's talking about texels in memory, not texture fetches, and only per texture at that. "Texel" is a term that has been given two meanings. It really comes from "texture element", i.e. the data elements in the texture, but GTexel/s refers to texture fetches executed by the shader engine, each needed 4 (or more) texels from the cache.

Megatexture reduces the amount of texture data needed for a scene (given certain constraints) by creating a tighter bound - relative to normal engines - on what really has a chance to be accessed, but has no effect on the texture operations needed to render it.

Jawed · Jun 23, 2008

Xmas said:
There is a huge difference. Fixed function fog maps to a few general ALU instructions and is used at most once per fragment, while S3TC decompression can be required for several texture fetches and is a very specialized sequence of operations which would probably result in several dozen ALU instructions. If you want to do it efficiently there is hardly anything that could be reused for other operations, so there is no point in integrating it into the ALU.

I've had a rummage for some code to decode S3TC but haven't found anything so far. I've got the patent and this:

http://graphics.stanford.edu/courses/cs448a-01-fall/nvOpenGLspecs.pdf

pages 155-158 being key.

Yeah looks like quite a lot of instructions, 20-30 for DXT5? So, yeah, quite a long time before it'll be running on ALUs.

I suppose it's worth querying the trend for the ratio of DXT:non-DXT texels...

Jawed

Jawed · Jun 23, 2008

Mintmaster said:
I'm not disagreeing with you here. This process, however, can be more costly than just rendering uncompressed in the first place.

I presume you mean cases like a render target generated with practically zero overdraw or low-resolution render targets.

Not sure what you're referring to, but no, I don't think you can make that generalization. Shadow maps are different, environment maps are different, etc.

I was thinking in terms of 2D colour and got stuck there.

Oh I did. But just because I felt there's a "wrench in the comparison" doesn't mean I think the AA drop was as small as it should have been. RV770, for example, has the same 2x Z rate without AA, but it has a smaller drop than even R580, IIRC.

Eh? RV770 should have 4xZ without AA. Also you're forgetting that R5xx is comparatively wasteful of bandwidth...

He's talking about texels in memory, not texture fetches, and only per texture at that. "Texel" is a term that has been given two meanings. It really comes from "texture element", i.e. the data elements in the texture, but GTexel/s refers to texture fetches executed by the shader engine, each needed 4 (or more) texels from the cache.

Well the point about MT is there's no need to multi-texture for albedo (since the artist flattens down to a single texel) so then it's albedo + whatever maps are required.

But the point is, if you've only got, say 4 million texels of albedo per frame to sample, then there's not much texel rate required. Specular, normal, ambient will all require lower sampling rates, too, I presume.

Megatexture reduces the amount of texture data needed for a scene (given certain constraints) by creating a tighter bound - relative to normal engines - on what really has a chance to be accessed, but has no effect on the texture operations needed to render it.

As far as I can tell MT implicitly minimises overdraw (in order to minimise the texture footprint loaded into video RAM and stream textures at the "required resolution"), so overall I'm puzzled how such an engine is going to be sampling billions of texels per second for albedo.

I'm just comparing the albedo of polygons with the pseudo-albedo of voxels (the presenter referred to wavelet compression of a huge dataset - I'm presuming he's referring to the voxels that make up the scene).

Jawed

Mintmaster · Jun 23, 2008

Jawed said:
I presume you mean cases like a render target generated with practically zero overdraw or low-resolution render targets.

Near zero net overdraw. With good sorting this can include lots of gross overdraw.

Eh? RV770 should have 4xZ without AA. Also you're forgetting that R5xx is comparatively wasteful of bandwidth...

In theory, yes, but tests show it pretty close to 2x:
http://www.computerbase.de/artikel/...50_rv770/3/#abschnitt_theoretische_benchmarks
I wouldn't say R5xx is wasteful of bandwidth, but I will acknowledge that it has more bandwidth per shader pipe per clock, thus being less saturated and more able to absorb the hit of AA. R600, though, is even more "wasteful" of BW than R580, so if that's a factor in R580 having a low % drop w/ AA, it should help R600, too.

Well the point about MT is there's no need to multi-texture for albedo (since the artist flattens down to a single texel) so then it's albedo + whatever maps are required.

But the point is, if you've only got, say 4 million texels of albedo per frame to sample, then there's not much texel rate required. Specular, normal, ambient will all require lower sampling rates, too, I presume.

As far as I can tell MT implicitly minimises overdraw (in order to minimise the texture footprint loaded into video RAM and stream textures at the "required resolution"), so overall I'm puzzled how such an engine is going to be sampling billions of texels per second for albedo.

I'm just comparing the albedo of polygons with the pseudo-albedo of voxels (the presenter referred to wavelet compression of a huge dataset - I'm presuming he's referring to the voxels that make up the scene).

Jawed, you're way off base here. Your estimates for the number of fetches needed for a scene are off by over an order of magnitude. Megatexturing does not reduce bilinear fetch count much at all. It definately does not reduce the number of fetches to the point where in-shader decompression is feasible, and Xmas agrees with me.

CarstenS · Jun 23, 2008

Mintmaster said:
In theory, yes, but tests show it pretty close to 2x:
http://www.computerbase.de/artikel/...50_rv770/3/#abschnitt_theoretische_benchmarks

When it in fact still achieves about 8.500 MZixels/sec. with 4x AA enabled (3,4 Z/ROP/clk), odds are, the Z-Only rate without AA is utterly bandwidth limited.

I've got a similar test (thx to Marco Dolenc) and it shows HD2900 XT and HD4850 on par without AA with the HD4850 pulling ahead with 4xAA enabled. Normalized to clk and # of ROPs, HD4850 get's about 3,39 Z while HD2900 is pushing out a mere 1,83 Z.

edit:
Even more interesting are the achieved throughputs with 8xAA enabled.
HD4850: 3,9056 z/clk/ROP
HD3850: 1,9041 z/clk/ROP
HD2900: 1,8567 z/clk/ROP
GTX280: 3,1279 z/clk/ROP
9600GT: 5,1307 z/clk/ROP

AlexV · Jun 23, 2008

Mintmaster said:
Near zero net overdraw. With good sorting this can include lots of gross overdraw.

In theory, yes, but tests show it pretty close to 2x:
http://www.computerbase.de/artikel/...50_rv770/3/#abschnitt_theoretische_benchmarks

Actually, I can get it to do 4X Z...it depends on the app used to test that. Fillrate Benchmark(which I guess is loosely based on MDolenc's fillrate tester) seems to go oddball about testing Z-it gets me about 5X Z on an 8800GT and about 2.5X Z on the 4850, which you'll agree are both wrong. Using Archmark you get the correct 4X Z on the 4850.

Mintmaster · Jun 23, 2008

Interesting. Care to post the ArchMark numbers?

G80 vs R600 Part X: The Blunt & The Rich Feature

Jawed

Jawed

Jawed

Mintmaster

CarstenS

Moderator

Jawed

Jawed

jimmyjames123

Davros

mczak

Mintmaster

Jawed

Xmas

Porous

Mintmaster

Jawed

Jawed

Mintmaster

CarstenS

Moderator

AlexV

Heteroscedasticitate

Mintmaster

Similar threads