Jawed
Legend
Is R5xx a good architecture? R520 was never a good advertisement for it. An X1950Pro is quite a bit faster than a 7900GTX in newer games. etc.Low ALU:TEX = brute force (G8X)
Low TEX:ALU = elegant (R6xx)
Jawed
Is R5xx a good architecture? R520 was never a good advertisement for it. An X1950Pro is quite a bit faster than a 7900GTX in newer games. etc.Low ALU:TEX = brute force (G8X)
Low TEX:ALU = elegant (R6xx)
Yeah, thanks, it'd be nice to have an example of a significantly ALU-limited game.Would a test count in your books, where the pure pixel shaders have been grabbed from a real game, run through a fillrate-meter and then been normalized in relation to a given card?
It depends on the amount of hardware that would need to be dedicated to decompression - decompression might just be a macro, a bit like fog is right now, I guess.How is that supposed to improve anything?
So we know for sure that an MSAA rendertarget can be compressed? I seem to remember someone saying that render-to-texture is uncompressed, but I could be mistaken and/or thinking of a console.You can't Load() from the current render target until the Colour+Z compression has been decoded by the RBEs. Only the RBEs have access to the tile tag tables in order to make sense of the data in memory.
Well, I wasn't sure that the RBE's were the things doing the MSAA resolve. If they were, it makes no sense to me that a "hardware" resolve is any faster than a software resolve. NVidia complained about this in their CoJ rant, and everyone said this was the reason R6xx was slow in AA. I figured that the only way this can really be true is if there was some parallel hardware to do the resolve, maybe near the display engine.If you're going to run a decompression pass (within the RBEs) and you're going to do bog-standard MSAA resolve, why ask the RBEs to write the decompressed data to a new target for the TUs to consume when it's possible to transmit the data directly to the register file?
Well, there's a difference between realtime on a 1-2MPix screen and 60 GTex/s decompression (for two of them) without tying up any resources.Apparently RV770 (well 2 of them I guess) is doing real time wavelet decompression on the ALUs for the new Ruby demo.
Jawed
Yeah, thanks, it'd be nice to have an example of a significantly ALU-limited game.
Got the shader code?
Jawed
Yep, even with fp16 samples:So we know for sure that an MSAA rendertarget can be compressed?
In one embodiment, graphics processor comprises a Tile Format Table (TFT) to keep track of the compression format of the incoming files [sic]. In another embodiment, two TFTs are used, with one keeping track of the tiles stored in main memory and another keeping track of the tiles stored in cache memory. With the help of TFT, the graphics processor is able to process incoming tiles in their compressed formats, without the need of decompression. This speeds up overall system performance. In one embodiment, the TFT is also used to enable a "Fast Clear" operation where the graphics processor can quickly clear the cache memory. In another embodiment, the process performs resolve operations on sample colors. In these operations, the sample colors are combined to get the final color of the pixel. The compression schemes of the present invention also enable the processor to optimize the procedure of the resolve operation.
"Copy to texture", which would decompress the compression, is how I think it works. It doesn't make sense to have the MSAA compression hardware idling during rendering to a target that will later be used as a texture.I seem to remember someone saying that render-to-texture is uncompressed, but I could be mistaken and/or thinking of a console.
In an ATI GPU, what other hardware would do this (bearing in mind that the actual averaging is performed by the ALUs since R600).Well, I wasn't sure that the RBE's were the things doing the MSAA resolve.
We know this is true because R600 MSAA performance shows no performance loss against the pure-RBE resolve of R5xx - i.e. with enough ALUs software resolve is as fast as hardware resolve.If they were, it makes no sense to me that a "hardware" resolve is any faster than a software resolve.
NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.NVidia complained about this in their CoJ rant, and everyone said this was the reason R6xx was slow in AA. I figured that the only way this can really be true is if there was some parallel hardware to do the resolve, maybe near the display engine.
Which is Carmack's reason for doing MegaTexture, too, isn't it? But then he suggested doing voxel (or was it octree?) rendering to get "MegaGeometry", along the same lines. Well, there are plenty of variations of this...Well, there's a difference between realtime on a 1-2MPix screen and 60 GTex/s decompression (for two of them) without tying up any resources.
Maybe you should wait until Monday to check what you intend to do...I'm not sure though, whether I'm legally allowed to make this stuff public. Maybe just a diagram with the achieved fillrates normalized to a 7900 GT would suffice?
NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.
Sure nvidia is still doing display-engine msaa resolve? IIRC nv30 could do this, but only with 2xAA whereas it could do 2xAA or 4xAA with a copy resolve.NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.
I know that any format can be compressed. I'm talking solely about the case where the rendertarget is a texture. I guess these textures can only be accessed in the shader using the Load() function instead of Sample() or tex2d(), now that I look into it a bit more closely, so there aren't any orthogonality problems. It does make sense to use the compression hardware, but like I said, I heard otherwise for texture rendering.Yep, even with fp16 samples:
Well that's just it: R6xx MSAA performance has never been as fast as their RV5xx counterparts, at least in terms of performance drop.We know this is true because R600 MSAA performance shows no performance loss against the pure-RBE resolve of R5xx - i.e. with enough ALUs software resolve is as fast as hardware resolve.
That's basically what I was talking about. I figured that ATI would do something similar. Maybe not directly to the display device like NVidia, but a separate unit nonetheless.NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.
The situations are rather different. Megatexture is about only having the data needed consuming the RAM. Your engine is still going to require the same billions of texture accesses per second.Which is Carmack's reason for doing MegaTexture, too, isn't it? But then he suggested doing voxel (or was it octree?) rendering to get "MegaGeometry", along the same lines. Well, there are plenty of variations of this...
But a render target is never a texture while the RT is being written. When the GPU has finished writing the RT then the API tells the GPU to "re-cast" the RT as a texture. All I'm saying is that at the time of re-casting, the compressed RT needs to decompressed because the TU hardware doesn't know how to decompress a RT. Once recast, the RT simply becomes just a "flat" wodge in memory, now labelled as a texture for the TUs to read.I know that any format can be compressed. I'm talking solely about the case where the rendertarget is a texture.
I don't understand why that might be the case, or if that's the case. Most of the time RTT is consumed by a screen-sized quad isn't it?I guess these textures can only be accessed in the shader using the Load() function instead of Sample() or tex2d(), now that I look into it a bit more closely, so there aren't any orthogonality problems.
There was a point when you understood this:Well that's just it: R6xx MSAA performance has never been as fast as their RV5xx counterparts, at least in terms of performance drop.
I've never seen a detailed description of the physical realities of screen display in modern GPUs - I'm out of my depth here.That's basically what I was talking about. I figured that ATI would do something similar. Maybe not directly to the display device like NVidia, but a separate unit nonetheless.
No, Carmack is quite explict in saying that with MT the GPU only accesses texels at around twice the frequency of pixels:The situations are rather different. Megatexture is about only having the data needed consuming the RAM. Your engine is still going to require the same billions of texture accesses per second.
At 1024x768 resolution, well under two million texels will be referenced, no matter what the finest level of detail is.
I agree, fixed function is here for quite a while yet for conventional OGL/D3D-style graphics.To have full speed in today's games, you need to be able to decompress 160 point samples per clock in RV770. There's no way to do that economically except with fixed function hardware away from the shader engines.
There is a huge difference. Fixed function fog maps to a few general ALU instructions and is used at most once per fragment, while S3TC decompression can be required for several texture fetches and is a very specialized sequence of operations which would probably result in several dozen ALU instructions. If you want to do it efficiently there is hardly anything that could be reused for other operations, so there is no point in integrating it into the ALU.It depends on the amount of hardware that would need to be dedicated to decompression - decompression might just be a macro, a bit like fog is right now, I guess.
I'm not disagreeing with you here. This process, however, can be more costly than just rendering uncompressed in the first place.But a render target is never a texture while the RT is being written. When the GPU has finished writing the RT then the API tells the GPU to "re-cast" the RT as a texture. All I'm saying is that at the time of re-casting, the compressed RT needs to decompressed because the TU hardware doesn't know how to decompress a RT.
Not sure what you're referring to, but no, I don't think you can make that generalization. Shadow maps are different, environment maps are different, etc.I don't understand why that might be the case, or if that's the case. Most of the time RTT is consumed by a screen-sized quad isn't it?
Oh I did. But just because I felt there's a "wrench in the comparison" doesn't mean I think the AA drop was as small as it should have been. RV770, for example, has the same 2x Z rate without AA, but it has a smaller drop than even R580, IIRC.There was a point when you understood this:
http://forum.beyond3d.com/showpost.php?p=1086411&postcount=7
I wish you'd remember...
He's talking about texels in memory, not texture fetches, and only per texture at that. "Texel" is a term that has been given two meanings. It really comes from "texture element", i.e. the data elements in the texture, but GTexel/s refers to texture fetches executed by the shader engine, each needed 4 (or more) texels from the cache.No, Carmack is quite explict in saying that with MT the GPU only accesses texels at around twice the frequency of pixels:
I've had a rummage for some code to decode S3TC but haven't found anything so far. I've got the patent and this:There is a huge difference. Fixed function fog maps to a few general ALU instructions and is used at most once per fragment, while S3TC decompression can be required for several texture fetches and is a very specialized sequence of operations which would probably result in several dozen ALU instructions. If you want to do it efficiently there is hardly anything that could be reused for other operations, so there is no point in integrating it into the ALU.
I presume you mean cases like a render target generated with practically zero overdraw or low-resolution render targets.I'm not disagreeing with you here. This process, however, can be more costly than just rendering uncompressed in the first place.
I was thinking in terms of 2D colour and got stuck there.Not sure what you're referring to, but no, I don't think you can make that generalization. Shadow maps are different, environment maps are different, etc.
Eh? RV770 should have 4xZ without AA. Also you're forgetting that R5xx is comparatively wasteful of bandwidth...Oh I did. But just because I felt there's a "wrench in the comparison" doesn't mean I think the AA drop was as small as it should have been. RV770, for example, has the same 2x Z rate without AA, but it has a smaller drop than even R580, IIRC.
Well the point about MT is there's no need to multi-texture for albedo (since the artist flattens down to a single texel) so then it's albedo + whatever maps are required.He's talking about texels in memory, not texture fetches, and only per texture at that. "Texel" is a term that has been given two meanings. It really comes from "texture element", i.e. the data elements in the texture, but GTexel/s refers to texture fetches executed by the shader engine, each needed 4 (or more) texels from the cache.
As far as I can tell MT implicitly minimises overdraw (in order to minimise the texture footprint loaded into video RAM and stream textures at the "required resolution"), so overall I'm puzzled how such an engine is going to be sampling billions of texels per second for albedo.Megatexture reduces the amount of texture data needed for a scene (given certain constraints) by creating a tighter bound - relative to normal engines - on what really has a chance to be accessed, but has no effect on the texture operations needed to render it.
Near zero net overdraw. With good sorting this can include lots of gross overdraw.I presume you mean cases like a render target generated with practically zero overdraw or low-resolution render targets.
In theory, yes, but tests show it pretty close to 2x:Eh? RV770 should have 4xZ without AA. Also you're forgetting that R5xx is comparatively wasteful of bandwidth...
Jawed, you're way off base here. Your estimates for the number of fetches needed for a scene are off by over an order of magnitude. Megatexturing does not reduce bilinear fetch count much at all. It definately does not reduce the number of fetches to the point where in-shader decompression is feasible, and Xmas agrees with me.Well the point about MT is there's no need to multi-texture for albedo (since the artist flattens down to a single texel) so then it's albedo + whatever maps are required.
But the point is, if you've only got, say 4 million texels of albedo per frame to sample, then there's not much texel rate required. Specular, normal, ambient will all require lower sampling rates, too, I presume.
As far as I can tell MT implicitly minimises overdraw (in order to minimise the texture footprint loaded into video RAM and stream textures at the "required resolution"), so overall I'm puzzled how such an engine is going to be sampling billions of texels per second for albedo.
I'm just comparing the albedo of polygons with the pseudo-albedo of voxels (the presenter referred to wavelet compression of a huge dataset - I'm presuming he's referring to the voxels that make up the scene).
In theory, yes, but tests show it pretty close to 2x:
http://www.computerbase.de/artikel/...50_rv770/3/#abschnitt_theoretische_benchmarks
Near zero net overdraw. With good sorting this can include lots of gross overdraw.
In theory, yes, but tests show it pretty close to 2x:
http://www.computerbase.de/artikel/...50_rv770/3/#abschnitt_theoretische_benchmarks