G80 vs R600 Part X: The Blunt & The Rich Feature

Discussion in 'Architecture and Products' started by Jawed, Aug 11, 2007.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Is R5xx a good architecture? R520 was never a good advertisement for it. An X1950Pro is quite a bit faster than a 7900GTX in newer games. etc.

    Jawed
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Yeah, thanks, it'd be nice to have an example of a significantly ALU-limited game.

    Got the shader code?

    Jawed
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    It depends on the amount of hardware that would need to be dedicated to decompression - decompression might just be a macro, a bit like fog is right now, I guess.

    Jawed
     
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    So we know for sure that an MSAA rendertarget can be compressed? I seem to remember someone saying that render-to-texture is uncompressed, but I could be mistaken and/or thinking of a console.

    Well, I wasn't sure that the RBE's were the things doing the MSAA resolve. If they were, it makes no sense to me that a "hardware" resolve is any faster than a software resolve. NVidia complained about this in their CoJ rant, and everyone said this was the reason R6xx was slow in AA. I figured that the only way this can really be true is if there was some parallel hardware to do the resolve, maybe near the display engine.

    I guess the datapath between the RBEs and shader units is a reason.

    Well, there's a difference between realtime on a 1-2MPix screen and 60 GTex/s decompression (for two of them) without tying up any resources. :wink:
     
  5. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    At least that what was grabbed from some sample scenes - maybe some games do generate new shaders on the fly or in different levels.

    I'm not sure though, whether I'm legally allowed to make this stuff public. Maybe just a diagram with the achieved fillrates normalized to a 7900 GT would suffice?
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Yep, even with fp16 samples:

    Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same

    this appears to have been for R5xx.

    "Copy to texture", which would decompress the compression, is how I think it works. It doesn't make sense to have the MSAA compression hardware idling during rendering to a target that will later be used as a texture.

    In an ATI GPU, what other hardware would do this (bearing in mind that the actual averaging is performed by the ALUs since R600).

    We know this is true because R600 MSAA performance shows no performance loss against the pure-RBE resolve of R5xx - i.e. with enough ALUs software resolve is as fast as hardware resolve.

    NVidia appears to have display-engine MSAA resolve, which is why MSAA uses excessive memory (and excessive bandwidth?) - whereas ATI has long resolved down to a smaller front buffer (just colour) for the display engine to use.

    Which is Carmack's reason for doing MegaTexture, too, isn't it? But then he suggested doing voxel (or was it octree?) rendering to get "MegaGeometry", along the same lines. Well, there are plenty of variations of this...

    Anyway, RTRT of this quality makes Intel's videos look kinda silly.

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Maybe you should wait until Monday to check what you intend to do...

    Jawed
     
  8. jimmyjames123

    Regular

    Joined:
    Apr 14, 2004
    Messages:
    810
    Likes Received:
    3
    Using excessive memory/bandwidth with MSAA on NVIDIA cards (relative to ATI cards) surely sounds plausible, which is why CSAA used on NV cards makes a lot of sense when moving beyond 4xAA as a bandwidth-saving technique. Unfortunately there isn't much testing done by reviewers nowadays with 16x CSAA, probably because there is nothing directly comparable on the ATI cards.

    Anyhow, the whole "clever" vs "brute-force" concept is a semantical argument, just a play on words. Depending on how one spins something, it is easy to make one or the other approach appear "clever" or "brutish". The most important thing is how the product performs for the intended market at a given price point, be it gaming/GPGPU/developer applications.
     
  9. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
    As i said before he who dies with the most frames wins ;)
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Sure nvidia is still doing display-engine msaa resolve? IIRC nv30 could do this, but only with 2xAA whereas it could do 2xAA or 4xAA with a copy resolve.
    I'd have thought they've abandoned display-engine resolve, as the drawbacks aren't worth the effort (higher memory usage, fullscreen only). It did in fact use less bandwidth though since with fullscreen you can do pageflip (so no copy necessary), and afair that was pretty much the reason it even existed.
     
  11. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I know that any format can be compressed. I'm talking solely about the case where the rendertarget is a texture. I guess these textures can only be accessed in the shader using the Load() function instead of Sample() or tex2d(), now that I look into it a bit more closely, so there aren't any orthogonality problems. It does make sense to use the compression hardware, but like I said, I heard otherwise for texture rendering.

    Well that's just it: R6xx MSAA performance has never been as fast as their RV5xx counterparts, at least in terms of performance drop.

    That's basically what I was talking about. I figured that ATI would do something similar. Maybe not directly to the display device like NVidia, but a separate unit nonetheless.

    The situations are rather different. Megatexture is about only having the data needed consuming the RAM. Your engine is still going to require the same billions of texture accesses per second.

    To have full speed in today's games, you need to be able to decompress 160 point samples per clock in RV770. There's no way to do that economically except with fixed function hardware away from the shader engines.
     
    #231 Mintmaster, Jun 22, 2008
    Last edited by a moderator: Jun 22, 2008
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    But a render target is never a texture while the RT is being written. When the GPU has finished writing the RT then the API tells the GPU to "re-cast" the RT as a texture. All I'm saying is that at the time of re-casting, the compressed RT needs to decompressed because the TU hardware doesn't know how to decompress a RT. Once recast, the RT simply becomes just a "flat" wodge in memory, now labelled as a texture for the TUs to read.

    I don't understand why that might be the case, or if that's the case. Most of the time RTT is consumed by a screen-sized quad isn't it?

    There was a point when you understood this:

    http://forum.beyond3d.com/showpost.php?p=1086411&postcount=7

    I wish you'd remember...

    I've never seen a detailed description of the physical realities of screen display in modern GPUs - I'm out of my depth here.

    No, Carmack is quite explict in saying that with MT the GPU only accesses texels at around twice the frequency of pixels:

    http://forum.beyond3d.com/showpost.php?p=674633&postcount=15

    :razz: I agree, fixed function is here for quite a while yet for conventional OGL/D3D-style graphics.

    Jawed
     
  13. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    There is a huge difference. Fixed function fog maps to a few general ALU instructions and is used at most once per fragment, while S3TC decompression can be required for several texture fetches and is a very specialized sequence of operations which would probably result in several dozen ALU instructions. If you want to do it efficiently there is hardly anything that could be reused for other operations, so there is no point in integrating it into the ALU.
     
  14. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I'm not disagreeing with you here. This process, however, can be more costly than just rendering uncompressed in the first place.

    Not sure what you're referring to, but no, I don't think you can make that generalization. Shadow maps are different, environment maps are different, etc.

    Oh I did. But just because I felt there's a "wrench in the comparison" doesn't mean I think the AA drop was as small as it should have been. RV770, for example, has the same 2x Z rate without AA, but it has a smaller drop than even R580, IIRC.

    He's talking about texels in memory, not texture fetches, and only per texture at that. "Texel" is a term that has been given two meanings. It really comes from "texture element", i.e. the data elements in the texture, but GTexel/s refers to texture fetches executed by the shader engine, each needed 4 (or more) texels from the cache.

    Megatexture reduces the amount of texture data needed for a scene (given certain constraints) by creating a tighter bound - relative to normal engines - on what really has a chance to be accessed, but has no effect on the texture operations needed to render it.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I've had a rummage for some code to decode S3TC but haven't found anything so far. I've got the patent and this:

    http://graphics.stanford.edu/courses/cs448a-01-fall/nvOpenGLspecs.pdf

    pages 155-158 being key.

    Yeah looks like quite a lot of instructions, 20-30 for DXT5? So, yeah, quite a long time before it'll be running on ALUs.

    I suppose it's worth querying the trend for the ratio of DXT:non-DXT texels...

    Jawed
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I presume you mean cases like a render target generated with practically zero overdraw or low-resolution render targets.

    :oops: I was thinking in terms of 2D colour and got stuck there.

    Eh? RV770 should have 4xZ without AA. Also you're forgetting that R5xx is comparatively wasteful of bandwidth...

    Well the point about MT is there's no need to multi-texture for albedo (since the artist flattens down to a single texel) so then it's albedo + whatever maps are required.

    But the point is, if you've only got, say 4 million texels of albedo per frame to sample, then there's not much texel rate required. Specular, normal, ambient will all require lower sampling rates, too, I presume.

    As far as I can tell MT implicitly minimises overdraw (in order to minimise the texture footprint loaded into video RAM and stream textures at the "required resolution"), so overall I'm puzzled how such an engine is going to be sampling billions of texels per second for albedo.

    I'm just comparing the albedo of polygons with the pseudo-albedo of voxels (the presenter referred to wavelet compression of a huge dataset - I'm presuming he's referring to the voxels that make up the scene).

    Jawed
     
  17. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Near zero net overdraw. With good sorting this can include lots of gross overdraw.

    In theory, yes, but tests show it pretty close to 2x:
    http://www.computerbase.de/artikel/...50_rv770/3/#abschnitt_theoretische_benchmarks
    I wouldn't say R5xx is wasteful of bandwidth, but I will acknowledge that it has more bandwidth per shader pipe per clock, thus being less saturated and more able to absorb the hit of AA. R600, though, is even more "wasteful" of BW than R580, so if that's a factor in R580 having a low % drop w/ AA, it should help R600, too.

    Jawed, you're way off base here. Your estimates for the number of fetches needed for a scene are off by over an order of magnitude. Megatexturing does not reduce bilinear fetch count much at all. It definately does not reduce the number of fetches to the point where in-shader decompression is feasible, and Xmas agrees with me.
     
  18. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    When it in fact still achieves about 8.500 MZixels/sec. with 4x AA enabled (3,4 Z/ROP/clk), odds are, the Z-Only rate without AA is utterly bandwidth limited.

    I've got a similar test (thx to Marco Dolenc) and it shows HD2900 XT and HD4850 on par without AA with the HD4850 pulling ahead with 4xAA enabled. Normalized to clk and # of ROPs, HD4850 get's about 3,39 Z while HD2900 is pushing out a mere 1,83 Z.

    edit:
    Even more interesting are the achieved throughputs with 8xAA enabled.
    HD4850: 3,9056 z/clk/ROP
    HD3850: 1,9041 z/clk/ROP
    HD2900: 1,8567 z/clk/ROP
    GTX280: 3,1279 z/clk/ROP
    9600GT: 5,1307 z/clk/ROP
     
    #238 CarstenS, Jun 23, 2008
    Last edited by a moderator: Jun 23, 2008
  19. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,535
    Likes Received:
    144
    Actually, I can get it to do 4X Z...it depends on the app used to test that. Fillrate Benchmark(which I guess is loosely based on MDolenc's fillrate tester) seems to go oddball about testing Z-it gets me about 5X Z on an 8800GT and about 2.5X Z on the 4850, which you'll agree are both wrong. Using Archmark you get the correct 4X Z on the 4850.
     
  20. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Interesting. Care to post the ArchMark numbers?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...