AMD: R7xx Speculation

Discussion in 'Architecture and Products' started by Unknown Soldier, May 18, 2007.

Thread Status:
Not open for further replies.
  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    To make this work you need a ROP architecture that can work with non-uniformly organised memory for render targets.

    To my knowledge all GPUs use a uniform memory layout for their render targets - the compression comes solely from the way tiling of pixel-channel/sample-channel data is peformed, meaning that the ROPs access memory in units of an entire-tile. A fully compressed AA sample merely means that the ROPs access just one tile of memory. Compression for a render target varies over the lifetime of the frame, since it's a technique aimed at saving bandwidth not memory.

    As it happens, the recent ATI patent for AA sample compression vaguely hints at a non-uniform layout of pixel/sample data in memory - but I think that's just patentese covering all bases or me being creatively interpretive.

    Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same

    Jawed
     
  2. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I don't think that's necessary. For each block in EDRAM you have three in RAM, and you only access the latter if the tile is uncompressed.

    The problem with this idea is that it improves the best case while keeping the worst case the same. When a tile is compressed there isn't much benefit from storing it in EDRAM because it's low BW. It's the uncompressed tiles that chew up BW.

    Arun, the whole point of EDRAM is to avoid compression logic and the worst case perf associated with compression.
     
  3. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,442
    Likes Received:
    181
    Location:
    Chania
    Don't misunderstand me, yet this debate is about a hypothetical scalable PC architecture (low to high end). We've chewed over that topic more than often even in private conversations and when you say "next round" I honestly hope you don't mean the OpenGL-ES2.x generation, since there especially NV is more than just late.

    Why do I have the feeling that it would make more sense for low end and possibly even lower mainstream parts? If yes than I don't see any IHV easily bothering unless they can effectively scale an implementation from top to bottom.

    As Mintmaster points out, I'd too figure that compression logic would add to the final budget.

    IMHLO (highlight the L for layman if you please) the implementation of eDRAM into a desktop design should have clear advantages and only miniscule and manageable disadvantages. For the time being I haven't seen anything that suggests that a healthy amount of eDRAM in order to support today's and tomorrow's display resolutions, without any awkward workarounds (be it added compression or macro-tiling), wouldn't add to the final cost of a GPU.
     
  4. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,442
    Likes Received:
    181
    Location:
    Chania
    Not directly correlated at all. If the target is roughly today's available resolutions in the desktop space then 64 ROPs sound like a huge overkill to me with merely 20MB eDRAM (unless of course the hypothetical eDRAM is meant for something totally different, since they also suggest 1 or 2GB framebuffers).

    Have a look at the hypothetical specs:

    ultra low end = no eDRAM, 8 ROPs
    low end= 10MB eDRAM, 16 ROPs
    ....
    ultra high end = 20MB eDRAM, 64 ROPs
     
  5. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I can see why you'd think that (I used to also), but no... :) Mintmaster's comment explains why this is the case, so I won't bother unless you want me to elaborate on potential implementation details.

    You are working on two assumptions which are, as far as I can tell, far from perfectly accurate:
    1) eDRAM used this way wouldn't improve the worst-case: No, it would improve it, just by a lesser percentage. You could read three memory bursts instead of four, since one of them (even when it's for uncompressed data!) is in eDRAM. This saves 25% bandwidth in the *worst-case*.
    2) The vast majority of bandwidth comes from non-compressed tiles. This is probably a gross simplification, see below.

    Any smart modern architecture wouldn't just have 'compressed' and 'uncompressed' tiles. You'll have different levels of compression most likely, ideally reusing the same techniques (thus sharing silicon) but less aggressively.

    Nobody outside NVIDIA and ATI has any idea how fine or coarse these compression levels really are, but at the strict minimum I would expect that you to have, say, 4:1, 2:1 and 1:1 for 4x MSAA's color buffer. I would also be surprised if there was no '3.5:1' mode basically (or perhaps that really is 4:1!) to handle the common case of 'nearly-perfect-but-really-not' compressibility.

    So, what I suspect is that a majority of the bandwidth is taken by midly compressed tiles, not fully uncompressed ones which are more the exception than the rule and that are probably limited by, say, triangle setup anyway in current architectures.

    Furthermore, there is something else that might not be completely obvious. Assuming there is only exactly enough eDRAM to fit everything (i.e. save 100% framebuffer bandwidth) under maximum compression for all tiles, then the amount of bandwidth you save is always exactly this, where both the final result and the average compression are between 0 and 1:
    Code:
    Saved Bandwidth = eDRAM Amount / (Framebuffer Size * Average Compression)
    It could be proven that 50% of the framebuffer compressing 50% (and the other half being uncompressed) results in equal savings to any other way to achieve 25% overall framebuffer compression under the above rules. Thus, every tile being 25% compressed or 75% of tiles being 33.3% compressed results in the same bandwidth savings for a given amount of eDRAM.

    Of course, that breaks down when you have more eDRAM than your framebuffer size multiplied by your best-case compression rate (unless you want to go non-uniform; ugh!) but the final results remain very impressive IMO.

    So, the real question becomes: what do you think the average compression rates are for a 1920x1200 4x MSAA HDR framebuffer in, say, Oblivion? I'd expect them to be pretty damn good, otherwise the final performance doesn't make much sense in my mind. And as a logical consequence of this and the above, I would expect eDRAM bandwidth savings under my proposed approach to be pretty damn good too.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    And you've still got the problem of supporting 8 MRTs, so now the EDRAM is a drop in the ocean.

    Jawed
     
  7. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    1) No need to split hairs :wink: I did realize this when posting above, but the savings aren't enough, IMO. I don't think 25% of just framebuffer BW is enough to justify having so much EDRAM. I think you'd need 30MB of EDRAM to occupy 10% of the die before this decision becomes a win, even considering your second point.
    2) This is not quite the assumption I was making. I was really suggesting that among the spans of time when you are BW limited, many times you have a large percentage of pixels from non-compressed tiles.

    In the end I guess I'm just saying EDRAM makes a lot more sense when all framebuffer traffic goes there. If not, the benefits are greatly reduced, and the cost/benefit analysis makes the decision a lot closer to a wash than blindingly obvious. When radical architecture changes entail such iffy benefits, companies don't generally go for them.
    What exactly doesn't make sense in your mind? I can't find many figures of 0xAA vs. 4xAA with HDR enabled on the web, but even if I could, there's no way that you can deduce compression rates from them.
     
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Of course, the games using 8 MRTs is a drop in the ocean also.

    Arun is basically saying use the EDRAM for as many pixels as you can. If a tile is uncompressed, store the rest in memory. If you have MRTs or a big FB, use the EDRAM for a fraction of the screen.

    The counterargument, of course, is that this system kills or dampens several advantages of EDRAM. If educated people aren't sold on Xenos' implementation, then they'd think this system is an outright waste of time.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    I'm thinking of D3D10 deferred renderers...

    I think the GPU should adapt to the most efficient use of a gob of memory, not have foisted upon it a fixed function that loses any advantage it might have had as soon as the simple use-case is exceeded. It doesn't appear to degrade gracefully.

    Also, I think that with ROPs destined to disappear (becoming shader programs) and with the apparently urgent need for GPUs to get better at scatter I'd argue that a gob of on-die memory needs to be more adaptable to multiple types of concurrent workloads - so a fixed-function EDRAM render target buffer is short sighted.

    Jawed
     
  10. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    If DX10-only deferred renderer needs 8 MRTs, then the entire programming team should probably be fired on the spot! ;)

    It's easy to create a D3D10 deferred renderer for terrain/sky/water that is, in fact, Z-only! Well, maybe stencil too, but you get the point. If you're smart, it should also be possible not to use *that* much memory for doing deferred rendering on objects too. Although it's still much more than an immediate renderer, obviously.

    The easy solution to your worry would be to have it directly reservable and accessible via shaders for the parts you have reserved. I fail to see how that would work in DirectX, but it should be possible via OpenGL extensions, via CUDA, or in a console.
     
  11. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    So, are you saying it makes sense on 40nm? ;) eDRAM should be roughly 1MByte/mm2 on that node (and yes it does exist), and a 300mm2 chip is hardly out of the question.

    Also, I would be very surprised if 15MB of eDRAM would only result in 25% bandwidth savings. I would bet on 50%+ personally - of course, that isn't taking textures into consideration for example.

    Theoretically, you could determine how much extra bandwidth 4x MSAA takes over 0x MSAA by reducing the memory clock to an insanely low value and adding a LOD bias that guarantees only the 1x1 mipmap is used. Performance should then roughly scale linearly with framebuffer bandwidth requirements, unless fetching vertices is more expensive than you'd expect it to be (unlikely).

    Of course, I'm not aware of anyone having ever done that, but perhaps I should bother testing that a bit soon to further consolidate my theory! :) (and it might make for some nice data in an article in (framebuffer) compression too).
     
  12. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I think that's when you'll just begin to pass break-even, but it still doesn't make sense, especially when considering other SKUs. A quarter perf. chip with a quarter the EDRAM would be fine if the gamer played the same settings at a quarter the resolution, but not lower settings at the same or half the resolution. Then you'd have crappy RV630-like scaling.

    I will admit, though, that in the long term we could very well see a scheme similar to what you're talking about. It just doesn't seem sensible for the R700 timeline.
    If we're talking about 1920x1200 and FP16, that's 27MB. That's the minimum res you'd want to be competitive in at the high end. I didn't say it would average 25% savings, just that 25% of FB BW in a pretty common worst case isn't much to write home about. Remember also that you have to write out everything in the EDRAM to main memory unless you can texture from it, and then you'd need even more space for the new backbuffer you're writing to.

    Theoretically we can do a lot of things, but you said the performance in Oblivion wouldn't make sense if the compression rate isn't good. What is the basis of this statement?

    Even with the test you described, there's no way of determining the compression ratio. Perf would only scale linearly if you're always BW limited, which you're not, even with a low clock (think about setup-limited clumps). AA also has a bit more work from fewer empty triangles and more quads touched per triangle, along with loopback through ROPs when applicable. Maybe you could look at the change in slope of scaling with mem speed, but it's still sketchy to draw any conclusions about compression.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,875
    Likes Received:
    767
    Location:
    London
    I can't think of anything else that's going to use lots of MRTs, and the common thread in DR discussions is "fitting the G-buffer into only 4 MRTs is a pain in the arse".

    So, I've got no idea how many MRTs will be used in D3D10 versions of DR, but there's a shed load of extra space there going begging...

    D3D is moving towards fully-programmable render target operations. And full scatter support in the shaders.

    I dunno how GPU designers will attack making this stuff work. Perhaps with just a wodge of cache. Maybe they'll create some kind of DMA-list-processing architecture (like Cell SPE's). Maybe it's already in there waiting to be set free...

    Whatever, an EDRAM fixed-function partial colour/Z buffer seems rather short-sighted to me.

    And more general-purpose memory handling is a high priority for GPGPU stuff (though I think R600 is already quite advanced in this regard)...

    Jawed
     
  14. santyhammer

    Newcomer

    Joined:
    Apr 22, 2006
    Messages:
    85
    Likes Received:
    2
    Location:
    Behind you
    It seems AMD gonna introduce 45nm multicore GPUs with the R700:

    http://www.fudzilla.com/index.php?option=com_content&task=view&id=4327&Itemid=1
    http://www.fudzilla.com/index.php?option=com_content&task=view&id=4346&Itemid=1
    http://www.fudzilla.com/index.php?option=com_content&task=view&id=4348&Itemid=1
    (sorry, fudzilla source)

    Is not clear to me, perhaps they are moving to a tile rendering system so they need to duplicate 2D/triangle setup/clipping/etc transistors? If not... why not just to add more shading units into the same silicon die?
     
  15. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Would R600 work in any system if it had 640 shaders?
     
  16. santyhammer

    Newcomer

    Joined:
    Apr 22, 2006
    Messages:
    85
    Likes Received:
    2
    Location:
    Behind you
    Well, depends on the silicon die size... If is too big reducing the integration level could help... but making it multicore you are really increasing the die(well, or package encapsulation if you do 2+2 cores) size/heat dissipation/power consumption/prod costs so...
     
  17. Sound_Card

    Regular

    Joined:
    Nov 24, 2006
    Messages:
    936
    Likes Received:
    4
    Location:
    San Antonio, TX
    If fudo is to be believed....

    R700

    • 45nm
    • 4 chips sharing the same package
    • Each ship is 72mm2 and packs 300 million transistors
     
  18. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    I like the concept, but the big concerns are performance scalability (2 chips, 2x as fsat as 1 chip; 4 chips, 4x as fast as 1 chip) and software compatibility. If these issues can be addressed the potential is pretty high for AMD/ATI. Yields on four small cores should be higher than those of 1 large core. R&D would be more focused as your high end GPU and low end would share the exact same DNA -- but this could be a problem for some redundant features like video acceleration (so my guess is you would see 2 core types, a "master" core and "bare" cores). The potential for more hand tuned design at 300M transistors (x4) is more likely than that of a 1.2B transistor core. Featureset across the board would be level (no more 9200 GPUs). The economy of scale of producing basically 1 core should keep prices low on both the low end and high end, potentially allowing ATI/AMD to pack more product into a board at the same price. Over time the migration of such cores right onto CPU packages and eventually onto CPU cores for initially low end systems (especially the quickly growing Notebook/Laptop market) would mean AMD/ATI keep more of the sale, could get more OEM contracts, and even have a marketing advantage in that an R700 GPU would probably perform much better, relative to the market, than stuff like the X1150, GF6150M, etc products do.

    Yet if the first two issues, performance and across the board compatibility with multicore, then this could kill AMD/ATI. So I guess we wait and see.
     
  19. chavvdarrr

    Veteran

    Joined:
    Feb 25, 2003
    Messages:
    1,165
    Likes Received:
    34
    Location:
    Sofia, BG
    question is: Will AMD/ATi be able to use these 4 cores as "ring-bus" stops?
    If not, using dumb AFR will mean NV will kill'em
     
  20. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    I think that is basically the design decision behind R600.
    although the scale and utilization of the controller in R600 is out of proportion that same controller would fit snuggly in a situation where it would be controlling multiple bus stops, both internal and external.

    This .. for me.. seems to be the only reason ATI pushed through with it's design of R600/R700. not because of what problems it faced this year. but of the future potential of the ring-bus controller.

    The fact that ATI went this route (including shader based AA) seems to me that want as little dependencies on on-die logic that would make one single "processor" bulky .. or dependent on the other processors.

    I have no idea how well it would scale to the low end, but seen 670's recent 8xAA numbers it would suggest it's potential under high res/high AA is enormous.
     
    #140 neliz, Nov 23, 2007
    Last edited by a moderator: Nov 23, 2007
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...