AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by UniversalTruth, Dec 17, 2010.

  1. Malo

    Malo YakTribe Gaming
    Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    4,344
    Location:
    Pennsylvania
    Are you using chrome?
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,628
    Location:
    London
    No, FF and IE. Anyway I can only guess that my ISP was being creative with something or other, since the problem has disappeared. I had wondered whether Win8CP was the cause, but I ruled that out.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,721
    Location:
    Well within 3d
    The Radeon 7900 review thread discussed a possible weakness in AMD's design when it came to MRTs and MSAA. Building the G-buffer in various deferred schemes is an area where Nvidia handled things significantly better.
     
  4. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,345
    So, apart from obvious economic reassons that probably made them tease it in a Keppler card, it is not crazy to think that a heavy deferred engine like the future Unreal Engine 4 could perform much better in an Nvidia architecture ?.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,628
    Location:
    London
    There's no "why" in that discussion, as far as I can tell.
     
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,506
    Location:
    Hamburg, Germany
    To add to this point, if I'm not mistaken, AMD GPUs read a render target (configured as input to a pixel shader) through the TMU data path. Therefore it could actually be a TMU weakness, not one of the ROPs. That would also explain, that the performance relation between Pitcairn and Tahiti stays virtually the same with MSAA.
     
  7. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,902
    I would think everybody reading a render target in a pixel shader would do so through the tmu data path? A render target should look pretty much like any ordinary texture when accessed in the pixel shader. Maybe it's more likely to have non-full speed throughput due to "odd" format but otherwise what's the difference?
     
  8. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    12,607
    That's the feeling I get when hearing some people talk about the results in BF3 when comparing Tahiti to Fermi. As without MSAA Tahiti's performance is about where you'd expect it. But once you enable MSAA, performance tanks compared to Fermi cards.

    Regards,
    SB
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,628
    Location:
    London
    Going back into the mists of time, CUDA was often (though not always) higher performance reading linearly organised buffers through non-TMU paths rather than through the TMUs.

    What we could simply be seeing in this scenario, is that NVidia's "CUDA-specific" linear access hardware is better than AMD's. It wasn't that long ago that doing linear buffers in OpenCL on AMD was a disaster zone (because it was based upon the vertex fetch hardware) and AMD might still be climbing that curve.

    AMD's initial support for UAVs was something of a kludge as far as I can tell - for the multiple UAVs that are required by D3D, using an emulation that configures a single physical UAV in hardware and splits it up. Additionally AMD hardware has severe constraints on the size of a UAV - a common complaint amongst OpenCL programmers is (was?) that it is impossible to allocate a single monster UAV (that is, a linear buffer) to use the majority of graphics memory (e.g. 900MB out of 1GB). There's some kind of hardware/driver restriction that only allows for 50% allocation. Allocating texture memory in OpenCL is less constrained.

    Textures are normally non-linearly mapped across memory channels.

    On Xenos there's a non-linear organisation of MSAA'd render targets (and regular render targets? can't remember) . If you rummage you'll find swizzle "hacks" for deferred MSAA rendering on XB360. Something like that, I forget the details.

    It seems to me that anyone with access to both AMD and NVidia cards should be able to enumerate the MSAA levels and MRT configurations with some clever code, to observe the performance profiles of the raw access techniques, without being obscured by all the other stuff that happens in a game frame.
     
  10. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    What if, I choose a random factor for each chip being a multiple of 3 for Cape Verde, a multiple of 6 for Pitcairn and a multiple of 8 for Tahiti?
    [​IMG]

    I know this isn't very scientific but... There should be something more than just ROPs, TMUS, SPs and memory bandwidth there...
     
  11. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,902
    I'd think that gcn's better cache architecture should potentially fix such issues? I think though I'm largely missing how any necessary synchronization etc. really works for UAVs...
     
  12. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,352
    We've had caching for buffer reads for EG/NI chips for quite a while now. It doesn't help when a buffer is read/write, but there are plenty of buffers that are read only so it's still quite beneficial. SI has caching all the time of course.

    There are some reasons for this. First, the GPU's memory pool is split into two regions: CPU visible and invisible. The CPU visible region we expose is 256MB, normally. This means that you have ato most 768MB of contiguous memory on a 1GB card. The way the OpenCL conformance tests are written, you have to be able to allocate a buffer of the maximal size you report, which is sort of impossible to guarantee unless you're conservative. I believe Nvidia only exposes 128MB of CPU visible memory, so they have a larger continuous pool to work with. They also may handle memory allocations differently, but we use VidMM and expose two memory pools. Note that I believe we've improved this (memory allocation) behavior recently, but you're still going to have some limits caused by having two memory pools.

    My understanding is that if everyone were using 64-bit OSes (and apps) we could expose all the video memory to the CPU and not worry about having separate memory pools, not to mention facilitating faster data uploads in some cases.
     
  13. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    At least on HD5xxx family, you can not allocate a single OpenCL buffer larger than 128MB (indeed you can allocate multiple 128MB buffers). I haven't recently verified if this limit is still present but I assume so. It is one of the most annoying limitation of AMD old hardware. It was a severe limitation for most OpenCL applications on AMD.

    In my opinion, this limit was more annoying than not having access to all GPU memory pool.

    Another note: in the past, using linear data stored in an OpenCL image buffer was an effective way to improve performance over storing data on OpenCL linear buffer. This optimization was quite annoying to code too.
     
  14. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,506
    Location:
    Hamburg, Germany
    To get back to the question of MSAA performance:
    Exactly that was my idea. ;)
    Just imagine nV can do it full speed and AMD only half speed (or some other difference). Factor in that AMD TMUs are already slower for the FP16 data format used quite often (afaik) for such render targets and you may arrive at a significant difference.
     
  15. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,352
    If you're using Linux, then the issue is lack of VM support. In Windows we support VM for all EG/NI/SI chips and don't have these issues. Currently, only SI has VM support in Linux.
    This is probably because read-only images are always cached. Buffers used read-only would be cached as well, as long as you don't alias pointers. I.e. "kernel void foo(global float* in, global float* out)", if the same memory object were bound to "in" and "out", then in would not be cached.

    Sorry for the OT, but I thought it was worth explaining.
     
  16. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    6,492
  17. Mianca

    Regular

    Joined:
    Aug 7, 2010
    Messages:
    330
    Well, I chuckled a bit at the notion of 6GB of RAM on one card. :grin:

    Should come with some kind of "4k Eyefinity ready" sticker ... :wink:
     
  18. Dooby

    Regular

    Joined:
    Jul 21, 2003
    Messages:
    478
    Well, thats only 3GB per chip, much like all their other cross fire on a stick cards.

    Given that theres already a 6GB for ONE GPU card coming out, 6GB for two GPUs is hardly amazing.
     
  19. Psycho

    Regular Subscriber

    Joined:
    Jun 7, 2008
    Messages:
    707
    Location:
    Copenhagen
    So 2 full Tahitis @ 850 mhz with only 300W TDP?
    (and probably a bios switch making it a full 7970x2 around 375W)
    That sounds pretty efficient :)
     

Share This Page

Loading...