Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

Discussion in 'Architecture and Products' started by Geo, Oct 13, 2005.

  1. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    THough it has only that much headroom, so I wouldn't expect wonders. I assume it'll be some 10% more max in the end.
     
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Anything more is obviously for the better. I'd urge anybody who will consider testing any of those improvements to compare standard available with custom timedemos to exclude any weird possibilities.
     
  3. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Our "Turkey Baster" timedemo is very custom, and becuase it was designed to show up 512MB board performances it specifically goes through nearly an entire level to pick up areas of swapping of textures.
     
  4. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    Sure. Just wanted to say, people should not expect another 30% jump or two :)
     
  5. Bouncing Zabaglione Bros.

    Legend

    Joined:
    Jun 24, 2003
    Messages:
    6,363
    Likes Received:
    83
    It's already shown more than that on some games at the first attempt from ATI. Given the memory bottleneck we've been seeing for a while, it seems like it could be quite an advantage if the memory controller can monitor itself and change operation on the fly in order to maximise bandwidth use as Sireric discussed. A memory controller that can reconfigure itself depending what kind of game or what kind of scene it's rendering could be a significant improvement if it means your chip is more highly utilised.

    I wouldn't turn down at an extra 10-20 percent performance just from a smart memory controller.
     
    #165 Bouncing Zabaglione Bros., Oct 15, 2005
    Last edited by a moderator: Oct 15, 2005
  6. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    BZB, I doubt it's a first attempt. But surely not the last ;)
     
  7. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I still want to see tests from as many different levels as possible; and that because I more than often fell on my nose lately while not checking more areas in a game; it was actually TAA related but it doesn't do me any good if performance is fine in the majority of maps, yet proves lacklustering in a minority of instances. NV claimed a 10-15% performance drop for TAA; if the application/map is highly CPU bound yes of course, otherwise....
     
  8. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    TAA is naturally going to be very dependant on the number of Alphas used - scenes that are primarily built of opaque textures are going to behave just like MSAA, scenes that have lots of Alphas (such as a forest area in Far Cry, for instance) and you're basically performing the same as SSAA. A feature like this is going to be very variable in performance according to the composition within it; MSAA mostly less so.
     
  9. BobbleHead

    Newcomer

    Joined:
    Sep 24, 2002
    Messages:
    58
    Likes Received:
    2
    Halving the minimum access is one possibility. But if you've built up the rest of the system to handle higher latencies, why not keep the same sized accesses? Those would now keep a smaller channel occupied twice as long. You don't mind a little longer latency (because of upstream design changes), you keep the data bus moving data a higher percentage of the time (reducing command overhead), and you can be doing something for twice as many blocks on the chip at the same time.

    DRAM cores increase in speed a lot slower than the busses that feed them. If you look at various spec sheets you'll see that absolute access times have not really changed (or even gotten worse), while clock speeds (and thus bandwidth) have increased quite a bit. A 250 MHz DDR part with a CL of 3 starts giving you data back in 12 ns. A top-end 800 MHz GDDR3 part requires a CL of 11, giving you data back in 13.75 ns. If you make just one small access, you're wasting a lot of cycles on your data bus. Instead you've got to work harder to find larger blocks of consecutive requests to keep those high bandwidth data pins humming.
     
  10. CK

    CK
    Newcomer

    Joined:
    Jun 27, 2004
    Messages:
    15
    Likes Received:
    0
    I have been benchmarking the X1800XL and X1600XT in Doom III (v1.3) and Riddick (v1.1) with and without the OGL fix. The X1600XT does not really seem to react with the fix. I don’t know if there’s some registry bugs changing from X1800XL to X1600XT without "re-ghosting" the rig or something like that, but the X1600 card did not show any difference. Can somebody confirm that or?

    The X1800Xl reacted like reported; 2xAA shows no difference or a slightly decrease in performance (1-2 FPS) and same goes for 6xAA in 1024x768, 1280x1024 and 1600x1200. With 4xAA I experienced boost ranging from 9% to 13% in Riddick and 9% to 15% in Doom III.
     
  11. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    Quick note:
    The update is mainly for X1800* in OGL at 4xAA. We will follow up with other AA modes and with X1600/X1300 in the future, but we can't do everything at once. All of them require different tuning to optmize performance.
     
    Geo likes this.
  12. tEd

    tEd Casual Member
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,105
    Likes Received:
    70
    Location:
    switzerland
    Will these optimisations controlled via Cat.AI , on a per game basis or will they also work if Cat. AI is disabled?
     
  13. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,455
    Likes Received:
    471
    Yes.
     
  14. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Yup, and that one reason that it's difficult to compare per transistor efficiency between architectures. G70 wasn't designed for sky-high core/memory speeds, so it won't need as many FIFO's as R520 will. I remember doing some work on R300 at ATI, and by lowering both the clock and memory speed a bit resulted in a significantly higher efficiency.
     
  15. Headstone

    Newcomer

    Joined:
    Sep 29, 2003
    Messages:
    123
    Likes Received:
    0
    It is very strange that the 1600 sees no benifit where the 1800 and 1300 see large gains. Does any one have a good reason for this?
     
  16. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    The reason is that DDR is already designed to work at full bandwith with a given burst rate (if you ignore the overhead of loading a new row, changing pages, etc) and GDDR3 supports only one burst mode: 4. Burst 4 means that four 32 bit elements per memory chip are read/written each 2 cycles. GDDR2 supported burst modes 4 and 8 but I doubt any GPU used 8. For 64-bit independent buses (2 memory chips per bus) and burst 4 the minimum access is 32 bytes. So for burst 8 the minimum access is 64 bytes (what I'm currently using in the simulator, just to get even worst bandwidth usage :wink: ).

    R520 implements 32-bit independant buses (1 memory chip per bus) so the minimum access is 16 bytes and given the GDDR3 restriction it can't get longer. Of course the GPU stages can request or send more than 16 bytes to the memory controllers but those accesses are splitted into multiple 16 byte accesses to the memory chips.
     
    #176 RoOoBo, Oct 17, 2005
    Last edited by a moderator: Oct 17, 2005
  17. BobbleHead

    Newcomer

    Joined:
    Sep 24, 2002
    Messages:
    58
    Likes Received:
    2
    Initial GDDR3 parts only required support for burst of 4 (8 was optional), however the newer revisions all generally support both 4 and 8. That really is not important though, since a burst of 8 read from address 0 looks exactly the same as back to back burst of 4 reads from 0 and 4. As you said, it just appears as multiple accesses. Upstream blocks on the chip do not need to know anything about that. If you design something to always work on blocks of 64 bytes (because you know that is more efficient), the lowest level hardware that talks to the DRAM chip can decide whether to do that as 2x burst of 8 or 4x burst of 4.

    More troublesome is your first sentence. You can not ignore the substantial time required to open and close pages. As an example the 800 MHz 512 MBit Samsung part has a row cycle time of 35 clocks. More importantly it has a four-activate time of 40 clocks. If you were to read only a single burst of 4 each time you opened a page, you would be wasting 80% of your bandwidth. 2 cycles per read * 4 banks read from / 40 clocks = 0.20. Make that 2 bursts of 4 (or 1 burst of 8), and you are still tossing away 60%. Even doubling up to 2 bursts of 8 only gets you to 80% utilization, for that short period of time. Add in loss due to read/write switching, refresh, and not always having something available for a different bank to do, and utilization over a longer time scale drops much lower.

    You need a lot of reads or writes that can be sent out on consecutive cycles to approach any kind of useful utilization. Make much longer bursts to get stretches where you are using 100% to offset the inevitable waste that happens at other times.
     
    Geo likes this.
  18. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    WRT this patch, I'd suggest looking at the pre-patched 4x vs 6x scores. 6x is actually faster than 4x on the XT, suggesting that something may not quite have been right with the 4x memory mappings in the first place.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    What 6x scores?

    Jawed
     
  20. AlphaWolf

    AlphaWolf Specious Misanthrope
    Legend

    Joined:
    May 28, 2003
    Messages:
    9,470
    Likes Received:
    1,686
    Location:
    Treading Water
    I think its most likely because x1300 is more like x1800 than it is like x1600 and the improvements were done mostly with the x1800 in mind.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...