AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by Nemo, May 7, 2013.

Tags:
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The compression is for frame buffer data, not the whole memory budget of the graphics context.
    The memory budget for that is comparatively small, but it is heavily read and written by the ROPs--which are currently the dominant bandwidth consumers of the GPU.

    While AMD has said little on the implementation of this compression, the argument could be made that even if the compression data + color buffer in memory were bigger than the color buffer alone, it might still be worthwhile if the compression data is accessed less frequently than the color data it compresses.
     
  2. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    If the data has to be read and written at random, then it can only be BW compression, not data compression, I would think?

    What I mean is: if you're compression sectors of data, the start address of each sector will still be the same, it's just that there'd be gaps in between each sector. And some attribute space indicating whether the sector is compressed or not.

    It would be cool if someone could write a micro-benchmark to figure this out.
     
  3. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    No, because you have no way of predicting how much the buffer will compress from one frame to the next; it could be a lot (however much that is), or maybe nothing at all, and anything inbetween. You must always reserve enough space to store the entire buffer as-is always...unless they're doing lossy compression and forcibly squish down the buffer to some arbitrary size, which would be insanely dumb, and generally just infeasible. :)
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Since the GPU's rasterizer, wavefronts, and ROPs work in tiles, import and export from the ROP caches should be more structured than that.
    I'm not clear as to the exact granularity that the ROP caches and the ROPs themselves work at, although the rasterizer and shader wavefronts like the number 64.
    Tiling at this or even coarser levels allows ROPs to burn bandwidth to hide read-modify-write latencies, which makes very high DRAM bus utilization both possible and obligatory for them.

    Breaking the data to be compressed to a ROP-friendly granularity could then allow for such transactions to occur at a cadence roughly similar to what the ROPs would be pipelined to use.
    As such, I'm curious whether a more variable bandwidth load would actually cause a departure from this lockstep relationship and actually hurt bus utilization in some cases in exchange for getting past a lack of bus growth (at least until the next memory standards come into play).

    A scheme could allow for a variable amount of total memory consumption, although with the amount of variability between exports it might not be a good idea to make the value of SizeWorstcase-SizeCurrent visible.
     
  5. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Let's just take that wavefront of 64 pixels, organized in some kind of rectangular fashion. If you use RGB & Alpha, that's 4 bytes per pixel or 256 bytes. A memory atom is 32 bytes. With delta-coding followed by something like Huffman or arithmetic coding, you can probably get those 256 bytes down to 128 or less bytes in many cases? You can probably do this kind of compression on-the-fly, streaming, without multiple passes of the data. So this thing would simply sit in between the cache and the MC.

    Still, going back to the original question, the start address of each tile would be fixed, and at the same location as the uncompressed tile. Otherwise, how could you do a random access to any kind of tile in memory?

    If there's cache between the ROP and the MC, that lockstep relationship will probably be gone for the most part?

    I still don't see how any of that would work for random accesses.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    There might be a special case for a zeroed buffer, so maybe a handful of bytes for a whole tile.

    This goes to the unknown as to the granularity of the ROPs. There may be a level of tiling beyond the raster granularity, and the compression scheme might be quad-based.
    8x8 and 16x16 of some base unit of representation are dimensions that come up fairly often.
    At the larger granularity, and if a pixel quad is the base unit, could reduce a 1080p buffer to a few thousand entries in some kind of indirection structure.
    That does sound like a fair amount of work, though, so I find your theory of a buffer full of severely fragmented free space quite plausible.
    It is primarily about bandwidth, which is why I noted earlier that it could be a win even if the buffers themselves are bigger than they would be without compression as long as the number of bus transactions is reduced.

    The idea was that the small (16KB per RBE) caches have a thrash rate that the render export path is tuned to exploit. They have enough space for a some number of active tiles, some number waiting for writeback on the bus, and some number in the progress of being read in.
    The caches aren't coherent and the ROPs themselves are statically allocated screen space on the input side and buffer space on the output, so whatever they read in or write back is heavily scheduled.

    Short answer: I'm waiting for a disclosure on what manner it can do so, if it does.

    The frame buffer was the designated target for this technology, and its greatest measured benefits are ROP throughput synthetics, which are fine at the granularity the ROPs have.
    Perhaps more clarity will be provided at some point, but that's the extent of the disclosure at this point.

    Going by the assumption that the compression method is built with the more rigidly defined ROP tiling in mind, those accesses would not be random at the level of the compression logic.
    Going beyond that, nothing has been said about random access, although one could imagine that the GPU memory pipeline could detect an access to a buffer it knows to be compressed in this way and decompress on the fly. An access to a compressed buffer could have its address converted to an initial load of a pointer structure or an upper tier of a hierarchical structure and then the logic could work its way down.
    Another possibility is a performance hit where if the driver/GPU detects this it decompresses the target buffer, which happens in the case of depth buffers bound as texture.

    A less likely possibility is that such arbitrary access is requires a conversion back to a non-compressed format. There wasn't any discussion of special measures needed for software to use this, so I'm not banking on this possibility outside of a "it's happened before" sort of thing in some now-ancient GPUs.
     
  7. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    948
    Likes Received:
    417
    There is also the possibility that only the ROP-cache contains compressed blocks. With the fairly predictable rasterizing pattern and a favoring benchmark, LRU+compression could reduce memory-bandwidth to the final write only, all intermediate i/o running in cache.

    It wouldn't be the first case. Texture caches contained uncompressed texture blocks in the past and then they changed that to compressed blocks. It was good for cache utilization.
     
  8. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Something to do with ECC?
    Detection but not correction of double bit errors is fairly normal. Correcting double bit errors, though, should be pretty costly. Maybe that's a typo, though.

    Is there any evidence that GPGPU benefits from ECC?
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    The 12GB are mentioned everywhere, including quotes like "twice the competition". If it wouldn't be 512bit then the memory bandwidth would be wrong too. And, I see no reason why it would have something to do with ECC since the S9150 shows no signs of this. But you don't want asysmmetric configurations neither, especially not for a compute card.
    Thus my conclusion is that this is really a 16GB card, that is it is identical to the S9150 from a hardware point of view, with a different bios (so lower clock and memory artificially limited to 12GB somehow).
     
  11. cal_guy

    Newcomer

    Joined:
    Jun 27, 2008
    Messages:
    217
    Likes Received:
    3
    Could it be 1 4Gb and 1 2Gb GDDR5 module in clamshell mode for every 32-bit controller?
     
  12. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    That doesn't sound like a great idea, as AFAIK, clamshell halves the data bus for each chip and sends read/write requests to both chips in parallel. That means you'd get half bandwidth for the odd-man-out 4GB, assuming it works at all (IE, not returning invalid data when accessing the "halved" memory space...)
     
  13. TKK

    TKK
    Newcomer

    Joined:
    Jan 12, 2010
    Messages:
    148
    Likes Received:
    0
    What about 8 controllers equipped with 2x 4Gb chips and the other half with 2x 2Gb chips?

    Looks a little weird either way, though.
     
  14. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    629
    Likes Received:
    1,131
    Location:
    PCIe x16_1
    AMD uses soft ECC and always advertises their capacity with ECC off. So assuming for a moment that it's not a typo, then 12GB would indeed be the amount of addressable memory.
     
  15. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Thats the config of Hawaii... Its easy to see how the 8 and 16GB are obtained.. as for the 12GB... dont know where they have cut ( soft bios disabled ).. or just change the density of the memory chips ( as suggested ).
    [​IMG]
     
    #2715 lanek, Oct 14, 2014
    Last edited by a moderator: Oct 14, 2014
  16. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,247
    Likes Received:
    4,465
    Location:
    Finland
    Mobile Tonga shows up for the first time, not by AMD, but by Apple.
    New iMac with Retina Display has option for Radeon R9 M295X, which according to everything is Tonga, but isn't released yet
     
  17. Pressure

    Veteran

    Joined:
    Mar 30, 2004
    Messages:
    1,655
    Likes Received:
    593
    Yeah, should be the full Tonga chip with all 2048 stream processors enabled.
     
  18. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    I suppose that may explain why the R9 285X is taking so long to show up. Well, that and excess inventory.
     
  19. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    746
    Likes Received:
    41
    Location:
    Copenhagen
    http://www.apple.com/imac-with-retina/
    3.5 TF compute power - 2048@854mhz? sounds reasonable for "mobile" downclock.

    this one says 4gb memory, which points more to 256 bit than 384, so no extra evidence for the 384bit/48rop Tonga.
     
  20. homerdog

    homerdog donator of the year
    Legend Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,294
    Likes Received:
    1,075
    Location:
    still camping with a mauler
    I was under the understanding the 384bit bus is confirmed. I recall reading it from a very reliable source, but now it escapes me.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...