AMD: Volcanic Islands R1100/1200 (8/9 series) Speculation/ Rumour Thread

3dilettante · Sep 30, 2014

Squilliam said:
Forgive me for being a little dumb, but if the colour compression is saving bandwidth then surely it is also saving memory? So for instance if the card has 2GB, but there is a 40% savings on bandwidth then that is the equivalent of 2.8GB under the old system, or am I completely wrong?

The compression is for frame buffer data, not the whole memory budget of the graphics context.
The memory budget for that is comparatively small, but it is heavily read and written by the ROPs--which are currently the dominant bandwidth consumers of the GPU.

While AMD has said little on the implementation of this compression, the argument could be made that even if the compression data + color buffer in memory were bigger than the color buffer alone, it might still be worthwhile if the compression data is accessed less frequently than the color data it compresses.

silent_guy · Oct 1, 2014

If the data has to be read and written at random, then it can only be BW compression, not data compression, I would think?

What I mean is: if you're compression sectors of data, the start address of each sector will still be the same, it's just that there'd be gaps in between each sector. And some attribute space indicating whether the sector is compressed or not.

It would be cool if someone could write a micro-benchmark to figure this out.

Grall · Oct 1, 2014

Squilliam said:
Forgive me for being a little dumb, but if the colour compression is saving bandwidth then surely it is also saving memory?

No, because you have no way of predicting how much the buffer will compress from one frame to the next; it could be a lot (however much that is), or maybe nothing at all, and anything inbetween. You must always reserve enough space to store the entire buffer as-is always...unless they're doing lossy compression and forcibly squish down the buffer to some arbitrary size, which would be insanely dumb, and generally just infeasible.

3dilettante · Oct 1, 2014

silent_guy said:
If the data has to be read and written at random, then it can only be BW compression, not data compression, I would think?

What I mean is: if you're compression sectors of data, the start address of each sector will still be the same, it's just that there'd be gaps in between each sector. And some attribute space indicating whether the sector is compressed or not.

It would be cool if someone could write a micro-benchmark to figure this out.

Since the GPU's rasterizer, wavefronts, and ROPs work in tiles, import and export from the ROP caches should be more structured than that.
I'm not clear as to the exact granularity that the ROP caches and the ROPs themselves work at, although the rasterizer and shader wavefronts like the number 64.
Tiling at this or even coarser levels allows ROPs to burn bandwidth to hide read-modify-write latencies, which makes very high DRAM bus utilization both possible and obligatory for them.

Breaking the data to be compressed to a ROP-friendly granularity could then allow for such transactions to occur at a cadence roughly similar to what the ROPs would be pipelined to use.
As such, I'm curious whether a more variable bandwidth load would actually cause a departure from this lockstep relationship and actually hurt bus utilization in some cases in exchange for getting past a lack of bus growth (at least until the next memory standards come into play).

A scheme could allow for a variable amount of total memory consumption, although with the amount of variability between exports it might not be a good idea to make the value of SizeWorstcase-SizeCurrent visible.

silent_guy · Oct 1, 2014

3dilettante said:
Since the GPU's rasterizer, wavefronts, and ROPs work in tiles, import and export from the ROP caches should be more structured than that.
I'm not clear as to the exact granularity that the ROP caches and the ROPs themselves work at, although the rasterizer and shader wavefronts like the number 64.
Tiling at this or even coarser levels allows ROPs to burn bandwidth to hide read-modify-write latencies, which makes very high DRAM bus utilization both possible and obligatory for them.

Let's just take that wavefront of 64 pixels, organized in some kind of rectangular fashion. If you use RGB & Alpha, that's 4 bytes per pixel or 256 bytes. A memory atom is 32 bytes. With delta-coding followed by something like Huffman or arithmetic coding, you can probably get those 256 bytes down to 128 or less bytes in many cases? You can probably do this kind of compression on-the-fly, streaming, without multiple passes of the data. So this thing would simply sit in between the cache and the MC.

Still, going back to the original question, the start address of each tile would be fixed, and at the same location as the uncompressed tile. Otherwise, how could you do a random access to any kind of tile in memory?

Breaking the data to be compressed to a ROP-friendly granularity could then allow for such transactions to occur at a cadence roughly similar to what the ROPs would be pipelined to use.
As such, I'm curious whether a more variable bandwidth load would actually cause a departure from this lockstep relationship and actually hurt bus utilization in some cases in exchange for getting past a lack of bus growth (at least until the next memory standards come into play).

If there's cache between the ROP and the MC, that lockstep relationship will probably be gone for the most part?

A scheme could allow for a variable amount of total memory consumption, although with the amount of variability between exports it might not be a good idea to make the value of SizeWorstcase-SizeCurrent visible.

I still don't see how any of that would work for random accesses.

3dilettante · Oct 1, 2014

silent_guy said:
Let's just take that wavefront of 64 pixels, organized in some kind of rectangular fashion. If you use RGB & Alpha, that's 4 bytes per pixel or 256 bytes. A memory atom is 32 bytes. With delta-coding followed by something like Huffman or arithmetic coding, you can probably get those 256 bytes down to 128 or less bytes in many cases?

There might be a special case for a zeroed buffer, so maybe a handful of bytes for a whole tile.

Still, going back to the original question, the start address of each tile would be fixed, and at the same location as the uncompressed tile. Otherwise, how could you do a random access to any kind of tile in memory?

This goes to the unknown as to the granularity of the ROPs. There may be a level of tiling beyond the raster granularity, and the compression scheme might be quad-based.
8x8 and 16x16 of some base unit of representation are dimensions that come up fairly often.
At the larger granularity, and if a pixel quad is the base unit, could reduce a 1080p buffer to a few thousand entries in some kind of indirection structure.
That does sound like a fair amount of work, though, so I find your theory of a buffer full of severely fragmented free space quite plausible.
It is primarily about bandwidth, which is why I noted earlier that it could be a win even if the buffers themselves are bigger than they would be without compression as long as the number of bus transactions is reduced.

If there's cache between the ROP and the MC, that lockstep relationship will probably be gone for the most part?

The idea was that the small (16KB per RBE) caches have a thrash rate that the render export path is tuned to exploit. They have enough space for a some number of active tiles, some number waiting for writeback on the bus, and some number in the progress of being read in.
The caches aren't coherent and the ROPs themselves are statically allocated screen space on the input side and buffer space on the output, so whatever they read in or write back is heavily scheduled.

I still don't see how any of that would work for random accesses.

Short answer: I'm waiting for a disclosure on what manner it can do so, if it does.

The frame buffer was the designated target for this technology, and its greatest measured benefits are ROP throughput synthetics, which are fine at the granularity the ROPs have.
Perhaps more clarity will be provided at some point, but that's the extent of the disclosure at this point.

Going by the assumption that the compression method is built with the more rigidly defined ROP tiling in mind, those accesses would not be random at the level of the compression logic.
Going beyond that, nothing has been said about random access, although one could imagine that the GPU memory pipeline could detect an access to a buffer it knows to be compressed in this way and decompress on the fly. An access to a compressed buffer could have its address converted to an initial load of a pointer structure or an upper tier of a hierarchical structure and then the logic could work its way down.
Another possibility is a performance hit where if the driver/GPU detects this it decompresses the target buffer, which happens in the case of depth buffers bound as texture.

A less likely possibility is that such arbitrary access is requires a conversion back to a non-compressed format. There wasn't any discussion of special measures needed for software to use this, so I'm not banking on this possibility outside of a "it's happened before" sort of thing in some now-ancient GPUs.

Ethatron · Oct 1, 2014

There is also the possibility that only the ROP-cache contains compressed blocks. With the fairly predictable rasterizing pattern and a favoring benchmark, LRU+compression could reduce memory-bandwidth to the final write only, all intermediate i/o running in cache.

It wouldn't be the first case. Texture caches contained uncompressed texture blocks in the past and then they changed that to compressed blocks. It was good for cache utilization.

AnarchX · Oct 12, 2014

12GiB @ 512-Bit: http://www.amd.com/en-us/products/graphics/workstation/firepro-remote-graphics/s9100#

PR error? Or something like NVs 192-Bit 2GiB cards?

Jawed · Oct 12, 2014

Something to do with ECC?

Error Correcting Code (ECC) Memory

Helps ensure the accuracy of computations by correcting single or double bit errors

Detection but not correction of double bit errors is fairly normal. Correcting double bit errors, though, should be pretty costly. Maybe that's a typo, though.

Is there any evidence that GPGPU benefits from ECC?

mczak · Oct 12, 2014

AnarchX said:
12GiB @ 512-Bit: http://www.amd.com/en-us/products/graphics/workstation/firepro-remote-graphics/s9100#

PR error? Or something like NVs 192-Bit 2GiB cards?

The 12GB are mentioned everywhere, including quotes like "twice the competition". If it wouldn't be 512bit then the memory bandwidth would be wrong too. And, I see no reason why it would have something to do with ECC since the S9150 shows no signs of this. But you don't want asysmmetric configurations neither, especially not for a compute card.
Thus my conclusion is that this is really a 16GB card, that is it is identical to the S9150 from a hardware point of view, with a different bios (so lower clock and memory artificially limited to 12GB somehow).

cal_guy · Oct 12, 2014

Could it be 1 4Gb and 1 2Gb GDDR5 module in clamshell mode for every 32-bit controller?

Grall · Oct 12, 2014

That doesn't sound like a great idea, as AFAIK, clamshell halves the data bus for each chip and sends read/write requests to both chips in parallel. That means you'd get half bandwidth for the odd-man-out 4GB, assuming it works at all (IE, not returning invalid data when accessing the "halved" memory space...)

TKK · Oct 13, 2014

What about 8 controllers equipped with 2x 4Gb chips and the other half with 2x 2Gb chips?

Looks a little weird either way, though.

Ryan Smith · Oct 14, 2014

mczak said:
The 12GB are mentioned everywhere, including quotes like "twice the competition". If it wouldn't be 512bit then the memory bandwidth would be wrong too. And, I see no reason why it would have something to do with ECC since the S9150 shows no signs of this.

AMD uses soft ECC and always advertises their capacity with ECC off. So assuming for a moment that it's not a typo, then 12GB would indeed be the amount of addressable memory.

lanek · Oct 14, 2014

Thats the config of Hawaii... Its easy to see how the 8 and 16GB are obtained.. as for the 12GB... dont know where they have cut ( soft bios disabled ).. or just change the density of the memory chips ( as suggested ).

Kaotik · Oct 16, 2014

Mobile Tonga shows up for the first time, not by AMD, but by Apple.
New iMac with Retina Display has option for Radeon R9 M295X, which according to everything is Tonga, but isn't released yet

Pressure · Oct 16, 2014

Yeah, should be the full Tonga chip with all 2048 stream processors enabled.

Alexko · Oct 16, 2014

I suppose that may explain why the R9 285X is taking so long to show up. Well, that and excess inventory.

Psycho · Oct 16, 2014

http://www.apple.com/imac-with-retina/
3.5 TF compute power - 2048@854mhz? sounds reasonable for "mobile" downclock.

this one says 4gb memory, which points more to 256 bit than 384, so no extra evidence for the 384bit/48rop Tonga.

homerdog · Oct 16, 2014

Psycho said:
this one says 4gb memory, which points more to 256 bit than 384, so no extra evidence for the 384bit/48rop Tonga.

I was under the understanding the 384bit bus is confirmed. I recall reading it from a very reliable source, but now it escapes me.

AMD: Volcanic Islands R1100/1200 (8/9 series) Speculation/ Rumour Thread

3dilettante

silent_guy

Grall

Invisible Member

3dilettante

silent_guy

3dilettante

Ethatron

AnarchX

Jawed

mczak

cal_guy

Grall

Invisible Member

TKK

Ryan Smith

lanek

Kaotik

Drunk Member

Pressure

Alexko

Psycho

homerdog

donator of the year

Similar threads

AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

Invisible Member

Invisible Member

Drunk Member

donator of the year

Similar threads

AMD: Volcanic Islands R1100/1200 (8/9 series) Speculation/ Rumour Thread