AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

http://www.hardware.fr/articles/926-24/tonga-vs-tahiti.html

I've just added some extra numbers : Tonga - 28 CU @ 918 MHz - 163.9 Gio/s (256-bit @ 1375 MHz) VS Tahiti - 28 CU @ 918 MHz - 163.9 Gio/s (384-bit @ 917 MHz)

Of course nothing is perfect and 384-bit @ 917 MHz is not exactly the same thing as 256-bit @ 1375 MHz, but it still helps a lot to compare Tonga to Tahiti in a more direct way. Tonga seems to be at its best when a lot of tessellation happens.

Interesting. This certainly paints AMD's claims of "+40% bandwidth efficiency" in a new light. By the way, Scott mentioned that he got confirmation from AMD that Tonga does indeed feature 32 CUs:

AMD has confirmed to us that Tonga is indeed hiding four more compute units than are active in the R9 285
http://techreport.com/review/26997/amd-radeon-r9-285-graphics-card-reviewed/2
 
Indeed, R600 (and RV670) were an exception in this regard, as a "native" 16-bit architecture design, but that wasn't scalable and later with RV770 they backtracked on full-rate FP16 blending and filtering to shift resources for more parallelism.
Yeah I guess you're right. Actually some Tahiti configurations indeed should show _slightly_ higher FP16 blend rate if they were limited by memory bandwidth and not half-rate blend (the original 7970 being one). Must have changed it (but not the fp16 filtering rate, which I think is slightly strange nowadays) with Sea Islands, then.
 
Can someone explain the 290x blend rates in the Hardware.fr review? The effective bandwidth numbers are off the charts. Is the Tonga magic also present in Hawaii?
 
That's not quite an accurate description. There's several compression ratios available

That what I hinted in the last paragraph. Block compression schemes naturally use coding-selectors for alternative codings per block.
The coding is traditionally not written down outside of the block, you have a large bit bitfield indicating if something's compressed or not, this selects the decoder to be used. The decoder for the compressed scheme is an isolated piece of hardware which doesn't take "parameters" like a function call (the compression mode) but just the fixed size memory chunk.
The selector can be a bit, or a few bits, but it can also be a violation of a convention (see start > stop criterion in BC1-5) or it can be a variable length prefix code (see BC6-7) or chained variable length prefix codes (see ASTC). It is rather inconvenient to have the selector outside of the block.

(since r3xx I think, for depth, but I doubt it's only one per color neither) - so blocks can be either compressed by 1:2, 1:4 and so on (not sure exactly which ratios are available, probably more than these 2), hence you need more bits per block (to identify the compression scheme, 2 bits would be good for just 2 ratios, as you need fast cleared, uncompressed, ratio 1, ratio 2,...)

My belief is that fixed rate coding is employed, as this reduces complexity by a large amount, it also prevents the encoder to deal with unneccessary decision problems. Encoding and decoding times have to be symmetric for the given problem.
You can select for example an encoding with more planes, but less precise delta, or more precide deltas but less planes. Different code-block sizes allow a better best case, as the data just might be compressible well, but fixed code-block sizes with a lot of different codings allow a better worst case, as much more blocks are compressible. It's a tradeoff.

Also, I don't think this buffer is really loaded as a whole nowadays. For color this would be very problematic as you'd waste _a lot_ of transistors (essentially should be able to hold that information for 8 16kx16k (which is the max size with d3d11, not 8kx8k)

I never tried allocating and binding 16kx16k rendertargets, my gut feeling was just that it might be stopped by some soft constraint, it's no less than 1GB for RGBA8.

color buffers - that is 8MB (with the assumption of 2 bits per block and your 8x8 block assumption which I don't think is quite accurate neither since IIRC nowadays this is really done per "memory block" hence the amount of pixels covered differs depending on the buffer format).

I would be interested in reading the description of such a scheme - other than ASTC which I consider not quite suitable for rendertarget compression, especially because of the encoder complexity and problems.

Sure you could say you only support it when there's just one color buffer or some such - meaning you miss it when you need that feature the most... Should be more efficient to just hold that information like other data - though this would increase latency in the (hopefully rare) case the block information data itself isn't yet in the cache.

I don't think we're in disagreement on this one. Albeit, I do think the net-effect of actually touching the rendertarget(s) and piping the compressed data through the ROP caches, is that the compressed data is "on the chip" afterwards.
 
I don't see how it is a waste of space and effort. Sounds like the kind of thing that will get used.
Audio is a solved problem, and it was solved by throwing a moderate amount of CPU cycles at it. There may be a handful of people who want better, but given the financial state of sound companies, it's obvious the few actually care: we've reached the point of good enough over a decade ago.

I don't doubt that it will get used as long as it's a transparent layer in a SW stack that doesn't require any extra effort, but it's the kind of thing where you wonder if the effort warrants the benefits. AMD isn't known for having an abundance of engineers...
 
That what I hinted in the last paragraph. Block compression schemes naturally use coding-selectors for alternative codings per block.
The coding is traditionally not written down outside of the block, you have a large bit bitfield indicating if something's compressed or not, this selects the decoder to be used. The decoder for the compressed scheme is an isolated piece of hardware which doesn't take "parameters" like a function call (the compression mode) but just the fixed size memory chunk.
The selector can be a bit, or a few bits, but it can also be a violation of a convention (see start > stop criterion in BC1-5) or it can be a variable length prefix code (see BC6-7) or chained variable length prefix codes (see ASTC). It is rather inconvenient to have the selector outside of the block.
Allocation for that block information appears to be completely separate (both for color and depth buffers) and programmed with a separate offset. I think it would be impractical if it was embedded in this case, because otherwise you can't really do inplace compression/decompression, you couldn't do deferred allocation of that information neither (both things the open source driver does). Edit: And forgot, it really needs to be outside in any case, otherwise fast color clear cannot work.

My belief is that fixed rate coding is employed, as this reduces complexity by a large amount, it also prevents the encoder to deal with unneccessary decision problems. Encoding and decoding times have to be symmetric for the given problem.
You can select for example an encoding with more planes, but less precise delta, or more precide deltas but less planes. Different code-block sizes allow a better best case, as the data just might be compressible well, but fixed code-block sizes with a lot of different codings allow a better worst case, as much more blocks are compressible. It's a tradeoff.
Ok I dig that out - depth buffer being able to use both 1:2 or 1:4 compression is well documented for r300:
http://www.beyond3d.com/content/reviews/37/4 ("ATI claim a minimum of a 2:1 compression ratio and a best case of 4:1 during normal rendering"). Not sure about SI though and that was just for depth.
I'm not sure how the encoder deals with this - I think for this chip generation it was just mostly RLE so the encoder basically wouldn't have to do any decisions, it would just blindly try to encode and when it's finished it either used max quarter size, half size (or stop trying...).

I would be interested in reading the description of such a scheme - other than ASTC which I consider not quite suitable for rendertarget compression, especially because of the encoder complexity and problems.

Yeah you're actually right. I read that code in the driver a bit closer and the block size is always 8x8 (both for color and depth too it seems). It is subject to quite heavy alignment restrictions (which depends on the tile config of the chip), though in the end 4 bits are actually allocated per block (cmask actually only, the fmask which is needed in case of msaa is separate). I can't tell you though how many of these bits are actually used (it's always 4 bits from r600 to CIK for the cmask).
 
Last edited by a moderator:
Audio is a solved problem, and it was solved by throwing a moderate amount of CPU cycles at it.

As a consumer, I prefer the idea of a better audio than PhysX enabled... ok, maybe its me, but still.

Also, I play on my notebook, so my CPU cycles count, as those of my GPU - and I am not the only one. That's why Mantle would be great, if only notebook with GCN were fking avaiable (NONE, wtf with marketing AMD...) when I bought mine.
 
As a consumer, I prefer the idea of a better audio than PhysX enabled... ok, maybe its me, but still.

Also, I play on my notebook, so my CPU cycles count, as those of my GPU - and I am not the only one. That's why Mantle would be great, if only notebook with GCN were fking avaiable (NONE, wtf with marketing AMD...) when I bought mine.

For notebooks, I think the better bandwidth utilization is the big benefit of Tonga improvements. Regardless if you look at it from the angle of future APUs with integrated GPU but still attached to slow external RAM, or if you consider next lithographic generation discrete GPUs with twice the areal density (and thus ALU power) but where the memory interface cannot scale to match.
 
Audio is a solved problem, and it was solved by throwing a moderate amount of CPU cycles at it. There may be a handful of people who want better, but given the financial state of sound companies, it's obvious the few actually care: we've reached the point of good enough over a decade ago.

I don't doubt that it will get used as long as it's a transparent layer in a SW stack that doesn't require any extra effort, but it's the kind of thing where you wonder if the effort warrants the benefits. AMD isn't known for having an abundance of engineers...

Audio is pretty limited in games: spatialization isn't particularly accurate, there's almost no attention paid to acoustics, etc.

TrueAudio enables all these things and more without paying for them in CPU cycles and, perhaps even more importantly, with a very low cost in power consumption. The area occupied is unknown but TrueAudio is featured in Kaveri so it can't be much. As for the amount of effort, I don't know, but it's based on third-party IP and some portion (all?) of the integration effort may have effectively been paid for by Sony/Microsoft.

In the end it will all depend on developers, but the concept seems sound to me. Generally speaking, modern designs tend to have more transistors to spare than watts, so using dedicated, efficient hardware wherever possible makes sense.
 
With the specs given (on-chip RAM etc.), the area cost should indeed be minimal, probably less than 2mm2. But there's always a support cost for these things. And AMD clearly thinks they're worth it. I still think it's not going to convince many people to buy a 285.
 
As a consumer, I prefer the idea of a better audio than PhysX enabled... ok, maybe its me, but still.

Also, I play on my notebook, so my CPU cycles count, as those of my GPU - and I am not the only one. That's why Mantle would be great, if only notebook with GCN were fking avaiable (NONE, wtf with marketing AMD...) when I bought mine.


I'm all for improved audio but I don't think notebooks are the best platform for appreciating better sound effects :)
 
Proper spatialization is going to be critical for a good VR experience. True Audio has a big chance to shine there. But without the software support it's nothing.
 
I'm all for improved audio but I don't think notebooks are the best platform for appreciating better sound effects :)

hehe indeed, but when travelling or... when you are at home with sleeping kids in a relatively small house, you use headset.

And the lichdom fire's demo is quite convincing with headset :)

to be honest, I want back my cool 5.1 and games like thief or doom 3, but I must keep sound to minimum now ... and the ground floor trembling is not an option any more :)
 
Wondering if there is any info about fp64 support/speed on Tonga? None of the reviews so far mention it.
Some do - for r9 285 it's apparently 1/16. AMD apparently didn't say anything about native rate, so make your guess :). Some believe it should be 1/2 though I see no reason for that, imho 1/4 is far more likely though it could be just 1/16 who knows.
 
Audio is a solved problem, and it was solved by throwing a moderate amount of CPU cycles at it. There may be a handful of people who want better, but given the financial state of sound companies, it's obvious the few actually care: we've reached the point of good enough over a decade ago.
Not really. There are a few posters (devs themselves) in console fora who often point out that game audio on CPU does much too little. There are a lot more effects that they would want to apply but can't.

Music playback doesn't need much CPU. But game audio is a different thing.

I don't doubt that it will get used as long as it's a transparent layer in a SW stack that doesn't require any extra effort, but it's the kind of thing where you wonder if the effort warrants the benefits. AMD isn't known for having an abundance of engineers...

Which is why they got sony to pay for it. :)
 
Back
Top