AMD: R7xx Speculation

Status
Not open for further replies.
In the best case, 4:1 compression is achieved and everything fits into eDRAM, saving 100% framebuffer bandwidth.
To make this work you need a ROP architecture that can work with non-uniformly organised memory for render targets.

To my knowledge all GPUs use a uniform memory layout for their render targets - the compression comes solely from the way tiling of pixel-channel/sample-channel data is peformed, meaning that the ROPs access memory in units of an entire-tile. A fully compressed AA sample merely means that the ROPs access just one tile of memory. Compression for a render target varies over the lifetime of the frame, since it's a technique aimed at saving bandwidth not memory.

As it happens, the recent ATI patent for AA sample compression vaguely hints at a non-uniform layout of pixel/sample data in memory - but I think that's just patentese covering all bases or me being creatively interpretive.

Method and apparatus for anti-aliasing using floating point subpixel color values and compression of same

Jawed
 
To make this work you need a ROP architecture that can work with non-uniformly organised memory for render targets.
I don't think that's necessary. For each block in EDRAM you have three in RAM, and you only access the latter if the tile is uncompressed.

The problem with this idea is that it improves the best case while keeping the worst case the same. When a tile is compressed there isn't much benefit from storing it in EDRAM because it's low BW. It's the uncompressed tiles that chew up BW.

Arun, the whole point of EDRAM is to avoid compression logic and the worst case perf associated with compression.
 
If you've got an IMR (heck, even a TBDR as I argued in the past, but probably to a lesser extend) then using eDRAM means you won't have to go off-chip as much, saving power. This makes it a viable architecture for IMR handhelds... And it's exactly what NV is doing there next round (and for, let us say, something else), but heh, I'm disgressing.

Don't misunderstand me, yet this debate is about a hypothetical scalable PC architecture (low to high end). We've chewed over that topic more than often even in private conversations and when you say "next round" I honestly hope you don't mean the OpenGL-ES2.x generation, since there especially NV is more than just late.

As for a desktop part where the power requirements from going off-chip don't matter as much - well, one advantage is that it might allow you to have cheaper RAM (or a less wide memory bus) for a given amount of performance. So it might give better performance/dollar for the consumer, and certainly would increase the ASPs of the IHV: instead of selling a $65 GPU with $35 of VRAM, they could sell a $90 GPU with $10 of VRAM. Clearly, that makes them more money.

Why do I have the feeling that it would make more sense for low end and possibly even lower mainstream parts? If yes than I don't see any IHV easily bothering unless they can effectively scale an implementation from top to bottom.

Also, it is important to realize you don't need a truckload of eDRAM or to use tiling. If you're smart, you can use compression for what you're writing to eDRAM too. So, say that your framebuffer is 40MiB and you have ~4:1 color compression with MSAA. You could just use 10MiB of eDRAM, and in the worst case you're caching 1/4th of the full color information, saving 25% bandwidth. In the best case, 4:1 compression is achieved and everything fits into eDRAM, saving 100% framebuffer bandwidth.

As Mintmaster points out, I'd too figure that compression logic would add to the final budget.

eDRAM is a lot more attractive once you realize all past and current designs are awfully naive and you could be way smarter about it.

IMHLO (highlight the L for layman if you please) the implementation of eDRAM into a desktop design should have clear advantages and only miniscule and manageable disadvantages. For the time being I haven't seen anything that suggests that a healthy amount of eDRAM in order to support today's and tomorrow's display resolutions, without any awkward workarounds (be it added compression or macro-tiling), wouldn't add to the final cost of a GPU.
 
Why is the EDRAM size correlated to the ROP throughput? That decision should be based on resolution and how finely the scene can be tiled (if at all).

Not directly correlated at all. If the target is roughly today's available resolutions in the desktop space then 64 ROPs sound like a huge overkill to me with merely 20MB eDRAM (unless of course the hypothetical eDRAM is meant for something totally different, since they also suggest 1 or 2GB framebuffers).

Have a look at the hypothetical specs:

ultra low end = no eDRAM, 8 ROPs
low end= 10MB eDRAM, 16 ROPs
....
ultra high end = 20MB eDRAM, 64 ROPs
 
To make this work you need a ROP architecture that can work with non-uniformly organised memory for render targets.
I can see why you'd think that (I used to also), but no... :) Mintmaster's comment explains why this is the case, so I won't bother unless you want me to elaborate on potential implementation details.

Mintmaster said:
The problem with this idea is that it improves the best case while keeping the worst case the same. When a tile is compressed there isn't much benefit from storing it in EDRAM because it's low BW. It's the uncompressed tiles that chew up BW.
You are working on two assumptions which are, as far as I can tell, far from perfectly accurate:
1) eDRAM used this way wouldn't improve the worst-case: No, it would improve it, just by a lesser percentage. You could read three memory bursts instead of four, since one of them (even when it's for uncompressed data!) is in eDRAM. This saves 25% bandwidth in the *worst-case*.
2) The vast majority of bandwidth comes from non-compressed tiles. This is probably a gross simplification, see below.

Any smart modern architecture wouldn't just have 'compressed' and 'uncompressed' tiles. You'll have different levels of compression most likely, ideally reusing the same techniques (thus sharing silicon) but less aggressively.

Nobody outside NVIDIA and ATI has any idea how fine or coarse these compression levels really are, but at the strict minimum I would expect that you to have, say, 4:1, 2:1 and 1:1 for 4x MSAA's color buffer. I would also be surprised if there was no '3.5:1' mode basically (or perhaps that really is 4:1!) to handle the common case of 'nearly-perfect-but-really-not' compressibility.

So, what I suspect is that a majority of the bandwidth is taken by midly compressed tiles, not fully uncompressed ones which are more the exception than the rule and that are probably limited by, say, triangle setup anyway in current architectures.

Furthermore, there is something else that might not be completely obvious. Assuming there is only exactly enough eDRAM to fit everything (i.e. save 100% framebuffer bandwidth) under maximum compression for all tiles, then the amount of bandwidth you save is always exactly this, where both the final result and the average compression are between 0 and 1:
Code:
Saved Bandwidth = eDRAM Amount / (Framebuffer Size * Average Compression)
It could be proven that 50% of the framebuffer compressing 50% (and the other half being uncompressed) results in equal savings to any other way to achieve 25% overall framebuffer compression under the above rules. Thus, every tile being 25% compressed or 75% of tiles being 33.3% compressed results in the same bandwidth savings for a given amount of eDRAM.

Of course, that breaks down when you have more eDRAM than your framebuffer size multiplied by your best-case compression rate (unless you want to go non-uniform; ugh!) but the final results remain very impressive IMO.

So, the real question becomes: what do you think the average compression rates are for a 1920x1200 4x MSAA HDR framebuffer in, say, Oblivion? I'd expect them to be pretty damn good, otherwise the final performance doesn't make much sense in my mind. And as a logical consequence of this and the above, I would expect eDRAM bandwidth savings under my proposed approach to be pretty damn good too.
 
I can see why you'd think that (I used to also), but no... :) Mintmaster's comment explains why this is the case, so I won't bother unless you want me to elaborate on potential implementation details.
And you've still got the problem of supporting 8 MRTs, so now the EDRAM is a drop in the ocean.

Jawed
 
You are working on two assumptions which are, as far as I can tell, far from perfectly accurate:
1) eDRAM used this way wouldn't improve the worst-case: No, it would improve it, just by a lesser percentage. You could read three memory bursts instead of four, since one of them (even when it's for uncompressed data!) is in eDRAM. This saves 25% bandwidth in the *worst-case*.
2) The vast majority of bandwidth comes from non-compressed tiles. This is probably a gross simplification, see below.
1) No need to split hairs ;) I did realize this when posting above, but the savings aren't enough, IMO. I don't think 25% of just framebuffer BW is enough to justify having so much EDRAM. I think you'd need 30MB of EDRAM to occupy 10% of the die before this decision becomes a win, even considering your second point.
2) This is not quite the assumption I was making. I was really suggesting that among the spans of time when you are BW limited, many times you have a large percentage of pixels from non-compressed tiles.

In the end I guess I'm just saying EDRAM makes a lot more sense when all framebuffer traffic goes there. If not, the benefits are greatly reduced, and the cost/benefit analysis makes the decision a lot closer to a wash than blindingly obvious. When radical architecture changes entail such iffy benefits, companies don't generally go for them.
So, the real question becomes: what do you think the average compression rates are for a 1920x1200 4x MSAA HDR framebuffer in, say, Oblivion? I'd expect them to be pretty damn good, otherwise the final performance doesn't make much sense in my mind. And as a logical consequence of this and the above, I would expect eDRAM bandwidth savings under my proposed approach to be pretty damn good too.
What exactly doesn't make sense in your mind? I can't find many figures of 0xAA vs. 4xAA with HDR enabled on the web, but even if I could, there's no way that you can deduce compression rates from them.
 
And you've still got the problem of supporting 8 MRTs, so now the EDRAM is a drop in the ocean.
Of course, the games using 8 MRTs is a drop in the ocean also.

Arun is basically saying use the EDRAM for as many pixels as you can. If a tile is uncompressed, store the rest in memory. If you have MRTs or a big FB, use the EDRAM for a fraction of the screen.

The counterargument, of course, is that this system kills or dampens several advantages of EDRAM. If educated people aren't sold on Xenos' implementation, then they'd think this system is an outright waste of time.
 
Of course, the games using 8 MRTs is a drop in the ocean also.
I'm thinking of D3D10 deferred renderers...

Arun is basically saying use the EDRAM for as many pixels as you can.
I think the GPU should adapt to the most efficient use of a gob of memory, not have foisted upon it a fixed function that loses any advantage it might have had as soon as the simple use-case is exceeded. It doesn't appear to degrade gracefully.

Also, I think that with ROPs destined to disappear (becoming shader programs) and with the apparently urgent need for GPUs to get better at scatter I'd argue that a gob of on-die memory needs to be more adaptable to multiple types of concurrent workloads - so a fixed-function EDRAM render target buffer is short sighted.

Jawed
 
I'm thinking of D3D10 deferred renderers...
If DX10-only deferred renderer needs 8 MRTs, then the entire programming team should probably be fired on the spot! ;)

It's easy to create a D3D10 deferred renderer for terrain/sky/water that is, in fact, Z-only! Well, maybe stencil too, but you get the point. If you're smart, it should also be possible not to use *that* much memory for doing deferred rendering on objects too. Although it's still much more than an immediate renderer, obviously.

I think the GPU should adapt to the most efficient use of a gob of memory, not have foisted upon it a fixed function that loses any advantage it might have had as soon as the simple use-case is exceeded. It doesn't appear to degrade gracefully.
The easy solution to your worry would be to have it directly reservable and accessible via shaders for the parts you have reserved. I fail to see how that would work in DirectX, but it should be possible via OpenGL extensions, via CUDA, or in a console.
 
1) No need to split hairs ;) I did realize this when posting above, but the savings aren't enough, IMO. I don't think 25% of just framebuffer BW is enough to justify having so much EDRAM. I think you'd need 30MB of EDRAM to occupy 10% of the die before this decision becomes a win, even considering your second point.
So, are you saying it makes sense on 40nm? ;) eDRAM should be roughly 1MByte/mm2 on that node (and yes it does exist), and a 300mm2 chip is hardly out of the question.

Also, I would be very surprised if 15MB of eDRAM would only result in 25% bandwidth savings. I would bet on 50%+ personally - of course, that isn't taking textures into consideration for example.

What exactly doesn't make sense in your mind? I can't find many figures of 0xAA vs. 4xAA with HDR enabled on the web, but even if I could, there's no way that you can deduce compression rates from them.
Theoretically, you could determine how much extra bandwidth 4x MSAA takes over 0x MSAA by reducing the memory clock to an insanely low value and adding a LOD bias that guarantees only the 1x1 mipmap is used. Performance should then roughly scale linearly with framebuffer bandwidth requirements, unless fetching vertices is more expensive than you'd expect it to be (unlikely).

Of course, I'm not aware of anyone having ever done that, but perhaps I should bother testing that a bit soon to further consolidate my theory! :) (and it might make for some nice data in an article in (framebuffer) compression too).
 
So, are you saying it makes sense on 40nm? ;) eDRAM should be roughly 1MByte/mm2 on that node (and yes it does exist), and a 300mm2 chip is hardly out of the question.
I think that's when you'll just begin to pass break-even, but it still doesn't make sense, especially when considering other SKUs. A quarter perf. chip with a quarter the EDRAM would be fine if the gamer played the same settings at a quarter the resolution, but not lower settings at the same or half the resolution. Then you'd have crappy RV630-like scaling.

I will admit, though, that in the long term we could very well see a scheme similar to what you're talking about. It just doesn't seem sensible for the R700 timeline.
Also, I would be very surprised if 15MB of eDRAM would only result in 25% bandwidth savings. I would bet on 50%+ personally - of course, that isn't taking textures into consideration for example.
If we're talking about 1920x1200 and FP16, that's 27MB. That's the minimum res you'd want to be competitive in at the high end. I didn't say it would average 25% savings, just that 25% of FB BW in a pretty common worst case isn't much to write home about. Remember also that you have to write out everything in the EDRAM to main memory unless you can texture from it, and then you'd need even more space for the new backbuffer you're writing to.

Theoretically...
Theoretically we can do a lot of things, but you said the performance in Oblivion wouldn't make sense if the compression rate isn't good. What is the basis of this statement?

Even with the test you described, there's no way of determining the compression ratio. Perf would only scale linearly if you're always BW limited, which you're not, even with a low clock (think about setup-limited clumps). AA also has a bit more work from fewer empty triangles and more quads touched per triangle, along with loopback through ROPs when applicable. Maybe you could look at the change in slope of scaling with mem speed, but it's still sketchy to draw any conclusions about compression.
 
If DX10-only deferred renderer needs 8 MRTs, then the entire programming team should probably be fired on the spot! ;)
I can't think of anything else that's going to use lots of MRTs, and the common thread in DR discussions is "fitting the G-buffer into only 4 MRTs is a pain in the arse".

So, I've got no idea how many MRTs will be used in D3D10 versions of DR, but there's a shed load of extra space there going begging...

The easy solution to your worry would be to have it directly reservable and accessible via shaders for the parts you have reserved. I fail to see how that would work in DirectX, but it should be possible via OpenGL extensions, via CUDA, or in a console.
D3D is moving towards fully-programmable render target operations. And full scatter support in the shaders.

I dunno how GPU designers will attack making this stuff work. Perhaps with just a wodge of cache. Maybe they'll create some kind of DMA-list-processing architecture (like Cell SPE's). Maybe it's already in there waiting to be set free...

Whatever, an EDRAM fixed-function partial colour/Z buffer seems rather short-sighted to me.

And more general-purpose memory handling is a high priority for GPGPU stuff (though I think R600 is already quite advanced in this regard)...

Jawed
 
It seems AMD gonna introduce 45nm multicore GPUs with the R700:

http://www.fudzilla.com/index.php?option=com_content&task=view&id=4327&Itemid=1
http://www.fudzilla.com/index.php?option=com_content&task=view&id=4346&Itemid=1
http://www.fudzilla.com/index.php?option=com_content&task=view&id=4348&Itemid=1
(sorry, fudzilla source)

Is not clear to me, perhaps they are moving to a tile rendering system so they need to duplicate 2D/triangle setup/clipping/etc transistors? If not... why not just to add more shading units into the same silicon die?
 
Would R600 work in any system if it had 640 shaders?
Well, depends on the silicon die size... If is too big reducing the integration level could help... but making it multicore you are really increasing the die(well, or package encapsulation if you do 2+2 cores) size/heat dissipation/power consumption/prod costs so...
 
I like the concept, but the big concerns are performance scalability (2 chips, 2x as fsat as 1 chip; 4 chips, 4x as fast as 1 chip) and software compatibility. If these issues can be addressed the potential is pretty high for AMD/ATI. Yields on four small cores should be higher than those of 1 large core. R&D would be more focused as your high end GPU and low end would share the exact same DNA -- but this could be a problem for some redundant features like video acceleration (so my guess is you would see 2 core types, a "master" core and "bare" cores). The potential for more hand tuned design at 300M transistors (x4) is more likely than that of a 1.2B transistor core. Featureset across the board would be level (no more 9200 GPUs). The economy of scale of producing basically 1 core should keep prices low on both the low end and high end, potentially allowing ATI/AMD to pack more product into a board at the same price. Over time the migration of such cores right onto CPU packages and eventually onto CPU cores for initially low end systems (especially the quickly growing Notebook/Laptop market) would mean AMD/ATI keep more of the sale, could get more OEM contracts, and even have a marketing advantage in that an R700 GPU would probably perform much better, relative to the market, than stuff like the X1150, GF6150M, etc products do.

Yet if the first two issues, performance and across the board compatibility with multicore, then this could kill AMD/ATI. So I guess we wait and see.
 
question is: Will AMD/ATi be able to use these 4 cores as "ring-bus" stops?
If not, using dumb AFR will mean NV will kill'em

I think that is basically the design decision behind R600.
although the scale and utilization of the controller in R600 is out of proportion that same controller would fit snuggly in a situation where it would be controlling multiple bus stops, both internal and external.

This .. for me.. seems to be the only reason ATI pushed through with it's design of R600/R700. not because of what problems it faced this year. but of the future potential of the ring-bus controller.

The fact that ATI went this route (including shader based AA) seems to me that want as little dependencies on on-die logic that would make one single "processor" bulky .. or dependent on the other processors.

I have no idea how well it would scale to the low end, but seen 670's recent 8xAA numbers it would suggest it's potential under high res/high AA is enormous.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top