Sir Eric Demers on AMD R600

Status
Not open for further replies.
Very fabulous interview and responses. Not to sure how to word some questions, will do my best considering all the professionals here.

R600 design looks to go beyond DX10 or pixel cruching, at least to me and I just can't help but ask.

- Is other calculational or computing type problems looked at and incorporated into R600 design? For example physic calculations in conjunction with graphics to use the vastly multi threaded design and high bandwidth available?

- Maybe other types of uses as in raytrace/GI calculations for use in render farms, would or could R600 even be efficient doing this?

- Maybe just better to ask, what other areas, besides graphics and video R600 design was influence and can do?

Any elaboration on any of these questions would be most helpful if answered, thank you.

Certainly the field of "GPGPU" has been around for sometime and keeps on expanding. While not there when R600 was being conceived, it did influence it in some aspects, such as the read/write cache, precision requirements, different kinds of shader types, etc...

A lot of the burden is still more on the SW side than on the HW side. But I expect R6xx to be good in all these things as well.
 
Maybe I am being a little incorrect in my useage of terminology not to mention probably not understanding something here. I had thought AA was primarily memory bandwidth bound with the extra texture data fetches and not particularly decided by fillrate. In the R600's case this then has the extra overhead of shader execution time which is where I had thought the performance penalty comes in when doing 'simple' MSAA compared to the G80 series.

Is that right?

It's a mix -- With good compression or with enough BW, it pushes back to the engine. With poor compression or low BW, it's a bandwidth hog. Consequently, both BW and "Fillrate" contribute to the overall performance.
 
Eric, will there be perfomance difference between GDDR4 and GDDR3. Given the huge bandwidth the R600 already has.

Assuming the same engine speed, an 925~950 DDR3 is about the same real BW as a 1GHZ DDR4 (taking latency and various factors into effect). So, going from 825 MHz memory to 1 GHz or higher will make some difference, but it's on the order of 10~15% at most. And that would be only in high resolution with AA and lots of texturing.

The other thing that would make potentially more of a difference is more frame buffer. For most of today's apps, 512MB is fine. But there are a few where 1GB improves performance. I don't have any deltas in mind, but I've seen cases from 256 -> 512 where performance doubles. I would assume in some "bad" apps, 512 -> 1G might make a big difference.

However, overall, I don't expect it to make that much difference with today's games.
 
Coming late to this thread, I'll just add my thanks to the other members having already asked most of the questions I had left after reading the article and to sireric for answering (most of) them.

One thing sticks out to me, though, that has not been addressed. Somehow, the answers given makes R600 (and its derivative designs) sound like they are supposed to be more than what is readily apparent: What, if any, design considerations, have gone into applications for purposes other than consumer 3D graphics? And how large, if any, importance does such (hypothetical) areas of consideration have going forward compared to the 'traditional focus' (rendering paradigms) of a GPU?

There's two aspects to consider. One is DX10 -- Today, there are but a few DX10 apps, and most of those don't really make use of that much DX10. But we built a chip to work on that, and that comes at a significant cost. So, a lot of the chip, today, stands idle when running games.

As for non graphics application silicon, there is some, as I've answered in the original QA, but we tend to put in things that have multiple purposes and don't cost too much. I can think of only a few places where silicon was added and that feature was not useable by graphics applications.
 
I'm probably going to take a break from answering all (or most) questions at this point and checking every few hours. I'll still try to kick in when something obvious and easy (yes, I'm lazy) hits (I'll try to check daily). but I think that I've answered a lot of questions. And I'm starting to see repeating patterns.

Thanks for your time and attention. Hope it helped explain things.
 
I dunno about the ALU clock cost of shader AA. I'm thinking that decoding the compression tags and un-packing the compressed samples is fairly costly. Hmm, comparing box, narrow and wide-tent AA resolve should give some indication of the ALU clock cost...
I don't think any unpacking is performed in the shader, neither is sRGB<->lRGB conversion. All that leaves to the shader is calculating a weighted sum of samples, i.e. up to 16 MAD (vec3 for the visible framebuffer). That may be a dozen cycles for 8x + wide tent. Edge detection is not a lot more complicated: compare all the samples in a pixel, use that value if they're identical, otherwise apply a tent filter.

However, going from box to tent filter should require significantly more bandwidth unless you can cache a lot of samples.
 
It was already confirmed in another thread that the register file was split amongst the clusters.

I was curious about whether and how the L2 cache is also distributed or shared by all the clusters.

Is it divided four ways with 1/4 going to each cluster, or is it connected like the sampling units and is divided four ways with 1/4 of each SIMD group being able to access 1/4 of the L2?

Is there a direct relationship between the number of sampler units and the amount of cache, or is can the amount of L2 be tuned separately?
 
I don't think any unpacking is performed in the shader,
I should have said "decompression". Did you see the patent document I posted:

http://forum.beyond3d.com/showthread.php?p=1021771#post1021771

the compression system is three-level. Though I do wonder if it's always three-level or if it depends on the number of samples or the colour depth.

neither is sRGB<->lRGB conversion. All that leaves to the shader is calculating a weighted sum of samples, i.e. up to 16 MAD (vec3 for the visible framebuffer). That may be a dozen cycles for 8x + wide tent. Edge detection is not a lot more complicated: compare all the samples in a pixel, use that value if they're identical, otherwise apply a tent filter.
:LOL: I was over-zealous, huh?

However, going from box to tent filter should require significantly more bandwidth unless you can cache a lot of samples.
Can't wait until the driver murk clears.

Jawed
 
It was already confirmed in another thread that the register file was split amongst the clusters.

I was curious about whether and how the L2 cache is also distributed or shared by all the clusters.

Is it divided four ways with 1/4 going to each cluster, or is it connected like the sampling units and is divided four ways with 1/4 of each SIMD group being able to access 1/4 of the L2?

Is there a direct relationship between the number of sampler units and the amount of cache, or is can the amount of L2 be tuned separately?

L1 is distributed by identical between SIMDs (each gets the same copy, so only one makes external requests). L2 is not distributed among SIMDs -- It's unified and available to all equally and fully.
 
L1 is distributed by identical between SIMDs (each gets the same copy, so only one makes external requests). L2 is not distributed among SIMDs -- It's unified and available to all equally and fully.
So does that mean that L2 is "local" to one ring stop? If so, is that a dedicated stop or is it one of the 4 memory channel stops, or the PCI Express stop?

Jawed
 
L1 is distributed by identical between SIMDs (each gets the same copy, so only one makes external requests). L2 is not distributed among SIMDs -- It's unified and available to all equally and fully.

Just for clarity, are you saying that the L1 is distributed between SIMDs, so that each one has its own local L1?

By SIMD, do you mean a 16-processor array? Various places have used different terminology, so I'm unclear on the correct version.

The way I read what you said, each of the 4 16-processor arrays has its own copy of a single L1 data set.

Have I interpreted what you said correctly?

That would mean all 4 SIMDs have the same view of the L1 contents?
Does that affect how R600 can partition work?
Instead of having L1 contents specific to one SIMD, it has to share L1 capacity with threads that wind up working on separate SIMDs.
On the flip side, does that mean that any SIMD can service any thread transparently?

The single L1 would simplify the L2, since only only one eviction to the L2 would ever occur in a cycle.

Otherwise, the L2 would have to be at least quad-ported just for the SIMD caches, correct?

Is the L2 set-associative?
 
Just for clarity, are you saying that the L1 is distributed between SIMDs, so that each one has its own local L1?
I'll do my best Eric impersonation here, given what I know. Yep, to this one.

By SIMD, do you mean a 16-processor array? Various places have used different terminology, so I'm unclear on the correct version.
Yep, a SIMD in R600 is the 16 ALU processor block (each 5-way scalar).

The way I read what you said, each of the 4 16-processor arrays has its own copy of a single L1 data set.
Yep, L1 is mirrored.

Is the L2 set-associative?
It's fully associative.
 
L1 is distributed by identical between SIMDs (each gets the same copy, so only one makes external requests).
Hmm, avoiding the coherency burden?
So it's like an inclusive design, but in "horizontal" order... :rolleyes:

Is this the same case for R500 family?

Are the L1 and L2 arrays scaled down with the mid- and low-end parts and with what proportions?
 
Thanks for the very detailed answer Jawed, I think I get it now.

[*]AA resolve costs 25.9% of ALU capacity of R600[/LIST]That's a lot :!: But that still leaves ~350 GFLOPs for all other shading. Approximately the same as 8800GTX's total available GFLOPs...

My only comment would be that out of the remaining ALU capacity we still have to pull the anisotropic filtering workload before we get to the capacity available for actual texture rendering. Do we have a handle on how hefty an ALU load it demands?
 
Thanks for the very detailed answer Jawed, I think I get it now.
Well I hope you've seen that other people's answers give a much fuller picture.

My only comment would be that out of the remaining ALU capacity we still have to pull the anisotropic filtering workload before we get to the capacity available for actual texture rendering. Do we have a handle on how hefty an ALU load it demands?
My average 50-clocks per pixel for 8xMSAA resolve is way too much it seems. It could be an average of 5 or 10, somewhere in that vicinity.

Separately, though, anisotropic filtering doesn't directly consume ALU clocks. This is because the TUs can do all this work independently: they can work out the texel addresses and they can fetch and progressively filter the source texels, depending on the degree of AF specified, without having to rely upon the ALU pipeline.

So I'm not sure what you're asking... If you're referring to how I worked out the bilinear filtering loss of 5.2%, well, it seems that loss is also going to scale down by a factor of 5 or 10, e.g. 0.5 to 1%, because the numbers I was using were too high.

That filtering loss occurs because while the ALUs and TUs are busy doing AA resolve they can't be used for anything else within the game application.

Jawed
 
So the "overhead" is mostly on the ALU units. Let's do a worst-case guesstimate: say an average of 50 ALU cycles per pixel to perform an 8xMSAA resolve for a 2560x1600 render target at 60fps:
50 ALU clocks is ridiculous. Even the tent filter is simply a weighted average (after gamma conversion, which is provided by the texture units AFAIK), i.e. 8 vec4 MAD ops. I think fetching the values into the shader is the bigger cost, and even then it shouldn't take much out of the frame time.

EDIT: okay, it seems you've realized that.

I'll agree, however, that the sample rate during rendering is where G80's big MSAA strength is. 70 Gsamples/s (for the GTX) is just nuts. Enough for 60fps on a 10MP backbuffer with 16xAA and 7x overdraw. :oops: Seems like overkill, but such fillrate can still give a few percent advantage at high res, I guess.
 
50 ALU clocks is ridiculous. Even the tent filter is simply a weighted average (after gamma conversion, which is provided by the texture units AFAIK), i.e. 8 vec4 MAD ops. I think fetching the values into the shader is the bigger cost, and even then it shouldn't take much out of the frame time.

EDIT: okay, it seems you've realized that.
Yep, it really was an over-egged worst case, lol, I was being super-pessimistic. Still it had a happy ending, don't you think?!

I'll agree, however, that the sample rate during rendering is where G80's big MSAA strength is. 70 Gsamples/s (for the GTX) is just nuts.
55.2 G samples/s. Double that for z-only.

Jawed
 
Status
Not open for further replies.
Back
Top