Sir Eric Demers on AMD R600

Status
Not open for further replies.
Eric, what's the trend for thread size? We have trends for bandwidth and ALU:TEX ratio, etc., and I'm curious what's going to happen with thread size.

R520 set a nice standard with 16 pixels, but since then it's gotten "worse". Are we looking down the barrel of an ever-increasing thread size?

Does it matter much?

Jawed

I don't think ours will get any bigger than 64. It's a fight between granularity loss and, really, cost. The finer grain things are, the more you need to sequence and so your "control" aspect grows in complexity; as well, you will need more threads, since each thread will now hide less latency, and there's a cost (proportional to the number of threads) to hold all that state. The overall datapath structure grows also, to allow for so much independant SIMD work. The larger granularity does tend to improve cache coherency also, up to a point. The downside is the larger the granularity, the worst the branching costs of going seperate ways.

As of now, 16 vs 48 vs 64 doesn't really lead to any significant performance difference in real apps. You can make some synthetic cases that show a difference, but those have to be constructed.
 
It can do either. We can predicate out instructions, or take full seperate branches and mask out. The predication must be specified in the shader otherwise you'll get the full branches.
Presumably predication can only be specified in the shader by the driver compiler - there isn't anything in D3D that allows the programmer to specify this, is there?

Preumably as far as the driver/hardware is concerned, R600 is much like Xenos in this respect, where the sequencer can identify that an entire thread has coherent branching, and so the sequencer can skip instructions (or exit a loop).

Xenos gives the programmer a chance to specify branching behaviour by writing SEQ instructions. Do you think there's much chance that a future version of D3D will provide this capaibility too?

Jawed
 
As of now, 16 vs 48 vs 64 doesn't really lead to any significant performance difference in real apps. You can make some synthetic cases that show a difference, but those have to be constructed.
So even in GS and VS code there's no notable difference?

Is this primarily because once you've got multiple objects in a thread (as opposed to the nominal ideal of a thread size of 1), the performance cliff is so great that these various thread sizes only constitute where you land on the slope of rubble below?

Jawed
 
Presumably predication can only be specified in the shader by the driver compiler - there isn't anything in D3D that allows the programmer to specify this, is there?

I have not looked at HLSL -- From a HW perspective, either can be selected. But perhaps a heuristic picks, at the compiler level. I haven't checked.

Preumably as far as the driver/hardware is concerned, R600 is much like Xenos in this respect, where the sequencer can identify that an entire thread has coherent branching, and so the sequencer can skip instructions (or exit a loop).

Yep.

Xenos gives the programmer a chance to specify branching behaviour by writing SEQ instructions. Do you think there's much chance that a future version of D3D will provide this capaibility too?

Jawed

Good question. I think it's likely. Should be available through CTM.
 
So even in GS and VS code there's no notable difference?

There's a lot more reasons to do a smaller vector size for VS/GS. But historically and with current apps, there's been either little VS branching or even draw sizes; or the apps have been dominated by pixel processing (with large vertex:pixel ratios), and so vertex performance wasn't a big deal. But that is something likely to push down granularity in the future, more so than the pixel side.

Is this primarily because once you've got multiple objects in a thread (as opposed to the nominal ideal of a thread size of 1), the performance cliff is so great that these various thread sizes only constitute where you land on the slope of rubble below?

Jawed

I'm not sure I fully understand the question. If I infer the question, no, it's more the opposite -- The performance loss due to granularity isn't that big and the thread sizes are so small to start with, that the differences between these sizes isn't significant.
 
I'm not sure I fully understand the question. If I infer the question, no, it's more the opposite -- The performance loss due to granularity isn't that big and the thread sizes are so small to start with, that the differences between these sizes isn't significant.
Thanks, you untangled my question :D

Jawed
 
In CTM, R580 is quite happy to allocate 128 vec4 GPRs per "pixel", far higher than the SM3 requirement of 32. As far as I can tell R600 uses a virtualised register file. Presumably this means that the D3D10 requirement for 4096 vec4 GPRs is "trivial" for R600. R600 can assign GPRs upto the extent of VRAM+available system RAM, presumably?

Can GPRs and memory addresses overlap in some way; can GPRs be aliased by logical address? i.e. can a block of memory be accessed "locally" by a shader in terms of GPR ID or "globally" in terms of a memory address? e.g. could a shader produce a huge block of results and another shader treat that block of memory as GPRs?

Presumably, also, there could be situations when contexts are switched, in which case GPRs could be paged out of VRAM into system RAM.

Really, all I'm doing here is thinking about the fluidity of CTM. One of the things that concerns me about the CUDA threading model is that programmers seem to be forced to work around the memory system, performing their own memory management in effect. I'm curious if one of the design aims for R600 was to relieve the programmers, giving them "latency-hidden" memory management across and within threads.

Jawed
 
Last edited by a moderator:
Presumably predication can only be specified in the shader by the driver compiler - there isn't anything in D3D that allows the programmer to specify this, is there?
Predication has been around for a while. It's in SM 3.0 and was also available in PS 2.x if you exposed the PREDICATION cap bit. To use it you set the predicate register with "setp_comp dst, src0, src1" where comp is the comparison you want.
 
Very fabulous interview and responses. Not to sure how to word some questions, will do my best considering all the professionals here.

R600 design looks to go beyond DX10 or pixel cruching, at least to me and I just can't help but ask.

- Is other calculational or computing type problems looked at and incorporated into R600 design? For example physic calculations in conjunction with graphics to use the vastly multi threaded design and high bandwidth available?

- Maybe other types of uses as in raytrace/GI calculations for use in render farms, would or could R600 even be efficient doing this?

- Maybe just better to ask, what other areas, besides graphics and video R600 design was influence and can do?

Any elaboration on any of these questions would be most helpful if answered, thank you.
 
Last edited by a moderator:
Well, case temperature has a huge effect on the fan speed and, consequently, fan noise. We fixed up the drivers to drop sound level in 2D to pretty quiet, assuming a reasonably cool fan. But I agree that the fan speed is a little more than our X1950XTX. But it's nowhere near that bad either. New boards will have different cooling solutions, but XD2900XT boards are the way they are.

We kicked it around internally and came to the conclusion that it would be a huge improvement in such situations for the fan ramping to be much more graduated. Like, say, 1% rpm change per sec, or something like that. It's sudden large changes in fan speed as much as anything else that really make themselves noticeable. It seems like y'all have the infrastructure in place for that; it'd be a matter of tweaking the driver code. Presumably it'd need some kind of emergency path where if chip temp hits a certain spot where actual danger is dead ahead then forget all that and go high speed.
 
Dual-card configs still have high fanspeeds on slave cards as the slave card never goes to 2D speeds ATM, just 3D speeds. MY system is quiet, except for the slave, that drives me nuts.

I also wonder how this may affect the lifespan of the card...:oops:
 
In CTM, R580 is quite happy to allocate 128 vec4 GPRs per "pixel", far higher than the SM3 requirement of 32. As far as I can tell R600 uses a virtualised register file. Presumably this means that the D3D10 requirement for 4096 vec4 GPRs is "trivial" for R600. R600 can assign GPRs upto the extent of VRAM+available system RAM, presumably?

Conceptually yes. However, DX10 has some restrictive items associated with fetching outside the 4K range (clear clamp rules on <0 and > 4093). As well, direct access to those registers only gives the need for 12b addressing. I'd have to check if we allow indirect addressing outside of the 4K, but it would not be possible outside of CTM to access that.

Can GPRs and memory addresses overlap in some way; can GPRs be aliased by logical address? i.e. can a block of memory be accessed "locally" by a shader in terms of GPR ID or "globally" in terms of a memory address? e.g. could a shader produce a huge block of results and another shader treat that block of memory as GPRs?
When using "virtual" GPRs, that block is private for each data element, but it is in the GPU addressable memory, and could be made to overlap (But not in DX). You need to be careful about accessing other's memory, since execution is out of order. So you would need to synchronize the shaders -- Possibly by using different prims and forcing a flush (A rather heavy synchronization method, but an easy one to understand).

Presumably, also, there could be situations when contexts are switched, in which case GPRs could be paged out of VRAM into system RAM.
Yep. Hopefully they would simply be kept in VRAM, but it's an OS thing to decide.
Really, all I'm doing here is thinking about the fluidity of CTM. One of the things that concerns me about the CUDA threading model is that programmers seem to be forced to work around the memory system, performing their own memory management in effect. I'm curious if one of the design aims for R600 was to relieve the programmers, giving them "latency-hidden" memory management across and within threads.
Jawed

The aim of the shader model is to offer latency free operation, and offer reasonably resources with that. I believe that CTM will continue to offer at least 128 GPRs per shader for R600, and we continue to offer cliff-free performance with GPR usage.
 
Dual-card configs still have high fanspeeds on slave cards as the slave card never goes to 2D speeds ATM, just 3D speeds. MY system is quiet, except for the slave, that drives me nuts.

I also wonder how this may affect the lifespan of the card...:oops:

I think that there's a driver update coming down to drop down 2D fan speed. Also, make sure that the cards aren't right on top of each other -- There is required spacing between the cards (I think we even ship spacers with cards). Otherwise, one is too hot. Great PC form factors :(
 
Oh, believe me, i hear ya on that one. XBX2 resulted in 90c++ laod temps on upper card, 85x++ on bottom. Now running in P5WDH/P5K load temps hit 76 or so on each card.:D Even with the Phys-X between them.:rolleyes:


I'm not sure if all systems are affected by the problem ,but Vista32 definately is. :LOL: This is just one of many niggles left in these cards...like no image in full-screen 3D if app is started after screensaver kicks in.:rolleyes:


Um, about these spacers...I have had asus, sapphire, HIS, powercolor cards, each does not have this "spacer"...:oops: Are we missing something?

It was quite obvious that these cards needed decent airflow...hence the 12-inch card w/cooler...almost seems I'd rather have those coolers on my cards! :LOL:


There are still quite a few issues with these cards...I'm well versed in this enough to realize that almsot every single one is driver-related. Given that, it seems that R600 is plauged by fresh platform blues...
 
Which is why I said that it`s not the resolve step holding the R600 back?Or am I misunderstanding you?

Maybe I am being a little incorrect in my useage of terminology not to mention probably not understanding something here. I had thought AA was primarily memory bandwidth bound with the extra texture data fetches and not particularly decided by fillrate. In the R600's case this then has the extra overhead of shader execution time which is where I had thought the performance penalty comes in when doing 'simple' MSAA compared to the G80 series.

Is that right?
 
Eric, will there be perfomance difference between GDDR4 and GDDR3. Given the huge bandwidth the R600 already has.
 
Coming late to this thread, I'll just add my thanks to the other members having already asked most of the questions I had left after reading the article and to sireric for answering (most of) them.

One thing sticks out to me, though, that has not been addressed. Somehow, the answers given makes R600 (and its derivative designs) sound like they are supposed to be more than what is readily apparent: What, if any, design considerations, have gone into applications for purposes other than consumer 3D graphics? And how large, if any, importance does such (hypothetical) areas of consideration have going forward compared to the 'traditional focus' (rendering paradigms) of a GPU?
 
Last edited by a moderator:
Maybe I am being a little incorrect in my useage of terminology not to mention probably not understanding something here. I had thought AA was primarily memory bandwidth bound with the extra texture data fetches and not particularly decided by fillrate. In the R600's case this then has the extra overhead of shader execution time which is where I had thought the performance penalty comes in when doing 'simple' MSAA compared to the G80 series.

Is that right?
The texture fetches required to do shader AA aren't using "extra" bandwidth as such. The data fetched to perform a hardware AA resolve is the same as that fetched to perform a shader AA resolve.

This may not be 100% true, because "something" may be happening with the compression tags, in order to support shader AA resolve. Whatever that something is may well be a bandwidth overhead. But then again, it might not.

e.g. the compression tags may be data that gets dumped into the 8KB R/W memory cache (per SIMD) to be fetched by the AA-resolve shader as it progresses across the render target. Judging by that patent document it's possible that the compression data is located in two places: in an on-die tag table and as per-tile status stored in VRAM. Anyway, if the R/W cache is used, then no extra bandwidth is consumed.

So the "overhead" is mostly on the ALU units. Let's do a worst-case guesstimate: say an average of 50 ALU cycles per pixel to perform an 8xMSAA resolve for a 2560x1600 render target at 60fps:
  • 4 ALU clocks per pixel drawn on screen (64 ALU pipes, 16 RBEs)
  • theoretical fillrate of 742MHz * 16 RBEs = 11.872 G pixels/s
  • equals 47.488 G ALU clocks per second capacity
  • 2560*1600 = 4096000 pixels
  • at 60 fps that's 245.76 M pixels/s
  • AA resolve at 50 ALU clocks per pixel, equals 12.288 G ALU clocks
  • AA resolve costs 25.9% of ALU capacity of R600
That's a lot :!: But that still leaves ~350 GFLOPs for all other shading. Approximately the same as 8800GTX's total available GFLOPs...

http://www.bit-tech.net/hardware/2007/06/15/xfx_geforce_8800_ultra_650m_xt/9

Shows 2560x1600 4xMSAA at >60fps on R600. But that prolly needs less clocks for the AA resolve, say 35 for the sake of argument... Prey is not known for being particularly heavy in terms of ALU instructions per pixel. Guessing ~20 per pixel (before, say, 5x overdraw). So maybe 100 ALU clocks per screen pixel of actual shader code? Should add a bit more in there for vertex shading...

I dunno about the ALU clock cost of shader AA. I'm thinking that decoding the compression tags and un-packing the compressed samples is fairly costly. Hmm, comparing box, narrow and wide-tent AA resolve should give some indication of the ALU clock cost...

Having said all that, the cost in terms of texturing rate is notable (since it's a rare resource in R600). Assuming that all 80 samplers can be used for AA resolve but being generous and saying that texturing rate is defined merely by the 16 filtering units, then the 25.9% of ALU cost translates into 5.2% bilinear texturing cost.

In the end, R600's AA sample rate of 32 samples per clock is the real hindrance. e.g. 23.744 G samples/s versus 8800GTS-640's ability to do 40 G samples/s.

Jawed
 
We kicked it around internally and came to the conclusion that it would be a huge improvement in such situations for the fan ramping to be much more graduated. Like, say, 1% rpm change per sec, or something like that. It's sudden large changes in fan speed as much as anything else that really make themselves noticeable. It seems like y'all have the infrastructure in place for that; it'd be a matter of tweaking the driver code. Presumably it'd need some kind of emergency path where if chip temp hits a certain spot where actual danger is dead ahead then forget all that and go high speed.

I actually checked on that a few weeks ago. There aren't that many settings, due to the controller being used (so no driver changes possible on number of different settings). I think we might want to improve that in the future.
 
Status
Not open for further replies.
Back
Top