Sir Eric Demers on AMD R600

Discussion in 'Architecture and Products' started by B3D News, Jun 15, 2007.

Thread Status:
Not open for further replies.
  1. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    I don't think ours will get any bigger than 64. It's a fight between granularity loss and, really, cost. The finer grain things are, the more you need to sequence and so your "control" aspect grows in complexity; as well, you will need more threads, since each thread will now hide less latency, and there's a cost (proportional to the number of threads) to hold all that state. The overall datapath structure grows also, to allow for so much independant SIMD work. The larger granularity does tend to improve cache coherency also, up to a point. The downside is the larger the granularity, the worst the branching costs of going seperate ways.

    As of now, 16 vs 48 vs 64 doesn't really lead to any significant performance difference in real apps. You can make some synthetic cases that show a difference, but those have to be constructed.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Presumably predication can only be specified in the shader by the driver compiler - there isn't anything in D3D that allows the programmer to specify this, is there?

    Preumably as far as the driver/hardware is concerned, R600 is much like Xenos in this respect, where the sequencer can identify that an entire thread has coherent branching, and so the sequencer can skip instructions (or exit a loop).

    Xenos gives the programmer a chance to specify branching behaviour by writing SEQ instructions. Do you think there's much chance that a future version of D3D will provide this capaibility too?

    Jawed
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    So even in GS and VS code there's no notable difference?

    Is this primarily because once you've got multiple objects in a thread (as opposed to the nominal ideal of a thread size of 1), the performance cliff is so great that these various thread sizes only constitute where you land on the slope of rubble below?

    Jawed
     
  4. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    I have not looked at HLSL -- From a HW perspective, either can be selected. But perhaps a heuristic picks, at the compiler level. I haven't checked.

    Yep.

    Good question. I think it's likely. Should be available through CTM.
     
  5. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    There's a lot more reasons to do a smaller vector size for VS/GS. But historically and with current apps, there's been either little VS branching or even draw sizes; or the apps have been dominated by pixel processing (with large vertex:pixel ratios), and so vertex performance wasn't a big deal. But that is something likely to push down granularity in the future, more so than the pixel side.

    I'm not sure I fully understand the question. If I infer the question, no, it's more the opposite -- The performance loss due to granularity isn't that big and the thread sizes are so small to start with, that the differences between these sizes isn't significant.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Thanks, you untangled my question :grin:

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    In CTM, R580 is quite happy to allocate 128 vec4 GPRs per "pixel", far higher than the SM3 requirement of 32. As far as I can tell R600 uses a virtualised register file. Presumably this means that the D3D10 requirement for 4096 vec4 GPRs is "trivial" for R600. R600 can assign GPRs upto the extent of VRAM+available system RAM, presumably?

    Can GPRs and memory addresses overlap in some way; can GPRs be aliased by logical address? i.e. can a block of memory be accessed "locally" by a shader in terms of GPR ID or "globally" in terms of a memory address? e.g. could a shader produce a huge block of results and another shader treat that block of memory as GPRs?

    Presumably, also, there could be situations when contexts are switched, in which case GPRs could be paged out of VRAM into system RAM.

    Really, all I'm doing here is thinking about the fluidity of CTM. One of the things that concerns me about the CUDA threading model is that programmers seem to be forced to work around the memory system, performing their own memory management in effect. I'm curious if one of the design aims for R600 was to relieve the programmers, giving them "latency-hidden" memory management across and within threads.

    Jawed
     
    #167 Jawed, Jun 19, 2007
    Last edited by a moderator: Jun 19, 2007
  8. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Predication has been around for a while. It's in SM 3.0 and was also available in PS 2.x if you exposed the PREDICATION cap bit. To use it you set the predicate register with "setp_comp dst, src0, src1" where comp is the comparison you want.
     
  9. noko

    Regular

    Joined:
    Feb 10, 2002
    Messages:
    502
    Likes Received:
    0
    Location:
    Eustis Florida
    Very fabulous interview and responses. Not to sure how to word some questions, will do my best considering all the professionals here.

    R600 design looks to go beyond DX10 or pixel cruching, at least to me and I just can't help but ask.

    - Is other calculational or computing type problems looked at and incorporated into R600 design? For example physic calculations in conjunction with graphics to use the vastly multi threaded design and high bandwidth available?

    - Maybe other types of uses as in raytrace/GI calculations for use in render farms, would or could R600 even be efficient doing this?

    - Maybe just better to ask, what other areas, besides graphics and video R600 design was influence and can do?

    Any elaboration on any of these questions would be most helpful if answered, thank you.
     
    #169 noko, Jun 19, 2007
    Last edited by a moderator: Jun 19, 2007
  10. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    We kicked it around internally and came to the conclusion that it would be a huge improvement in such situations for the fan ramping to be much more graduated. Like, say, 1% rpm change per sec, or something like that. It's sudden large changes in fan speed as much as anything else that really make themselves noticeable. It seems like y'all have the infrastructure in place for that; it'd be a matter of tweaking the driver code. Presumably it'd need some kind of emergency path where if chip temp hits a certain spot where actual danger is dead ahead then forget all that and go high speed.
     
  11. cadaveca

    Newcomer

    Joined:
    Aug 3, 2006
    Messages:
    98
    Likes Received:
    3
    Dual-card configs still have high fanspeeds on slave cards as the slave card never goes to 2D speeds ATM, just 3D speeds. MY system is quiet, except for the slave, that drives me nuts.

    I also wonder how this may affect the lifespan of the card...:shock:
     
  12. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    Conceptually yes. However, DX10 has some restrictive items associated with fetching outside the 4K range (clear clamp rules on <0 and > 4093). As well, direct access to those registers only gives the need for 12b addressing. I'd have to check if we allow indirect addressing outside of the 4K, but it would not be possible outside of CTM to access that.

    When using "virtual" GPRs, that block is private for each data element, but it is in the GPU addressable memory, and could be made to overlap (But not in DX). You need to be careful about accessing other's memory, since execution is out of order. So you would need to synchronize the shaders -- Possibly by using different prims and forcing a flush (A rather heavy synchronization method, but an easy one to understand).

    Yep. Hopefully they would simply be kept in VRAM, but it's an OS thing to decide.
    The aim of the shader model is to offer latency free operation, and offer reasonably resources with that. I believe that CTM will continue to offer at least 128 GPRs per shader for R600, and we continue to offer cliff-free performance with GPR usage.
     
  13. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    I think that there's a driver update coming down to drop down 2D fan speed. Also, make sure that the cards aren't right on top of each other -- There is required spacing between the cards (I think we even ship spacers with cards). Otherwise, one is too hot. Great PC form factors :-(
     
  14. cadaveca

    Newcomer

    Joined:
    Aug 3, 2006
    Messages:
    98
    Likes Received:
    3
    Oh, believe me, i hear ya on that one. XBX2 resulted in 90c++ laod temps on upper card, 85x++ on bottom. Now running in P5WDH/P5K load temps hit 76 or so on each card.:grin: Even with the Phys-X between them.:roll:


    I'm not sure if all systems are affected by the problem ,but Vista32 definately is. :lol: This is just one of many niggles left in these cards...like no image in full-screen 3D if app is started after screensaver kicks in.:roll:


    Um, about these spacers...I have had asus, sapphire, HIS, powercolor cards, each does not have this "spacer"...:shock: Are we missing something?

    It was quite obvious that these cards needed decent airflow...hence the 12-inch card w/cooler...almost seems I'd rather have those coolers on my cards! :lol:


    There are still quite a few issues with these cards...I'm well versed in this enough to realize that almsot every single one is driver-related. Given that, it seems that R600 is plauged by fresh platform blues...
     
  15. banksie

    Newcomer

    Joined:
    Jun 9, 2003
    Messages:
    213
    Likes Received:
    4
    Location:
    Wellington, New Zealand
    Maybe I am being a little incorrect in my useage of terminology not to mention probably not understanding something here. I had thought AA was primarily memory bandwidth bound with the extra texture data fetches and not particularly decided by fillrate. In the R600's case this then has the extra overhead of shader execution time which is where I had thought the performance penalty comes in when doing 'simple' MSAA compared to the G80 series.

    Is that right?
     
  16. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    54
    Location:
    Canada
    Eric, will there be perfomance difference between GDDR4 and GDDR3. Given the huge bandwidth the R600 already has.
     
  17. Zaphod

    Zaphod Remember
    Veteran

    Joined:
    Aug 26, 2003
    Messages:
    2,267
    Likes Received:
    160
    Coming late to this thread, I'll just add my thanks to the other members having already asked most of the questions I had left after reading the article and to sireric for answering (most of) them.

    One thing sticks out to me, though, that has not been addressed. Somehow, the answers given makes R600 (and its derivative designs) sound like they are supposed to be more than what is readily apparent: What, if any, design considerations, have gone into applications for purposes other than consumer 3D graphics? And how large, if any, importance does such (hypothetical) areas of consideration have going forward compared to the 'traditional focus' (rendering paradigms) of a GPU?
     
    #177 Zaphod, Jun 20, 2007
    Last edited by a moderator: Jun 20, 2007
  18. bdotobdot2

    Newcomer

    Joined:
    Jun 7, 2004
    Messages:
    56
    Likes Received:
    1
    Location:
    Tonawanda, NY
    Ditto!! Good reading all around!!
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    The texture fetches required to do shader AA aren't using "extra" bandwidth as such. The data fetched to perform a hardware AA resolve is the same as that fetched to perform a shader AA resolve.

    This may not be 100% true, because "something" may be happening with the compression tags, in order to support shader AA resolve. Whatever that something is may well be a bandwidth overhead. But then again, it might not.

    e.g. the compression tags may be data that gets dumped into the 8KB R/W memory cache (per SIMD) to be fetched by the AA-resolve shader as it progresses across the render target. Judging by that patent document it's possible that the compression data is located in two places: in an on-die tag table and as per-tile status stored in VRAM. Anyway, if the R/W cache is used, then no extra bandwidth is consumed.

    So the "overhead" is mostly on the ALU units. Let's do a worst-case guesstimate: say an average of 50 ALU cycles per pixel to perform an 8xMSAA resolve for a 2560x1600 render target at 60fps:
    • 4 ALU clocks per pixel drawn on screen (64 ALU pipes, 16 RBEs)
    • theoretical fillrate of 742MHz * 16 RBEs = 11.872 G pixels/s
    • equals 47.488 G ALU clocks per second capacity
    • 2560*1600 = 4096000 pixels
    • at 60 fps that's 245.76 M pixels/s
    • AA resolve at 50 ALU clocks per pixel, equals 12.288 G ALU clocks
    • AA resolve costs 25.9% of ALU capacity of R600
    That's a lot :!: But that still leaves ~350 GFLOPs for all other shading. Approximately the same as 8800GTX's total available GFLOPs...

    http://www.bit-tech.net/hardware/2007/06/15/xfx_geforce_8800_ultra_650m_xt/9

    Shows 2560x1600 4xMSAA at >60fps on R600. But that prolly needs less clocks for the AA resolve, say 35 for the sake of argument... Prey is not known for being particularly heavy in terms of ALU instructions per pixel. Guessing ~20 per pixel (before, say, 5x overdraw). So maybe 100 ALU clocks per screen pixel of actual shader code? Should add a bit more in there for vertex shading...

    I dunno about the ALU clock cost of shader AA. I'm thinking that decoding the compression tags and un-packing the compressed samples is fairly costly. Hmm, comparing box, narrow and wide-tent AA resolve should give some indication of the ALU clock cost...

    Having said all that, the cost in terms of texturing rate is notable (since it's a rare resource in R600). Assuming that all 80 samplers can be used for AA resolve but being generous and saying that texturing rate is defined merely by the 16 filtering units, then the 25.9% of ALU cost translates into 5.2% bilinear texturing cost.

    In the end, R600's AA sample rate of 32 samples per clock is the real hindrance. e.g. 23.744 G samples/s versus 8800GTS-640's ability to do 40 G samples/s.

    Jawed
     
  20. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    I actually checked on that a few weeks ago. There aren't that many settings, due to the controller being used (so no driver changes possible on number of different settings). I think we might want to improve that in the future.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...