Sir Eric Demers on AMD R600

Discussion in 'Architecture and Products' started by B3D News, Jun 15, 2007.

Thread Status:
Not open for further replies.
  1. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    It doesn't matter how many base clocks are needed to go through ALU instructions. If you want to hide texture latency, you need to keep the texture units fed. If you can do math in parallel or issue more texture requests, fine, but otherwise those threads are stalled until you get the texture value back.

    If 768 threads in one multiprocessor need a texture fetch, then you either ran into a very rare coincidence or you have a very texture heavy test. Thus the other multiprocessor also has 768 threads needing a texture fetch. It will take 384 base clocks to get through them at the very fastest.

    If you are not limited by texture fetch throughput, you can pass through the other threads (those threads that don't currently need a texture fetch) multiple times, thus reducing the number of threads necessary to hide the latency.

    G80 needs 192 threads to hide register access latency in all cases. Double the warp size and SIMD size, and that become 384. Still half of what the current G80 can handle. There's no need to double the register size unless you want pure math workloads to have 171 bytes of register space per thread.

    Threads per TMU is the limiting factor by far, not threads per SIMD array.
    AF takes extra clock cycles (i.e. far less throughput), so you don't need more latency hiding than with bilinear. You filter the first 4 samples and when it loops again, you have the 220 clocks for the next 4.

    Threads necessary = throughput per clock * latency

    For 4xAF, throughput becomes 1/4, latency quadruples (actually it's probably less due to the cache friendly nature of AF). You don't need more threads.
     
  2. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    What I'm saying is that G84 has not 90% of the theoretical bilinear rate of 8800 GTS, because it has 16 TFU, not 32, while G80 has 64 TFU and the GTS 48 TFU. It has 16 TAU against 32 of G80 and 24 TAU of G80 GTS. So a 675 MHZ 8600 GTS has only 45% of the bilinear filtering capability of a 500 MHZ 8800 GTS. If it's faster than that is probably due to faster addressing rate respect the 8800. So, clock per clock, an 8600 cluster seems faster in texturing than an 8800 cluster.
    (Theoretical bilinear rate of G84 can be 10800 MT/s only in some instances, in the same conditions 8800 GTS should be not 12000 MT/s but 24000MT/s, but this is limited in g80 by other factors IMHO)
     
    #362 leoneazzurro, Jul 11, 2007
    Last edited by a moderator: Jul 11, 2007
  3. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    The bilinear texture fetch rate is limited by the texture address units, not the filter units. The additional filter units only help with trilinear/anisotropic filtering as well as wide texture formats.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Those drivers look broken on both ATI and NVidia.

    The RightMark test at the bottom of this page:

    http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=5

    shows R600 about 45% faster for bilinear than R580 (14% faster core). At 16xHQAF, R600 is 14% faster. G80's not looking too healthy there generally, despite ignoring the "HQ" option.

    On lower GPUs, such as G84, gamers are expected to turn down the eye candy, so I agree, bilinear rate is potentially more important there. Trouble is, reviews of these cards rarely turn down the eye candy, they just drive them at maximum, so it's still going to be hard to find evidence of bilinear efficacy in G84/G86... Though my default position is that G84 has even more of an excess of TF than G80 :roll:

    It'd be interesting to see what G84 would have been like if it was one cluster with 4x SIMDs and 8x TAs and 8 TFs.

    So the two TFs in G80 are, say, 1.9x the complexity of "one Int8 TF". 10 adders and 14 multipliers per TF in NV40 is the only baseline I've got, and well, it's a struggle to get from there to G80. e.g. some of NV40's single TF was capable of dealing with FP16 texture formats.

    You might also argue that G80's second TF has no decompression hardware associated with it. But presumably G84's second TF does (er, hopefully).

    You can't have it both ways. You were just arguing earlier that G80 has 3x the theoretical AF rate of G71 for IQ. Now you're arguing that G71 didn't gain any die area advantage from poor AF quality?!!!

    I suppose games released after the GPUs hit the market don't count :roll: There's a reason NVidia released 7950GX2, you know.

    Jawed
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    G71 does math in parallel, it uses other fragments' math to hide the bilinear latency of the just-issued fragment (even though in G71 all the other fragments' math is actually a texturing instruction running on the top ALU). Latency-hiding is about keeping the ALUs busy, not about keeping the TMUs busy.

    Or you have 768 objects that are all waiting for a texture result because the math is dependent on that result. For the sake of texture-cache efficiency (reduced thrashing), it's generally better to perform "round-robin" texturing, rather than letting a subset of objects get one or two texture fetches ahead of others.

    768 fragments is an onscreen quad of 32*24 pixels. I don't think that number of fragments needing texturing "simultaneously" is some kind of rare coincidence.

    When ATI expanded the ALU pipe count in R580 from R520, do you think they kept the register file capacity the same? No, it grew 3x too. The batch count per cluster remained the same (128 batches) and the size of each batch tripled (16 fragments to 48 fragments).

    Jawed
     
  6. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    I am a little lost here. Texture bilinear filtering rate does not depend on Texture filtering units? :?:

    This is strange even because the synthetic tests seems to mimick the TFU ratio between G80 and G84 instead of the TA ratio.
     
  7. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    You can't filter anything if you have no data to filter, if you have more filtering units than textures address/fetch units some of them will sit idle as they won't have anything to filter.
    Turn on trilinear, aniso or higher precision formats and they will have some extra work to do to keep everyone busy and happy
     
  8. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    Yes, but I'm thinking here that TA rate is a limit value and it's not reached every time in practice. I thought that there's some "extra work" anyway to be done inthe TFU preventing to reach the maximum rate ( half-speed filtering?). What I'm saying here is that if there is something preventing G84 to reach maximum theoretical rate is in the TFU, because the results seems to mimick the G84TFU:G80TFU ratio instead of TAU ratio.
    So, it is possible that G80 TF units have some limitations that prevent them to work full speed in some cases (i.e. int8 filtering per clock cycle) and this is not apparent in G80 because it has half the TAU but becomes apparent in G84 because the TAU:TFU ratio is 1:1?

    PS: of course it can be also a problem of TAU, but then why G80's units seem to work "full speed"?
     
  9. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Sure, but I was comparing to ATI's TF unit, and there the factor is not 1.9 by any stretch of the imagination. I'm not sure why you need 16 bit arithmetic either, as shown in that slide (where did you get that from?).

    But let's assume the worst: 16x16 multipliers have around 300 full adders worth of die space (including carry lookahead and pipeline FFs). A full adder is well under 30 transistors. 14 of these for 32 TF units works out to 4M transistors. A little math to figure out where the next quad of texels are for AF/trilinear/volume lookups my bump that up a tad.

    The filtering math is really not expensive at all. Sure, on a Voodoo you really feel each TF, but not on a 700M transistor GPU. Have a texture cache that can feed you 8 32-bit texels per clock per TMU instead of 4 64-bit texels is a little harder to quantify.

    The funny thing is that ATI has plenty of cache bandwidth, as you showed earlier in the thread. Each TMU can fetch 20 32-bit values per clock, right? Why the heck didn't they add more filtering math?

    Yes, I can have it both ways, as they're entirely unrelated.

    In one case you were arguing about the design decision. G80 clearly has a much higher focus on image quality (both AA and AF), likely because reviewers started paying more attention to it. Hence the decision to aim for 2x the speed of G71 (standard quality) while delivering higher image quality. NV4x/G7x was never designed to run HQ, but G80 is, even in its default setting.

    In the other post, you were discussing die size efficiency. Like I said, NVidia never designed G7x/NV4x for HQ mode. If they had, it could be done for minimal cost. They saw ATI getting away with a poor AF pattern (which also helps with performance) in the R300 generation, and said "screw it".
    Do you seriously think R580 is twice the speed of G71 in all newer games? Get real. "As much as 50% behind" is more exaggerated than any PR statement I've read from ATI themselves. With equal memory speeds, R580 is rarely even 20% faster despite using 80% more die space.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    http://www-csl.csres.utexas.edu/use...ics_Arch_Tutorial_Micro2004_BillMarkParts.pdf

    http://forum.beyond3d.com/showthread.php?t=37616

    Might be useful.

    One could attempt something along these lines for G8x, but I don't think there's enough data points yet...

    I dunno. We'd be having a very different discussion if R600 was 32 TUs and 16 RBEs with 4xAA per loop.

    (I just discovered from the Tech Report review of RV630/610 that the RBEs in RV610 max-out at 4xAA, i.e. ignoring CFAA which can obviously bump that up.)

    Charlie over at L'Inq:

    http://www.theinquirer.net/default.aspx?article=40913

    thinks that NVidia couldn't be bothered to implement virtualisation stuff in G80. The reasons he gives seem rather off-track, but I dunno. Clearly R600 has loads of virtualisation stuff going on inside, which appears (amongst other things, eh?) to have capped the TUs and RBEs far lower than we'd like.

    I know it is considerably faster in lots of newer games, I said G71 is giving-up 30-50% performance in the worst case. Outdoors in Oblivion on G7x is a disaster zone. But as I keep saying, I think that's mostly an overdraw problem... There are some exceptions.

    Jawed
     
  11. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    You just told me that that G80 only needs 192 threads to keep the ALUs running at maximum efficiency! Do I have to point it out to you? 768 >> 192.

    The degree of latency hiding available to GPUs is designed around the maximum number of threads that could be waiting for texture results. That depends on texturing throughput, not ALU throughput. Read below.
    You don't have to have all threads marching in lockstep to keep the texture accesses from happening in the same order. The math in each thread is independent from each other. A few one-time stalls and the texture fetches are spaced out as necessary.

    Consider an example with 512 threads (16 warps) per multiprocessor and 192 clock fetch latency. Imagine a shader has a 10 scalar ALU instructions between each TEX, every instruction is dependent on the previous one, and we start off in the state you mentioned with every pixel needing a fetch.

    We start fetching: Warp 1a, w2a, ... , w16a, w1b, w2b, ... , w16b (a and b refer to the multiprocessor). It takes 8 base clocks to finish a warp, so after w8b is issued, w1a's results start coming in and we can start doing math (slowly at first due to insufficient warps). After w14b is issued, w1a-w6a's results are here, so we have 6 warps to rotate between to keep efficiency at its peak. w1a eventually gets done (and now needs a new texture fetch), followed by w2a, etc. and during this time w7a, w8a, etc start recieving their results, ready to be fed into the ALUs. The process keeps going on and eventually multiprocessor b can start doing math, etc.

    The point is that everything just spaces out rather quickly. At that point, 15 warps (480 threads) each needing 10 instructions gives you 600 ALU clocks, or 255 base clocks, to wait for texture results from the first warp. Note that if you had any fewer ALU instructions, then you'd be texture throughput limited (remember multiprocessor b is also using it), so the ALU will have to idle. This is the critical point you missed.

    Now, it's true that you need a little consideration for the fact that 6 warps need to be in the shader for full efficiency, and it takes a few clocks for a warp to get through an ALU pipeline. Thus you do need a little more latency hiding than just the 192 clocks of texture latency. However, this is the major factor that determines how many threads you need, so "# of threads ~= tex. latency * TMU throughput" is a good approximation. That's the total # of threads being fed by the TMU, so you have to count both multiprocessors.

    With less regular patterns between ALU and texture instruction spacing, statistics takes over. It's a huge fluke for you to get lots of texture requests unless you have a texture heavy shader.

    Yeah, and IMO that was a stupid design decision. Why on earth do you need 6144 fragments for 12 Vec3+scalar ALUs and one TMU quad? G71 has around that many fragments to hide the latency of six TMU quads. R580 can wait a whopping 1500+ clocks for a texture result when the TMUs are churning out results at their peak rate, as I mentioned before. The ALUs can wait ~500 clocks before needing to return to the same batch. It's ridiculous.

    Anyway, do we have solid proof that the R580 does indeed have triple the register file? Say, by running a pure math shader and increasing the registers? If it's really true, that would really suggest that R520 is a whole lot of nothing, i.e. >150M transistors is scheduling logic.
     
  12. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    The bandwidth galore in R600 is by far not only for plain "rasterization" workload, but looking forward to intensive streaming data in and out of the GPU, but that is quite ahead in DX10+ realm of titles.

    As for texturing deficient of ATi's part -- the current architectural base of the R600 gives little opportunities to improve texture address speed. As I see it, the rather easy way is to break beyond the current SIMD width and just chain more texture quads along the more shader units, but that will explode the batch size, while the other way is a little more radical - by adding more texture threads, thus lowering ALU:TEX ratio (if much of a care todays) and complicating the data sampling portion of the dispatch pre-processor... but now each ALU quad will boast with dual address/sampling rate. ;)
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Huh? What has hiding RF read after write latency got to do with hiding texture latency? When your texture fetch takes too long for the available (dependency-free) instructions' latency in the ALU pipeline, then the ALU pipeline stalls.

    (Except that RF latency adds a little bit of extra latency on texture results when they return. And, it adds a bit of extra latency at the start, because the TAs read from the RF to get texture coordinates.)

    Actually, in an out of order threading GPU, it won't. Because the latency for succeeding texture fetches is mostly defined by texture cache architecture, not video memory latency. Memory latency is just the start-up cost for some tiles of texture data - prefetching plays its part here - don't forget there's a fair amount of L2 cache...

    This appears to be why R600's bilinear rate is about 27% faster than R580's (per TU clock).

    This is all lovely and fluffy, but you do realise you're now describing a high ALU:TEX ratio scenario, where G80's TF rate advantage disappears? Oh look, that's where we came in :razz:

    You do realise that's for only 2x vec4 registers per fragment, don't you? G71's nominal register file allocation is 4x vec4 registers per fragment.

    As you increase the register payload per fragment, the count of available batches drops off in direction proportion. The difference between G71 and R580 is that the latter can stand far more register allocation per fragment before texturing latency-hiding starts falling off. Whereas G71 performance starts falling off as soon as you allocate more than 4x vec4s (I'm talking about fp32s - this is why there's such a heavy emphasis on fp16 on G7x, even in PS3's RSX).

    With misunderstandings like this about R5xx, no wonder you think it's a shit product.

    Well, there's always the chance that R520 had even more "too many" registers per fragment allocatable in its RF. I don't know of any evidence, to be honest - it would be nice to see.

    Eh? In that thread I indicated ~168M transistors of other stuff, things like VS, setup, memory controller (10% of R520's die area is MC buffering in the centre of the die!), PCI Express, buses (including ring bus), AVIVO as well as "high-level" out of order scheduling (i.e. not at the ALU level).

    Jawed
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Would be nice to get a handle on this...

    RV630 appears to indicate that a single cluster can have more than one quad-TU. It has two. It also has 3 SIMDs in that cluster. Well, that's my interpretation.

    It's like a half-Xenos (with the vertex and texture fetching integrated).

    It appears ATI can scale clusters, SIMDs, TUs and RBEs independently.

    Jawed
     
  15. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    What's your point?

    You are saying double SIMD width needs 1536 threads to saturate the ALUs (hence your "double the rate, double the RF needed" post). I'm saying double SIMD width only needs 384 threads to saturate the ALUs, so the RF is fine unless you need more than 85 bytes of register space per thread at full performance in read-after-write scenarios.

    My previous post shows that hiding texture latency is almost independent of SIMD width (well, until you go crazy). You could put 16 ALUs per multiprocessor, double the warp size, and halve the number of warps (i.e. still 512 threads). You'd still hide the texture latency completely, i.e. the ALUs would be saturated with 20 or more scalar instructions between texture lookups (any fewer and you're tex throughput limited so ALU saturation is impossible regardless of thread count).
    What does that have to do with spacing? The spacing happens from TMU throughput being much slower than ALU throughput. If every thread needs a texture fetch, slowly they'll come in. As they come in, the ALUs can proceed. Now they're no longer aligned like your pathological initial condition, and they won't get back to that situation unless there's an incredible fluke.

    I feel like I need to make some sort of animation for you.
    You know you can put at least an iota of thought into your reply. What does TF have to do with anything? I'm assuming we're fetching at maximum rate - four filtered texels per clock. Go ahead and make them trilinear if it makes you happy.

    10 scalar instructions between texture lookups. G80 executes 172 GInst/s and 18.4 GTex/s maximum. One less instruction and it's texture fetch throughput limited. This is NOT a high ALU:TEX scenario by any means. I chose this scenario to show that if you have any fewer ALU instructions, then you're TMU limited. If you have any more, then you have even more time (i.e. more than 255 cycles) between the time a fetch is issued and the time it's needed for a warp. This is the toughest scenario possible where we're trying to maximize usage of both the TMU's and ALU's for a completely serial instruction stream.

    Statistics take care of all other scenarios, like a variable number of instructions between lookups. All that matters is the average.
    So? It's still overkill. First, there's no need for so being able to handle so many fragments in flight regardless of register count. The scheduler/sequencer/dispatcher/whatever is way overbuilt. Secondly, the register space is still 3x as big as G71's per TMU. As I am trying to explain to you, that's all that matters.

    When did I say it was a shit product? It still blows its competition away by a factor of 10 for anything needing DB. My problem is that it was built to perform under scenarios too far ahead of its time, i.e. shaders with heavy register use, lots of math, and/or dynamic branching. If I bought a card from that gen, it would have been from the X1k line (actually, I did, but the etailer didn't ship and by the time it got cancelled I just decided to wait 'til DX10), but I am very far from the typical consumer. From a business standpoint, G7x is a vastly superior design, especially in scalability to the low end. Nearly 2x the game performance in a smaller chip is no small feat.
    I noticed that post afterwards, but you're relying heavily on those numbers being exact with nothing else changing. If the RV570 figure was 2% larger, for example, you'd get 192M for the fixed part. The TMU + ROPs counts you got are way too high to be reasonable when looking at R420, IMO.

    Anyway, I wasn't really serious when I said the scheduler is that big. It was just a subjective remark that's not really worth discussing further.

    Since when? Some game benchmark that you're assuming is 90% texture limited again? Every texturing test (like here or here) says they're both near 100% utilized when bandwidth isn't an issue.
     
  16. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Or having double the amount of filtering units will give you lower net-latency of ALUs waiting to be texture-fed. After all, a TA does not take nearly as long as it takes to fetch and filter the values - especially on a cache miss.

    Or am i completely mistaken here?
     
  17. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Err... no, unless your uber-double TEX sampling trashes the cache. :???:
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    If 384 threads are enough in some 16-SIMD refresh of G80, then why does G80 have 768?

    85 bytes per fragment is 5 vec4 registers.

    I disagree, because shaders can have "hotspots", e.g. where a combination of register allocation per fragment and dependent texturing, say, causes you to chew through available threads resulting in an ALU pipeline stall. As Mike Shebanow says, as you consider smaller and smaller windows of instructions, the shader average throughput is irrelevant - the small window has its own effective ALU:TEX ratio, which you have to balance against other resource constraints (e.g. registers per fragment). The program counters for threads will "bunch-up" behind this hotspot.

    My main point was to indicate that you're seemingly an order of magnitude out. You're saying, as far as I can tell, that excluding cache, 32TAs and 64 TFs in G80 cost about 12M transistors. Or that the TMUs in R5xx should cost about 4M transistors in total. Assuming that TA (LOD, bias and address ) costs as much as TF. Obviously decompression is a factor. Cache operation + memory is also a factor (though L2 in G80 is per MC).

    R5xx texture cache architecture is relatively simple, because it only has L1 cache to address, as far as I can tell.

    I guess you didn't look at:

    http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=5

    I'm just making an observation on that RightMark test, I don't know how it's implemented and why it's different from the multi-texturing test.

    Jawed
     
  19. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    If G80 reflects patents I found it would be able to allocate more registers per thread than the ones normally available as part of them (the ones that don't contribute to determine thread states..) would be shared between threads.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I've been basing my discussions of G80 upon CUDA documentation and there seems to be some divergence when comparing CUDA and graphics modes. Shebanow's presentation here:

    http://courses.ece.uiuc.edu/ece498/al/Syllabus.html

    (Lecture 12) is curious because he talks from the graphics point of view, not the CUDA point of view (confusing the audience somewhat). In that context he refers to 512 threads per SM (multiprocessor). He also refers to a warp as being 16 wide (which is what I've been referring to as a batch). A CUDA warp he calls a "convoy" (though that's a kind of off-hand comment), which is effectively two batches joined together (for instruction issue purposes, something he states quite explicitly).

    So, all in all, it's hard to be sure what G80 is actually doing when it's in graphics mode, not CUDA mode. The numbers are all a bit wobbly. Some registers might be reserved for attribute interpolation results? etc.

    I take your point though, it's hard to be sure what G80's doing when in graphics mode, so latency hiding may be different.

    Jawed
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...