Sir Eric Demers on AMD R600

Mintmaster · Jul 10, 2007

Jawed said:
G80 has 768 threads (fragments) per multiprocessor. It takes 226 base clocks to get through them, worst case (8800GTX, 2.35:1 ALU:core clock).

It doesn't matter how many base clocks are needed to go through ALU instructions. If you want to hide texture latency, you need to keep the texture units fed. If you can do math in parallel or issue more texture requests, fine, but otherwise those threads are stalled until you get the texture value back.

If 768 threads in one multiprocessor need a texture fetch, then you either ran into a very rare coincidence or you have a very texture heavy test. Thus the other multiprocessor also has 768 threads needing a texture fetch. It will take 384 base clocks to get through them at the very fastest.

If you are not limited by texture fetch throughput, you can pass through the other threads (those threads that don't currently need a texture fetch) multiple times, thus reducing the number of threads necessary to hide the latency.

G80 needs 192 threads to hide register access latency in all cases. Double the warp size and SIMD size, and that become 384. Still half of what the current G80 can handle. There's no need to double the register size unless you want pure math workloads to have 171 bytes of register space per thread.

Threads per TMU is the limiting factor by far, not threads per SIMD array.

Put another way, I wonder if 220 is biased to making AF work, and is significantly over the top for bilinear.

AF takes extra clock cycles (i.e. far less throughput), so you don't need more latency hiding than with bilinear. You filter the first 4 samples and when it loops again, you have the 220 clocks for the next 4.

Threads necessary = throughput per clock * latency

For 4xAF, throughput becomes 1/4, latency quadruples (actually it's probably less due to the cache friendly nature of AF). You don't need more threads.

leoneazzurro · Jul 11, 2007

Jawed said:
In the second comparison posted:

http://www.digit-life.com/articles2/video/g84-part2.html#p5

8600GTS is being compared against 8800GTS. Theoretically it has 90% of the theoretical bilinear rate (10800 MT/s versus 12000 MT/s). The 8 textures case shows 8600GTS achieving 47% of 8800GTS. 1 texture is 52%, 2 textures is 54% and 3 textures is 56% (the best).

Jawed

What I'm saying is that G84 has not 90% of the theoretical bilinear rate of 8800 GTS, because it has 16 TFU, not 32, while G80 has 64 TFU and the GTS 48 TFU. It has 16 TAU against 32 of G80 and 24 TAU of G80 GTS. So a 675 MHZ 8600 GTS has only 45% of the bilinear filtering capability of a 500 MHZ 8800 GTS. If it's faster than that is probably due to faster addressing rate respect the 8800. So, clock per clock, an 8600 cluster seems faster in texturing than an 8800 cluster.
(Theoretical bilinear rate of G84 can be 10800 MT/s only in some instances, in the same conditions 8800 GTS should be not 12000 MT/s but 24000MT/s, but this is limited in g80 by other factors IMHO)

Xmas · Jul 11, 2007

The bilinear texture fetch rate is limited by the texture address units, not the filter units. The additional filter units only help with trilinear/anisotropic filtering as well as wide texture formats.

Jawed · Jul 11, 2007

Mintmaster said:
We're still talking about SS2, right? Games are never purely a bilinear texturing test. Moreover, drivers and benchmark choice are always a potential factor. Everything scales as expected here, for example.

Those drivers look broken on both ATI and NVidia.

The RightMark test at the bottom of this page:

http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=5

shows R600 about 45% faster for bilinear than R580 (14% faster core). At 16xHQAF, R600 is 14% faster. G80's not looking too healthy there generally, despite ignoring the "HQ" option.

On lower GPUs, such as G84, gamers are expected to turn down the eye candy, so I agree, bilinear rate is potentially more important there. Trouble is, reviews of these cards rarely turn down the eye candy, they just drive them at maximum, so it's still going to be hard to find evidence of bilinear efficacy in G84/G86... Though my default position is that G84 has even more of an excess of TF than G80

It'd be interesting to see what G84 would have been like if it was one cluster with 4x SIMDs and 8x TAs and 8 TFs.

For this reason, forget about INT16. For all intents and purposes, one extra INT8 filtering unit per TMU and some very simple relative addressing is all the math NVidia has added.

So the two TFs in G80 are, say, 1.9x the complexity of "one Int8 TF". 10 adders and 14 multipliers per TF in NV40 is the only baseline I've got, and well, it's a struggle to get from there to G80. e.g. some of NV40's single TF was capable of dealing with FP16 texture formats.

You might also argue that G80's second TF has no decompression hardware associated with it. But presumably G84's second TF does (er, hopefully).

HQ on both cards is not apples to apples. R580 looked better due to better AF detection math, and NVidia probably disabled a bunch of optimizations (which ATI still mostly had on) that don't affect quality too much but tank performance.

You can't have it both ways. You were just arguing earlier that G80 has 3x the theoretical AF rate of G71 for IQ. Now you're arguing that G71 didn't gain any die area advantage from poor AF quality?!!!

"As much as 50%" does not invalidate the statement that G71 is far more efficient per mm^2 for real games.

I suppose games released after the GPUs hit the market don't count

There's a reason NVidia released 7950GX2, you know.

Jawed

Jawed · Jul 11, 2007

Mintmaster said:
It doesn't matter how many base clocks are needed to go through ALU instructions. If you want to hide texture latency, you need to keep the texture units fed. If you can do math in parallel or issue more texture requests, fine, but otherwise those threads are stalled until you get the texture value back.

G71 does math in parallel, it uses other fragments' math to hide the bilinear latency of the just-issued fragment (even though in G71 all the other fragments' math is actually a texturing instruction running on the top ALU). Latency-hiding is about keeping the ALUs busy, not about keeping the TMUs busy.

If 768 threads in one multiprocessor need a texture fetch, then you either ran into a very rare coincidence or you have a very texture heavy test.

Or you have 768 objects that are all waiting for a texture result because the math is dependent on that result. For the sake of texture-cache efficiency (reduced thrashing), it's generally better to perform "round-robin" texturing, rather than letting a subset of objects get one or two texture fetches ahead of others.

768 fragments is an onscreen quad of 32*24 pixels. I don't think that number of fragments needing texturing "simultaneously" is some kind of rare coincidence.

When ATI expanded the ALU pipe count in R580 from R520, do you think they kept the register file capacity the same? No, it grew 3x too. The batch count per cluster remained the same (128 batches) and the size of each batch tripled (16 fragments to 48 fragments).

Jawed

leoneazzurro · Jul 11, 2007

Xmas said:
The bilinear texture fetch rate is limited by the texture address units, not the filter units. The additional filter units only help with trilinear/anisotropic filtering as well as wide texture formats.

I am a little lost here. Texture bilinear filtering rate does not depend on Texture filtering units? :?:

This is strange even because the synthetic tests seems to mimick the TFU ratio between G80 and G84 instead of the TA ratio.

nAo · Jul 11, 2007

leoneazzurro said:
I am a little lost here. Texture bilinear filtering rate does not depend on Texture filtering units?

You can't filter anything if you have no data to filter, if you have more filtering units than textures address/fetch units some of them will sit idle as they won't have anything to filter.
Turn on trilinear, aniso or higher precision formats and they will have some extra work to do to keep everyone busy and happy

leoneazzurro · Jul 11, 2007

nAo said:
You can't filter anything if you have no data to filter, if you have more filtering units than textures address/fetch units some of them will sit idle as they won't have anything to filter.
Turn on trilinear, aniso or higher precision formats and they will have some extra work to do to keep everyone busy and happy

Yes, but I'm thinking here that TA rate is a limit value and it's not reached every time in practice. I thought that there's some "extra work" anyway to be done inthe TFU preventing to reach the maximum rate ( half-speed filtering?). What I'm saying here is that if there is something preventing G84 to reach maximum theoretical rate is in the TFU, because the results seems to mimick the G84TFU:G80TFU ratio instead of TAU ratio.
So, it is possible that G80 TF units have some limitations that prevent them to work full speed in some cases (i.e. int8 filtering per clock cycle) and this is not apparent in G80 because it has half the TAU but becomes apparent in G84 because the TAU:TFU ratio is 1:1?

PS: of course it can be also a problem of TAU, but then why G80's units seem to work "full speed"?

Mintmaster · Jul 11, 2007

Jawed said:
So the two TFs in G80 are, say, 1.9x the complexity of "one Int8 TF". 10 adders and 14 multipliers per TF in NV40 is the only baseline I've got, and well, it's a struggle to get from there to G80. e.g. some of NV40's single TF was capable of dealing with FP16 texture formats.

Sure, but I was comparing to ATI's TF unit, and there the factor is not 1.9 by any stretch of the imagination. I'm not sure why you need 16 bit arithmetic either, as shown in that slide (where did you get that from?).

But let's assume the worst: 16x16 multipliers have around 300 full adders worth of die space (including carry lookahead and pipeline FFs). A full adder is well under 30 transistors. 14 of these for 32 TF units works out to 4M transistors. A little math to figure out where the next quad of texels are for AF/trilinear/volume lookups my bump that up a tad.

The filtering math is really not expensive at all. Sure, on a Voodoo you really feel each TF, but not on a 700M transistor GPU. Have a texture cache that can feed you 8 32-bit texels per clock per TMU instead of 4 64-bit texels is a little harder to quantify.

The funny thing is that ATI has plenty of cache bandwidth, as you showed earlier in the thread. Each TMU can fetch 20 32-bit values per clock, right? Why the heck didn't they add more filtering math?

You can't have it both ways. You were just arguing earlier that G80 has 3x the theoretical AF rate of G71 for IQ. Now you're arguing that G71 didn't gain any die area advantage from poor AF quality?!!!

Yes, I can have it both ways, as they're entirely unrelated.

In one case you were arguing about the design decision. G80 clearly has a much higher focus on image quality (both AA and AF), likely because reviewers started paying more attention to it. Hence the decision to aim for 2x the speed of G71 (standard quality) while delivering higher image quality. NV4x/G7x was never designed to run HQ, but G80 is, even in its default setting.

In the other post, you were discussing die size efficiency. Like I said, NVidia never designed G7x/NV4x for HQ mode. If they had, it could be done for minimal cost. They saw ATI getting away with a poor AF pattern (which also helps with performance) in the R300 generation, and said "screw it".

I suppose games released after the GPUs hit the market don't count There's a reason NVidia released 7950GX2, you know.

Do you seriously think R580 is twice the speed of G71 in all newer games? Get real. "As much as 50% behind" is more exaggerated than any PR statement I've read from ATI themselves. With equal memory speeds, R580 is rarely even 20% faster despite using 80% more die space.

Jawed · Jul 11, 2007

Mintmaster said:
Sure, but I was comparing to ATI's TF unit, and there the factor is not 1.9 by any stretch of the imagination. I'm not sure why you need 16 bit arithmetic either, as shown in that slide (where did you get that from?).

http://www-csl.csres.utexas.edu/use...ics_Arch_Tutorial_Micro2004_BillMarkParts.pdf

14 of these for 32 TF units works out to 4M transistors.

http://forum.beyond3d.com/showthread.php?t=37616

Might be useful.

One could attempt something along these lines for G8x, but I don't think there's enough data points yet...

The funny thing is that ATI has plenty of cache bandwidth, as you showed earlier in the thread. Each TMU can fetch 20 32-bit values per clock, right? Why the heck didn't they add more filtering math?

I dunno. We'd be having a very different discussion if R600 was 32 TUs and 16 RBEs with 4xAA per loop.

(I just discovered from the Tech Report review of RV630/610 that the RBEs in RV610 max-out at 4xAA, i.e. ignoring CFAA which can obviously bump that up.)

Charlie over at L'Inq:

http://www.theinquirer.net/default.aspx?article=40913

thinks that NVidia couldn't be bothered to implement virtualisation stuff in G80. The reasons he gives seem rather off-track, but I dunno. Clearly R600 has loads of virtualisation stuff going on inside, which appears (amongst other things, eh?) to have capped the TUs and RBEs far lower than we'd like.

Do you seriously think R580 is twice the speed of G71 in all newer games?

I know it is considerably faster in lots of newer games, I said G71 is giving-up 30-50% performance in the worst case. Outdoors in Oblivion on G7x is a disaster zone. But as I keep saying, I think that's mostly an overdraw problem... There are some exceptions.

Jawed

Mintmaster · Jul 11, 2007

Jawed said:
G71 does math in parallel, it uses other fragments' math to hide the bilinear latency of the just-issued fragment (even though in G71 all the other fragments' math is actually a texturing instruction running on the top ALU). Latency-hiding is about keeping the ALUs busy, not about keeping the TMUs busy.

You just told me that that G80 only needs 192 threads to keep the ALUs running at maximum efficiency! Do I have to point it out to you? 768 >> 192.

The degree of latency hiding available to GPUs is designed around the maximum number of threads that could be waiting for texture results. That depends on texturing throughput, not ALU throughput. Read below.

Or you have 768 objects that are all waiting for a texture result because the math is dependent on that result. For the sake of texture-cache efficiency (reduced thrashing), it's generally better to perform "round-robin" texturing, rather than letting a subset of objects get one or two texture fetches ahead of others.

You don't have to have all threads marching in lockstep to keep the texture accesses from happening in the same order. The math in each thread is independent from each other. A few one-time stalls and the texture fetches are spaced out as necessary.

Consider an example with 512 threads (16 warps) per multiprocessor and 192 clock fetch latency. Imagine a shader has a 10 scalar ALU instructions between each TEX, every instruction is dependent on the previous one, and we start off in the state you mentioned with every pixel needing a fetch.

We start fetching: Warp 1a, w2a, ... , w16a, w1b, w2b, ... , w16b (a and b refer to the multiprocessor). It takes 8 base clocks to finish a warp, so after w8b is issued, w1a's results start coming in and we can start doing math (slowly at first due to insufficient warps). After w14b is issued, w1a-w6a's results are here, so we have 6 warps to rotate between to keep efficiency at its peak. w1a eventually gets done (and now needs a new texture fetch), followed by w2a, etc. and during this time w7a, w8a, etc start recieving their results, ready to be fed into the ALUs. The process keeps going on and eventually multiprocessor b can start doing math, etc.

The point is that everything just spaces out rather quickly. At that point, 15 warps (480 threads) each needing 10 instructions gives you 600 ALU clocks, or 255 base clocks, to wait for texture results from the first warp. Note that if you had any fewer ALU instructions, then you'd be texture throughput limited (remember multiprocessor b is also using it), so the ALU will have to idle. This is the critical point you missed.

Now, it's true that you need a little consideration for the fact that 6 warps need to be in the shader for full efficiency, and it takes a few clocks for a warp to get through an ALU pipeline. Thus you do need a little more latency hiding than just the 192 clocks of texture latency. However, this is the major factor that determines how many threads you need, so "# of threads ~= tex. latency * TMU throughput" is a good approximation. That's the total # of threads being fed by the TMU, so you have to count both multiprocessors.

768 fragments is an onscreen quad of 32*24 pixels. I don't think that number of fragments needing texturing "simultaneously" is some kind of rare coincidence.

With less regular patterns between ALU and texture instruction spacing, statistics takes over. It's a huge fluke for you to get lots of texture requests unless you have a texture heavy shader.

When ATI expanded the ALU pipe count in R580 from R520, do you think they kept the register file capacity the same? No, it grew 3x too. The batch count per cluster remained the same (128 batches) and the size of each batch tripled (16 fragments to 48 fragments).

Yeah, and IMO that was a stupid design decision. Why on earth do you need 6144 fragments for 12 Vec3+scalar ALUs and one TMU quad? G71 has around that many fragments to hide the latency of six TMU quads. R580 can wait a whopping 1500+ clocks for a texture result when the TMUs are churning out results at their peak rate, as I mentioned before. The ALUs can wait ~500 clocks before needing to return to the same batch. It's ridiculous.

Anyway, do we have solid proof that the R580 does indeed have triple the register file? Say, by running a pure math shader and increasing the registers? If it's really true, that would really suggest that R520 is a whole lot of nothing, i.e. >150M transistors is scheduling logic.

fellix · Jul 11, 2007

The bandwidth galore in R600 is by far not only for plain "rasterization" workload, but looking forward to intensive streaming data in and out of the GPU, but that is quite ahead in DX10+ realm of titles.

As for texturing deficient of ATi's part -- the current architectural base of the R600 gives little opportunities to improve texture address speed. As I see it, the rather easy way is to break beyond the current SIMD width and just chain more texture quads along the more shader units, but that will explode the batch size, while the other way is a little more radical - by adding more texture threads, thus lowering ALU:TEX ratio (if much of a care todays) and complicating the data sampling portion of the dispatch pre-processor... but now each ALU quad will boast with dual address/sampling rate.

Jawed · Jul 11, 2007

Mintmaster said:
You just told me that that G80 only needs 192 threads to keep the ALUs running at maximum efficiency! Do I have to point it out to you? 768 >> 192.

Huh? What has hiding RF read after write latency got to do with hiding texture latency? When your texture fetch takes too long for the available (dependency-free) instructions' latency in the ALU pipeline, then the ALU pipeline stalls.

(Except that RF latency adds a little bit of extra latency on texture results when they return. And, it adds a bit of extra latency at the start, because the TAs read from the RF to get texture coordinates.)

The point is that everything just spaces out rather quickly.

Actually, in an out of order threading GPU, it won't. Because the latency for succeeding texture fetches is mostly defined by texture cache architecture, not video memory latency. Memory latency is just the start-up cost for some tiles of texture data - prefetching plays its part here - don't forget there's a fair amount of L2 cache...

This appears to be why R600's bilinear rate is about 27% faster than R580's (per TU clock).

At that point, 15 warps (480 threads) each needing 10 instructions gives you 600 ALU clocks, or 255 base clocks, to wait for texture results from the first warp. Note that if you had any fewer ALU instructions, then you'd be texture throughput limited (remember multiprocessor b is also using it), so the ALU will have to idle. This is the critical point you missed.

This is all lovely and fluffy, but you do realise you're now describing a high ALU:TEX ratio scenario, where G80's TF rate advantage disappears? Oh look, that's where we came in

Yeah, and IMO that was a stupid design decision. Why on earth do you need 6144 fragments for 12 Vec3+scalar ALUs and one TMU quad?

You do realise that's for only 2x vec4 registers per fragment, don't you? G71's nominal register file allocation is 4x vec4 registers per fragment.

As you increase the register payload per fragment, the count of available batches drops off in direction proportion. The difference between G71 and R580 is that the latter can stand far more register allocation per fragment before texturing latency-hiding starts falling off. Whereas G71 performance starts falling off as soon as you allocate more than 4x vec4s (I'm talking about fp32s - this is why there's such a heavy emphasis on fp16 on G7x, even in PS3's RSX).

It's ridiculous.

With misunderstandings like this about R5xx, no wonder you think it's a shit product.

Anyway, do we have solid proof that the R580 does indeed have triple the register file? Say, by running a pure math shader and increasing the registers?

Well, there's always the chance that R520 had even more "too many" registers per fragment allocatable in its RF. I don't know of any evidence, to be honest - it would be nice to see.

If it's really true, that would really suggest that R520 is a whole lot of nothing, i.e. >150M transistors is scheduling logic.

Eh? In that thread I indicated ~168M transistors of other stuff, things like VS, setup, memory controller (10% of R520's die area is MC buffering in the centre of the die!), PCI Express, buses (including ring bus), AVIVO as well as "high-level" out of order scheduling (i.e. not at the ALU level).

Jawed

Jawed · Jul 11, 2007

fellix said:
The bandwidth galore in R600 is by far not only for plain "rasterization" workload, but looking forward to intensive streaming data in and out of the GPU, but that is quite ahead in DX10+ realm of titles.

Would be nice to get a handle on this...

As for texturing deficient of ATi's part -- the current architectural base of the R600 gives little opportunities to improve texture address speed.

RV630 appears to indicate that a single cluster can have more than one quad-TU. It has two. It also has 3 SIMDs in that cluster. Well, that's my interpretation.

It's like a half-Xenos (with the vertex and texture fetching integrated).

It appears ATI can scale clusters, SIMDs, TUs and RBEs independently.

Jawed

Mintmaster · Jul 12, 2007

Jawed said:
Huh? What has hiding RF read after write latency got to do with hiding texture latency? When your texture fetch takes too long for the available (dependency-free) instructions' latency in the ALU pipeline, then the ALU pipeline stalls.

What's your point?

You are saying double SIMD width needs 1536 threads to saturate the ALUs (hence your "double the rate, double the RF needed" post). I'm saying double SIMD width only needs 384 threads to saturate the ALUs, so the RF is fine unless you need more than 85 bytes of register space per thread at full performance in read-after-write scenarios.

My previous post shows that hiding texture latency is almost independent of SIMD width (well, until you go crazy). You could put 16 ALUs per multiprocessor, double the warp size, and halve the number of warps (i.e. still 512 threads). You'd still hide the texture latency completely, i.e. the ALUs would be saturated with 20 or more scalar instructions between texture lookups (any fewer and you're tex throughput limited so ALU saturation is impossible regardless of thread count).

Actually, in an out of order threading GPU, it won't. Because the latency for succeeding texture fetches is mostly defined by texture cache architecture, not video memory latency. Memory latency is just the start-up cost for some tiles of texture data - prefetching plays its part here - don't forget there's a fair amount of L2 cache...

What does that have to do with spacing? The spacing happens from TMU throughput being much slower than ALU throughput. If every thread needs a texture fetch, slowly they'll come in. As they come in, the ALUs can proceed. Now they're no longer aligned like your pathological initial condition, and they won't get back to that situation unless there's an incredible fluke.

I feel like I need to make some sort of animation for you.

This is all lovely and fluffy, but you do realise you're now describing a high ALU:TEX ratio scenario, where G80's TF rate advantage disappears? Oh look, that's where we came in

You know you can put at least an iota of thought into your reply. What does TF have to do with anything? I'm assuming we're fetching at maximum rate - four filtered texels per clock. Go ahead and make them trilinear if it makes you happy.

10 scalar instructions between texture lookups. G80 executes 172 GInst/s and 18.4 GTex/s maximum. One less instruction and it's texture fetch throughput limited. This is NOT a high ALU:TEX scenario by any means. I chose this scenario to show that if you have any fewer ALU instructions, then you're TMU limited. If you have any more, then you have even more time (i.e. more than 255 cycles) between the time a fetch is issued and the time it's needed for a warp. This is the toughest scenario possible where we're trying to maximize usage of both the TMU's and ALU's for a completely serial instruction stream.

Statistics take care of all other scenarios, like a variable number of instructions between lookups. All that matters is the average.

You do realise that's for only 2x vec4 registers per fragment, don't you? G71's nominal register file allocation is 4x vec4 registers per fragment.

So? It's still overkill. First, there's no need for so being able to handle so many fragments in flight regardless of register count. The scheduler/sequencer/dispatcher/whatever is way overbuilt. Secondly, the register space is still 3x as big as G71's per TMU. As I am trying to explain to you, that's all that matters.

With misunderstandings like this about R5xx, no wonder you think it's a shit product.

When did I say it was a shit product? It still blows its competition away by a factor of 10 for anything needing DB. My problem is that it was built to perform under scenarios too far ahead of its time, i.e. shaders with heavy register use, lots of math, and/or dynamic branching. If I bought a card from that gen, it would have been from the X1k line (actually, I did, but the etailer didn't ship and by the time it got cancelled I just decided to wait 'til DX10), but I am very far from the typical consumer. From a business standpoint, G7x is a vastly superior design, especially in scalability to the low end. Nearly 2x the game performance in a smaller chip is no small feat.

Eh? In that thread I indicated ~168M transistors of other stuff, things like VS, setup, memory controller (10% of R520's die area is MC buffering in the centre of the die!), PCI Express, buses (including ring bus), AVIVO as well as "high-level" out of order scheduling (i.e. not at the ALU level).

I noticed that post afterwards, but you're relying heavily on those numbers being exact with nothing else changing. If the RV570 figure was 2% larger, for example, you'd get 192M for the fixed part. The TMU + ROPs counts you got are way too high to be reasonable when looking at R420, IMO.

Anyway, I wasn't really serious when I said the scheduler is that big. It was just a subjective remark that's not really worth discussing further.

This appears to be why R600's bilinear rate is about 27% faster than R580's (per TU clock).

Since when? Some game benchmark that you're assuming is 90% texture limited again? Every texturing test (like here or here) says they're both near 100% utilized when bandwidth isn't an issue.

CarstenS · Jul 12, 2007

nAo said:
You can't filter anything if you have no data to filter, if you have more filtering units than textures address/fetch units some of them will sit idle as they won't have anything to filter.
Turn on trilinear, aniso or higher precision formats and they will have some extra work to do to keep everyone busy and happy

Or having double the amount of filtering units will give you lower net-latency of ALUs waiting to be texture-fed. After all, a TA does not take nearly as long as it takes to fetch and filter the values - especially on a cache miss.

Or am i completely mistaken here?

fellix · Jul 12, 2007

Err... no, unless your uber-double TEX sampling trashes the cache. :???:

Jawed · Jul 13, 2007

Mintmaster said:
You are saying double SIMD width needs 1536 threads to saturate the ALUs (hence your "double the rate, double the RF needed" post). I'm saying double SIMD width only needs 384 threads to saturate the ALUs, so the RF is fine unless you need more than 85 bytes of register space per thread at full performance in read-after-write scenarios.

If 384 threads are enough in some 16-SIMD refresh of G80, then why does G80 have 768?

85 bytes per fragment is 5 vec4 registers.

Statistics take care of all other scenarios, like a variable number of instructions between lookups. All that matters is the average.

I disagree, because shaders can have "hotspots", e.g. where a combination of register allocation per fragment and dependent texturing, say, causes you to chew through available threads resulting in an ALU pipeline stall. As Mike Shebanow says, as you consider smaller and smaller windows of instructions, the shader average throughput is irrelevant - the small window has its own effective ALU:TEX ratio, which you have to balance against other resource constraints (e.g. registers per fragment). The program counters for threads will "bunch-up" behind this hotspot.

I noticed that post afterwards, but you're relying heavily on those numbers being exact with nothing else changing. If the RV570 figure was 2% larger, for example, you'd get 192M for the fixed part. The TMU + ROPs counts you got are way too high to be reasonable when looking at R420, IMO.

My main point was to indicate that you're seemingly an order of magnitude out. You're saying, as far as I can tell, that excluding cache, 32TAs and 64 TFs in G80 cost about 12M transistors. Or that the TMUs in R5xx should cost about 4M transistors in total. Assuming that TA (LOD, bias and address ) costs as much as TF. Obviously decompression is a factor. Cache operation + memory is also a factor (though L2 in G80 is per MC).

R5xx texture cache architecture is relatively simple, because it only has L1 cache to address, as far as I can tell.

Since when? Some game benchmark that you're assuming is 90% texture limited again? Every texturing test (like here or here) says they're both near 100% utilized when bandwidth isn't an issue.

I guess you didn't look at:

http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=5

I'm just making an observation on that RightMark test, I don't know how it's implemented and why it's different from the multi-texturing test.

Jawed

nAo · Jul 13, 2007

Jawed said:
85 bytes per fragment is 5 vec4 registers.

If G80 reflects patents I found it would be able to allocate more registers per thread than the ones normally available as part of them (the ones that don't contribute to determine thread states..) would be shared between threads.

Jawed · Jul 13, 2007

nAo said:
If G80 reflects patents I found it would be able to allocate more registers per thread than the ones normally available as part of them (the ones that don't contribute to determine thread states..) would be shared between threads.

I've been basing my discussions of G80 upon CUDA documentation and there seems to be some divergence when comparing CUDA and graphics modes. Shebanow's presentation here:

http://courses.ece.uiuc.edu/ece498/al/Syllabus.html

(Lecture 12) is curious because he talks from the graphics point of view, not the CUDA point of view (confusing the audience somewhat). In that context he refers to 512 threads per SM (multiprocessor). He also refers to a warp as being 16 wide (which is what I've been referring to as a batch). A CUDA warp he calls a "convoy" (though that's a kind of off-hand comment), which is effectively two batches joined together (for instruction issue purposes, something he states quite explicitly).

So, all in all, it's hard to be sure what G80 is actually doing when it's in graphics mode, not CUDA mode. The numbers are all a bit wobbly. Some registers might be reserved for attribute interpolation results? etc.

I take your point though, it's hard to be sure what G80's doing when in graphics mode, so latency hiding may be different.

Jawed

Sir Eric Demers on AMD R600

Mintmaster

leoneazzurro

Xmas

Porous

Jawed

Jawed

leoneazzurro

nAo

Nutella Nutellae

leoneazzurro

Mintmaster

Jawed

Mintmaster

fellix

Jawed

Jawed

Mintmaster

CarstenS

Moderator

fellix

Jawed

nAo

Nutella Nutellae

Jawed

Similar threads