Sir Eric Demers on AMD R600

fellix · Jul 9, 2007

Here is another snap of the rather poor OGL on R600 in ArchMark, aside from the most disgraceful results:

Jawed · Jul 10, 2007

Mintmaster said:
Dealing with the SF is more of an issue with throughput rather than latency, isn't it?

Yeah, precisely, you want to have available instructions to co-issue on MAD if at all possible. I was distinguishing between 16-vertex batches and 32-fragment warps, and suggesting that vertex instructions tend to be wider (hence will be more likely to keep MAD occupied) while fragment instructions tend to be narrower, hence a lowered chance of keeping MAD busy - corrected by doubling the number of fragments, by joining two batches together.

Per 8 ALUs, you only get two SF results per clock.

Either 1 or 2 results per clock depending on instruction. So that impacts dependency - you can't use the result of an SF until at least 16 of them have been produced.

The ALUs have 8 clock latency too, don't they?

I'm unclear whether you're referring to the duration of an instruction from fetch to retire or whether you're referring to read-after-write.

Anyway, this seems like a very minor consideration to me, focussed too narrowly on worst case scenarios.

Well if you can come up with a better reason for warps being 32 fragments, when the batch size of G80 is 16, then you're welcome... It seems to me it's the same reason CUDA uses warps of 32. It can't be that minor if it squanders dynamic branching performance.

Yup, I know that. ALU clocks of latency hiding isn't particularly relevant, though. You're trying to saturate the texture unit, which fetches values for 4 pixels each clock. With a ~200 clock latency, you need a minimum of ~800 threads per cluster to avoid stalls. That's a lot more than the 192x2 needed to avoid those register limitations, which shouldn't change with a double-width SIMD array. The only thing that will change is the number of threads needed in flight to saturate the ALUs, and that's still a lot less than ~800.

We started discussing this originally because of the idea of doubling the width of the ALU pipeline, to 16 objects. If you do that, then you consume fragments twice as fast. The 768 fragments which in G80 provide a minimum of 96 clocks of latency hiding will only provide 48 clocks of latency hiding in the new GPU. So the register file has to double in its fragment capacity. And that's before you clock the new GPU's ALU core to 2GHz+...

Fewer texture units on ATI hardware is something that should allow it to get away with a much smaller register file than NVidia, but they seem to go the other way. Why on earth does RV580 handle 512 batches of 48 pixels?

Per ALU pipeline, it's 128 batches.

That's enough for 1536 cycles between texture lookups! For each TMU-quad, G80 has a max of 48 warps while R580 has 128 batches, with the latter bigger than the former too boot.

You can't spread the latency around like you're doing. e.g. G80's two SIMDs per cluster each have to hide their own texturing latency, independently. Threads can't be shared across the two SIMDs.

Jawed

nAo · Jul 10, 2007

I wouldn't be surprised if ALUs latency is MUCH more than 8 cycles

Mintmaster · Jul 10, 2007

Jawed said:
But I have some doubt, because looking across the cards/resolutions/settings I see bandwidth playing a factor which makes me wonder if fillrate is dominating.

For the "no HDR" scores I doubt it. The 8800GTS wouldn't perform similarly to the HD2900XT in that case. Remember that anything above 130 fps is getting a lot of CPU limitation. If you plotted framerate vs. CPU speed for any video card (or framerate vs. video card speed for any CPU), it would gradually level off, not be a sudden transition from linear to plateau. You start having parts of the test that are CPU limited and parts that are GPU limited instead of being 90% one or the other.

7950GT should be ~44% faster if bilinear texturing rate were dominant here. Trouble is, G7x texturing isn't a bastion of efficiency, and R580's out of order texturing may be making a huge difference (even though this is supposedly "DX8" type code) so this datapoint may be moot

Yeah, I'll agree with that. Remember how texturing takes away ALU ability and vice versa for G7x, and stalls can't be hidden. I don't even know if I'd attribute it to "out of order" processing - even R300 could do math while the texture units were doing their thing.

So you're saying that G84's lacklustre performance advantage over RV630 is because it has too much texture filtering/addressing/fetching? Maybe the ALU pipeline clocks in G84 are way lower than NVidia was hoping to get. Or maybe the core clock just goes much higher than they were expecting.

I was under the impression that G84's co-issued MUL works much better than G80's. Maybe that's a driver problem yet to be fixed, it is still pretty new after all.

There's definately something weird about G84's texturing. On pretty much every NVidia GPU for the last 5 years you'd get >95% efficiency in multitexturing tests. G84 only manages 70% efficiency (or, according to RightMark, only 50%!).

One thing that bugs me is that you need to iterate 2 texture coordinate values in order to get a fetch a texture. If G84 can only do 4 per cluster per ALU clock (like G80), then it should be limited to 5800 Mtex/s. Maybe 3DMark does multiple fetches per texcoord, but RightMark uses a different texcoord pair per texture fetch, which is how you'd do it with fixed-function processing.

Does G84 have more SF units per cluster than G80? That might explain better MUL co-issue, I guess, and why they'd put so much texture addressing there.

I see there's some debate on the single-textured test, oh well... I don't understand why the RightMark test doesn't show any sign of the TA:TF ratio as number of textures increases - unless each texture has independent coordinates.

All texturing tests out there measure bilinear filter speed. Even if they enable trilinear, they'd have to make sure that the pixel:texel ratio was right so that even optimized drivers would sample from two mipmaps. Otherwise you wouldn't see the TA:TF exposed.

G80 doesn't let you sample two different textures per clock even if the coordinates are the same. The address will be different, and in games even when textures use the same coordinates, they often have different resolutions. The TF ability is there only for faster AF, trilinear, and volume texture sampling with 8-bit per channel textures.

I presume you meant INT16 filtering.

G80 has 136% of the theoretical trilinear/AF/fp16 capability of G71. Out of order threading should provide 30-50% more performance. So G80 is in the region of being theoretically 200% more capable than G71. What was the matter with 2x faster? Why go through the pain of shoving NVIO off die? 3x the TF when you're aiming to construct a 2x faster GPU is just bizarre.

Nope, I meant INT8, though for some reason ATI does INT16 filtering at half the speed of FP16, whereas G80 does both at the same speed.

Anyway, I was saying an ATI TMU quad and G80 TMU quad both have the same 4-channel filtering rate for FP16, FP32, sRGB, and other new >8bpc HDR formats. The only difference is with INT8 filtering, and then only for aniso, trilinear, and volume textures.

It's not bizarre because this particular addition is fairly cheap. The G71 had very shimmer-prone AF and that star-shaped pattern. G80 gives you 2x the performance and better image quality.

It's a very sensible design decision for high end cards where people care about AF. I personally think it's important for low end cards too, but my guess is that sales aren't particularly dependent on it.

Mintmaster · Jul 10, 2007

nAo said:
I wouldn't be surprised if ALUs latency is MUCH more than 8 cycles

Well, I'm told it's 8 in Xenos, and in Cell it's 7 cycles for MADD. 8 stages is probably enough to get 1.5 GHz, and they certainly want to make it as small as possible if they want small batches or else they need to flip back and forth between more batches.

Mintmaster · Jul 10, 2007

Jawed said:
Well if you can come up with a better reason for warps being 32 fragments, when the batch size of G80 is 16, then you're welcome... It seems to me it's the same reason CUDA uses warps of 32. It can't be that minor if it squanders dynamic branching performance.

Maybe a warp count limitation? Maybe NVidia would rather have 16 object granularity for the VS instead of extra texture hiding latency. High register usage in the VS (lots of inputs and outputs) probably precludes them from having lots of threads anyway.

We started discussing this originally because of the idea of doubling the width of the ALU pipeline, to 16 objects. If you do that, then you consume fragments twice as fast. The 768 fragments which in G80 provide a minimum of 96 clocks of latency hiding will only provide 48 clocks of latency hiding in the new GPU. So the register file has to double in its fragment capacity. And that's before you clock the new GPU's ALU core to 2GHz+...

You just told me that 192 fragments (6 warps) is enough to hide the register file limitations and keep the ALUs fed. Double the warp size and SIMD width, and now 384 fragments is enough. That's still less than what G80 can currently handle.

The reason G80 has 768 threads is to hide texture latency.

Per ALU pipeline, it's 128 batches.

Hence my next paragraph.

You can't spread the latency around like you're doing. e.g. G80's two SIMDs per cluster each have to hide their own texturing latency, independently. Threads can't be shared across the two SIMDs.

I know the threads can't be shared, but it doesn't matter. TMU throughput is 4 pixels per base clock. If you have 500 threads in each SIMD array needing a texture request, regardless of what order you choose it will take 250 clocks to get through them. Those 500 threads are more than enough per SIMD to do some math in the mean time, and you can double the SIMD width and still have enough threads. By the time the result for the first texture request comes back (~200 cycles), the TMU is almost ready to do more fetching.

Now, it's true that you can get away with fewer threads if you can do multiple fetches per pixel before needing a result. We don't really know how long it takes for a fetch to come back, but judging from G7x it appears 220 cycles is the minimum latency hiding that NVidia feels is necessary. Dependent texturing tests with a controlled number of registers is probably the best way to find out how long latency is.

(A case where one SIMD array needs 400 requests and the other needs none is a statistical anomaly. I described a workload where the need to hide latency is consistently high. Other situations, e.g. a non-saturated TMU, can use fewer threads to hide latency.)

Razor1 · Jul 10, 2007

I might be reading that OGL chart wrong but is that in chip bandwidth (cache) or memory bandwidth, those are some really strange results for the r600, its like its choking on something.

pakotlar · Jul 10, 2007

trinibwoy said:
Good thing they weren't aiming for 8800GTS performance then

haha

Jawed · Jul 10, 2007

Mintmaster said:
Yeah, I'll agree with that. Remember how texturing takes away ALU ability and vice versa for G7x, and stalls can't be hidden. I don't even know if I'd attribute it to "out of order" processing - even R300 could do math while the texture units were doing their thing.

But this is a pure bilinear texturing case. The texturing latency should be hidden 100% of the time and losing the top MAD ALU to TEX still leaves 1 instruction co-issuing.

R300's semi-decoupled texturing is actually a disadvantage here because there's never enough math to avoid stalling the ALU pipe, because it actually requires a 3:1 ALU:TEX ratio in the code. On the other hand, if the shader is predominantly TEX, does it matter?

Now it's true that R580 is ~20% faster than R520 per clock in texturing heavy portions of D3, simply due to its ALU:TEX ratio (R520 is ALU-bound) - but that's in high-AF scenarios, ultra-quality texture mode. So R580 is theoretically better than G7x due to its out of order threading, but in a bilinear texturing game test?...

There's definately something weird about G84's texturing. On pretty much every NVidia GPU for the last 5 years you'd get >95% efficiency in multitexturing tests. G84 only manages 70% efficiency (or, according to RightMark, only 50%!).

Yeah, I decided "come back in a few months".

One thing that bugs me is that you need to iterate 2 texture coordinate values in order to get a fetch a texture. If G84 can only do 4 per cluster per ALU clock (like G80), then it should be limited to 5800 Mtex/s.

Ah, you think that it's attribute interpolation that's slowing down G84? That's a good thought. Though in the past I've speculated that NVidia has the option to perform interpolation "early" and dump the results in CB (or, worst case, RF). I think a concensus arose that NVidia is storing 1/w, but I'm not sure now.

These shaders may be "too intensive" for "early attribute interpolation" to make a difference. If the shader is literally too short, then G84 is doomed. The synthetic, itself, becomes the problem, not G84.

Does G84 have more SF units per cluster than G80? That might explain better MUL co-issue, I guess, and why they'd put so much texture addressing there.

I don't know why G84 is supposed to be better at MUL co-issue. I'm waiting for the journalists at this site to dig, they've got the hardware and the connections

...

Certainly, with these G84 texturing results (which I wasn't aware of) the mystery deepens.

Nope, I meant INT8, though for some reason ATI does INT16 filtering at half the speed of FP16, whereas G80 does both at the same speed.

Hmm, don't get what you meant by "adds a little more math for the INT8 filtering". Whereas we know that NVidia must have added math for INT16 filtering, because ATI is running at half rate, and INT16 requires more precision than FP16.

Anyway, I was saying an ATI TMU quad and G80 TMU quad both have the same 4-channel filtering rate for FP16, FP32, sRGB, and other new >8bpc HDR formats. The only difference is with INT8 filtering, and then only for aniso, trilinear, and volume textures.

It's not bizarre because this particular addition is fairly cheap. The G71 had very shimmer-prone AF and that star-shaped pattern. G80 gives you 2x the performance and better image quality.

It's a very sensible design decision for high end cards where people care about AF. I personally think it's important for low end cards too, but my guess is that sales aren't particularly dependent on it.

Maybe that gives the lie to the "G7x is wonderfully efficient per mm bullshit" then, because HQ texturing was, until the bitter end, not something enabled in benchtests on G71. And G71 still fell as much as 50% behind R580 - though I wouldn't put the blame on texturing per se.

I'm prolly in the minority, I think R600's AF looks better than G80's. Those 3DMk06 tunnel tests show R600 producing cleaner results and producing detail further into the tunnel. But I haven't seen that test in motion (does it move?).

Jawed

Rys · Jul 10, 2007

Jawed said:
I'm prolly in the minority, I think R600's AF looks better than G80's. Those 3DMk06 tunnel tests show R600 producing cleaner results and producing detail further into the tunnel. But I haven't seen that test in motion (does it move?).

Yeah, it can move. I'll see if I can knock up some movies for you, actually (and some from a game or two as well).

Jawed · Jul 10, 2007

Mintmaster said:
I know the threads can't be shared, but it doesn't matter. TMU throughput is 4 pixels per base clock. If you have 500 threads in each SIMD array needing a texture request, regardless of what order you choose it will take 250 clocks to get through them.

G80 has 768 threads (fragments) per multiprocessor. It takes 226 base clocks to get through them, worst case (8800GTX, 2.35:1 ALU:core clock).

If you double the ALU width, then it will take 113 base clocks to get through them.

Now, it's true that you can get away with fewer threads if you can do multiple fetches per pixel before needing a result. We don't really know how long it takes for a fetch to come back, but judging from G7x it appears 220 cycles is the minimum latency hiding that NVidia feels is necessary.

I don't think that's necessarily a good guide. The reason I say that is that G7x architecture is scaled up and down for various SKUs, but is a common pipeline architecture. Different SKUs have different clocks, there's a variety of DDR types and therefore the ratio of core:memory clocks or, ultimately, the ratio of instruction-throughput:memory-latency varies. It'd be nice to see an analysis of this SKU-scaling effect on architecture...

Put another way, I wonder if 220 is biased to making AF work, and is significantly over the top for bilinear.

Dependent texturing tests with a controlled number of registers is probably the best way to find out how long latency is.

Yes, dependent texturing is another area where GPUs seem infeasibly good...

Jawed

Jawed · Jul 10, 2007

Rys said:
Yeah, it can move. I'll see if I can knock up some movies for you, actually (and some from a game or two as well).

One of the things that puzzles me about those 3DMk06 images is why they're non-circular. What the hell is going on there?

That must be having some kind of effect on texture filtering, so it makes me wonder if it's even a valid test. Or, is it deliberate to catch-out certain kinds of optimisations in texture filtering?

Jawed

Jawed · Jul 10, 2007

Mintmaster said:
Well, I'm told it's 8 in Xenos, and in Cell it's 7 cycles for MADD. 8 stages is probably enough to get 1.5 GHz, and they certainly want to make it as small as possible if they want small batches or else they need to flip back and forth between more batches.

This describes an 8-stage main-ALU:

Multipurpose functional unit with multiply-add and format conversion pipeline

and I think the MI/SF is less.

Jawed

leoneazzurro · Jul 10, 2007

Jawed said:
Certainly, with these G84 texturing results (which I wasn't aware of) the mystery deepens.

But, If I read well the B3D article, G84 has more texture address units per cluster, but not more texture filtering units per cluster with respect to G80.
So, total filtering power per clock should be 1/4 of G80's but "texture addressing power" should be 1/2.

Jawed · Jul 10, 2007

leoneazzurro said:
But, If I read well the B3D article, G84 has more texture address units per cluster, but not more texture filtering units per cluster with respect to G80.
So, total filtering power per clock should be 1/4 of G80's but "texture addressing power" should be 1/2.

In the second comparison posted:

http://www.digit-life.com/articles2/video/g84-part2.html#p5

8600GTS is being compared against 8800GTS. Theoretically it has 90% of the theoretical bilinear rate (10800 MT/s versus 12000 MT/s). The 8 textures case shows 8600GTS achieving 47% of 8800GTS. 1 texture is 52%, 2 textures is 54% and 3 textures is 56% (the best).

Jawed

Bob · Jul 10, 2007

8600GTS is being compared against 8800GTS. Theoretically it has 90% of the theoretical bilinear rate (10800 MT/s versus 12000 MT/s). The 8 textures case shows 8600GTS achieving 47% of 8800GTS. 1 texture is 52%, 2 textures is 54% and 3 textures is 56% (the best).

Incidentally, G84-GTS has 50% of the bandwidth of G80-GTS. If you run a bandwidth test, you shouldn't be surprised that things scale by the bandwidth ratio...

Mintmaster · Jul 10, 2007

Bob said:
Incidentally, G84-GTS has 50% of the bandwidth of G80-GTS. If you run a bandwidth test, you shouldn't be surprised that things scale by the bandwidth ratio...

Multitexturing tests with more than 1 or 2 textures are never bandwidth limited unless you use an enormous texture (netting you a 1:1 pixel:texel ratio) and no compression. I'm 99% sure that RightMark and 3DMark do not fall into that category. That's why nearly every video card reaches its peak when running them.

Except G84...

Unknown Soldier · Jul 10, 2007

Rys said:
And you can't say that with a completely straight face either. There's nothing easy in writing drivers for new GPU architectures, and the opportunities to leverage expertise in writing a driver for an old arch on the same API aren't as wide ranging as you might think. It'd be nice to see AMD or NVIDIA publish a diagram or two of the driver stack that highlights the reality of the complexity they contain.

Come on Rys, how long have they had the R600 out for? Looking at pictures of the R600 in January leads me to beleive, for quite a while. So to release it almost 6 months after the first pics hit the internet(if not longer) and still have shoddy drivers for DX9, they need to be smacked.

I would've understood if it was 2 months, hell maybe even three .. but 6?

US

Mintmaster · Jul 10, 2007

Jawed said:
But this is a pure bilinear texturing case. The texturing latency should be hidden 100% of the time and losing the top MAD ALU to TEX still leaves 1 instruction co-issuing.

We're still talking about SS2, right? Games are never purely a bilinear texturing test. Moreover, drivers and benchmark choice are always a potential factor. Everything scales as expected here, for example.

Ah, you think that it's attribute interpolation that's slowing down G84? That's a good thought.

Sort of, but if you work out the numbers that doesn't seem like the right theory either unless 3DMark is re-using interpolators in some non integer ratio.

Hmm, don't get what you meant by "adds a little more math for the INT8 filtering". Whereas we know that NVidia must have added math for INT16 filtering, because ATI is running at half rate, and INT16 requires more precision than FP16.

INT16 filtering can be done by chaining together two INT8 units at minimal cost (I worked out the logic myself), because the interpolation factors are constant width (8-bit?) for all texture formats. Xenos does it the same way, with bilinear filtering speed always dependent on total width of the format (except for FP16).

For this reason, forget about INT16. For all intents and purposes, one extra INT8 filtering unit per TMU and some very simple relative addressing is all the math NVidia has added.

Maybe that gives the lie to the "G7x is wonderfully efficient per mm bullshit" then, because HQ texturing was, until the bitter end, not something enabled in benchtests on G71. And G71 still fell as much as 50% behind R580 - though I wouldn't put the blame on texturing per se.

HQ on both cards is not apples to apples. R580 looked better due to better AF detection math, and NVidia probably disabled a bunch of optimizations (which ATI still mostly had on) that don't affect quality too much but tank performance.

"As much as 50%" does not invalidate the statement that G71 is far more efficient per mm^2 for real games. You'd have to have your head way down in the sand to think that statement is "bullshit". Nobody runs G7x in HQ texturing mode if performance matters at all, and very few people can tell the difference when it is. HQ mode is pretty cheap on R580, so it is usable there.

Now ask yourself if R580's HQ texturing at slightly better overall performance than G71's standard texturing is worth 80% more die space. I say hell no, and so would every executive at ATI and NVidia. The only reason NVidia didn't have HQ AF was because nobody cared during the NV3x/R300 era (mostly because NV3x was so deficient in other areas, and NVidia fubared the AF anyway with their crazy mipmap optimizations), and it cost them a few percent in performance too. The die cost is really cheap.

(BTW, we're getting really off topic. Mods, maybe you can move the discussion between me and Jawed to another thread?)

swaaye · Jul 10, 2007

G7x has bad texturing visual issues in Oblivion in "Quality" mode. Especially if you add any of Qarl's mods. You can see distinct lines of mip map boundaries, most notably up near Bruma in the snow. Icky. KOTOR had probs with it too. These things get solved by going into HQ mode. That necessity really soured me to NV4x and G7x. I don't think there is an ATI card out there that has texture filtering as bad as what I've seen come out of 6800/7800 at times.

But yeah, usually Quality mode is adequate.

On another note, would anyone like to explain to me why G80's 16-bit color depth rendering looks worse than a Rage 128 in 16-bit? It's horrifyingly bad. Switch Quake 3 into 16-bit color depth and it'll look like stained glass almost, lol. Kinda a bummer if you ever load up some older games.

Sir Eric Demers on AMD R600

fellix

Jawed

nAo

Nutella Nutellae

Mintmaster

Mintmaster

Mintmaster

Razor1

pakotlar

Jawed

Rys

Graphics @ AMD

Jawed

Jawed

Jawed

leoneazzurro

Jawed

Bob

Mintmaster

Unknown Soldier

Mintmaster

swaaye

Entirely Suboptimal

Similar threads