AMD: R8xx Speculation

Jawed · May 1, 2009

3dilettante said:
This appears consistent with a simple VLIW design.
It's better than how some of the original VLIWs would simply read and write, heedless of hazards, but it's a simple check for the scheduler to pick up a potential conflict and inject a NOP into a wavefront's instruction stream, rather than try to piece through an instruction packet's read and write operands.

Yeah, the hardware knows the size of the domain of shared registers so it's a simple check.

If there were forwarding within the cluster, this latency could be avoided, but that's 16 5-way bypass networks per SIMD and a tag check per ALU.

The in-pipeline registers, used to avoid read-after-write latency, have the data, but still there's no time to get that data in place if the underlying latency of these registers is 4 cycles, say.

Maybe they haven't settled on a final scheme for data sharing?
Global registers are an incremental addition to what is already there.
The LDS is an addition, but with minimal disruption to the already existing design.

Maybe AMD doesn't want to commit too much for a low-level detail that they might be revamping.

Yeah, still too early to tell whether LDS is viable. Funny, coming up to 3 years after G80 and one year after RV770 and still no idea if AMD has caught up.

Jawed

3dilettante · May 1, 2009

Jawed said:
The in-pipeline registers, used to avoid read-after-write latency, have the data, but still there's no time to get that data in place if the underlying latency of these registers is 4 cycles, say.

If the complexity of a forwarding network were tolerable, this would avoid that problem. It would also probably make pipeline regsiters pointless.

Given the complexity required to shave off that last few cycles, pipeline registers were at least a good compromise for the current design.

Jawed · May 1, 2009

trinibwoy said:
Well there's always the possibility that it's not the theoretical bandwidth available driving this thing but the architecture's use of that bandwidth.

True, R6xx is appalling on this score.

There could be other changes that impact the effective bandwidth more than theoretical numbers would suggest. Or maybe it's not bandwidth at all and these things are so complicated that this sort of simplified analysis won't ever uncover what's really going on.....

Apart from anything else we need to have some basis for reasoning about the progress of these things. Shame the data's so thin though.

Maybe there's something in the RBEs no-one's spotted, for example...

---

Changing tack slightly.

Ideally HD5870 would be 2x performance for everything in RV790, bandwidth, texturing, GFLOPs and RBEs. How much of that could be squeezed in though?

A naive ~0.53x scaling for 40nm in comparison with 55nm implies that 2xRV740 would be 259mm². That would result in 16 clusters, 1280 ALUs, 64 TUs, 32 RBEs, 256-bits. Theoretically memory controller physical-I/O (analogue) doesn't scale down with process particularly well, so 2xRV740 should be smaller than 259mm². Another 4 clusters could be about 22mm² (he says, boldly).

Seems like a "doubled RV790" is do-able in

00mm², doesn't it?

Jawed

CarstenS · May 1, 2009

Jawed,

In our original review of the HD 4670, a different (older) set of tests was used and a different test system also.

The index relative to the HD 4850 was 60,6% for HD 4670 and 69,6% for HD 3870 . This time 'round it changed to 66,2% percent for HD 4670 and 58% for HD 3870. (I can send you the Open-Office file if you want)

You could argue that the parcours changed towards more texturing or ALU-heavy usage or that drivers have been more optimized for HD 4000 series than for the older generation but IMO not that bandwidth plays more of a role compared to september 2008.

Jawed said:
Seems like a "doubled RV790" is do-able in 00mm², doesn't it?

No DX11-love from AMD?

Jawed · May 2, 2009

CarstenS said:
Jawed,

In our original review of the HD 4670, a different (older) set of tests was used and a different test system also.

The index relative to the HD 4850 was 60,6% for HD 4670 and 69,6% for HD 3870 . This time 'round it changed to 66,2% percent for HD 4670 and 58% for HD 3870. (I can send you the Open-Office file if you want)

So HD4850 got ~20% faster than HD3870 over that time and HD4670 increased by 9% relative to HD4850, and so ~31% faster than before.

You could argue that the parcours changed towards more texturing or ALU-heavy usage or that drivers have been more optimized for HD 4000 series than for the older generation but IMO not that bandwidth plays more of a role compared to september 2008.

Or the AI tweaks for the newest games take a while to work out

No DX11-love from AMD?

It's really hard to find the size impact of double-precision on ATI GPUs, so I think someone needs to come up with an idea for some major new functionality that's going to take 10s of mm² to be noticeable.

It's like comparing the "transistor count" of RV630 and RV635. RV635 gained D3D10.1 functionality and supposedly lost 12M transistors. Transistor count isn't much use though.

Unordered access views are like render targets that are just "chaos" as far as addressing goes, i.e. data coming out of pixel shaders or compute shaders just gets scattered into video memory. That's my interpretation anyway. I suspect atomic operations are part of UAVs, not sure. Anyone got any ideas how costly UAVs are going to be?

CS4.1 supports a single UAV on RV770 (whereas CS5.0 requires support for upto 8; render targets + UAVs together can only sum to 8 or less). The UAV is also typeless but can have formats in CS5.0.

So, what extra logic is required to support up to 7 more UAVs or support types or make them perform "fast"?

It's like with LDS, as far as I can tell RV770 supports writes anywhere in LDS - yet CS4.1 restricts writes to a thread's own private area of LDS. What's going on there? Maybe this is a concession for RV670 (which doesn't have LDS, but maybe memory via the memory-cache can be goaded into making this work)? I'm dubious anything before RV770 supports thread group shared memory...

Jawed

Humus · May 2, 2009

Jawed said:
Unordered access views are like render targets that are just "chaos" as far as addressing goes, i.e. data coming out of pixel shaders or compute shaders just gets scattered into video memory. That's my interpretation anyway. I suspect atomic operations are part of UAVs, not sure. Anyone got any ideas how costly UAVs are going to be?

I assume atomic operations are used under the hood by HLSL to implement AppendStructuredBuffer and ConsumeStructuredBuffer. A shared register holding a counter and an atomic increment to get the address for the thread.

As for performance, I expect it to boil down to coherency. For CS4.x where you only have RWBuffer and not RWTexture2D etc. it'll leave a lot of responsibility in the hands of the developer, wheras I expect RWTexture2D to be swizzled in general and perform similar to writes to a render target. Doing any form of computations in two dimensions in CS4.x will probably require the developer to do some kind of manual tiling rather than using a more intuitive linear mapping, at least if you want reasonable performance.

CarstenS · May 2, 2009

Thinking a bit on what Jawed went on about yesterday, I'm starting to wonder, if AMD still is in a good position to keep a 256 Bit memory interface.

After all, it was their early use of GDDR5 and Nvidias reluctance to do likewise, that made a good part of RV770s excellent impression.

Silent_Buddha · May 2, 2009

So are you basically saying that the relatively small size of the AMD chips may end up biting them in the arse if GDDR 5 speeds don't ramp up significantly?

Regards,
SB

Lukfi · May 2, 2009

This is a common problem, not just AMD's. The HD 4850 can produce better performance than G92 equipped with faster (>2 GHz) GDDR3, although G92's problem is mainly capacity, not bandwidth.
Generally, if there's a lack of fast GDDR5, both AMD and nVidia will encounter problems. And either the performance of their solutions will depend more on bandwidth than the GPU itself, or it will boil down to who can do better compression.

Pantagruel's Friend · May 2, 2009

Jawed said:
For instance, is the fact that HD4670 is faster than HD3870 down to driver maturity this last ~6 months+ since the launch of HD4670 or testing technique or game selection or is it just highlighting performance that's always been like this?

My tests show quite a different picture of the two cards, the 3870 being 10-40% faster in most cases. Driver was Catalyst 9.3.

no-X · May 2, 2009

HD4890 vs HD4770:
+125% BW
+60% performance

ATi could squeeze additional 40% performance of R7xx architecture even when using todays 3600MHz GDDR5.

If we speculate about ~300mm² GPU, than is realistic to expect about 50% performance increase - even 4000MHz GDDR5 would be sufficient for that.

For more signigicant performance increase, die has to be bigger, too. ~400mm² die is sufficient for 512bit bus.

I don't think that ATi will get to a situation, where RV870 could be twice as fast RV790, but slow memory modules would spoil the performance...

I'd like to see small 256bit RV870 in July and 400mm² 512bit refresh in winter. Next /+6 month/ refresh could shrink the die again under 300mm² (32nm)...

Jawed · May 3, 2009

Humus said:
I assume atomic operations are used under the hood by HLSL to implement AppendStructuredBuffer and ConsumeStructuredBuffer. A shared register holding a counter and an atomic increment to get the address for the thread.

Sounds a bit messy if there are 10s of clusters all generating writes/reads simultaneously. But yeah, I suppose we can't expect this to be fast, any time soon.

As for performance, I expect it to boil down to coherency. For CS4.x where you only have RWBuffer and not RWTexture2D etc. it'll leave a lot of responsibility in the hands of the developer, wheras I expect RWTexture2D to be swizzled in general and perform similar to writes to a render target. Doing any form of computations in two dimensions in CS4.x will probably require the developer to do some kind of manual tiling rather than using a more intuitive linear mapping, at least if you want reasonable performance.

Yeah, that becomes a real mess if you want to mix access techniques, e.g. scattered writes to a resource which you then want to fetch through texturing.

Jawed

Jawed · May 3, 2009

CarstenS said:
Thinking a bit on what Jawed went on about yesterday, I'm starting to wonder, if AMD still is in a good position to keep a 256 Bit memory interface.

After all, it was their early use of GDDR5 and Nvidias reluctance to do likewise, that made a good part of RV770s excellent impression.

I agree to a degree. The other aspect of AMD's strategy is that, in theory, they have multiple SKUs on the market covering $100-300 (and maybe higher) while NVidia only has D3D10 GPUs + GT300.

Obviously AMD's lucky that D3D11, in its own right, should provide marketable differentiation. That's not going to keep happening with later versions of D3D as the changes seem destined to be less and less "marketable" - famous last words? Then there's Larrabee which should always be D3D-feature complete as each new revision arrives. e.g. pre-emptive context switching should just work in Larrabee when D3D requires it, presumably in D3D12.

Jawed

Blazkowicz · May 3, 2009

Changes don't need to be marketable, they can be marketed anyway. People would care about Sata 2 vs Sata 1, or AGP 8x vs AGP 4x despite a completely non existent performance gain.

Jawed · May 3, 2009

Pantagruel's Friend said:
My tests show quite a different picture of the two cards, the 3870 being 10-40% faster in most cases. Driver was Catalyst 9.3.

How much grief are your testing scenarios?

e.g. when I'm fiddling with settings for performance in a game I'm quite happy with a few seconds of the most insane rendering (explosions, lots of foliage, lots of enemies etc. concurrently) to determine what settings I want to use. I don't care what the "average" is, I want the worst case experience to be bearable.

What's the most meaningful test of an architeture? The laziest tests that some reviewers do or torture? The Crysis built-in benchmark appears to be one of those lies perpetrated on gamers. I got bored with the game though, so I didn't get to the later stages where performance is meant to be seriously bad in comparison with the start of the game.

I'm not saying you haven't tested properly, but I'm interested in how one values an architecture/configuration. I don't think burying occasionally terrible performance in minutes of "adequate average performance" is useful. It's like a fillrate test, these days, on its own, doesn't tell us much, except if the IHV is being deceptive about the internal configuration of the GPU.

Jawed

Jawed · May 3, 2009

no-X said:
HD4890 vs HD4770:
+125% BW
+60% performance

I wonder if PCGH agrees

I'd like to see small 256bit RV870 in July and 400mm² 512bit refresh in winter. Next /+6 month/ refresh could shrink the die again under 300mm² (32nm)...

I dunno, maybe AMD will always do the cheapest possible refresh - that seems to be what we saw with HD4890. The 512-bit GPU comes courtesy of X2.

TSMC's 28nm node or GF's 32nm are certainly enticing prospects. This post:

http://www.semiconductorblog.com/2009/04/28/weak-node/

seems to imply that TSMC is de-emphasising 40nm in favour of getting 28nm working. I suppose this might be partly defensive against GF. It seems kinda bizarre to contemplate that 40nm might have only a ~1 year lifetime for the most advanced GPUs.

Jawed

fellix · May 3, 2009

no-X said:
I'd like to see small 256bit RV870 in July and 400mm² 512bit refresh in winter. Next /+6 month/ refresh could shrink the die again under 300mm² (32nm)...

Heck, I would prefer all 256-bit design again, with 1.25GHz GDDR5 parts at launch and why not 1.75GHz follow up (full differential signaling). That way you take best of the both worlds -- bandwidth and lower read latency on a cheaper base, while with wider data path the benefit is almost BW-related only and an expensive SKU to deal with.

CarstenS · May 3, 2009

Silent_Buddha said:
So are you basically saying that the relatively small size of the AMD chips may end up biting them in the arse if GDDR 5 speeds don't ramp up significantly?

Regards,
SB

HD4870 performed quite a bit above which you would expect from a HD4850 with 750 MHz core clock, so that was mainly due to the advent of GDDR5.

If you don't have the advantage of a brand new memory technologie compensating at least to some degree for your smaller memory interface, then you've got the problem where to get your bandwidth from.

If - and only if - the trend of Nvidia using a wider memory interface as part of their "real men have big dice"-strategy than AMD with their sweet spot strategy continues, then yes, therer's going to be a problem for AMD if Nvidia chooses to utilize GDDR5 this time also.

edit: This is not set in stone yet, mind you! I am starting to wonder, if Nvidia would dare to step down some bits from their SI and in turn also lessen the amount of ROPs, at the same time lower z-fill and pixel-fill in order to go for a smaller die-size and/or an increased amount of computational units. So far, their behaviour would not make that very likely, wouldn't it?

Pantagruel's Friend said:
My tests show quite a different picture of the two cards, the 3870 being 10-40% faster in most cases. Driver was Catalyst 9.3.

May I ask whether or not you've tested also without any AA/AF? Because that would greatly benefit HD3870s architectural traits (despite it's nominal greater bandwidth) as you know.

And of course, the choice of tests, resolutions etc.pp. also influences the results.

no-X · May 3, 2009

Jawed said:
I wonder if PCGH agrees

I comprehend what you mean, but I used CB.de numbers. A few weeks ago I calculated average performance numbers from all HD4890/GTX275 reviews and find out, that ComputerBase numbers are about in the middle (actually, their results are slightly more positive for GTX275 when compared to average result, but only by a few %). I think this website and other sites with similar results are (and will be) taken as a representative illustration of RV870 performance.

Jawed · May 3, 2009

CarstenS said:
edit: This is not set in stone yet, mind you! I am starting to wonder, if Nvidia would dare to step down some bits from their SI and in turn also lessen the amount of ROPs, at the same time lower z-fill and pixel-fill in order to go for a smaller die-size and/or an increased amount of computational units. So far, their behaviour would not make that very likely, wouldn't it?

If GT300 is radical then maybe they've fixed all the wastefulness in GT200 and don't need such huge amounts of TMUs/ROPs.

If rumours of ALUs being more sophisticated are true then the implication is that they'll take yet more area per FLOP. NVidia may only be able to afford that kind of extravagance if the TMUs/ROPs go on an extreme diet. Though the wastefulness is only in the region of 30-50% I reckon, so there isn't a monster saving to be made there.

AMD supposedly increased performance by 70% per unit area for the TUs in RV770 compared with RV670. A good chunk of that was presumably related to reverting to 8-bit per unit capability instead of the 16-bit capability of RV670 - so that performance increment only applies to 8-bit textures. NVidia doesn't seem to have that kind of option.

If NVidia was ultra-radical and deleted the blending/Z-testing part of the ROPs in favour of using the ALUs, would that save enough area? Can some of the texturing calculations be done in the ALUs, too, e.g. LOD/bias and filtering?

So is it possible for NVidia to be radical enough to leave the ROPs to handle the messy parts of interacting with memory (coalescing/caching, (de)compression) and the TMUs the messy part of memory (caching/decompression)?

Anyone want to hazard a guess for the real, shader-only ALU-borne GFLOPs in games at their current frame rates? 500GFLOPs worst case? If NVidia put 3TFLOPs into a new GPU, would that be enough to double current frame rates and take care of texturing and render back end workloads?

Jawed

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

Jawed

3dilettante

Jawed

CarstenS

Moderator

Jawed

Humus

Crazy coder

CarstenS

Moderator

Silent_Buddha

Lukfi

Pantagruel's Friend

no-X

Jawed

Jawed

Blazkowicz

Jawed

Jawed

fellix

CarstenS

Moderator

no-X

Jawed

Similar threads