AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
This page shows performance in comparison with HD4670, though I don't know if it was tested with the latest driver:

http://www.computerbase.de/artikel/...adeon_hd_4770/19/#abschnitt_performancerating

excluding 8xAA, that averages to 180% performance. Worst case is 161% in Crysis Warhead. Best case is 200%+ in some games, making me suspect other things.
Yeah, that's actually quite surprising isn't it. Looks like those chips aren't actually that texture limited after all. Even more surprising considering the 4670 should be more efficient in some situations (due to the smaller shader array length).
So increasing the shader array from 16 to 20 probably wouldn't have that much of a performance impact. Still, I don't like the idea and proposing 16 clusters with 16 5D shaders and 64 textures units instead :). Are 32 ROPs necessary though? I agree 16 wouldn't cut it but what about for instance 24 (everyone assumed they'll need to attach to memory channels directly hence 16 or 32 but it seems that's not even necessary for current generation...)
 
Grid may well be the "most ALU bound" of those games tested, (according to:

http://forum.beyond3d.com/showpost.php?p=1220350&postcount=27

which is sadly missing Crysis) yet HD4770 is 68% faster - apparently ending-up as mostly bandwidth bound. So GFLOPs are either not very relevant or that is a sign of divergence penalty in the ALUs. I don't have the shader code for Grid, so that's the end of that.

Also, in theory, GDDR5 results in a "bandwidth/latency penalty" in comparison with GDDR3 for the same bandwidth. Too early to tell if this has transpired in RV740 or whether any "evidence" for this theory is compelling...

Apparently (relying upon German->English translation) they're using the enthusiast settings for Crysis Warhead, getting 61% more performance worst case. So not ALU-bound there.

Apart from GFLOPs, only fillrate could produce the 180%+ results.

It seems to me that bandwidth for "HD5870" could be quite a problem, e.g. 150% more bandwidth than HD4890 seems likely to be the limit. Yet, RV730->RV740 appears to show that bandwidth across all these games is only the bottleneck 50% of the time.

So while 24 RBEs would be faster it seems that more would work. Countering this, any increase in TUs would offset an increase in RBEs, gobbling up some of that bandwidth margin. While I like to think in terms of 2:1, RBE:TU for bandwidth allocation, it's still hard to figure out what's going on in today's ATI GPUs.

Jawed
 
If those specs are real, very boring. 50% more shaders and 20% more texture units, plus 100% more rops. Real performance increase would seem to be less than 50% as well as they seem to be trending towards being more texture limited again (although hopefully this time they actually know what theyre doing).

Doesnt seem as ambitious as the supposed GT300 specs. Way early to speculate though.

Two things:

1. The "specs" floating around are insufficient to tell us the whole story. Just knowing the number of ALUs, TMUs, ROPs, and clock speeds leaves out a lot of things that affect performance greatly. Like, how deep/big the SIMD units are (is it 4 TMUs per SIMD unit, or 2 TMUs per SIMD and twice as many of them that are half as deep?). How big the local memory is. How big other caches are, what the scheduler is like, if the ROPs have any other improvements to them, etc. I wouldn't expect the RV870 to simply be a "bigger RV770 that does the minimum required to be DX11 compliant." I'd be surprised if there weren't other important tweaks, even if the general architecture isn't radically new.

2. If GT300 is indeed going to continue to be a big 'ol monolithic ~500mm2 chip like GT200 was, and RV870 is going to follow ATI's new plan of making sub-300mm2 their target, then the specs for GT300 are simply going to be more "ambitious." Without knowing chip sizes, and even stuff like power consumption, comparing raw specs against each other is pretty meaningless.
 
It seems to me that bandwidth for "HD5870" could be quite a problem, e.g. 150% more bandwidth than HD4890 seems likely to be the limit. Yet, RV730->RV740 appears to show that bandwidth across all these games is only the bottleneck 50% of the time.
Jawed

I wonder what's the realistic gDDR5 clock speed for the timeframe of the 5870. It would be interesting to see a 384-bit, 24 ROP GPU...

So while 24 RBEs would be faster it seems that more would work.

... however, I agree here. The 4870 is quite ROP limited at times, so the 24 ROPs are the bare minimum if we're to have 240 VLIW units.
 
You can take the perimeter of RV770, delete the CrossFireX Sideport and see how much extra is required to get to 384 bits...

Jawed
 
Would ATI even use a 384 bit bus? It seems to go against their 'sweetspot' strategy of reusing the same chip for multiple SKUs.
 
Any thoughts on what AMD needs to do beyond the minimum?

Jawed

It's not that they need to do more. I think an RV770 or 740 or whatever with a new tessellation unit and maybe bigger local cache in the SIMDs might even be enough to be DX11 compliant.

I just don't expect all this time to have passed and for there not to be other improvements. Not "major features" but some architectural bits and bobs and tuning and re-thinking ratios to squeeze greater perf per watt, mm2, and transistor than before.
 
This page shows performance in comparison with HD4670, though I don't know if it was tested with the latest driver:

http://www.computerbase.de/artikel/...adeon_hd_4770/19/#abschnitt_performancerating

excluding 8xAA, that averages to 180% performance. Worst case is 161% in Crysis Warhead. Best case is 200%+ in some games, making me suspect other things.

The Computerbase-Test was done like this (no kidding, just look at the page "Testsystem")

"Treiberversionen

* Nvidia GeForce 185.68 (GTX 275, 9800 GT)
* Nvidia GeForce 182.46 (GTX 260²)
* Nvidia GeForce 181.22 (GTX 295, GTX 285)
* Nvidia GeForce 180.48
* ATi Catalyst 8.60-090316a1 (HD 4770, HD 4830, HD 4850)
* ATi Catalyst 8.592.1 (HD 4890)
* ATi Catalyst 9.3 (HD 4870 1GB)
* ATi Catalyst 8.11"

Yeah, that's actually quite surprising isn't it. Looks like those chips aren't actually that texture limited after all. Even more surprising considering the 4670 should be more efficient in some situations (due to the smaller shader array length).
Apart from the driver mess at computerbase, how much filtering are recent catalyst allow the chip to do? Clearly, it isn't the full amount, since the Radeons are more prone to shimmering on high frequency textures than are Geforce 8+ (contrary to the abysmal situation of GF7 vs X1k, where Nvidia did their utmost to save on filtering cycles.)
 
Last edited by a moderator:
http://www.pcgameshardware.com/aid,...-4770-vs-HD-4850-und-Geforce-9800-GT/Reviews/

Less data points but with an honest driver configuration. The real shocker here is that HD3870 is usually slower than HD4670, despite having 225% of the bandwidth :oops:

Overall this paints a very different picture, with HD4770 showing much lower benefit over HD4670 than the broken Computerbase numbers.

These tests appear to be fundamentally bandwidth-/texture-limited as none of the "averages" is ever more than 160% comparing HD4770 against HD4670. I'm ignoring Stalker:CS at 1680 as the framerates are in the 1-3 region.

CoD:WaW's minimum framerates are perhaps also too low to be meaningful, but here is the only time that HD4770 is more than 160% of HD4670, 179%, 182% and 150% at 1280, 1680 and 1920 respectively.

So, ahem, congratulations to PCGH for coming up with such bandwidth-/texturing-limited tests! And woe-betide HD5870 if it has <50% more bandwidth than HD4890...

Jawed
 
I just don't expect all this time to have passed and for there not to be other improvements. Not "major features" but some architectural bits and bobs and tuning and re-thinking ratios to squeeze greater perf per watt, mm2, and transistor than before.
For what it's worth I have suspicions that LDS needs an overhaul, as I think the bandwidth simply isn't there. The R700 ISA document seems to say that only 4 threads (work-items) can read from LDS per clock.

Though I've now gained a better understanding of the way that the register file can be configured to allow sharing of data between all threads whose wavefront-relative address is the same - i.e. wavefront A thread 12 (I hate calling them threads, they're really strands or work-items) can share multiple registers with B12, C12, D12 etc. So that's a hell of a lot more bandwidth than LDS can muster (indeed it seems to be full register bandwidth).

Also I can't find anything concrete on the atomic operations that D3D11 requires. There are atomic concepts in R700, and on-die memory is persistent across kernel invocations, but overall it seems skeletal :???:

Jawed
 
Jawed,

What's your point? That PCGH tests with settings this kind of cards are made for, i.e. 4xAA 16:1 AF? Yes, of course that puts load on texturing and bandwidth. But isn't that exactly what sells graphics cards to gamers - eye candy?
 
For what it's worth I have suspicions that LDS needs an overhaul, as I think the bandwidth simply isn't there. The R700 ISA document seems to say that only 4 threads (work-items) can read from LDS per clock.
This seems to make sense.
The LDS write scheme permits 16 with a scheme that purposefully keeps writes limited to a fixed index that avoids any conflicts.
Reads roam over more than a fixed subset of the space.
16 writes to statically allocated and strided locations isn't as bad as 16 reads from any index in the space would have been.

One would require a heftier crossbar to get done in a single pass.


edit: scratch this
Four threads can write per cycle, but up to 16 values can be written.

Though I've now gained a better understanding of the way that the register file can be configured to allow sharing of data between all threads whose wavefront-relative address is the same - i.e. wavefront A thread 12 (I hate calling them threads, they're really strands or work-items) can share multiple registers with B12, C12, D12 etc. So that's a hell of a lot more bandwidth than LDS can muster (indeed it seems to be full register bandwidth).
My impression is that this is so because this form of sharing is a standard register access, with the per-thread offset removed. The register file doesn't see the difference.
 
Last edited by a moderator:
:???: I wasn't complaining at all! I thought it was an excellent test!

Jawed

Hm, okay sorry. Then i was just interpreting this wrong "So, ahem, congratulations to PCGH for coming up with such bandwidth-/texturing-limited tests! And woe-betide HD5870 if it has <50% more bandwidth than HD4890...".
 
Four threads can write per cycle, but up to 16 values can be written.
In absolute bandwidth terms, per thread, this isn't bad - 16 scalars per clock effectively. The trouble seems to be that the minimum latency is effectively 16 cycles.

With 1024 threads in a thread group there are 16 wavefronts of 64, so 16 cycles per wavefront = 256 cycles to sweep across all wavefronts, meaning that the code requires >=4:1 ALU:LDS to hide that latency. Double that if a paired write-then-read is performed, which would seem to be the norm.

I suppose the payback is that after the LDS operation has completed there are no further latencies incurred on fetches, unlike in NVidia shared memory where any repeated fetches go to shared memory - but is that common, dunno.

I think the other thing that helps with LDS is that ALU cycles are not consumed with initialising, e.g. loading a tile of data from video memory, which consists of two phases: import from memory into registers, which is effectively a TEX operation, then copy registers to LDS which is effectively another TEX operation. Same goes for tile export, if one is required.

My impression is that this is so because this form of sharing is a standard register access, with the per-thread offset removed. The register file doesn't see the difference.
Yeah, that's it exactly - private register addressing has an implicit wavefront-number*stride which isn't being used in this scenario.

It's notable that extra, un-hidable, latency can occur here when odd and even wavefronts are doing different things (i.e. one's reading the other's writing).

I have to admit I'm intrigued by the combination of wavefront-shared-lane registers and LDS, it sounds pretty potent. It's kind of incredible that there seem to be no detailed discussions of this kind of stuff with working examples coming out of AMD.

Jawed
 
Hm, okay sorry. Then i was just interpreting this wrong "So, ahem, congratulations to PCGH for coming up with such bandwidth-/texturing-limited tests! And woe-betide HD5870 if it has <50% more bandwidth than HD4890...".
Yeah, well it just makes me even more worried about the usefulness of most reviews on the web.

For instance, is the fact that HD4670 is faster than HD3870 down to driver maturity this last ~6 months+ since the launch of HD4670 or testing technique or game selection or is it just highlighting performance that's always been like this?

Clearly this kind of question really spoils the party when trying to work out what would be a desirable specification for RV870.

Your review paints a picture of RV870 being doomed, in my view. AMD needs a huge uplift in bandwidth to make a splash in comparison with HD4890, and it seems pretty unlikely that GDDR5 is maturing fast enough.

I wonder if AMD's own internal game tests are as "harsh/realistic" as yours?

Jawed
 
It's notable that extra, un-hidable, latency can occur here when odd and even wavefronts are doing different things (i.e. one's reading the other's writing).
This appears consistent with a simple VLIW design.
It's better than how some of the original VLIWs would simply read and write, heedless of hazards, but it's a simple check for the scheduler to pick up a potential conflict and inject a NOP into a wavefront's instruction stream, rather than try to piece through an instruction packet's read and write operands.

If there were forwarding within the cluster, this latency could be avoided, but that's 16 5-way bypass networks per SIMD and a tag check per ALU.

I have to admit I'm intrigued by the combination of wavefront-shared-lane registers and LDS, it sounds pretty potent. It's kind of incredible that there seem to be no detailed discussions of this kind of stuff with working examples coming out of AMD.
Maybe they haven't settled on a final scheme for data sharing?
Global registers are an incremental addition to what is already there.
The LDS is an addition, but with minimal disruption to the already existing design.

Maybe AMD doesn't want to commit too much for a low-level detail that they might be revamping.
 
Your review paints a picture of RV870 being doomed, in my view. AMD needs a huge uplift in bandwidth to make a splash in comparison with HD4890, and it seems pretty unlikely that GDDR5 is maturing fast enough.

Well there's always the possibility that it's not the theoretical bandwidth available driving this thing but the architecture's use of that bandwidth. There could be other changes that impact the effective bandwidth more than theoretical numbers would suggest. Or maybe it's not bandwidth at all and these things are so complicated that this sort of simplified analysis won't ever uncover what's really going on.....
 
Back
Top