Yeah, that's actually quite surprising isn't it. Looks like those chips aren't actually that texture limited after all. Even more surprising considering the 4670 should be more efficient in some situations (due to the smaller shader array length).This page shows performance in comparison with HD4670, though I don't know if it was tested with the latest driver:
http://www.computerbase.de/artikel/...adeon_hd_4770/19/#abschnitt_performancerating
excluding 8xAA, that averages to 180% performance. Worst case is 161% in Crysis Warhead. Best case is 200%+ in some games, making me suspect other things.
If those specs are real, very boring. 50% more shaders and 20% more texture units, plus 100% more rops. Real performance increase would seem to be less than 50% as well as they seem to be trending towards being more texture limited again (although hopefully this time they actually know what theyre doing).
Doesnt seem as ambitious as the supposed GT300 specs. Way early to speculate though.
Any thoughts on what AMD needs to do beyond the minimum?I wouldn't expect the RV870 to simply be a "bigger RV770 that does the minimum required to be DX11 compliant."
It seems to me that bandwidth for "HD5870" could be quite a problem, e.g. 150% more bandwidth than HD4890 seems likely to be the limit. Yet, RV730->RV740 appears to show that bandwidth across all these games is only the bottleneck 50% of the time.
Jawed
So while 24 RBEs would be faster it seems that more would work.
Any thoughts on what AMD needs to do beyond the minimum?
Jawed
This page shows performance in comparison with HD4670, though I don't know if it was tested with the latest driver:
http://www.computerbase.de/artikel/...adeon_hd_4770/19/#abschnitt_performancerating
excluding 8xAA, that averages to 180% performance. Worst case is 161% in Crysis Warhead. Best case is 200%+ in some games, making me suspect other things.
Apart from the driver mess at computerbase, how much filtering are recent catalyst allow the chip to do? Clearly, it isn't the full amount, since the Radeons are more prone to shimmering on high frequency textures than are Geforce 8+ (contrary to the abysmal situation of GF7 vs X1k, where Nvidia did their utmost to save on filtering cycles.)Yeah, that's actually quite surprising isn't it. Looks like those chips aren't actually that texture limited after all. Even more surprising considering the 4670 should be more efficient in some situations (due to the smaller shader array length).
For what it's worth I have suspicions that LDS needs an overhaul, as I think the bandwidth simply isn't there. The R700 ISA document seems to say that only 4 threads (work-items) can read from LDS per clock.I just don't expect all this time to have passed and for there not to be other improvements. Not "major features" but some architectural bits and bobs and tuning and re-thinking ratios to squeeze greater perf per watt, mm2, and transistor than before.
This seems to make sense.For what it's worth I have suspicions that LDS needs an overhaul, as I think the bandwidth simply isn't there. The R700 ISA document seems to say that only 4 threads (work-items) can read from LDS per clock.
My impression is that this is so because this form of sharing is a standard register access, with the per-thread offset removed. The register file doesn't see the difference.Though I've now gained a better understanding of the way that the register file can be configured to allow sharing of data between all threads whose wavefront-relative address is the same - i.e. wavefront A thread 12 (I hate calling them threads, they're really strands or work-items) can share multiple registers with B12, C12, D12 etc. So that's a hell of a lot more bandwidth than LDS can muster (indeed it seems to be full register bandwidth).
I wasn't complaining at all! I thought it was an excellent test!
Jawed
In absolute bandwidth terms, per thread, this isn't bad - 16 scalars per clock effectively. The trouble seems to be that the minimum latency is effectively 16 cycles.Four threads can write per cycle, but up to 16 values can be written.
Yeah, that's it exactly - private register addressing has an implicit wavefront-number*stride which isn't being used in this scenario.My impression is that this is so because this form of sharing is a standard register access, with the per-thread offset removed. The register file doesn't see the difference.
Yeah, well it just makes me even more worried about the usefulness of most reviews on the web.Hm, okay sorry. Then i was just interpreting this wrong "So, ahem, congratulations to PCGH for coming up with such bandwidth-/texturing-limited tests! And woe-betide HD5870 if it has <50% more bandwidth than HD4890...".
This appears consistent with a simple VLIW design.It's notable that extra, un-hidable, latency can occur here when odd and even wavefronts are doing different things (i.e. one's reading the other's writing).
Maybe they haven't settled on a final scheme for data sharing?I have to admit I'm intrigued by the combination of wavefront-shared-lane registers and LDS, it sounds pretty potent. It's kind of incredible that there seem to be no detailed discussions of this kind of stuff with working examples coming out of AMD.
Your review paints a picture of RV870 being doomed, in my view. AMD needs a huge uplift in bandwidth to make a splash in comparison with HD4890, and it seems pretty unlikely that GDDR5 is maturing fast enough.