AMD: R9xx Speculation

With higher specced RAM chips you'll profit from price adjustments over time. With a 384 Bit wide memory interface you are stuck with a) more chips necessary (768 or 1.536 MiB) and b) a more complex and probably more-layered PCB.
 
Expecting die size around 380-400mm², it would be either the smallest 384bit GPU, or the largest 256bit GPU ever made :)
 
Or they didint clocked the memory higher and just increased the bus width. The 6.4 GHz gddr5 gpu-z shots are fake it seems.;)
Maybe its cheaper to buy mass production 5GHz chips in the end(also pcb and gpu is less complex if it doesnt need to run on such high frequency, and ati already mastered 5GHz). Thats still 240 GB/s on just 5 GHz if its true.

IMHO, higher clocked (by ~33%) memory will need less perimeter than wider bus(by 50%).
 
RV770 was a massive jump over RV670, on the same process node. Granted, RV770 was 33% bigger and I don't expect RV970 to be quite that much bigger than RV870, but it is not impossible.

Considering Rv670 -> Rv770 represented an ~100% increase in perf at ~33% larger die, then Evergreen to NI (or SI? bring back numbers dangit. :() could potentially be ~50% faster based on the rumored ~15% larger die. That's going purely by die size though, so don't put too much weight into it. :D

Also, while we won't see a massive increase in ALU count, rumors do point to there being some significant changes being done to ATI's superscaler structure, 5D -> 4D for example.

Personally, I'm thinking a little less than 50% increase in performance on average. And potentially much larger than that in Eyefinity mode if a 2 GB model is released.

Regards,
SB
 
From chiphell
20100901043701_0000.jpg


I smell fake but who knows
 
I'd agree the fastest model will probably come with 2GB standard. I'm not so sure on the slower one, I could imagine there both 1GB and 2GB being standard versions.
I think that'll depend on the availability/pricing of these chips. If a 2gbit chip doesn't cost more than 2 1gbit chips sure that's more cost effective. OTOH I don't think using twice the chips increases costs a lot, clamshell mode makes this easy and shouldn't complicate pcb too much.

Yea you're right, they might offer the slower one in 1GB and 2GB variants tho the higher one will probably only offer 2GB.

Usually higher density chips cost less and afaik the 2Gbit chips are made on a small process tech as well so they should be cheaper. Clamshell with 8 chips versus 4 chips still requires greater PCB real estate and memory traces/VRM's.

So, is this next release going to be more like to HD 2900 -> HD 3870, than HD 3870 -> 4870? Except without the reduced power part, cos we don't have a magical new process node.

My bet is its gonna be a X1800->X1900 style transition(performance and die size wise. Architecturally i have no idea)

I think you can chalk that up to the immature GDDR5 technology in RV770, the HD 4850 had significant improvements over the HD 3870 with only a slight increase in TDP.

But HD 4850 had a significantly slower clock speed than 3870. Remember 4850 also ditched the ring bus tech which afaik was very power inefficient

384 bit me bus, along with the rumored ~33% higher clocked memory, Cayman must be quite big then.

I really doubt that memory speed. I dont think it'll exceed 5.6-5.8 Ghz, and if they have gone 384 bit they i dont think they need that high a speed even, 4.8 ghz should suffice

With higher specced RAM chips you'll profit from price adjustments over time. With a 384 Bit wide memory interface you are stuck with a) more chips necessary (768 or 1.536 MiB) and b) a more complex and probably more-layered PCB.

I think they need an increase in size from 1 GB though, for the new gen i'd expect 1.5 or 2GB of memory

Expecting die size around 380-400mm², it would be either the smallest 384bit GPU, or the largest 256bit GPU ever made :)

Dont forget we had R600 with a 512 bit bus with a die size of 420 mm2 :D
 
Dont forget we had R600 with a 512 bit bus with a die size of 420 mm2 :D
And performed the same with only 256 bits enabled...

Remember 4850 also ditched the ring bus tech which afaik was very power inefficient
Really? I've never noticed... HD3850 consumed 10W less than similarly sized 7900GTX, 10W more than similarly sized 9600GT and 5W less than X1950PRO. All these ~200mm² GPUs consumed almost the same amount of power.
 
Dont forget we had R600 with a 512 bit bus with a die size of 420 mm2 :D
R600's logic core (ALUs, TMUs, ROPs, etc.) wasn't that big in size -- the stacked padding at the perimeter for the 512-bit interface and the fat ring-bus occuped rather large die area, than normal.
 
Or you could speculate like we did months ago that Barts is where it's at for 2010 and Cayman doesn't show up on 28nm 'till next year.

So let's say Cayman isn't ready for production yet, who would believe Barts is delivering the numbers discussed here?

I don't know, but let's just assume Barts really consists of 1280 shaders organized in a new, more efficent "1+3"/"2+2" arrangement.

Let's furthermore assume that the "old" 5D arrangement as used in Cypress provided shader utilization avaraging somehere around 50% of it's full/peak potential in games (which seems rather reasonable comparing peak FLOPs performance to real-world gaming benchmarks between Cypress' 5D and Fermi's 1D arrangement), then you get:

1600 (total shader count) * 0.5 (real-world utilization) = 800 (average used shader count)

In order for Barts to achieve an average shader performance about 30% better than Cypress (800*1.3=1040), the new shader arrangement would have to provide an average shader utilization of about 80% (1040/1028=0.8125).

Provided that the rest of the new architecture could meet that speculative increase in shader-efficiency, a 1280SP Barts could very well be 30% faster than Cypress in most games.

A 1920SP Cayman @ 28nm (allowing for higher clocks?) would probably put it some 60% over Barts when (speculatively) released in Summer 2011.

So the real question is: Assuming that the current codenames still refer to the "originally planned" architecture @ 32nm, couldn't it be possible that AMD just decided to "bloat" the 32nm high-midrange Barts into a 40nm chip with some agressive clock tweaks (hence the need of a high-performance 10-layer-PCB) - and make it the "new", "half-step" 68xx series?

And couldn't it be possible that the original 32nm high-end Cayman (being to big to be "bloated" to the 40nm process) was acordingly scheduled for a 28nm shrink, making it 2011's new, "full step" 7870 with a 60% performance gain over 40nm Barts?
 
I'm quite positive that all of the "SI-NI-something" series will be 40nm, excluding possibly the lowest end which might be used as "testdrive chips" for 28nm.

I think you're also underestimating the shader utilization of Cypress (and other 4+1D Radeons), at least I remember hearing about 70-80%+ utilization in games
 
384bit? AMD would go down that route just if they planned a huge increase in performance from Cypress. If the increase is just 30%, there is no real point in 240 gb/s of bandwidth.
But, well in that case, we would have a 480 Stream Core (4D), 120 TMUs, 384bit MC and probably 48 ROPs, arranged in 3 SIMD..or maybe even "tri-core" (lol since this time we haven't heard anything on multicore rumors, it may be the right time :D)
But that' s definately quite a leap forward, and would require a huge die at 40nm.. around 480 mm^2. Charlie said it's around 380 mm^2, and nApoleon said it's still under 200 watt of TDP. So 384bit must be a fake rumor.
IMHO, considering the philosophy AMD /ATi engineers used to build their chips, i wouldn't be surprise to see, as few pages back was said, a big increase in the TMU:ALU ratio.
 
I'm quite positive that all of the "SI-NI-something" series will be 40nm, excluding possibly the lowest end which might be used as "testdrive chips" for 28nm.

I think you're also underestimating the shader utilization of Cypress (and other 4+1D Radeons), at least I remember hearing about 70-80%+ utilization in games

Thanks for your feedback!

I'll admit that I'm just an "interested layman" in terms of GPUs - so my "speculation" is set on a rather low level of actual technical insight.

Nevertheless, not being an expert often helps thinking "outside the box" - as you arguably can't think in terms boxes you don't really know ;)

I just read that Cypress' peak shader performance is roughly about two times that of Fermi's (GF100) - nevertheless, GF100 seems to (at least narrowly) beat RV870 in most shader-heavy benchmarks. I'm not the right guy to factor in the impact of some of the "surrounding" architectural differences in that respect, but I'd assume that most of that discrepancy in theoretical peak performance vs. actual gaming performance is due the difference in shader utilization when comparing Cypress' "1+4D" vs. Fermi's "1D" arrangement?
 
Cayman could still maintain 32 ROP configuration, whilst being 384-bit device. We have already seen that AMD's architecture doesn't explicitly hardwire ROP partitions to memory channels. If AMD decides to double the depth-buffer sampling rate, the extra bandwidth would come in handy, and moreover -- HPC applications would benefit from more installable memory base (due to wider bus) and BW of course, given the intentions of AMD to be more completive with NV's Tesla line in this market (are they?).
 
I just read that Cypress' peak shader performance is roughly about two times that of Fermi's (GF100) - nevertheless, GF100 seems to (at least narrowly) beat RV870 in most shader-heavy benchmarks. I'm not the right guy to factor in the impact of some of the "surrounding" architectural differences in that respect, but I'd assume that most of that discrepancy in theoretical peak performance vs. actual gaming performance is due the difference in shader utilization when comparing Cypress' "1+4D" vs. Fermi's "1D" arrangement?
No that's highly unlikely. Now there's no doubt that the 4+1 shader arrangement does cost some performance in practice (compared to peak throughput), but it shouldn't be half typically. This can be clearly seen with some pixel shader oriented tests some sites are running (most prominent is vantage perlin noise feature test, but for instance ixbt runs a few more with rightmark).
There's just a whole bunch of differences which make Fermi reach higher performance compared to peak capability (but not per die area). Among other, higher internal cache bandwidth, 32 vs 64 wavefront size, higher tri throughput (certainly much higher with tesselation if app uses that, also better small tri handling in general it seems), higher z fillrate (though color fillrate otoh is a joke compared to Evergreen). It also seems to be more efficient in bandwidth utilization though the reasons for that are unclear (better buffer compression / hierarchical-z / caches?).
 
Cayman could still maintain 32 ROP configuration, whilst being 384-bit device. We have already seen that AMD's architecture doesn't explicitly hardwire ROP partitions to memory channels.
I believe this was possible only on R5xx and R6xx (maybe it was related to ring bus). There wasn't any R7xx or R8xx part which used such a combination It think ROPs are hardwired to MC since R7xx.
 
Back
Top