AMD: R8xx Speculation

ShaidarHaran · Oct 22, 2008

Kaotik said:
Yep, excluding the "SuperAA" mode where both cores render the same frames completely

Which makes for some awesome eye candy, btw. 14x AA is purty

One of the few benefits of CF when I was running it. Early adoption sucks for the most part. X1900 XTX CF sure looked good, but man was it buggy.

Shtal · Oct 25, 2008

Here's the break down: the GPU is going to be a 40nm part. It's going to have 1/4 (or +%25) more shaders compared to the HD 4870. The theoretical computational horsepower is going up to 1.5 TFLOPS, which is pretty insane (the HD 4870 has about 1 TFLOP.)

The GPU will be very tiny. The die is supposedly 205 mm². Compare this to the RV770 being 256 mm², and the GTX 280 being a big 576 mm².

ATI looks like it'll stick with GDDR5 memory, tied to a 512 bit memory interface. The source pins the memory bandwidth at 150-160 GB/s, which seems completely reasonable.

And here is where things get really interesting. The HD 5870 supposedly uses some sort of cooling that ATI hasn't tried before.

And thanks in part to the small size of the GPU, the HD 5870 X2 is going not have two seperate GPU's -- instead it will be two RV770 cores stacked on top of each other, sort of like Pentium-D style.

hmmm....

-----------------------------------
http://www.neoseeker.com/news/9078-ati-hd5870-rumors-1-5-tflops-40nm-1000-shaders-and-multi-core-/

Arun · Oct 25, 2008

That article implies this new GPU would have the same clock speeds as the HD 4870 (since 1.2*1.25 = 1.5); therefore, it is clueless speculation. Maybe the rest is somehow correct, but I'm more than a little bit skeptical.

hoom · Oct 25, 2008

So many things impossible there.

Acert93 · Oct 25, 2008

Stacked chips and a 512bit memory bus at 205mm^2--welcome to the distant, and I mean distant, future!

nutball · Oct 25, 2008

They forgot the eDRAM.

Arun · Oct 25, 2008

Joshua Luna said:
Stacked chips and a 512bit memory bus at 205mm^2--welcome to the distant, and I mean distant, future!

Actually, they're just being dumb, they probably heard "Pentium-D" style and concluded that on their own; however Intel uses MCMs (two chips next to each other on the same package), not SiPs. Not that it matters since this all sounds like so much bullshit.

fehu · Oct 25, 2008

maybe this is only amd's fud made only to confuse people and nvidia, like they done for the 4800

Jawed · Oct 25, 2008

Hmm, this is kinda interesting. Remember how R600 was intended to be only a moderate performance improvement over R580? This "rumour" seems to suggest that the D3D10.1->11 transition (if this is truly D3D11) is going to cause the same kind of mediocre increase in performance.

I could believe "R800" is an MCM with two RV870's, amounting to a 512-bit bus. Well, I've long wanted to believe such a device was coming, so...

150-160GB/s is certainly not reasonable, though, for a 512-bit GDDR5 configuration. EDIT for 1.2GHz GDDR5 and a 256-bit bus that would be reasonable though, 30%+ more bandwidth than HD4870.

Jawed

fellix · Oct 25, 2008

That "1/4 (or +%25)" notion probably means 48 TMU design (960 ALUs), which in this case is more like 20% up?

CarstenS · Oct 25, 2008

They could go with RV730-style SIMDs at 40ish increments, couldn't they?

fellix · Oct 25, 2008

Reducing the ALU:TEX ratio? No way!

CarstenS · Oct 25, 2008

Makes you wonder why they did it for a very price sensitive chip like RV730 in the first place, doesn't it?

fellix · Oct 25, 2008

Yeah, tell me about it!

The cut-down follow ups are always screwed this way.

Jawed · Oct 25, 2008

The lower end chips are supposed to support lower-quality rendering, which is supposed to be less math, less AF, less AA. Or, significantly lower resolutions for the same quality of rendering.

RV730, HD4670, has the following percentages of HD4870:

math - 40%
texturing - 80%
fillrate - 50%
bandwidth - 28%

In HD4870 math amounts to ~30% of the die, with texturing accounting for ~10% (excluding L2 cache). Fillrate/bandwidth aren't easy to separate but amount to ~50% of the die (including L2 cache) - presuming that about 10% goes to things like the hub, CrossFireX sideport, control processor etc.

The memory bus is a major determinant of overall die size, simply due to pad density, so lowering the pad count makes a dramatic difference to die cost. Anyone have any idea what's the smallest GPU die so far with a 128-bit bus?

So HD4670, at around 56% of the die size of HD4870, gives performance in games in the region of 40-60%.

---

A 3:1 HD4870 would have been interesting: 12 clusters, each of 12 quads of ALUs at 750MHz would give 1080GFLOPs, 90% of HD4870's math and 120% of its texturing. I reckon it would have been about the same die size.

Jawed

CarstenS · Oct 25, 2008

Thanks for all the details, Jawed.

I still have my reservations, though. Why, for example would AMD redesign their SIMD-Cores, when they could have had their 20:1 ratio and saved die space plus R&D costs? Especially since RV730 has so little bandwidth compared to HD4800 and only 16 Interpolators.

I still think that this move somehow was a kind of field test for some things to come - smaller branches, more Data fed into shaders... maybe interesting stuff for D3D11 and since TMU aren't that costly - as you've also hinted at...

--

The two smallest ones I've measured by hand were about 104mm²: RV380 and RV515. IIRC the latter was the last low-end-GPU from AMD (then ATi) with a 128 Bit mem interface. On the nv-side of things it was G96-300 (55nm AFAIK) measuring in about 118mm².

Jawed · Oct 26, 2008

CarstenS said:
Thanks for all the details, Jawed.
I still have my reservations, though. Why, for example would AMD redesign their SIMD-Cores, when they could have had their 20:1 ratio and saved die space plus R&D costs?

4 clusters, 320 ALU lanes, 16 TUs? I reckon that would save about 5% of the die space - based on about 17.5% of each SIMD being control (complete guess).

The problem, I guess, is that RV7xx has half RV6xx's performance, per TU, for fp16 filtering. But I still have no idea what proportion of typical rendering time in current games is fp16-filtering

Especially since RV730 has so little bandwidth compared to HD4800 and only 16 Interpolators.

HD4850 is definitely short on bandwidth, while HD4870 has too much.

16 interpolators is still a 2:1 interpolator:fragment ratio:

http://forum.beyond3d.com/showpost.php?p=1193433&postcount=184

like RV770.

I still think that this move somehow was a kind of field test for some things to come - smaller branches, more Data fed into shaders... maybe interesting stuff for D3D11 and since TMU aren't that costly - as you've also hinted at...

I'm torn over the smaller branches thing - I presume you mean lower branching-divergence penalty. Lower is clearly useful, but using the control overhead I mentioned earlier, a 2:1 (what you would call 10:1) ratio makes the ALUs (ALUs+control, excluding TUs) about 22% larger (resulting in that 5% penalty for the die as a whole).

With nested branching there's an explosion in the divergence penalty. Sadly GPUSA is still broken for calculating the throughput of complex shaders with nested branching (well with the shaders I've tried, anyway, e.g. Steep Parallax Mapping), so I haven't had a chance to play to see what kind of effects nesting has on performance compared with ALU:TEX. Also, is nesting really relevant at this time (as a proportion of frame rendering time)?

Without nesting it's then a matter of sheer throughput of a 4:1 configuration compared with a 2:1 configuration. If the former has ~20% higher throughput for the same SIMD area, and shaders with DB are still relatively rare...

A 2:1 version of RV770, with ~ 1TFLOP, would have had 16 clusters, which would make 64 TUs

It wouldn't have been hugely bigger, though, around 270mm2, I guess.

--

The two smallest ones I've measured by hand were about 104mm²: RV380 and RV515. IIRC the latter was the last low-end-GPU from AMD (then ATi) with a 128 Bit mem interface. On the nv-side of things it was G96-300 (55nm AFAIK) measuring in about 118mm².

So RV730 is about 40mm2 "over-sized" for its bandwidth - though that doesn't account for power. So in comparison to that, the ~5% increase in die area for using 2:1 clusters, instead of 4:1 clusters like RV770, seems relatively tame.

Jawed

CarstenS · Oct 26, 2008

Jawed,

5 percent less die space would results in about 10-11 (maybe even more, depending on placement options close to the wafer edges) additional dies. Apart from that - 5 percent more or less margins is something most companies except for intel maybe would steal, spy and kill for.

Especially considering the case of RV730 which has not direct competitors in terms of performance/die-size (I do not count Nvs vastly larger G92 derivates here despite them being sold at a similar price point for customers) the IMO minor gains were nothing to not save R&D, 5% die space and so on. This together with the very high numbers of RV730 which are going to be produced would make this move something I'd not understand when viewed as an isolated case.

Plus, as you've said, the market for which RV730 is supposedly targeted will have not so high demands for expensive filtering methods which would render the additional TMUs not-as-useful (to put it mildly) as in higher end GPUs.

--
WRT to bandwidth/texel we should consider that HD4850 only has 3,18 bytes per texel (interpolated - which should matter most as cache hit rates are getting better with higher degrees of filtering/aniso) available, whereas HD 4670 can use 2,67 bytes per texel (bpt) and you said yourself that HD4850 seems a bit short on bandwidth.

And that's not even taking into account that - IMO! - the vast majority of RV730 will be OEM-style HD4650 and such - which drops that ratio to only 1,67 bpt which i in turn would not consider that useful anymore.

--
As for RV730 being 40mm² oversized: Correct me if I am wrong, but IMO pad size also scales with mem clock frequency - albeit presumably not linear - and both RV380 and RV515 had much lower mem clocks than RV730. Plus, pads supposedly scale very good with process technologie, but is it really so that they do not scale at all?

MfA · Oct 26, 2008

Flip chip on substrate pad pitch is not going down as fast as feature width (even given the slowdown there). Thermal warping and solder bridging are putting the hurt on minimization. They have gone down over the last couple of years though.

Jawed · Oct 26, 2008

CarstenS said:
5 percent less die space would results in about 10-11 (maybe even more, depending on placement options close to the wafer edges) additional dies.

Yeah. The "fp16" focus of R600 really really hurt too.

The point I made earlier about the 4:1 version of RV730 having half the fp16-filtering capability might be key here...

Apart from that - 5 percent more or less margins is something most companies except for intel maybe would steal, spy and kill for. Especially considering the case of RV730 which has not direct competitors in terms of performance/die-size (I do not count Nvs vastly larger G92 derivates here despite them being sold at a similar price point for customers) the IMO minor gains were nothing to not save R&D, 5% die space and so on.

RV730 is huge compared to G96 at 121mm2 on 65nm. Sure, G96 is pathetic, but does RV730 need to be 20% bigger?...

This together with the very high numbers of RV730 which are going to be produced would make this move something I'd not understand when viewed as an isolated case.

NVidia decided not to compete?

Plus, as you've said, the market for which RV730 is supposedly targeted will have not so high demands for expensive filtering methods which would render the additional TMUs not-as-useful (to put it mildly) as in higher end GPUs.

We need an answer to the "is fp16-filtering important?" question... Also other texture formats and vertex fetching?

--
WRT to bandwidth/texel we should consider that HD4850 only has 3,18 bytes per texel (interpolated - which should matter most as cache hit rates are getting better with higher degrees of filtering/aniso) available, whereas HD 4670 can use 2,67 bytes per texel (bpt) and you said yourself that HD4850 seems a bit short on bandwidth.

And that's not even taking into account that - IMO! - the vast majority of RV730 will be OEM-style HD4650 and such - which drops that ratio to only 1,67 bpt which i in turn would not consider that useful anymore.

Sadly it's really hard to find comprehensive results for GPUs

http://www.techreport.com/articles.x/15559/6
http://www.techreport.com/articles.x/15559/8

In CoD4 1280 (sigh) HD4670 is 83% of the performance of HD4850 with AF/no-AA, but 65% with AF/4xAA. ETQW is 79% and 68% respectively. Any comparable tests at higher resolutions out there?

--
As for RV730 being 40mm² oversized: Correct me if I am wrong, but IMO pad size also scales with mem clock frequency - albeit presumably not linear - and both RV380 and RV515 had much lower mem clocks than RV730. Plus, pads supposedly scale very good with process technologie, but is it really so that they do not scale at all?

I don't know where the balance lies: process should help scaling but memory clocks shouldn't.

9400GT seems to be worse, being G96b to compete with RV710, which is about 73mm2. How big is G96b?

Jawed

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

ShaidarHaran

hardware monkey

Shtal

Arun

Unknown.

hoom

Acert93

Artist formerly known as Acert93

nutball

Arun

Unknown.

fehu

Jawed

fellix

CarstenS

Moderator

fellix

CarstenS

Moderator

fellix

Jawed

CarstenS

Moderator

Jawed

CarstenS

Moderator

MfA

Jawed

Similar threads