AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

  • Within 1 or 2 weeks

    Votes: 1 0.6%
  • Within a month

    Votes: 5 3.2%
  • Within couple months

    Votes: 28 18.1%
  • Very late this year

    Votes: 52 33.5%
  • Not until next year

    Votes: 69 44.5%

  • Total voters
    155
  • Poll closed .
each 5-way SIMD is 16 strands wide (just like now) but the ALU:TEX is doubled to 8:1 - this leads to a massive increase in compute density as the TUs currently cost ~29% of a cluster's area, and this would reduce that penalty to a mere ~17%. Put another way that would double per cluster compute for 71% more area.
If each TU is single-cycle fp16, that means each TU could be ~70% bigger. This huge increase in TU area might be motivation to increase ALU:TEX.

If ALUs also get 10% bigger per SIMD (say for new features) and the redundancy/LDS section get's 40% bigger per SIMD (mostly due to LDS doubling in size), then I estimate that the cluster size, overall, will grow by about 110% (before the 40nm shrink).

After 40nm shrink that would be about 11.9mm² per cluster, or about 11% bigger than on RV770. In this scenario the TU would be about 24% of the cluster, which represents a small saving in comparison with RV770.

So, ahem, guesstimating:
  • Juniper 181mm² - 5 clusters, 800 ALUs, 20 TUs, 16 RBEs, 128-bit
  • Cedar 220mm² - 8 clusters, 1280 ALUs, 32 TUs, 16 RBEs, 128-bit
  • Cypress 316mm² - 12 clusters, 1920 ALUs, 48 TUs, 32 RBEs, 256-bit
:p

Jawed
 
According to the latest rumors AMD's ATI RV870 flagship graphics card launch is almost imminent (somewhere next month)

At an Asian bulletin board, is seems that the RV870 surfaced with specs and all, it supposedly comes with a 384 Bit memory interface, but that info seems shady and somewhat inaccurate, we'll leave that info as is.

Other colleagues published a 3DMark Vantage scores in the performance preset for the mainstream solution RV840 (Juniper) already a few days ago. That chip reached a P9500 points, while they did not give any further details about the test system. But this causes a ranking between HD 4850 and HD 4870.

That same source has also a result of the flagship product RV870 (Cypress) on hand. But they did not disclose exact performance number, but a range in-between P16000 and P18000, which is almost twice the RV840 and passing the HD 4870 X2.

* Cypress/RV870: P16000-P18000
* Juniper/RV840: P95xx
* Redwood/RV830: P46xx

Though 3Dmark says very little, it's safe to assume that the new flagship card will performance double opposed to the last generation.
http://www.guru3d.com/news/ati-rv870-scores-p18000-in-vantage/
 
"Dual-shader" might mean that one cluster contains two 5-way SIMDs, both of which share a quad-TU. This provides two options:
  1. each 5-way SIMD is 8 strands wide, i.e. a thread is 32 wide - this improves branch incoherence penalties substantially
  2. each 5-way SIMD is 16 strands wide (just like now) but the ALU:TEX is doubled to 8:1 - this leads to a massive increase in compute density as the TUs currently cost ~29% of a cluster's area, and this would reduce that penalty to a mere ~17%. Put another way that would double per cluster compute for 71% more area.
Jawed
Maybe it just refers to the direly needed dual-setup-unit, which is indicated by the double-complex scheduler?
 
So, ahem, guesstimating:
  • Juniper 181mm² - 5 clusters, 800 ALUs, 20 TUs, 16 RBEs, 128-bit
  • Cedar 220mm² - 8 clusters, 1280 ALUs, 32 TUs, 16 RBEs, 128-bit
  • Cypress 316mm² - 12 clusters, 1920 ALUs, 48 TUs, 32 RBEs, 256-bit
:p

Jawed
:oops: Your Juniper specs are very RV670-like. The only significant difference is in higher number of ALUs. I can't believe, that the Juniper could be the same size as RV670 at 1,9x bigger manufacturing process...
 
Not sure what you mean, so I cannot tell if I mean the same.
Two threads are issued on the ALUs, with thread A issued as a single instruction over four cycles AAAA, then thread B takes its turn. So the SIMD looks like it's executing two threads at the same time.

Jawed
 
:oops: Your Juniper specs are very RV670-like. The only significant difference is in higher number of ALUs. I can't believe, that the Juniper could be the same size as RV670 at 1,9x bigger manufacturing process...
Well, there's a black hole labelled D3D11 that seems to be sucking up die - it's just a guess...

For what it's worth I'm wary of these big TUs, just because they have such a large hit on 8-bit performance. But, ahem, ATI tried once before, the question is, when does the tipping point come that makes them worthwhile?

Jawed
 
For what it's worth I'm wary of these big TUs, just because they have such a large hit on 8-bit performance. But, ahem, ATI tried once before, the question is, when does the tipping point come that makes them worthwhile?
Maybe never, because at this point it would make more sense to just use shader alus for filtering?
 
I think pretty much any single-cycle FP16 implementation would have to provide double throughput for INT8 right? Otherwise that's a lot of wasted space.
 
I think pretty much any single-cycle FP16 implementation would have to provide double throughput for INT8 right? Otherwise that's a lot of wasted space.
That's why I've been querying how expensive it is to meet the filtering precision specification of D3D11 for 8-bit textures. Single-cycle fp16 TUs are 70% more expensive in R600, apparently. Maybe less (if there is some overhead associated with process/library from back then).

Jawed
 
It is not just the bigger tex units for the 16-bit compute lanes, but what about the texture cache and load bandwidth for sampling & etc. -- this also must be accounted in to the transistor budget.
Anyway, in that perspective, should we expect some advancement in AF quality this time?
 
It is not just the bigger tex units for the 16-bit compute lanes, but what about the texture cache and load bandwidth for sampling -- this also must be accounted in the transistor budget.
The bandwidth is already there. 128 bits of unfiltered data are as fast as 32 bits. L1 is very small, so doubling it is hardly a world of pain - I think RV770 doubled L1 per cluster in comparison with RV670, so for performance parity against the per unit capability of RV670 the cache is already there.

Anyway, in that perspective, should we expect some advancement in AF quality this time?
I guess so, but I don't know what D3D11's filtering specifications cover.

Going back to no-X's comment, Juniper appears to be RV730's replacement. HD4670 was never offering a compelling alternative to HD3870 in terms of absolute performance as far as I can tell - not enough bandwidth - maybe drivers have changed things since its launch?

RV730's 32 TUs were squandered, though it has pretty much the same fp16 filtering rate as HD3870 (does that imply that fp16 rate is more important at the low end?). HD3870 has the stupid 2x Z rate, squandering its bandwidth.

So, ahem, in comparison with RV670, putting 25% more 8-bit and 16-bit texturing into Juniper would still be a performance win, assuming that GDDR5 is there to give some tasty extra bandwidth.

RV740->Cedar, that's more tricky as my speculation says there's only a doubling in fp16 rate and a doubling in ALUs.

Jawed
 
Two threads are issued on the ALUs, with thread A issued as a single instruction over four cycles AAAA, then thread B takes its turn. So the SIMD looks like it's executing two threads at the same time.
Jawed
Thanks, but no, that's not what I meant. I was merely referring to the "rumor" that started with Olicks presentation on id tech6 and wormed its way through doubled ROP-count on all DX11 chips compared to their DX10 predecessors (which may very well be more than just a rumor).

I'd like to know, what's the major cause of the R600's TMUs performance (compared to todays TMUs) - whether the native FP16 support, or the point samplers. Still thinking about the low (60%) performance difference between HD2900XT and HD4890 in current game w/o FSAA.
Be warned, though CB is using the same driver this time, they'd have to revert to medium details quite often in order to let the X800 and 6800 compete at all.

That's why I've been querying how expensive it is to meet the filtering precision specification of D3D11 for 8-bit textures. Single-cycle fp16 TUs are 70% more expensive in R600, apparently. Maybe less (if there is some overhead associated with process/library from back then).
Jawed
Would TA be have to beefed up significantly/ a bit in order to support the required 16k-textures?
 
Back
Top