AMD: R7xx Speculation

Shtal · Mar 4, 2008

This is my theory only "Thank you" Based on speculation what specs of RV770.

My believe it will be like ATI R360 to R420 transition in performance.

R360XT
412MHz core
8 pixel pipeline / 4 vertex Shaders / 8TMU's
256bit 365MHzx2 = 730MHz DDR

R420XT
500MHz core
16 pixel pipeline / 6 vertex Shaders / 16TMU's
256bit 500MHzx2 = 1000MHz GDDR3

It should be something similar with RV770, about 40-65% performance increase over RV670.

Twinkie · Mar 4, 2008

Pete said:
http://www.behardware.com/articles/701-14/product-review-the-amd-radeon-hd-3870-x2.html
http://www.computerbase.de/artikel/...e_9600_gt_sli/24/#abschnitt_performancerating

Thanks for the links.

So basically the RV770 could finally offer 8800ultra like performance in terms of single gpu performance? (seeing as the HD3870 is mostly twice as slow). I guess other features could tip the scale in favor of the RV770, but the 8800ultra is already very old.

If the GT200 is supposedly a new gen architecture in the sense that almost everything was overhauled, then i can see why nVIDIA doesn't need it til later this year since the 9 series looks like theyll be pretty competitive against the rumored RV770 even if its the same old stuff from 2006.

Kaotik · Mar 4, 2008

Ailuros said:
If the die size truly is around 250mm2, it could suggest something in between 175-190M more transistors. Something tells me that not all unit amounts have been doubled as some of the funky rumours suggest so far. It's enough if throughput is doubled, while not necessarily having in some spots twice as much units and it makes way more sense too. Combined probably with some further algorithmic optimisations and an expectable frequency increase, the result should have more than good chances to distance itself from a 8800Ultra f.e.

If all that is true it doesn't come as a surprise at all that NV is shrinking G92 down to 55nm. Even if they're preparing a single high end monster for the second half of this year, it'll take them time to get to a cost-effective Performance GPU out of that future family.

Not all of course, but how much do TMUs and SPs take space?
I mean, R520 > 580 tripled the amount of pixel shader units, and the diesize went up only 64mm^2 (288mm^2 > 352mm^2) and transitorcount only 63 million (321 > 384 million)

Ailuros · Mar 4, 2008

Kaotik said:
Not all of course, but how much do TMUs and SPs take space?
I mean, R520 > 580 tripled the amount of pixel shader units, and the diesize went up only 64mm^2 (288mm^2 > 352mm^2) and transitorcount only 63 million (321 > 384 million)

I don't recall any higher Pixel/Texel/Z fillrates on R580 compared to R520. Oh and just by the way since when is a PS3.0 ALU any comparison to a SM4.1 ALU?

Tchock · Mar 4, 2008

Ailuros said:
I don't recall any higher Pixel/Texel/Z fillrates on R580 compared to R520. Oh and just by the way since when is a PS3.0 ALU any comparison to a SM4.1 ALU?

R580 was a primary ALU extension though. Considerably effective? Maybe.

Best reference of RV670 would be RV630/635. With only a 65-70mm^2 die area disparity, look what the cat dragged in.

Jawed · Mar 4, 2008

Tchock said:
Best reference of RV670 would be RV630/635. With only a 65-70mm^2 die area disparity, look what the cat dragged in.

Or in transistor count terms, 76% more transistors for ~150-200% more performance.

Jawed

DeadlyNinja · Mar 4, 2008

turtle said:
Anyone read chinese? The OP from this article was on chiphell saying something about R700 being delayed to Q3 and pointed to this article he wrote:

http://www.pczilla.net/post/236.html

Not sure if it's of any importance, but thought I'd throw it out there.

Actually, it says it might be delayed to the second half of the year, not Q3.

turtle · Mar 5, 2008

Kaotik said:
Not all of course, but how much do TMUs and SPs take space?
I mean, R520 > 580 tripled the amount of pixel shader units, and the diesize went up only 64mm^2 (288mm^2 > 352mm^2) and transitorcount only 63 million (321 > 384 million)

Right, and don't get me wrong, I have no idea the difference in transistors between the change from 1->3 per ROP in R520->R580 and this suggestion, but I wonder if RV770 could keep it's "64" shaders, but instead be composed of 10 MADDs (9+1?), rather than 5 (4+1)?

Could this be done, and with minimal increase in transistors? If the R520->R580 took 63 million, and the theoretical transistor difference between RV670->RV770 is ~150-200m, could it be done while increasing the TMUs to 32, and could the 4x16 setup be kept in essence squishing the MADDs and TMUs into the existing configuration?

DeadlyNinja said:
Actually, it says it might be delayed to the second half of the year, not Q3.

Thank you. It's appreciated. The writer mentioned Q3 in a thread at chiphell.

ShaidarHaran · Mar 5, 2008

IIRC ROPs, caches, and memory controller(s) take up the most trannies/die space with SPs and TMUs being relatively "affordable", so I think it's entirely possible to increase both the TMU and SP count in RV770 given the known trannie/die space budgets.

Ailuros · Mar 5, 2008

Tchock said:
R580 was a primary ALU extension though. Considerably effective? Maybe.

Best reference of RV670 would be RV630/635. With only a 65-70mm^2 die area disparity, look what the cat dragged in.

Well as Jawed pointed out above, maybe you should have a closer look at transistor counts between all your examples. I don't even know why you used the RV630 for that comparison; it is on 65nm, while RV670 is on 55nm.

RV635 has if memory serves well 378M transistors vs. 666M on RV670.

DegustatoR · Mar 5, 2008

Twinkie said:
If the GT200 is supposedly a new gen architecture in the sense that almost everything was overhauled

It most probably not.
I think it'll be more like NV40->G70 type of upgrade.

Arun · Mar 5, 2008

Ailuros said:
RV635 has if memory serves well 378M transistors vs. 666M on RV670.

But the die size is indeed only ~70mm² higher for RV670. This implies: a) overhead including analogue/IO, b) that the scalable parts are likely denser (i.e. more transistors/mm²) than other digital parts of the chip. Both make perfect sense.

So, this does imply that area scaling for a chip with 50-100% more performance than RV670 would be pretty good - but on the other hand, the power density (per mm²) for those scalable bits is probably also worse than the chip's average. And more transistors per mm² also tends to imply more defects per mm² (for a given synthesis quality), so yields are yet another factor to consider (although their impact shouldn't be overestimated given the redundancy mechanisms used).

Tchock · Mar 5, 2008

Ailuros said:
Well as Jawed pointed out above, maybe you should have a closer look at transistor counts between all your examples. I don't even know why you used the RV630 for that comparison; it is on 65nm, while RV670 is on 55nm.

RV635 has if memory serves well 378M transistors vs. 666M on RV670.

Trannie density went up as more transistor-heavy components went in along with the core parts, that is.
I wanted to edit that to RV635 exclusively but B3D has no edit option. Dang.

The ALUs from a relative standpoint should be equal, could be cheaper than R5XX (although there's much less need to scale upward), I'm not sure on the TMU side as that's a bottleneck that's been identified and rectification should have been done. Taking R520/580 from the equation seems to help out with the perception though.

On the overall side, it seems that the architecture might really benefit from a lot of unit stacking, but exotic cooling OTOH might make it a diminishing return after all.

Jawed · Mar 5, 2008

Arun said:
So, this does imply that area scaling for a chip with 50-100% more performance than RV670 would be pretty good - but on the other hand, the power density (per mm²) for those scalable bits is probably also worse than the chip's average.

How much so, if at all? The reason I ask is that all the scalable parts use memory, sometimes as in the case of ALUs, a lot (register file).

Jawed

Arun · Mar 5, 2008

Jawed said:
How much so, if at all? The reason I ask is that all the scalable parts use memory, sometimes as in the case of ALUs, a lot (register file).

Hmm, that's a very interesting point I had forgotten about completely - oopsie, and good catch!

Given that AMD implied in the past that the register file was basically as big as the ALUs themselves (excluding the scheduling overhead I presume?) then this might imply a substantial part of that density benefit is due to SRAM (which is inherently much denser), which also has lower (rather than higher) power density and better (not worse) yields. Hmmm!

I'd presume that the TMUs, for example, would use a lot less memory (although you *might* want to scale the texture cache along with them) but the fact remains that what you just pointed out made my arguement quite different; I still suspect ALU/TMU density is higher than the density of the 'unique' stuff for a variety of reasons, but certainly power density and yields become a much more complex equation now. Hmm!

3dilettante · Mar 5, 2008

Arun said:
And more transistors per mm² also tends to imply more defects per mm² (for a given synthesis quality), so yields are yet another factor to consider (although their impact shouldn't be overestimated given the redundancy mechanisms used).

Just out of curiousity which mechanism implies that more transistors per unit of area leads to higher defects?
Do smaller transistors lower the threshold where physical defects in the silicon itself impact transistor function?

At least for mature processes at the MPU manufacturers, defect rates approach the baseline of physical defects in the wafer, irrespective of what's patterned on them.

Farhan · Mar 5, 2008

Arun said:
Hmm, that's a very interesting point I had forgotten about completely - oopsie, and good catch!

Given that AMD implied in the past that the register file was basically as big as the ALUs themselves (excluding the scheduling overhead I presume?) then this might imply a substantial part of that density benefit is due to SRAM (which is inherently much denser), which also has lower (rather than higher) power density and better (not worse) yields. Hmmm!

I'd presume that the TMUs, for example, would use a lot less memory (although you *might* want to scale the texture cache along with them) but the fact remains that what you just pointed out made my arguement quite different; I still suspect ALU/TMU density is higher than the density of the 'unique' stuff for a variety of reasons, but certainly power density and yields become a much more complex equation now. Hmm!

I don't think a register file SRAM would be very low power unless it is very underutilized. The reason why SRAMs in those large L2 caches have low average power/power density is because of the low activity factor. The vast majority of the cells are just idle/sleeping (and they are optimized for doing this). Accesses are quite expensive in terms of power since you have to drive those long wordlines and bitlines and have decoding logic and do way comparisons etc.

Jawed · Mar 5, 2008

Arun said:
Hmm, that's a very interesting point I had forgotten about completely - oopsie, and good catch!

It would also be worth doing a per-transistor/per-watt comparison - with the caveat that RV635 has more bandwidth than it needs, to a greater extent than RV670. As far as I can tell, RV670 is up to 3x the performance of RV635, but at <2x the power consumption. I don't know the precise numbers though (might only be able to compare 256MB cards: HD3650 and HD3850).

Given that AMD implied in the past that the register file was basically as big as the ALUs themselves

I don't remember that, I'm guessing foggy memory on my part.

(excluding the scheduling overhead I presume?) then this might imply a substantial part of that density benefit is due to SRAM (which is inherently much denser), which also has lower (rather than higher) power density and better (not worse) yields. Hmmm!

For what it's worth the SRAM in Cell SPEs is ~4x the density of the remainder of the SPE. I dunno how we'd assess this factor in ATI GPUs - and we still don't know how much register file there is (512KB seems like a reasonable guesstimate) or whether it is actually double-implemented (to get dual-porting). L2 cache is another 256KB. The RBE caches could be 10s of KB, I suppose. There's also hierarchical-Z/stencil buffers - but I presume they're small enough to ignore.

I'd presume that the TMUs, for example, would use a lot less memory (although you *might* want to scale the texture cache along with them)

RV670 has 2x the L2 texture cache that RV635 has, so I think there's a strong chance the scaling is linear.

but the fact remains that what you just pointed out made my arguement quite different; I still suspect ALU/TMU density is higher than the density of the 'unique' stuff for a variety of reasons, but certainly power density and yields become a much more complex equation now. Hmm!

One of the things I've been fiddling with is just how much of R6xx is "fixed": PCI Express, command processor, input assembler, rasteriser, setup, inter-stage buffers, instruction/constant-buffer caches, virtual memory system, ring bus buffering/command-processing etc.

So far I've narrowed it down to anywhere in the range 90-254M transistors

If I simply plump for a number in the middle, 180M, then 27% of RV670's transistors are "fixed". That's 48% of RV635

I don't know how to narrow down R6xx any further...

So, in summary we have:

a substantial part of R6xx is fixed
R6xx is comparatively low in performance per watt (against NVidia) despite R6xx "low" unit counts (TUs, RBEs)
super-linear scaling (from RV635 to RV670) both in per-unit performance and performance per watt

it seems to me that the "fixed" portion of R6xx GPUs is distorting things substantially. Obviously the scaling of RV635->RV670 is fairly risky basis for this point of view, being a single data point. But what you gonna do?...

---

I can think of one spanner to throw in here: ALU utilisation. It seems to me it's very hard to find any code that exercises R6xx ALUs (whilst also fully exercising the rest of the GPU). So with "easy" code what we might be seeing is that power consumption in RV670 is never topping out, whereas RV635 might be closer to being fully utilised (due to the lower ALU:TEX ratio).

Jawed

DeadlyNinja · Mar 5, 2008

turtle said:
Thank you. It's appreciated. The writer mentioned Q3 in a thread at chiphell.

No problem. If you guys have more you'd like me to translate, I'd be happy to help out. Of course... my reading skills are only around 2nd grade level becoming moving out of the country. I should be able to get the gist of most of the article, unless there's extremely heavy tech talk involved. Hell, even if it's in English, the tech talk's beyond me.

silent_guy · Mar 6, 2008

3dilettante said:
Just out of curiousity which mechanism implies that more transistors per unit of area leads to higher defects?

The ratio of the the defect size vs the feature size.

In the case of a defect that creates extra material, it could fall right on 1 wire, which would result in this wire become wider but without generating a short. For a missing material defect the opposite would be the case. When the defect size increases relative to the feature size, the chance of a real fault increases.

This page gives a nice explanation.

At least for mature processes at the MPU manufacturers, defect rates approach the baseline of physical defects in the wafer, irrespective of what's patterned on them.

Maybe that's because feature sizes are now so small that most defects almost always larger than the critical defect area?

AMD: R7xx Speculation

Shtal

Twinkie

Kaotik

Drunk Member

Ailuros

Epsilon plus three

Tchock

Jawed

DeadlyNinja

turtle

ShaidarHaran

hardware monkey

Ailuros

Epsilon plus three

DegustatoR

Arun

Unknown.

Tchock

Jawed

Arun

Unknown.

3dilettante

Farhan

Jawed

DeadlyNinja

silent_guy

Similar threads