G80 vs R600 Part X: The Blunt & The Rich Feature

Hmm, that's like arguing that only a few clusters nearest the RBEs should do MSAA resolve.
If my only desire is to keep signals from having to cross ever-widening expanses of die space, I'd say yes. If we start from the point of view of sloth, we don't want our signals working any harder than they have to.

As the ALU numbers rise, the proportional cost of leaning on one set of SIMDs over others would actually fall.
Such demarcations would serve to give the design more freedom, if we are to expect continued high ALU counts. Instead of designing a SIMD array where even the furthest units must must be timed and their interconnects specified to accept RBE, TMU, *insert unit here*, inputs, we can draw a line at a certain number of SIMDs.

The possible benefits include more flexible clocking schemes, slower growth of the crossbar, and possibly lower intensity hot spots or power-consuming repeaters.

In terms of connectivity texels would be going from cache into the register file, or into those funky, shiny and new, LDSs.
The diagrams are probably oversimplified for the ROP and L2 cache relationships, so I don't know how many ports lead to the L1s.
Future design increases could lead to greater pressure to economize.
A slimmer crossbar that simply means that 10 out of 16 SIMDs (random future design speculation) don't get RBE traffic may be a worthwhile tradeoff.
 
Two little queries:

1) R600's 512-bit bus. What do you guys think the reasoning was for it? It seems like it added very little, if anything, to its performance after seeing 38x0. It's as if ATI engineers didn't realize they needed more fillrate, not more bandwidth.

2) G80 fillrate vs. bandwidth. 8800GTX can't quite keep up with HD 4850 even with a good bit more bandwidth available and 24 ROPs. Is this because it doesn't have G84/G9x's extra texturing rate or is it more related to 4850's significant shader resources?
 
If my only desire is to keep signals from having to cross ever-widening expanses of die space, I'd say yes. If we start from the point of view of sloth, we don't want our signals working any harder than they have to.
How much bigger are ATI GPUs going to get? Bearing in mind that multi-chip puts a "lowered" ceiling on the GPU size (the "process sweet spot"?).

The diagrams are probably oversimplified for the ROP and L2 cache relationships, so I don't know how many ports lead to the L1s.
Nearly 400GB/s of bandwidth from L2 to L1. Note that path is not via the hub.

Jawed
 
Two little queries:

1) R600's 512-bit bus. What do you guys think the reasoning was for it? It seems like it added very little, if anything, to its performance after seeing 38x0. It's as if ATI engineers didn't realize they needed more fillrate, not more bandwidth.
Overall it seems like a white elephant. I've seen D3D10 synthetics that show R600's extra bandwidth taking it way past RV670.

The new architecture emplaces L2 and RBEs with what is effectively a private MC. So a whole pile of bandwidth that once exercised the ring bus has disappeared. That's just a suspicion, I don't feel like doing a long-winded ramble on the subject right now.

RBEs in R5xx/R6xx appear not to have a 1:1 relationship to MCs, implying another wodge of ring bus traffic.

RV7xx appears to have ended that "chaos", which presumably requires a quite different approach to the tiled use of memory for render targets and requires a crossbar between L2 and L1.

Jawed
 
1) R600's 512-bit bus. What do you guys think the reasoning was for it? It seems like it added very little, if anything, to its performance after seeing 38x0. It's as if ATI engineers didn't realize they needed more fillrate, not more bandwidth.
IMO it was just plain stupid design, though maybe we're overstating the cost of going from 256b to 512b on a chip of that size.

2) G80 fillrate vs. bandwidth. 8800GTX can't quite keep up with HD 4850 even with a good bit more bandwidth available and 24 ROPs. Is this because it doesn't have G84/G9x's extra texturing rate or is it more related to 4850's significant shader resources?
I think it's both, but mostly the latter. There's only one third the math ability.

G80 and G92 tend to always be pretty close together in benchmarks, so trying to break it down into detail is prone to error from splitting hairs.
 
How much bigger are ATI GPUs going to get? Bearing in mind that multi-chip puts a "lowered" ceiling on the GPU size (the "process sweet spot"?).
With each process transition, the RC delay for signal lines is expected to remain approximately the same or get worse, save ever wonkier low-K process knobs that still only partially offset the general trend.
Even if die sizes stayed exactly the same, the apparent wire lengths would not.
Lowest-level interconnects scale better with the transistor layer, but the higher metal layers that go across wider areas of the die would switch slower without other design changes.

If we only want the exact same clock speeds or gradually lowering clock speeds forever more, this isn't a problem.
That would probably leave some performance on the table, though.


Nearly 400GB/s of bandwidth from L2 to L1. Note that path is not via the hub.
Is the relationship between the L2s and L1s 1 port per cache, or 4 to 10?
 
With each process transition, the RC delay for signal lines is expected to remain approximately the same or get worse, save ever wonkier low-K process knobs that still only partially offset the general trend.
Even if die sizes stayed exactly the same, the apparent wire lengths would not.
Lowest-level interconnects scale better with the transistor layer, but the higher metal layers that go across wider areas of the die would switch slower without other design changes.

If we only want the exact same clock speeds or gradually lowering clock speeds forever more, this isn't a problem.
That would probably leave some performance on the table, though.
This may be relevant:

Integrated Circuit Chip With Repeater Flops and Method for Automated Design of Same

Is the relationship between the L2s and L1s 1 port per cache, or 4 to 10?
I don't know. I think an architect's going to need interrogating...

Jawed
 
I'll try to parse through this in depth. Apparently, the patent pipelines the various legs of a repeater chain.
Rather than giving up if the distance is too great to cover in a short period of time, the design allows for a long-distance signal to transition to repeater stage.
The actual time per unit of distance doesn't really change, but the maximum distance is upped.

I'm not an EE, so I'm curious what differentiates this methodology from a drive stage in some of the longer pipelined CPUs.

I haven't yet gotten to any claims for what this means for power consumption, so I'm wary unless the repeater flops are notably different in their consumption patterns.
Repeaters are a significant portion of power consumption on another VLIW design, Itanium.
 
R600's 512-bit bus. What do you guys think the reasoning was for it?
a: R600 was late
b: GDDR3 scaled to high rates & quickly compared to GDDR2
c: Early GDDR4 was available near synchronous with R600s late launch

I think if R600 was on time & GDDR3 had topped out at a lower clock, 512bit would have made much more sense.
 
a: R600 was late
b: GDDR3 scaled to high rates & quickly compared to GDDR2
c: Early GDDR4 was available near synchronous with R600s late launch

I think if R600 was on time & GDDR3 had topped out at a lower clock, 512bit would have made much more sense.

It still made little sense in the big picture though.

Reliance on the ring made "special" configs not that available, but I suppose back then GDDR3 should have hit sub 1-ns speeds already (hit me if I'm not, since there's no real solid info to back this one).

That would be enough bandwidth for R600 already. Some extra ALUs and TMUs in place of the doubled interface, and although conservative the chip could have turned out a lot different.
 
Two little queries:

1) R600's 512-bit bus. What do you guys think the reasoning was for it? It seems like it added very little, if anything, to its performance after seeing 38x0. It's as if ATI engineers didn't realize they needed more fillrate, not more bandwidth.

I wonder if ATI is planning on using a 512-bit bus for their R700 dual die card with the possibility of having a shared memory space?
 
I wonder if ATI is planning on using a 512-bit bus for their R700 dual die card with the possibility of having a shared memory space?
We don't know much about R700 but it's a pretty safe guess it will use 2x256-bit bus. It could be somewhat unified (that is at least texture data might be shared).
 
Reliance on the ring made "special" configs not that available, but I suppose back then GDDR3 should have hit sub 1-ns speeds already (hit me if I'm not, since there's no real solid info to back this one).
What "special" configs were not available? Not sure what this means.
 
IThe proof is in the pudding. Take a hetereogenous collection of shader workloads, that are not ROP bound, and show me an R770 beating a GT200 by close to paper spec margin. On paper, it has a big theoretical advantage, but in the real world, it doesn't pan out. So, either utilization is low, or they made a poor decision in spending too many trannies on ALUs and not enough on TMUs to balance out the demands of the workloads.
Okay, so we're finally getting some of the more interesting sites to publish custom tests:
xbitmark.gif


No, we're not seeing the 4850 beat the 9800 GTX in all tests by the factor of 2.3 suggested by SP and clock specs, but remember that it has only 0.58x the texturing performance. Seeing that it doesn't lose even one test (lots of these are texturing heavy shaders) and exceeds the "paper spec margin" of victory in a few others, that's a pretty impressive show of efficiency against the previous king.

Unfortunately, we don't have GT200 results yet. NVidia probably didn't send XbitLabs a sample.

EDIT: Some more results: http://www.ixbt.com/video3/rv770-part2.shtml. Aside from two PS4.0 shaders (which are outliers, given the results of the PS3.0 equivalents), every pixel shader they throw at the GTX 280 and 4850 runs at similar speed. Is that heterogenous enough?
 
Last edited by a moderator:
a: R600 was late
b: GDDR3 scaled to high rates & quickly compared to GDDR2
c: Early GDDR4 was available near synchronous with R600s late launch

I think if R600 was on time & GDDR3 had topped out at a lower clock, 512bit would have made much more sense.

Errr? GDDR4-based X1950 had been on the market for good 9 month by the time R600 came out.
 
What "special" configs were not available? Not sure what this means.

MCs tied to ring stops being the problem (I think only the ALUs and TUs were fun and flexible to play around, MC was tied to the ring overall)?

4*64/128 in R600/V670 could have done it, but before they threw the ringbus away, a 384-bit interface should have required more stops than optimal.
 
Back
Top