AMD: R9xx Speculation

Would it make sense to release a part, which would have very similar die-size and very similar performance to RV870? That wouldn't be more interesting for customers then current line-up.

200mm² on 32nm seems to be a bit much for a 128bit GPU (128bit bus on HD5870-class product would cripple performance) and too little for a 256bit GDDR5 + E4 GPU. I presume, that the original plan was mainstream 32nm 128bit GPU, so 150-180mm², so I'd expect 40nm RV770-sized product...

Also notice, that Bart should support Eyefinity 4, not Eyefinity 6. It could indicate smaller die, which isn't sized enough to fit pads for 256bit GDDR5 interface PLUS all the pads for E6.

btw. did you mean 80 TMUs?
 
<cough> but I have no source for that, just seems like a good way to get some increased utilisation (cut out some less well utilised hardware) without too much work.
 
If they do go 4-wide that would be funny. AMD goes narrower while Nvidia goes wider. Guess at some point they'll meet in the middle. The thing is, I have no idea what drives AMD. Nvidia is pretty vocal about the features they care about so it's a little bit easier to predict their moves. AMD, on the other hand, could surprise in a lot of ways just like they did with Eyefinity.
 
Looking at CPUs, I'd guess dual issue would be the sweet spot.
I would expect that for GPU workloads it is higher than for CPUs. Maybe 4 is already the sweetspot, especially if you have the ability to do dependent operations in one VLIW instruction like Evergreen and NI. For simple operations (multiplications, additions, and some others) this reduces it effectively to dual issue for some common cases.
 
I would expect that for GPU workloads it is higher than for CPUs. Maybe 4 is already the sweetspot, especially if you have the ability to do dependent operations in one VLIW instruction like Evergreen and NI. For simple operations (multiplications, additions, and some others) this reduces it effectively to dual issue for some common cases.

For graphics, 4 wide vliw makes sense. But a GPU workload isn't about graphics anymore.
 
For graphics, 4 wide vliw makes sense. But a GPU workload isn't about graphics anymore.
But an inherently parallel workload (which is a must either way) also means that it can be easily vectorized in most cases.

Going quad-issue with the ability to do a "dual dual-issue" for some instructions cover quite some bases already, especially if ATI's architecture enables them to have roughly twice the theoretical peak performance (I'm talking arithmetics now) as the competition in the same die size.
 
But an inherently parallel workload (which is a must either way) also means that it can be easily vectorized in most cases.

True. Stuff vectorizes just fine on G80 and GT200, which are strictly single issue. :smile: VLIW is orthogonal to both parallelization and vectorization, really.
 
I think you are getting many years ahead of yourself there.

If that's the case why aren't we seeing a massive advantage for AMD's wide ALUs in graphics workloads?

But an inherently parallel workload (which is a must either way) also means that it can be easily vectorized in most cases.

Going quad-issue with the ability to do a "dual dual-issue" for some instructions cover quite some bases already, especially if ATI's architecture enables them to have roughly twice the theoretical peak performance (I'm talking arithmetics now) as the competition in the same die size.

GPU hyperthreading?
 
Would it make sense to release a part, which would have very similar die-size and very similar performance to RV870? That wouldn't be more interesting for customers then current line-up.
I reckon HD6770 will be <=$150. Look at how close HD4770 gets to HD4850, faster than HD4830. HD5830 is about 20% faster than HD5770. So we're expecting Barts to at least overlap the bottom end of Cypress.

HD4770 is much more than 20% faster than HD4670.

In theory the 32nm chip was defined as:
  • add in the stuff that was chopped out of Evergreen - 20-25% increase in die size?
  • then add another, say, 50% performance (60% more ALUs, 100% more cores + TMUs) - which for ATI often means about the same amount of die space
That's about 300mm² on 40nm, with 32nm shrink taking it to ~200mm².

Then 32nm got cancelled.

I admit the 256-bit bus is a tricky point here, since 200mm² is a bit marginal. Faster memory (what is theoretically available), on its own, isn't enough to make Juniper that fast. Arguably stuff should change internally to make it more efficient (I think NVidia has the edge on memory efficiency now). So perhaps faster memory and greater efficiency could do it...

Also it could be argued that Barts has a doubling of ROPs per MC. Like RV740... Someone might have seen that in a rumour and drawn up a roadmap that translated the 32 ROP count into 256 bits of bus :p

200mm² on 32nm seems to be a bit much for a 128bit GPU (128bit bus on HD5870-class product would cripple performance) and too little for a 256bit GDDR5 + E4 GPU. I presume, that the original plan was mainstream 32nm 128bit GPU, so 150-180mm², so I'd expect 40nm RV770-sized product...
I reckon all the video outputs' physical I/O on RV770 total 2.2mm². Not sure how to scale that up for Evergreen or SI, once Eyefinity is taken into account. GDDR5 on RV770 is about 40mm of perimeter (excluding loss to corners - 1mm of depth). 200mm² die has about 57mm of perimeter, but 2mm is lost to each corner where GDDR5 takes both sides of the corner.

(I reckon the display logic/UVD amounts to 17mm² on RV770. Should be a lot smaller than that on Evergreen/SI.)

Also notice, that Bart should support Eyefinity 4, not Eyefinity 6. It could indicate smaller die, which isn't sized enough to fit pads for 256bit GDDR5 interface PLUS all the pads for E6.
It might merely reflect AMDs intention that 2GB per GPU (which is "needed" for E6) will be reserved for Cayman.

I haven't a clue what the difference is between E4 and E6 in terms of I/O.
 
True. Stuff vectorizes just fine on G80 and GT200, which are strictly single issue. :smile: VLIW is orthogonal to both parallelization and vectorization, really.
No, it's not orthogonal, it is a 45° angle, i.e. you open up a new dimension, but changing your position in the new dimension you still move in the plane of the other dimensions ;)
Or to put it differently, VLIW is a method to exploit instruction level parallelism. But the incarnation in GPUs can often also be used in combination with some developer effort to combine several work items into one, i.e. converting (some of the) thread level parallelism (GPU speak threads, so effectively data level parallelism normally handled by SIMD) into instruction level parallelism (which is handled by VLIW).

You can use the VLIW units in a virtual SIMD fashion. They can do more of course, but let's forget these additional capabilities for a moment. What then happens, is that you increase the effective vector length (to four times the size of warp or wavefront with the use of float4 for instance). Yes, it is more effort on the developers side, but it works often very nicely.
 
Last edited by a moderator:
If that's the case why aren't we seeing a massive advantage for AMD's wide ALUs in graphics workloads?
There is more to graphics workloads than just the arithmetics. And in some situations it pays off.
GPU hyperthreading?
No, the execution of two pairs of dependent operations in one instruction (instead of 4 or 5 independent ones) possible since Evergreen for some operations. That means the latency for some instructions is cut to the half.
 
No, it's not orthogonal, it is a 45° angle, i.e. you open up a new dimension, but changing your position in the new dimension you still move in the plane of the other dimensions ;)

You can use the VLIW units in a virtual SIMD fashion.

It works quite well, as long as there in no divergence amongst these 4 "vector lanes". Which is why I am not particularly fond of using VLIW as a substitute for vectorization, unless you are doing something like Black Scholes. :D
 
I would expect that for GPU workloads it is higher than for CPUs. Maybe 4 is already the sweetspot, especially if you have the ability to do dependent operations in one VLIW instruction like Evergreen and NI. For simple operations (multiplications, additions, and some others) this reduces it effectively to dual issue for some common cases.
And if T is deleted it gets even better, as transcendentals (or float<->int conversions) often cause a serial depedency. I'd assume that with T deleted these would occupy multiple T lanes, transcendentals as I've described before and float->(u)int would occupy two lanes (as FLT32_TO_FLT64 does currently).

Yes, I know, deletion of T is unlikely. But one can dream, huh?
 
Back
Top