AMD: R7xx Speculation

Status
Not open for further replies.
So that would imply that the SIMD's are longer? Can you explain to me how exactly are the Texture units are tied into the SIMD's arrays?
Yeah, each SIMD would gain 2 extra quads - imagine two extra on the bottom of each.

Now in all of the SIMDs it so happens that quad number X always talks to TU quad number X. So with 4 SIMDs, quad number 5 in each SIMD would talk to TU quad number 5.

So, with a 6-quad SIMD you get 6 TU quads. That's my theory.

Jawed
 
I put the strongest likelihood on the RBE count - 16 is a number that doesn't need to increase (colour fillrates don't need to increase so much). But the MSAA-sample/zixel rate does need to increase, so I have my fingers crossed that there's 4x Z per clock, not RV670's 2x.

With 16 RBEs (4 quads) it seems unavoidable to me that there are 4 SIMDs.

As I've shown already the texturing performance doubles, because of the 50% increase in units and the 35% increase in clocks.

But hey, I've been bitten a few times by ATI's SIMD v TU configurations...

Jawed

Good point. But the RV630 has 3 ALU-SIMDs but one RBE. The next possibility is, there are "320 SPs" only, but working on higher clocks than other units (we get ~1 TFLOP/s with 10*64*1600 MHz).
But we should not forget: Are these specifications true or just a bad joke ("R650" stuff looked real, too)?
 
Hmm, if the information is true I think the cards would look as follows:
rv770: 6 shader clusters with 16 of the vliw (vec5) units (in contrast to rv670 which has 4 clusters with 16 vliw units). The (8ta, 4tf) texture units are shared among all clusters like with rv670, but there are simply twice as many (or maybe they got downgraded to 4ta, 4tf).
So to make all chips fit:
rv770 = 6 clusters with 16 vec5 units, 8 quad-tu
rv740 = 4 clusters with 12 vec5 units, 6 quad-tu
rv710 = 2 clusters with 4 vec5 units, 2 quad-tu.
So maybe within a cluster you could still only get the results of one quad-tu (for simpler design), but two clusters could get texture sampling results at the same time.
The rv710 seems to have a very high tex:alu ratio (well even with the r6xx series rv615 had a twice as high tex:alu ratio compared to r600, but now it's 3 times higher, and compared to rv610 twice as high), almost overkill, and the rv740 ratio would be quite high too but all chips would have the same coupling of the simd arrays with texture units.
In contrast to "old" chips:
r600/rv670 = 4 clusters with 16 vec5 units, 4 quad-tu
rv630/rv635 = 3 clusters with 8 vec5 units, 2 quad-tu
rv615/rv620 = 2 clusters with 4 vec5 units, 1 quad-tu

It would be possible the tu's are oct-units instead, but I don't think it would make sense (it would mean all the rv7xx chips would have half the clusters as outlined above, but with twice as many vec5 units in them).
Well that's just how I'd design the chips given the number of tmus / shader units in that rumour :)
I don't mention ROPs because these are boring for speculation. Decoupled completely already in r6xx from both shader clusters and memory interface widths, any number seems plausible.
(btw I still don't quite understand why branch granularity is 64 pixels on r600 - I thought it should be 16, because the clusters don't have to run the same prog. Maybe I misunderstood something there, which would make the whole possible rv7xx configuration I've just written pointless...)
 
So to make all chips fit:
rv770 = 6 clusters with 16 vec5 units, 8 quad-tu
rv740 = 4 clusters with 12 vec5 units, 6 quad-tu
rv710 = 2 clusters with 4 vec5 units, 2 quad-tu.
So maybe within a cluster you could still only get the results of one quad-tu (for simpler design), but two clusters could get texture sampling results at the same time.
That looks quite neat. Though neatness in itself doesn't necessarily get you anywhere (argh).

The rv710 seems to have a very high tex:alu ratio (well even with the r6xx series rv615 had a twice as high tex:alu ratio compared to r600, but now it's 3 times higher, and compared to rv610 twice as high), almost overkill, and the rv740 ratio would be quite high too but all chips would have the same coupling of the simd arrays with texture units.
I see ALU:TEX as the biggest issue with that rumour and suspect that someone is back-projecting from the bilinear texel rate in a naive fashion. Specifically, the idea that ALU:TEX for RV770 would be less than RV670, i.e. 3:1 versus 4:1, strikes me as a significant flaw.

It's arguable that lower spec GPUs such as RV710 really need a lower ratio because users will be forced to rely on "low" shader quality settings, i.e. short shaders that will inherently have a few texture operations with minimal ALU instructions. Though if you compare the low end RV6xx and G8x GPUs, there doesn't appear to be any calling for this ratio to go even lower for ATI GPUs to remain competitive. So, RV710 as you propose looks extremely unlikely.

(btw I still don't quite understand why branch granularity is 64 pixels on r600 - I thought it should be 16, because the clusters don't have to run the same prog.)
While each SIMD is 16 wide, each instruction is issued for 4 successive clock cycles, which is how the batch size (thread size) becomes 64. Each SIMD is independent, as you've observed.

Jawed
 
3:1 ALU:TEX doesn't have to be flaw. R580 was 3:1 (PS ALUs are used for pixel shading and FP16 filtering). R600 is 4:1 (ALUs are used for pixel shading, vertex shading and MSAA resolve - FP16 filtering was moved to TU). If RV770 supports hardwired resolve (even only for box filters, so some workload is moved away from shader roce) and number of ALUs is boosted 1,5x, I don't think 3:1 (like RV630) is bad idea. 16 TU in addition will increase performance more, than (lets say) 16-24 5D ALUs. If it's true, that RV770 design is conservative, than 3:1 could be quite logical step. But 96 5D ALUs + 32 TFs + 16 ROPs (4 multi-samples/clock, fixed resolve) and near 1GHz core clock seems to be overly optimistical specs to me.
 
I think, that these 32 TMUs aren't 8 sampler units (how many transistors would this cost?), but 4 sampler units with eight TFs.
http://www.beyond3d.com/content/reviews/16/9
R600:
Each sampler unit can setup 8 addresses per clock, fetch 16 FP32 values for bilinear filtering, and 4 FP32 values for point sampling, all per clock, and then bilinearly filter at a rate of four INT8 or FP16 bilerps per cycle, from those fetched values.

RV770:
Each sampler unit can setup 8 addresses per clock, fetch 16 FP32 values for bilinear filtering, and 4 FP32 values for point sampling, all per clock, and then bilinearly filter at a rate of eight INT8 or FP16 bilerps per cycle, from those fetched values.
 
Also interessting:
AZX_DRIVER_ATIHDMI }, /* ATI RV770 HDMI */
+ { 0x1002, 0xaa38, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
AZX_DRIVER_ATIHDMI }, /* ATI RV730 HDMI */
+ { 0x1002, 0xaa40, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
AZX_DRIVER_ATIHDMI }, /* ATI RV710 HDMI */
+ { 0x1002, 0xaa48, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
AZX_DRIVER_ATIHDMI }, /* ATI RV740 HDMI */
http://article.gmane.org/gmane.linux.alsa.devel/51536

RV730? 6 ALU-quads, 2+2 texture-quads, 1 RBE?
 
Last edited by a moderator:
Interesting. Absolutely pathetic if true, but interesting nonetheless. How long is ATi going to drag on this 16 ROP B.S.? The doubling of TMUs was needed at least a generation ago, and the 50% increase in SPs is just meh. Honestly, this sounds like what RV670 (and even R600) should've been.

It looks pretty good to me. Especially if you consider that its only 2 generations ahead of Xenos and yet the X2 variant offers close to 10x its raw shader power :oops:

Seems a bit light on the memory bandwidth though.
 
If RV770 supports hardwired resolve (even only for box filters, so some workload is moved away from shader roce) and number of ALUs is boosted 1,5x, I don't think 3:1 (like RV630) is bad idea.
Shader AA resolve isn't costly, it's staying. RV770 needs higher Z fillrate. It also needs more texturing performance, but I think a doubling there (in performance, not unit count) will be all that's required.

Jawed
 
Seems a bit light on the memory bandwidth though.
1800MHz GDDR5 will provide 115.2GB/s, 60% more than RV670. So I suppose it's a question of how much bandwidth RV670 is wasting (quite a lot when 72GB/s is compared against 8800GTS 512's 62.1 GB/s) and how close to 2x the performance of RV670 they're aiming for.

Jawed
 
The question is, what do they mean by "32 TMUs".
R600 & RV670 already have 32 Texture [strike]Samplers[/strike] "Mapping Units", after their own fashion. Personally, I'm not seeing AMD deviating from it's already pretty solid route of "ALUs, ALUs, ALUs".

And since R'v'770 carries the 'V' in it, we might still not see a single high end chip yet again.
 
Lots of stuff about performance but what about power savings?

Isn't the rv7 series suppose to have a lot more than what rv670/635/620 added over r600?
 
1800MHz GDDR5 will provide 115.2GB/s, 60% more than RV670. So I suppose it's a question of how much bandwidth RV670 is wasting (quite a lot when 72GB/s is compared against 8800GTS 512's 62.1 GB/s) and how close to 2x the performance of RV670 they're aiming for.

Jawed

Is that 115.2GB/s per GPU? If so then thats double what I had thought as I assumed the 1800Mhz meant 1800Mhz effective. If its actually doubled then yeah that seema like plenty.
 
Do you think that by the time of the r800 family it will be a unified memory architecture for the GPUs on a card?
I suspect the architecture is already capable of unifying memory across multiple chips, but we aren't going to find out for sure for a long time. The patent applications appear to put everything in place...

Jawed
 
I think, that these 32 TMUs aren't 8 sampler units (how many transistors would this cost?), but 4 sampler units with eight TFs.
Well I don't know the transistor count but it's certainly not THAT excessive - remember G92 can already do 64ta and 64tf in total, which would still be twice as much texture filtering capacity (per clock) as the proposed rv770 with 8 rv670-style quad units could do (and the same ta capability).
Each sampler unit can setup 8 addresses per clock, fetch 16 FP32 values for bilinear filtering, and 4 FP32 values for point sampling, all per clock, and then bilinearly filter at a rate of eight INT8 or FP16 bilerps per cycle, from those fetched values.
This idea goes in the same direction as what Jawed proposed, but I think you'd need to be able to fetch more values per clock too.
Personally though I favor 8 simpler units which could only setup 4 addresses per clock, and fetch 16 FP32 units (either for bilinear filtering or point sampling). This gives the same final ta and tf capability, but with this solution two shader clusters could do texturing at the same time without having to wait, otherwise still you'd always need to do two simultaneous texture lookups in a single thread for maximum efficiency which seems very odd. Certainly I could be wrong, and as said even just doubling the amount of texture units does not really look impossible, if you compare that to what the competition can do even now (though G92 certainly has excessive ta/tf capability).
 
Some more random thoughts on these rumours:
- What's up with those high clocks (on rv770 at least)? We saw 0 clock increase going from R600 to RV670 (which involved a die shrink), and now suddenly a ~25% clock increase is possible with the same process technology (and a supposedly still similar architecture)?
- Why would AMD go back to 128bit memory interface for the low-end part? It doesn't really look faster than the predecessor (even if those 8 texture units are true, it's not going to be much faster (it already had quite a balanced tex/alu ratio).
- Performance: rv740 with less shader units but more texture units could achieve comparable per-clock performance than rv670 in quite a lot of real-world tests IMHO, but it would probably also have a only slightly smaller die. Should be a very decent midrange part (at the proposed clocks). Obviously, rv770 is bound to beat all g92-based parts easily according to these rumours, with a die size which I'd guesstimate at slightly smaller than G92 (but of course it would only be smaller because of the 55nm advantage).
 
Status
Not open for further replies.
Back
Top