AMD: R7xx Speculation

Jawed · Feb 15, 2008

Sound_Card said:
So that would imply that the SIMD's are longer? Can you explain to me how exactly are the Texture units are tied into the SIMD's arrays?

Yeah, each SIMD would gain 2 extra quads - imagine two extra on the bottom of each.

Now in all of the SIMDs it so happens that quad number X always talks to TU quad number X. So with 4 SIMDs, quad number 5 in each SIMD would talk to TU quad number 5.

So, with a 6-quad SIMD you get 6 TU quads. That's my theory.

Jawed

Arnold Beckenbauer · Feb 15, 2008

Jawed said:
I put the strongest likelihood on the RBE count - 16 is a number that doesn't need to increase (colour fillrates don't need to increase so much). But the MSAA-sample/zixel rate does need to increase, so I have my fingers crossed that there's 4x Z per clock, not RV670's 2x.

With 16 RBEs (4 quads) it seems unavoidable to me that there are 4 SIMDs.

As I've shown already the texturing performance doubles, because of the 50% increase in units and the 35% increase in clocks.

But hey, I've been bitten a few times by ATI's SIMD v TU configurations...

Jawed

Good point. But the RV630 has 3 ALU-SIMDs but one RBE. The next possibility is, there are "320 SPs" only, but working on higher clocks than other units (we get ~1 TFLOP/s with 10*64*1600 MHz).
But we should not forget: Are these specifications true or just a bad joke ("R650" stuff looked real, too)?

Jawed · Feb 15, 2008

Arnold Beckenbauer said:
Good point. But the RV630 has 3 ALU-SIMDs but one RBE.

Yeah, I've been pondering that but 6 SIMD quads feeding 4 RBE quads needs to go a step further for symmetry. Something like:

RBE W + X - SIMD A + B + C
RBE Y + Z - SIMD D + E + F

Jawed

mczak · Feb 16, 2008

Hmm, if the information is true I think the cards would look as follows:
rv770: 6 shader clusters with 16 of the vliw (vec5) units (in contrast to rv670 which has 4 clusters with 16 vliw units). The (8ta, 4tf) texture units are shared among all clusters like with rv670, but there are simply twice as many (or maybe they got downgraded to 4ta, 4tf).
So to make all chips fit:
rv770 = 6 clusters with 16 vec5 units, 8 quad-tu
rv740 = 4 clusters with 12 vec5 units, 6 quad-tu
rv710 = 2 clusters with 4 vec5 units, 2 quad-tu.
So maybe within a cluster you could still only get the results of one quad-tu (for simpler design), but two clusters could get texture sampling results at the same time.
The rv710 seems to have a very high tex:alu ratio (well even with the r6xx series rv615 had a twice as high tex:alu ratio compared to r600, but now it's 3 times higher, and compared to rv610 twice as high), almost overkill, and the rv740 ratio would be quite high too but all chips would have the same coupling of the simd arrays with texture units.
In contrast to "old" chips:
r600/rv670 = 4 clusters with 16 vec5 units, 4 quad-tu
rv630/rv635 = 3 clusters with 8 vec5 units, 2 quad-tu
rv615/rv620 = 2 clusters with 4 vec5 units, 1 quad-tu

It would be possible the tu's are oct-units instead, but I don't think it would make sense (it would mean all the rv7xx chips would have half the clusters as outlined above, but with twice as many vec5 units in them).
Well that's just how I'd design the chips given the number of tmus / shader units in that rumour

I don't mention ROPs because these are boring for speculation. Decoupled completely already in r6xx from both shader clusters and memory interface widths, any number seems plausible.
(btw I still don't quite understand why branch granularity is 64 pixels on r600 - I thought it should be 16, because the clusters don't have to run the same prog. Maybe I misunderstood something there, which would make the whole possible rv7xx configuration I've just written pointless...)

Jawed · Feb 16, 2008

mczak said:
So to make all chips fit:
rv770 = 6 clusters with 16 vec5 units, 8 quad-tu
rv740 = 4 clusters with 12 vec5 units, 6 quad-tu
rv710 = 2 clusters with 4 vec5 units, 2 quad-tu.
So maybe within a cluster you could still only get the results of one quad-tu (for simpler design), but two clusters could get texture sampling results at the same time.

That looks quite neat. Though neatness in itself doesn't necessarily get you anywhere (argh).

The rv710 seems to have a very high tex:alu ratio (well even with the r6xx series rv615 had a twice as high tex:alu ratio compared to r600, but now it's 3 times higher, and compared to rv610 twice as high), almost overkill, and the rv740 ratio would be quite high too but all chips would have the same coupling of the simd arrays with texture units.

I see ALU:TEX as the biggest issue with that rumour and suspect that someone is back-projecting from the bilinear texel rate in a naive fashion. Specifically, the idea that ALU:TEX for RV770 would be less than RV670, i.e. 3:1 versus 4:1, strikes me as a significant flaw.

It's arguable that lower spec GPUs such as RV710 really need a lower ratio because users will be forced to rely on "low" shader quality settings, i.e. short shaders that will inherently have a few texture operations with minimal ALU instructions. Though if you compare the low end RV6xx and G8x GPUs, there doesn't appear to be any calling for this ratio to go even lower for ATI GPUs to remain competitive. So, RV710 as you propose looks extremely unlikely.

(btw I still don't quite understand why branch granularity is 64 pixels on r600 - I thought it should be 16, because the clusters don't have to run the same prog.)

While each SIMD is 16 wide, each instruction is issued for 4 successive clock cycles, which is how the batch size (thread size) becomes 64. Each SIMD is independent, as you've observed.

Jawed

no-X · Feb 16, 2008

3:1 ALU:TEX doesn't have to be flaw. R580 was 3:1 (PS ALUs are used for pixel shading and FP16 filtering). R600 is 4:1 (ALUs are used for pixel shading, vertex shading and MSAA resolve - FP16 filtering was moved to TU). If RV770 supports hardwired resolve (even only for box filters, so some workload is moved away from shader roce) and number of ALUs is boosted 1,5x, I don't think 3:1 (like RV630) is bad idea. 16 TU in addition will increase performance more, than (lets say) 16-24 5D ALUs. If it's true, that RV770 design is conservative, than 3:1 could be quite logical step. But 96 5D ALUs + 32 TFs + 16 ROPs (4 multi-samples/clock, fixed resolve) and near 1GHz core clock seems to be overly optimistical specs to me.

Arnold Beckenbauer · Feb 16, 2008

I think, that these 32 TMUs aren't 8 sampler units (how many transistors would this cost?), but 4 sampler units with eight TFs.
http://www.beyond3d.com/content/reviews/16/9
R600:

Each sampler unit can setup 8 addresses per clock, fetch 16 FP32 values for bilinear filtering, and 4 FP32 values for point sampling, all per clock, and then bilinearly filter at a rate of four INT8 or FP16 bilerps per cycle, from those fetched values.

RV770:

Each sampler unit can setup 8 addresses per clock, fetch 16 FP32 values for bilinear filtering, and 4 FP32 values for point sampling, all per clock, and then bilinearly filter at a rate of eight INT8 or FP16 bilerps per cycle, from those fetched values.

AnarchX · Feb 16, 2008

Also interessting:

AZX_DRIVER_ATIHDMI }, /* ATI RV770 HDMI */
+ { 0x1002, 0xaa38, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
AZX_DRIVER_ATIHDMI }, /* ATI RV730 HDMI */
+ { 0x1002, 0xaa40, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
AZX_DRIVER_ATIHDMI }, /* ATI RV710 HDMI */
+ { 0x1002, 0xaa48, PCI_ANY_ID, PCI_ANY_ID, 0, 0,
AZX_DRIVER_ATIHDMI }, /* ATI RV740 HDMI */

http://article.gmane.org/gmane.linux.alsa.devel/51536

RV730? 6 ALU-quads, 2+2 texture-quads, 1 RBE?

pjbliverpool · Feb 16, 2008

ShaidarHaran said:
Interesting. Absolutely pathetic if true, but interesting nonetheless. How long is ATi going to drag on this 16 ROP B.S.? The doubling of TMUs was needed at least a generation ago, and the 50% increase in SPs is just meh. Honestly, this sounds like what RV670 (and even R600) should've been.

It looks pretty good to me. Especially if you consider that its only 2 generations ahead of Xenos and yet the X2 variant offers close to 10x its raw shader power

Seems a bit light on the memory bandwidth though.

Jawed · Feb 16, 2008

no-X said:
If RV770 supports hardwired resolve (even only for box filters, so some workload is moved away from shader roce) and number of ALUs is boosted 1,5x, I don't think 3:1 (like RV630) is bad idea.

Shader AA resolve isn't costly, it's staying. RV770 needs higher Z fillrate. It also needs more texturing performance, but I think a doubling there (in performance, not unit count) will be all that's required.

Jawed

Jawed · Feb 16, 2008

pjbliverpool said:
Seems a bit light on the memory bandwidth though.

1800MHz GDDR5 will provide 115.2GB/s, 60% more than RV670. So I suppose it's a question of how much bandwidth RV670 is wasting (quite a lot when 72GB/s is compared against 8800GTS 512's 62.1 GB/s) and how close to 2x the performance of RV670 they're aiming for.

Jawed

CarstenS · Feb 16, 2008

Arnold Beckenbauer said:
The question is, what do they mean by "32 TMUs".

R600 & RV670 already have 32 Texture [strike]Samplers[/strike] "Mapping Units", after their own fashion. Personally, I'm not seeing AMD deviating from it's already pretty solid route of "ALUs, ALUs, ALUs".

And since R'v'770 carries the 'V' in it, we might still not see a single high end chip yet again.

wishiknew · Feb 16, 2008

Lots of stuff about performance but what about power savings?

Isn't the rv7 series suppose to have a lot more than what rv670/635/620 added over r600?

pjbliverpool · Feb 16, 2008

Jawed said:
1800MHz GDDR5 will provide 115.2GB/s, 60% more than RV670. So I suppose it's a question of how much bandwidth RV670 is wasting (quite a lot when 72GB/s is compared against 8800GTS 512's 62.1 GB/s) and how close to 2x the performance of RV670 they're aiming for.

Jawed

Is that 115.2GB/s per GPU? If so then thats double what I had thought as I assumed the 1800Mhz meant 1800Mhz effective. If its actually doubled then yeah that seema like plenty.

Jawed · Feb 16, 2008

pjbliverpool said:
Is that 115.2GB/s per GPU?

If it wasn't that'd be an awful waste of GDDR5 - I'd be left assuming that GDDR5 was there solely for power saving, not performance.

So, I'm interpreting it as the base clock not the effective rate.

Jawed

kyetech · Feb 16, 2008

Do you think that by the time of the r800 family it will be a unified memory architecture for the GPUs on a card?

Kaotik · Feb 16, 2008

wishiknew said:
Lots of stuff about performance but what about power savings?

Isn't the rv7 series suppose to have a lot more than what rv670/635/620 added over r600?

I think the site said "under 10W idle"?

Jawed · Feb 17, 2008

kyetech said:
Do you think that by the time of the r800 family it will be a unified memory architecture for the GPUs on a card?

I suspect the architecture is already capable of unifying memory across multiple chips, but we aren't going to find out for sure for a long time. The patent applications appear to put everything in place...

Jawed

mczak · Feb 17, 2008

Arnold Beckenbauer said:
I think, that these 32 TMUs aren't 8 sampler units (how many transistors would this cost?), but 4 sampler units with eight TFs.

Well I don't know the transistor count but it's certainly not THAT excessive - remember G92 can already do 64ta and 64tf in total, which would still be twice as much texture filtering capacity (per clock) as the proposed rv770 with 8 rv670-style quad units could do (and the same ta capability).

Each sampler unit can setup 8 addresses per clock, fetch 16 FP32 values for bilinear filtering, and 4 FP32 values for point sampling, all per clock, and then bilinearly filter at a rate of eight INT8 or FP16 bilerps per cycle, from those fetched values.

This idea goes in the same direction as what Jawed proposed, but I think you'd need to be able to fetch more values per clock too.
Personally though I favor 8 simpler units which could only setup 4 addresses per clock, and fetch 16 FP32 units (either for bilinear filtering or point sampling). This gives the same final ta and tf capability, but with this solution two shader clusters could do texturing at the same time without having to wait, otherwise still you'd always need to do two simultaneous texture lookups in a single thread for maximum efficiency which seems very odd. Certainly I could be wrong, and as said even just doubling the amount of texture units does not really look impossible, if you compare that to what the competition can do even now (though G92 certainly has excessive ta/tf capability).

mczak · Feb 17, 2008

Some more random thoughts on these rumours:
- What's up with those high clocks (on rv770 at least)? We saw 0 clock increase going from R600 to RV670 (which involved a die shrink), and now suddenly a ~25% clock increase is possible with the same process technology (and a supposedly still similar architecture)?
- Why would AMD go back to 128bit memory interface for the low-end part? It doesn't really look faster than the predecessor (even if those 8 texture units are true, it's not going to be much faster (it already had quite a balanced tex/alu ratio).
- Performance: rv740 with less shader units but more texture units could achieve comparable per-clock performance than rv670 in quite a lot of real-world tests IMHO, but it would probably also have a only slightly smaller die. Should be a very decent midrange part (at the proposed clocks). Obviously, rv770 is bound to beat all g92-based parts easily according to these rumours, with a die size which I'd guesstimate at slightly smaller than G92 (but of course it would only be smaller because of the 55nm advantage).

AMD: R7xx Speculation

Jawed

Arnold Beckenbauer

Jawed

mczak

Jawed

no-X

Arnold Beckenbauer

AnarchX

pjbliverpool

B3D Scallywag

Jawed

Jawed

CarstenS

Moderator

wishiknew

pjbliverpool

B3D Scallywag

Jawed

kyetech

Kaotik

Drunk Member

Jawed

mczak

mczak

Similar threads