AMD: R7xx Speculation

Ailuros · Mar 6, 2008

Tchock said:
I wanted to edit that to RV635 exclusively but B3D has no edit option. Dang.

OT but I at least see an edit function for all my past posts.

Pete · Mar 6, 2008

You need to rack up a few posts before you can start covering your tracks willy-nilly.

aca · Mar 6, 2008

3dilettante said:
Just out of curiosity which mechanism implies that more transistors per unit of area leads to higher defects?
Do smaller transistors lower the threshold where physical defects in the silicon itself impact transistor function?

At least for mature processes at the MPU manufacturers, defect rates approach the baseline of physical defects in the wafer, irrespective of what's patterned on them.

More transistors per unit area means less area to make mistakes (mask/etching/doping/..). Considering the same sigma's for these mentioned, the statement seems to hold. Purely a stochastic process, I would say.

Sound_Card · Mar 6, 2008

I have a feeling the arc. is even longer instead of wider. Perhaps the addition of two more quads in each SIMD array with two more texture blocks.

RV670 --- 16x4 into 4 texture blocks
RV770 --- 24x4 into 6 texture blocks

Because I think the SIMD arrays will stay at 4, I don't think we will see anything higher than 4 render back ends. I guess I will wait and see though.

Jawed · Mar 7, 2008

Sound_Card said:
I have a feeling the arc. is even longer instead of wider. Perhaps the addition of two more quads in each SIMD array with two more texture blocks.

RV670 --- 16x4 into 4 texture blocks
RV770 --- 24x4 into 6 texture blocks

Because I think the SIMD arrays will stay at 4, I don't think we will see anything higher than 4 render back ends. I guess I will wait and see though.

I was thinking the same a while back and I have to admit despite my subsequent thoughts ("4xRV635" with 32 TUs) I'm biased towards 24 TUs simply because it's not such a huge jump. But the "asymmetry" of 24 TUs does bother me a bit...

Jawed

Sound_Card · Mar 7, 2008

Jawed said:
I was thinking the same a while back and I have to admit despite my subsequent thoughts ("4xRV635" with 32 TUs) I'm biased towards 24 TUs simply because it's not such a huge jump. But the "asymmetry" of 24 TUs does bother me a bit...

Jawed

My thoughts is exactly 24 TU's and yes, long back I did also think that it was 4xRV635 however, performance rumours and die size are leaning me to think otherwise and after some thought, this looks much more appropriate. But I'm not sure what you mean by "asymmetry" problem.

The 1 to 4 ALU:TEX ratio is kept in RV770 because you still have one texture block feeding into four quads across four SIMD's. Just simply add two more "levels".

S = SPU quad T = Texture block R = render back end

S S S S - - - T
S S S S - - - T
S S S S - - - T
S S S S - - - T
S S S S - - - T
S S S S - - - T
: : : : :
R R R R

I also believe that 24 TU's is plenty enough considering clocks should be higher as well. Guessing 900mhz for RV770 to hit target performance.

3dilettante · Mar 7, 2008

Ooh, let me draw some ASCII art!

Code:

SS  SS  SS  SS --TT
SS  SS  SS  SS --TT
SS  SS  SS  SS --TT
R   R   R   R

Yeah I've got nothing, just figured I'd like it better without the 96 element granularity.

I can't remember if the ROPs were each statically tied to specific SIMDs or not.

Jawed · Mar 7, 2008

Sound_Card said:
But I'm not sure what you mean by "asymmetry" problem.

I don't know how the ring stops would be organised. But, then again, does it matter?

I'm assuming that R600 has one ring stop per quad RBE. Then each SIMD also has one ring stop. Finally, each SIMD has one quad TU, implying to me that a TU is directly linked to this same ring stop (since the TU needs to send its results to all the other SIMDs, not just its local SIMD).

6 quads of TUs sorta implies 6 ring stops...

Also, in R600, each SIMD has an equal share (1/4) of the TUs. How do you share 6 quad TUs across 4 SIMDs? 2 SIMDs with 2 quad TUs and 2 SIMDs with 1 quad TU. Hmm.

This asymmetry is why I diagrammed 4xRV635, otherwise it just seems messy:

http://forum.beyond3d.com/showpost.php?p=1130755&postcount=649

Now it's worth pointing out that L2 in R600 is centralised. So all TUs (or their L1s, at least) are accessing the same L2. That implied path, which isn't via the ring bus, could imply that the TUs actually return results back to the requesting SIMD via a central route, not the ring bus. If so, that would mean that 6 ring stops wouldn't be needed. But it still leaves me puzzling over the "ownership" of TUs, normally something that's symmetric across all SIMDs.

Jawed

Jawed · Mar 7, 2008

3dilettante said:
I can't remember if the ROPs were each statically tied to specific SIMDs or not.

It's a question of screen-space tiling, first deployed in R300, solely for pixel shading: once a batch of pixels is rasterised (determined by their screen space tile) they're localised to a single quad RBE.

So I've been assuming that this is still the case going forwards, i.e. a strict 1:1 relationship.

It's possible to argue that with a centralised L2 screen-space tiling has less benefit (because the tiling also helped texture cache coherency). But there's still a question of hierarchical-Z and hierarchical-stencil both being tiled, implying a 1:1 link between those tiles of data and their owning RBEs.

So, the question is, does screen-space tiling have a locality benefit for RBEs in R600/R7xx? Or is the memory system of R6xx so flexible that it doesn't really matter?

Jawed

Sound_Card · Mar 7, 2008

Jawed said:
I don't know how the ring stops would be organised. But, then again, does it matter?

I'm assuming that R600 has one ring stop per quad RBE. Then each SIMD also has one ring stop. Finally, each SIMD has one quad TU, implying to me that a TU is directly linked to this same ring stop (since the TU needs to send its results to all the other SIMDs, not just its local SIMD).

6 quads of TUs sorta implies 6 ring stops...

Also, in R600, each SIMD has an equal share (1/4) of the TUs. How do you share 6 quad TUs across 4 SIMDs? 2 SIMDs with 2 quad TUs and 2 SIMDs with 1 quad TU. Hmm.

This asymmetry is why I diagrammed 4xRV635, otherwise it just seems messy:

http://forum.beyond3d.com/showpost.php?p=1130755&postcount=649

Now it's worth pointing out that L2 in R600 is centralised. So all TUs (or their L1s, at least) are accessing the same L2. That implied path, which isn't via the ring bus, could imply that the TUs actually return results back to the requesting SIMD via a central route, not the ring bus. If so, that would mean that 6 ring stops wouldn't be needed. But it still leaves me puzzling over the "ownership" of TUs, normally something that's symmetric across all SIMDs.

Jawed

Ahhh thanks Jawed. I can see what you mean.

3dilettante · Mar 7, 2008

That sounds like it might be tiled that way.

Maybe I'm being dense, but why would it matter to the RBE which SIMD it was talking to?
Every form of storage in the R600 diagrams is located everywhere but the SIMDs.
The register file cache, the schedulers, the unified L2 and single-image L1 seem to lean towards keeping the ALUs isolated from the particulars.

Jawed · Mar 7, 2008

3dilettante said:
That sounds like it might be tiled that way.

Maybe I'm being dense, but why would it matter to the RBE which SIMD it was talking to?

Historically it was about load-balancing and cache (for each of TU and RBE) coherence. The load balancing was "automatic", the ALUs were only shading pixels and the assumption was that the (statically defined) set of tiles assigned to each RBE would all amount to an "equal" workload per frame, for ALUs, TUs and RBEs.

If you junk this kind of tiling then every RBE has to be able to talk to all the ALUs and all of the hierarchical buffer units (for Z/stencil). Clearly the ring bus looks like it would trivially support this many-to-many organisation. And the hierarchical buffer units may in fact be a single unit.

It's worth remembering that R600 only rasterises 16 pixels per clock, so it's not as if the hierarchical buffer unit would be straining - provided it can "switch tile" it's working against without stalling. The same goes for the Interpolators (which is a "fixed function" block between the rasteriser and ALUs).

The biggest outstanding question is that the RBEs will all try to access the hierarchical buffers simultaneously. So, are those buffers (Z and stencil) tiled or can a single instance of each support the kind of throughput the RBEs demand? With tiling you guarantee collision-free accesses. Without tiling you have to have some kind of queueing/buffering/re-ordering front end to keep the RBEs happy. I think it's reasonable to assume that the RBEs are the most tetchy when it comes to being told to wait.

Every form of storage in the R600 diagrams is located everywhere but the SIMDs.
The register file cache, the schedulers, the unified L2 and single-image L1 seem to lean towards keeping the ALUs isolated from the particulars.

I see it as a question of whether the advantages that were once gained with the screen-space tiling are now relevant. And how much collision management are you willing to indulge in, in order to have as much many-to-many flexibility as possible.

Clearly, for example, the TUs manage collisions because they support multiple clients - whereas historically TUs only had a single client. That change came with Xenos and RV530.

With unification of the ALUs screen-space tiling has less impact, agreed. But pixel shading is still typically 80%, say, of the average frame workload and it seems that RBEs are, more and more, taking up a greater proportion of memory system bandwidth (explosion in shadowing and MRT-based algorithms).

Jawed

Jawed · Mar 7, 2008

Jawed said:
The biggest outstanding question is that the RBEs will all try to access the hierarchical buffers simultaneously. So, are those buffers (Z and stencil) tiled or can a single instance of each support the kind of throughput the RBEs demand? With tiling you guarantee collision-free accesses. Without tiling you have to have some kind of queueing/buffering/re-ordering front end to keep the RBEs happy. I think it's reasonable to assume that the RBEs are the most tetchy when it comes to being told to wait.

I just want to add that it's entirely possible to build a single hierarchical-buffer update unit (i.e. for early-Z culling/updating) which accesses tiled Z/stencil buffers. This would enable the RBEs to have privately tiled Z/stencil buffers.

So really it's a question of whether the RBEs have private tiles or whether it's a many-to-many configuration.

Jawed

mczak · Mar 8, 2008

Jawed said:
I'm assuming that R600 has one ring stop per quad RBE. Then each SIMD also has one ring stop. Finally, each SIMD has one quad TU, implying to me that a TU is directly linked to this same ring stop (since the TU needs to send its results to all the other SIMDs, not just its local SIMD).

No, the SIMD units do not have quad TU's attached - those must be independent (try to make the unit count match up otherwise with rv630...). So just have 24-wide simd arrays works perfectly fine from that point of view and naturally gives you 6 quad-tus. It would definitely be the most easy way to go from rv670 to rv770, assuming the 480 shader unit number is correct. The most obvious downside would be that the branching granularity would increase.

Jawed · Mar 8, 2008

mczak said:
No, the SIMD units do not have quad TU's attached - those must be independent (try to make the unit count match up otherwise with rv630...). So just have 24-wide simd arrays works perfectly fine from that point of view and naturally gives you 6 quad-tus. It would definitely be the most easy way to go from rv670 to rv770, assuming the 480 shader unit number is correct. The most obvious downside would be that the branching granularity would increase.

OK that would be like a "super-sized" Xenos, 24-wide units instead of 16-wide. I haven't thought of it in those terms, but it does sound feasible. It would certainly make me happier about redundancy as I really dislike the idea of SIMDs narrower than 16, it just seems relatively wasteful.

Jawed

turtle · Mar 8, 2008

Core: AMD/ATI RV770
Die size: ~250mm square
Production: TSMC 55nm
Silicon Revision: A11
Shader core: 160! x 5D (800! SP)
TMU: 32
ROP: 16/32 (?)
MC: 256-bit External, Ring bus
Frequency: 825~875 (xx70), 700~775 (xx50)

http://bbs.chiphell.com/viewthread.php?tid=17621&extra=page=1

160 shaders? doesn't seem possible.

no-X · Mar 8, 2008

Nice fake...

old image + PS = new RV770

turtle · Mar 8, 2008

I know, right?

It does look bullshitty.

edit: While the picture looks crappy and may be chopped, the specs match what was posted earlier, just with more detail. The earlier post said a little more than twice the shaders, and this says 160, or 2.5x...so to me it sounds like we're starting to get corroborating posts...or maybe just guesses based on the earlier post. I never know with Chiphell, but I try to keep track of the rep of certain posters there. This one is a mod, so you'd think they'd post less dung, but I suppose one never knows.

The strange thing about it though, wouldn't we be right back to ROP/TMU limitation? I wonder if such a massive increase in shaders would be to help with AA because of the way it's done in the R600/R700 arch?

Disharmonic · Mar 8, 2008

turtle said:
I know, right? It does look bullshitty.

The pictures are certainly the same. Its a really poor photoshop edit.

ShaidarHaran · Mar 8, 2008

LOL, that doesn't even look remotely realistic. Anyone that falls for that deserves to be tricked.

AMD: R7xx Speculation

Ailuros

Epsilon plus three

Pete

Moderate Nuisance

aca

Sound_Card

Jawed

Sound_Card

3dilettante

Jawed

Jawed

Sound_Card

3dilettante

Jawed

Jawed

mczak

Jawed

turtle

no-X

turtle

Disharmonic

ShaidarHaran

hardware monkey

Similar threads