AMD: R7xx Speculation

Status
Not open for further replies.
:oops: brain fade on my part, when I counted the columns on the diagram I wasn't paying attention to the fact that 16 single elements (pixels) are linked to a quad of TUs. I should have said 4 SIMDs. ARGH :oops:

In RV670, for example, the width of the TUs (16) agrees with the width of a SIMD, so the SIMDs are 16 wide.

If RV770 has 40-wide TUs then the naive interpretation is that the SIMDs are also 40-wide. 160 elements in a batch?

If we go by the possibly charitable assumption that the image is truly representative of the architecture (no simplifications, omissions, mistaken line draws, it actually is an AMD picture), I see some different wierdness.
The texture units have an extra set of control links: there's a line from the thread dispatcher, but the thick command line goes right through the SIMDs and right through the texture units.

Which way are the divisions being drawn?

I still like keeping batch sizes at R600 levels or less. Going for 160 is a mondo jump in the other direction.


I've just noticed that diagram appears to have no vertex-fetch specific units: there's no address and point-sampling units; just bilinear addressing, sampling and filtering. If true that seems like quite a big change architecturally - and presumably saves a bit of space.
The big black lines going through the SIMDs conveniently pass through where the point sampling units (edit: and vertex fetch units) would have been. Perhaps some of that hardware has been shifted elsewhere?
 
Last edited by a moderator:
And 40 TMUs...


260 mm^2! :oops:
This is the kind of density I expected from ATI with R600 (scaled appropriately for 80nm, of course). After being wowed by Xenos, I couldn't figure out why R600 was so huge. RV670 was much better, and now RV770 is right back to wowing us. 2.5 times the shader+texture units in only 30% more space? :oops:

Of course, this is assuming that all the leaked info is correct...

I've just noticed that diagram appears to have no vertex-fetch specific units: there's no address and point-sampling units; just bilinear addressing, sampling and filtering. If true that seems like quite a big change architecturally - and presumably saves a bit of space.

Jawed
Good catch! It may reduce per shader efficiency, esp. with vertex/geometry shaders, but per-mm2 efficiency would get a huge boost. Quite an opposite strategy from NVidia and even ATI's past tendencies.
 
Last edited by a moderator:
What's with the local and global share blocks in the diagram?

Something new, something they've repurposed from some of the other forms of local storage, something they've renamed and has always been there in some form?
 
The texture units have an extra set of control links: there's a line from the thread dispatcher, but the thick command line goes right through the SIMDs and right through the texture units.
The other notable thing is that "quads" are no longer explicitly drawn within the SIMDs.

Which way are the divisions being drawn?

I still like keeping batch sizes at R600 levels or less. Going for 160 is a mondo jump in the other direction.
Needless to say I dislike the idea of 160...

The big black lines going through the SIMDs conveniently pass through where the point sampling units would have been. Perhaps some of that hardware has been shifted elsewhere?
I've just found the vertex cache, though that doesn't tell us much about the point sampling units.

That damn "PCINLIFE.com" is blocking what appears to be the word "crossbar".

I have to say the way the thick black lines all combine into a single path that feeds Shader Export implies to me that there's 10 SIMDs and that each quad-TU is dedicated to a SIMD, unlike R6xx. That would give a 64-element batch size.

Jawed
 
30 GT/s texture fillrate, 1.2 Tflops, and 128GB/s bandwidth sounds like a pretty strong combination on HD4870. GT200 does seem to waste a lot of transistor budget in comparison.
 
Good catch! It may reduce per shader efficiency, esp. with vertex/geometry shaders, but per-mm2 efficiency would get a huge boost. Quite an opposite strategy from NVidia and even ATI's past tendencies.
Hmm, I've never seen any sign that NVidia uses dedicated point-sampling units in G80 onwards. NVidia seems to separate-out the stages that make up "texturing", with a separate LOD/Bias unit, then an addressing unit, then sampling, then filtering (I suspect I've forgotten something). I've been assuming that point samples (vertex data fetches) are taken simply by issuing commands to the address and sampling units.

So RV770 appears to be a hybrid approach, with less separation of stages in the ATI design (addressing, sampling and filtering), with dual-function samplers (point and filtered fetch) seemingly just like G80 etc.

Jawed
 
30 GT/s texture fillrate, 1.2 Tflops, and 128GB/s bandwidth sounds like a pretty strong combination on HD4870. GT200 does seem to waste a lot of transistor budget in comparison.
I thought it's 115.2GB/s bandwidth, 1800MHz GDDR5.

NVidia appears to have spent a lot on CUDA-related functionality - stuff that will be hidden for quite a while (it'll take a while to appreciate the impact)- although the headline double-precision performance is underwhelming. There may be other aspects to CUDA performance that are much healthier, e.g. register file bandwidth, scatter performance etc.

Jawed
 
I thought it's 115.2GB/s bandwidth, 1800MHz GDDR5.

NVidia appears to have spent a lot on CUDA-related functionality - stuff that will be hidden for quite a while (it'll take a while to appreciate the impact)- although the headline double-precision performance is underwhelming. There may be other aspects to CUDA performance that are much healthier, e.g. register file bandwidth, scatter performance etc.

Jawed

Actual AMD says 1.8GHz but all HD4870 ES are using 2GHz GDDR5, so its likely to be 2GHz @ retail.

BTW: The 2GHz GDDR5 on HD4870 only needs 1.5v.
 
The other notable thing is that "quads" are no longer explicitly drawn within the SIMDs.
The memory read/write cache has migrated a bit as well. It's now hanging off of the shader export section (which is linked to vertex assembler as well), and it's going to the "hub", which is a term I don't remember seeing in any other R6xx slides.

That damn "PCINLIFE.com" is blocking what appears to be the word "crossbar".
The implications for that and the local share are interesting as well. A lot of units are apparently going to be talking to each other more. (Unless the word is "crap" or "cruft", but I doubt anyone would use that for their own slide).

I have to say the way the thick black lines all combine into a single path that feeds Shader Export implies to me that there's 10 SIMDs and that each quad-TU is dedicated to a SIMD, unlike R6xx. That would give a 64-element batch size.
The quad-TU tied to a SIMD of 16 actually might not change too much unit-wise. R600 had 4 quad-groups tied to one TMU, so it's 16 elements either way.
The difference is that each TMU is only going towards one SIMD's batch, while R600 spread the TMU block over possibly 4 separate batches.
I wonder what if this hurts utilization, or if R600's scheme didn't really help much in the end.
 
Hmm, I've never seen any sign that NVidia uses dedicated point-sampling units in G80 onwards. NVidia seems to separate-out the stages that make up "texturing", with a separate LOD/Bias unit, then an addressing unit, then sampling, then filtering (I suspect I've forgotten something).
That's called marketing... ;) Just like the 'Integer' path in the GT200 diagrams are pure marketing spin, at least according to John Nickolls.
I've been assuming that point samples (vertex data fetches) are taken simply by issuing commands to the address and sampling units.
Exact same thing as in, uhhh, Voodoo Graphics, Riva 128, Rage, GeForce 256, Radeon, [...], R300, R520, NV30, NV40, G70, ...
If anything, as I said many times already, I very heavily suspect that there is a significant amount of sharing between the addressing & filtering parts on all DX10 NVIDIA chips, except maybe G80.

So RV770 appears to be a hybrid approach, with less separation of stages in the ATI design (addressing, sampling and filtering), with dual-function samplers (point and filtered fetch) seemingly just like G80 etc.
Yes, it seems they finally decided they could no longer afford to have an incredibly inefficient TMU architecture that made neither theoretical nor practical sense. Shock, horror? :) Hopefully the ROPs are sufficiently improved too so they can get nearer or even catch NVIDIA in these two respects, but given the die size what seems really really impressive IMO are the ALUs as there's a ridiculous amount of them and they presumably aren't any weaker (if anything, the FP64 numbers would imply they're stronger). They're definitely back in the game.
 
The memory read/write cache has migrated a bit as well. It's now hanging off of the shader export section (which is linked to vertex assembler as well), and it's going to the "hub", which is a term I don't remember seeing in any other R6xx slides.
Curiously, "hub" appears here:

http://66.102.9.104/translate_c?hl=...cle/graphics/2007-06-02/1180790746d363_1.html

on the Xenos diagram. It's unclear to me who actually drew that diagram (it's described as Microsoft's but...) but I have to say I prefer the clean lines of that to these newer AMD diagrams. It's interesting to see the R600 diagram there, in the same style.

The implications for that and the local share are interesting as well. A lot of units are apparently going to be talking to each other more. (Unless the word is "crap" or "cruft", but I doubt anyone would use that for their own slide).
Taking it literally, I'm thinking that the 4 separate L2s are communicating via the ring bus to the L1s. Though between the MCs and RBEs there's a thick black bus that appears to be a natural fit for "ring bus".

Hmm, this is confusing, because the L1s and TUs are separated by the crossbar that implies that individual texels will not normally appear in multiple L1s concurrently. This is nuts :???: can't work out what's going on. Maybe the Crossbar only routes global data share and vertex cache data, while the L1s are in fact private to each TU.

Is "local data share" just the register file?

The quad-TU tied to a SIMD of 16 actually might not change too much unit-wise. R600 had 4 quad-groups tied to one TMU, so it's 16 elements either way.
The difference is that each TMU is only going towards one SIMD's batch, while R600 spread the TMU block over possibly 4 separate batches.
I wonder what if this hurts utilization, or if R600's scheme didn't really help much in the end.
I'm thinking that as long as L2 contains the texture data the mapping doesn't really matter - L2 allows ATI to escape from "screen space tiling" data-locality.

Jawed
 
So... any guesses as to architectural features that would enable a more seamless CF experience, or would any such improvement be purely a function of software changes?
 
If you follow the arrows, one can hop from the Xfire port, to the hub, to the memory read/write cache, to shader export, to vertex assembly.

Maybe it's possible to not duplicate all of the vertex workload between multiple GPUs in SFR?
 
TG Daily : AMD puts 5 TFlops in your PC

Since the 9250 has a RV770 core, future owners of 4800-series cards will also be able to squeeze that performance also out of their graphics boards. The ATI team declined to say how much more than 1 TFlops the card can hit, but we were told that a 4x Crossfire X configuration is good for almost 5 TFlops. So, we would be tempted to assume that ATI is playing with a number of about a theoretically possible 1.2 TFlops per RV770 GPU.
And yes, the R700 - the dual-GPU 4870 X2 card – is good for almost twice that performance: According to AMD, the R700 will deliver 2 TFlops per board.
 
Status
Not open for further replies.
Back
Top