AMD: R7xx Speculation

ZerazaX · Jun 16, 2008

http://img115.imageshack.us/img115/4475/archrv770zx6.jpg

try that?

3dilettante · Jun 16, 2008

Jawed said:
brain fade on my part, when I counted the columns on the diagram I wasn't paying attention to the fact that 16 single elements (pixels) are linked to a quad of TUs. I should have said 4 SIMDs. ARGH

In RV670, for example, the width of the TUs (16) agrees with the width of a SIMD, so the SIMDs are 16 wide.

If RV770 has 40-wide TUs then the naive interpretation is that the SIMDs are also 40-wide. 160 elements in a batch?

If we go by the possibly charitable assumption that the image is truly representative of the architecture (no simplifications, omissions, mistaken line draws, it actually is an AMD picture), I see some different wierdness.
The texture units have an extra set of control links: there's a line from the thread dispatcher, but the thick command line goes right through the SIMDs and right through the texture units.

Which way are the divisions being drawn?

I still like keeping batch sizes at R600 levels or less. Going for 160 is a mondo jump in the other direction.

I've just noticed that diagram appears to have no vertex-fetch specific units: there's no address and point-sampling units; just bilinear addressing, sampling and filtering. If true that seems like quite a big change architecturally - and presumably saves a bit of space.

The big black lines going through the SIMDs conveniently pass through where the point sampling units (edit: and vertex fetch units) would have been. Perhaps some of that hardware has been shifted elsewhere?

ol92 · Jun 16, 2008

Wirmish said:
HD 4850 -> 147.51€

the price include VAT
the card is selling in france in various place : www.materiel.net www.opium-pc.com
at a cheap price around 150-180 euros and in stock

Mintmaster · Jun 16, 2008

MarkoIt said:
And 40 TMUs...

260 mm^2!

This is the kind of density I expected from ATI with R600 (scaled appropriately for 80nm, of course). After being wowed by Xenos, I couldn't figure out why R600 was so huge. RV670 was much better, and now RV770 is right back to wowing us. 2.5 times the shader+texture units in only 30% more space?

Of course, this is assuming that all the leaked info is correct...

Jawed said:
I've just noticed that diagram appears to have no vertex-fetch specific units: there's no address and point-sampling units; just bilinear addressing, sampling and filtering. If true that seems like quite a big change architecturally - and presumably saves a bit of space.

Jawed

Good catch! It may reduce per shader efficiency, esp. with vertex/geometry shaders, but per-mm2 efficiency would get a huge boost. Quite an opposite strategy from NVidia and even ATI's past tendencies.

3dilettante · Jun 16, 2008

What's with the local and global share blocks in the diagram?

Something new, something they've repurposed from some of the other forms of local storage, something they've renamed and has always been there in some form?

Jawed · Jun 16, 2008

3dilettante said:
The texture units have an extra set of control links: there's a line from the thread dispatcher, but the thick command line goes right through the SIMDs and right through the texture units.

The other notable thing is that "quads" are no longer explicitly drawn within the SIMDs.

Which way are the divisions being drawn?

I still like keeping batch sizes at R600 levels or less. Going for 160 is a mondo jump in the other direction.

Needless to say I dislike the idea of 160...

The big black lines going through the SIMDs conveniently pass through where the point sampling units would have been. Perhaps some of that hardware has been shifted elsewhere?

I've just found the vertex cache, though that doesn't tell us much about the point sampling units.

That damn "PCINLIFE.com" is blocking what appears to be the word "crossbar".

I have to say the way the thick black lines all combine into a single path that feeds Shader Export implies to me that there's 10 SIMDs and that each quad-TU is dedicated to a SIMD, unlike R6xx. That would give a 64-element batch size.

Jawed

trinibwoy · Jun 16, 2008

30 GT/s texture fillrate, 1.2 Tflops, and 128GB/s bandwidth sounds like a pretty strong combination on HD4870. GT200 does seem to waste a lot of transistor budget in comparison.

Jawed · Jun 16, 2008

Mintmaster said:
Good catch! It may reduce per shader efficiency, esp. with vertex/geometry shaders, but per-mm2 efficiency would get a huge boost. Quite an opposite strategy from NVidia and even ATI's past tendencies.

Hmm, I've never seen any sign that NVidia uses dedicated point-sampling units in G80 onwards. NVidia seems to separate-out the stages that make up "texturing", with a separate LOD/Bias unit, then an addressing unit, then sampling, then filtering (I suspect I've forgotten something). I've been assuming that point samples (vertex data fetches) are taken simply by issuing commands to the address and sampling units.

So RV770 appears to be a hybrid approach, with less separation of stages in the ATI design (addressing, sampling and filtering), with dual-function samplers (point and filtered fetch) seemingly just like G80 etc.

Jawed

Jawed · Jun 16, 2008

trinibwoy said:
30 GT/s texture fillrate, 1.2 Tflops, and 128GB/s bandwidth sounds like a pretty strong combination on HD4870. GT200 does seem to waste a lot of transistor budget in comparison.

I thought it's 115.2GB/s bandwidth, 1800MHz GDDR5.

NVidia appears to have spent a lot on CUDA-related functionality - stuff that will be hidden for quite a while (it'll take a while to appreciate the impact)- although the headline double-precision performance is underwhelming. There may be other aspects to CUDA performance that are much healthier, e.g. register file bandwidth, scatter performance etc.

Jawed

w0mbat · Jun 16, 2008

Jawed said:
I thought it's 115.2GB/s bandwidth, 1800MHz GDDR5.

NVidia appears to have spent a lot on CUDA-related functionality - stuff that will be hidden for quite a while (it'll take a while to appreciate the impact)- although the headline double-precision performance is underwhelming. There may be other aspects to CUDA performance that are much healthier, e.g. register file bandwidth, scatter performance etc.

Jawed

Actual AMD says 1.8GHz but all HD4870 ES are using 2GHz GDDR5, so its likely to be 2GHz @ retail.

BTW: The 2GHz GDDR5 on HD4870 only needs 1.5v.

3dilettante · Jun 16, 2008

Jawed said:
The other notable thing is that "quads" are no longer explicitly drawn within the SIMDs.

The memory read/write cache has migrated a bit as well. It's now hanging off of the shader export section (which is linked to vertex assembler as well), and it's going to the "hub", which is a term I don't remember seeing in any other R6xx slides.

That damn "PCINLIFE.com" is blocking what appears to be the word "crossbar".

The implications for that and the local share are interesting as well. A lot of units are apparently going to be talking to each other more. (Unless the word is "crap" or "cruft", but I doubt anyone would use that for their own slide).

I have to say the way the thick black lines all combine into a single path that feeds Shader Export implies to me that there's 10 SIMDs and that each quad-TU is dedicated to a SIMD, unlike R6xx. That would give a 64-element batch size.

The quad-TU tied to a SIMD of 16 actually might not change too much unit-wise. R600 had 4 quad-groups tied to one TMU, so it's 16 elements either way.
The difference is that each TMU is only going towards one SIMD's batch, while R600 spread the TMU block over possibly 4 separate batches.
I wonder what if this hurts utilization, or if R600's scheme didn't really help much in the end.

Arun · Jun 16, 2008

Jawed said:
Hmm, I've never seen any sign that NVidia uses dedicated point-sampling units in G80 onwards. NVidia seems to separate-out the stages that make up "texturing", with a separate LOD/Bias unit, then an addressing unit, then sampling, then filtering (I suspect I've forgotten something).

That's called marketing...

Just like the 'Integer' path in the GT200 diagrams are pure marketing spin, at least according to John Nickolls.

I've been assuming that point samples (vertex data fetches) are taken simply by issuing commands to the address and sampling units.

Exact same thing as in, uhhh, Voodoo Graphics, Riva 128, Rage, GeForce 256, Radeon, [...], R300, R520, NV30, NV40, G70, ...
If anything, as I said many times already, I very heavily suspect that there is a significant amount of sharing between the addressing & filtering parts on all DX10 NVIDIA chips, except maybe G80.

So RV770 appears to be a hybrid approach, with less separation of stages in the ATI design (addressing, sampling and filtering), with dual-function samplers (point and filtered fetch) seemingly just like G80 etc.

Yes, it seems they finally decided they could no longer afford to have an incredibly inefficient TMU architecture that made neither theoretical nor practical sense. Shock, horror?

Hopefully the ROPs are sufficiently improved too so they can get nearer or even catch NVIDIA in these two respects, but given the die size what seems really really impressive IMO are the ALUs as there's a ridiculous amount of them and they presumably aren't any weaker (if anything, the FP64 numbers would imply they're stronger). They're definitely back in the game.

Jawed · Jun 16, 2008

w0mbat said:
Actual AMD says 1.8GHz but all HD4870 ES are using 2GHz GDDR5, so its likely to be 2GHz @ retail.

But ATI cards have a history of using the memory at significantly less than what's marked on the chips.

Jawed

ShaidarHaran · Jun 16, 2008

Jawed said:
But ATI cards have a history of using the memory at significantly less than what's marked on the chips.

Jawed

Precisely. Literally every GPU launch from ATi I recall since the original Radeon (and yes, I had one).

Jawed · Jun 16, 2008

3dilettante said:
The memory read/write cache has migrated a bit as well. It's now hanging off of the shader export section (which is linked to vertex assembler as well), and it's going to the "hub", which is a term I don't remember seeing in any other R6xx slides.

Curiously, "hub" appears here:

http://66.102.9.104/translate_c?hl=...cle/graphics/2007-06-02/1180790746d363_1.html

on the Xenos diagram. It's unclear to me who actually drew that diagram (it's described as Microsoft's but...) but I have to say I prefer the clean lines of that to these newer AMD diagrams. It's interesting to see the R600 diagram there, in the same style.

The implications for that and the local share are interesting as well. A lot of units are apparently going to be talking to each other more. (Unless the word is "crap" or "cruft", but I doubt anyone would use that for their own slide).

Taking it literally, I'm thinking that the 4 separate L2s are communicating via the ring bus to the L1s. Though between the MCs and RBEs there's a thick black bus that appears to be a natural fit for "ring bus".

Hmm, this is confusing, because the L1s and TUs are separated by the crossbar that implies that individual texels will not normally appear in multiple L1s concurrently. This is nuts :???:

can't work out what's going on. Maybe the Crossbar only routes global data share and vertex cache data, while the L1s are in fact private to each TU.

Is "local data share" just the register file?

The quad-TU tied to a SIMD of 16 actually might not change too much unit-wise. R600 had 4 quad-groups tied to one TMU, so it's 16 elements either way.
The difference is that each TMU is only going towards one SIMD's batch, while R600 spread the TMU block over possibly 4 separate batches.
I wonder what if this hurts utilization, or if R600's scheme didn't really help much in the end.

I'm thinking that as long as L2 contains the texture data the mapping doesn't really matter - L2 allows ATI to escape from "screen space tiling" data-locality.

Jawed

digitalwanderer · Jun 16, 2008

Can I just interject a big, ear-to-ear, shit eating grin at this point in the speculation?

ShaidarHaran · Jun 16, 2008

So... any guesses as to architectural features that would enable a more seamless CF experience, or would any such improvement be purely a function of software changes?

ShaidarHaran · Jun 16, 2008

digitalwanderer said:
Can I just interject a big, ear-to-ear, shit eating grin at this point in the speculation?

LOL, I know what you mean digi

from one (former) fanATIc to another

3dilettante · Jun 16, 2008

If you follow the arrows, one can hop from the Xfire port, to the hub, to the memory read/write cache, to shader export, to vertex assembly.

Maybe it's possible to not duplicate all of the vertex workload between multiple GPUs in SFR?

Wirmish · Jun 16, 2008

TG Daily : AMD puts 5 TFlops in your PC

Since the 9250 has a RV770 core, future owners of 4800-series cards will also be able to squeeze that performance also out of their graphics boards. The ATI team declined to say how much more than 1 TFlops the card can hit, but we were told that a 4x Crossfire X configuration is good for almost 5 TFlops. So, we would be tempted to assume that ATI is playing with a number of about a theoretically possible 1.2 TFlops per RV770 GPU.

And yes, the R700 - the dual-GPU 4870 X2 card – is good for almost twice that performance: According to AMD, the R700 will deliver 2 TFlops per board.

AMD: R7xx Speculation

ZerazaX

3dilettante

ol92

Mintmaster

3dilettante

Jawed

trinibwoy

Meh

Jawed

Jawed

w0mbat

3dilettante

Arun

Unknown.

Jawed

ShaidarHaran

hardware monkey

Jawed

digitalwanderer

ShaidarHaran

hardware monkey

ShaidarHaran

hardware monkey

3dilettante

Wirmish

Similar threads