brain fade on my part, when I counted the columns on the diagram I wasn't paying attention to the fact that 16 single elements (pixels) are linked to a quad of TUs. I should have said 4 SIMDs. ARGH
In RV670, for example, the width of the TUs (16) agrees with the width of a SIMD, so the SIMDs are 16 wide.
If RV770 has 40-wide TUs then the naive interpretation is that the SIMDs are also 40-wide. 160 elements in a batch?
The big black lines going through the SIMDs conveniently pass through where the point sampling units (edit: and vertex fetch units) would have been. Perhaps some of that hardware has been shifted elsewhere?I've just noticed that diagram appears to have no vertex-fetch specific units: there's no address and point-sampling units; just bilinear addressing, sampling and filtering. If true that seems like quite a big change architecturally - and presumably saves a bit of space.
HD 4850 -> 147.51€
This is the kind of density I expected from ATI with R600 (scaled appropriately for 80nm, of course). After being wowed by Xenos, I couldn't figure out why R600 was so huge. RV670 was much better, and now RV770 is right back to wowing us. 2.5 times the shader+texture units in only 30% more space?And 40 TMUs...
260 mm^2!
Good catch! It may reduce per shader efficiency, esp. with vertex/geometry shaders, but per-mm2 efficiency would get a huge boost. Quite an opposite strategy from NVidia and even ATI's past tendencies.I've just noticed that diagram appears to have no vertex-fetch specific units: there's no address and point-sampling units; just bilinear addressing, sampling and filtering. If true that seems like quite a big change architecturally - and presumably saves a bit of space.
Jawed
The other notable thing is that "quads" are no longer explicitly drawn within the SIMDs.The texture units have an extra set of control links: there's a line from the thread dispatcher, but the thick command line goes right through the SIMDs and right through the texture units.
Needless to say I dislike the idea of 160...Which way are the divisions being drawn?
I still like keeping batch sizes at R600 levels or less. Going for 160 is a mondo jump in the other direction.
I've just found the vertex cache, though that doesn't tell us much about the point sampling units.The big black lines going through the SIMDs conveniently pass through where the point sampling units would have been. Perhaps some of that hardware has been shifted elsewhere?
Hmm, I've never seen any sign that NVidia uses dedicated point-sampling units in G80 onwards. NVidia seems to separate-out the stages that make up "texturing", with a separate LOD/Bias unit, then an addressing unit, then sampling, then filtering (I suspect I've forgotten something). I've been assuming that point samples (vertex data fetches) are taken simply by issuing commands to the address and sampling units.Good catch! It may reduce per shader efficiency, esp. with vertex/geometry shaders, but per-mm2 efficiency would get a huge boost. Quite an opposite strategy from NVidia and even ATI's past tendencies.
I thought it's 115.2GB/s bandwidth, 1800MHz GDDR5.30 GT/s texture fillrate, 1.2 Tflops, and 128GB/s bandwidth sounds like a pretty strong combination on HD4870. GT200 does seem to waste a lot of transistor budget in comparison.
I thought it's 115.2GB/s bandwidth, 1800MHz GDDR5.
NVidia appears to have spent a lot on CUDA-related functionality - stuff that will be hidden for quite a while (it'll take a while to appreciate the impact)- although the headline double-precision performance is underwhelming. There may be other aspects to CUDA performance that are much healthier, e.g. register file bandwidth, scatter performance etc.
Jawed
The memory read/write cache has migrated a bit as well. It's now hanging off of the shader export section (which is linked to vertex assembler as well), and it's going to the "hub", which is a term I don't remember seeing in any other R6xx slides.The other notable thing is that "quads" are no longer explicitly drawn within the SIMDs.
The implications for that and the local share are interesting as well. A lot of units are apparently going to be talking to each other more. (Unless the word is "crap" or "cruft", but I doubt anyone would use that for their own slide).That damn "PCINLIFE.com" is blocking what appears to be the word "crossbar".
The quad-TU tied to a SIMD of 16 actually might not change too much unit-wise. R600 had 4 quad-groups tied to one TMU, so it's 16 elements either way.I have to say the way the thick black lines all combine into a single path that feeds Shader Export implies to me that there's 10 SIMDs and that each quad-TU is dedicated to a SIMD, unlike R6xx. That would give a 64-element batch size.
That's called marketing... Just like the 'Integer' path in the GT200 diagrams are pure marketing spin, at least according to John Nickolls.Hmm, I've never seen any sign that NVidia uses dedicated point-sampling units in G80 onwards. NVidia seems to separate-out the stages that make up "texturing", with a separate LOD/Bias unit, then an addressing unit, then sampling, then filtering (I suspect I've forgotten something).
Exact same thing as in, uhhh, Voodoo Graphics, Riva 128, Rage, GeForce 256, Radeon, [...], R300, R520, NV30, NV40, G70, ...I've been assuming that point samples (vertex data fetches) are taken simply by issuing commands to the address and sampling units.
Yes, it seems they finally decided they could no longer afford to have an incredibly inefficient TMU architecture that made neither theoretical nor practical sense. Shock, horror? Hopefully the ROPs are sufficiently improved too so they can get nearer or even catch NVIDIA in these two respects, but given the die size what seems really really impressive IMO are the ALUs as there's a ridiculous amount of them and they presumably aren't any weaker (if anything, the FP64 numbers would imply they're stronger). They're definitely back in the game.So RV770 appears to be a hybrid approach, with less separation of stages in the ATI design (addressing, sampling and filtering), with dual-function samplers (point and filtered fetch) seemingly just like G80 etc.
But ATI cards have a history of using the memory at significantly less than what's marked on the chips.Actual AMD says 1.8GHz but all HD4870 ES are using 2GHz GDDR5, so its likely to be 2GHz @ retail.
But ATI cards have a history of using the memory at significantly less than what's marked on the chips.
Jawed
Curiously, "hub" appears here:The memory read/write cache has migrated a bit as well. It's now hanging off of the shader export section (which is linked to vertex assembler as well), and it's going to the "hub", which is a term I don't remember seeing in any other R6xx slides.
Taking it literally, I'm thinking that the 4 separate L2s are communicating via the ring bus to the L1s. Though between the MCs and RBEs there's a thick black bus that appears to be a natural fit for "ring bus".The implications for that and the local share are interesting as well. A lot of units are apparently going to be talking to each other more. (Unless the word is "crap" or "cruft", but I doubt anyone would use that for their own slide).
I'm thinking that as long as L2 contains the texture data the mapping doesn't really matter - L2 allows ATI to escape from "screen space tiling" data-locality.The quad-TU tied to a SIMD of 16 actually might not change too much unit-wise. R600 had 4 quad-groups tied to one TMU, so it's 16 elements either way.
The difference is that each TMU is only going towards one SIMD's batch, while R600 spread the TMU block over possibly 4 separate batches.
I wonder what if this hurts utilization, or if R600's scheme didn't really help much in the end.
Can I just interject a big, ear-to-ear, shit eating grin at this point in the speculation?
Since the 9250 has a RV770 core, future owners of 4800-series cards will also be able to squeeze that performance also out of their graphics boards. The ATI team declined to say how much more than 1 TFlops the card can hit, but we were told that a 4x Crossfire X configuration is good for almost 5 TFlops. So, we would be tempted to assume that ATI is playing with a number of about a theoretically possible 1.2 TFlops per RV770 GPU.
And yes, the R700 - the dual-GPU 4870 X2 card – is good for almost twice that performance: According to AMD, the R700 will deliver 2 TFlops per board.