AMD: R7xx Speculation

mczak · Jun 24, 2008

trinibwoy said:
I don't think FP16 filtering is half speed in RV770 though. Measured numbers seem to indicate that it's pushing more texels than G92 and it gets up pretty close to GT200 too.

I was refering to Cho's numbers here: http://forum.beyond3d.com/showthread.php?t=48630&page=2
Though these numbers don't make sense for the HD3870 which was used in comparison (I think it could be a 3870x2 instead, then it would make more sense...)

Mintmaster · Jun 24, 2008

jimmyjames123 said:
While I agree that GTX 280 is way overpriced, let's not get carried away and claim that GTX 260 is DOA. I'm sure when all is said and done that there will end up being several games where the GTX 260 has higher playable settings than the 4870 512MB, not to mention possibly lower idle power and also PhysX/CUDA support.

There will always be niche markets for any product, e.g. Parhelia, but this is NVidia we're talking about. It doesn't have to be literally DOA for it to be effectively so. ~10% more system idle power isn't a big deal, and will take several years of 24/7 usage to even reach half the difference in cost.

In fact, I would argue the opposite in that the GTX 280 has more usefulness than the 260 in that it is clearly the best and there will always be people who want that. I think SLI tends to scale a bit better than CF, too (not counting X2 and GX2, which aren't here yet).

Also note that any review that compares Geforce GT200 vs Radeon 48xx with 8x MSAA enabled will almost always show advantages for the Radeon. As mentioned before, it doesn't make sense to run the Geforce cards with 8x MSAA, as performance is much much better with 16x CSAA, with minimal tradeoff in image quality.

Only two games tested on that site used 8xAA, and I ignored them in my assessment, especially because framerates were so high that it didn't matter who won.

jimmyjames123 · Jun 24, 2008

Mintmaster said:
There will always be niche markets for any product, e.g. Parhelia, but this is NVidia we're talking about. It doesn't have to be literally DOA for it to be effectively so. ~10% more system idle power isn't a big deal, and will take several years of 24/7 usage to even reach half the difference in cost.

In fact, I would argue the opposite in that the GTX 280 has more usefulness than the 260 in that it is clearly the best and there will always be people who want that. I think SLI tends to scale a bit better than CF, too (not counting X2 and GX2, which aren't here yet).

I'm just confused why people would consider the GTX 260 "DOA" when based on a single review. If one can get higher playable settings some of the time with the GTX 260, in addition to PhysX and CUDA-application support down the road, then some will pay more to get it ($100 difference between the 4870 and GTX 260 is not that bad compared to the cost differential between GTX 260 and GTX 280).

The real problem is that NVIDIA has to cut the price to keep a competitive price/performance ratio. This really hurts their margins and ROI. Even though GT200 is an expensive chip, they may have to suck it up and bring the price of the GTX 260 down another $50 to make it a more competitive value until GT200b shows up.

Anyway, all things considered, I'm really impressed with the ATI cards this go-around. They are definitely back in the game!

Jawed · Jun 24, 2008

mczak said:
I'll answer instead . Well these diagrams were leaked before already, so there's nothing new to see here.
I don't even know where to start describing all the differences - I don't think I'm alone not having expected that many changes.

Yeah, talk about a deep clean.

I think it's fair to say that, apart from doubled-Z in the RBEs, there was a general expectation that RV770 would increase the counts of certain items and that would be it. More ALUs were guaranteed, more TUs were the popular choice and apart from slightly increased clocks, that was pretty much the end of it.

Ha, I lead the pessimists' charge, asserting continuation of 16 TUs + 16 ROPs - though not without question.

For a change that isn't architectural, RV770 is a pretty thorough refresh. Actually trying to define what makes for architectural changes would be tough in light of this

I think it was Dave Orton who said that they didn't have the tools they needed to make R600. Maybe that was just post-justification, or maybe it indicates they knew what they couldn't achieve in the R600 timeframe. Much like the uncertainty over which features intended for D3D10 got pushed back into 10.1, I don't think we'll ever know how much of RV770's changes were pushed out of R600.

If R600 was released on-time then we'd be looking at a 18-20-month refresh period between it and RV770. That length of time could be taken to indicate that RV770 is much as planned and it does not consist of stuff that couldn't make it into R600 due to "tools problems".

1) tmu organization. No longer shared across arrays, each array has its own quad tmu. I'll bet the sampler thread arbiter had to change with that as well.

Assuming that RV770 uses screen-space tiling for fragments, then this means that each quad TU now "owns" a region of the screen, since each SIMD owns a tile and there's a 1:1 relationship between the two.

I think R6xx uses the ring bus to allow the disjoint L2s (and therefore L1s) to share texels, after any one TU has fetched the texel from memory. But I don't remember anyone saying that this is the case. If true, this is extra work for the ring bus. I presume the ring bus also supports the SIMDs in fetching from "foreign" TUs, attached to other SIMDs, since all SIMDs have to use all TUs to get texture/vertex data.

As far as I can tell R6xx's TUs (L1, L2) each have a local ring stop. This ring stop serves the TU, an RBE and an MC. Connecting them is a crossbar. So not all memory operations by TUs and RBEs travel around the ring, as the local MC is "directly connected".

RV770 has a dedicated crossbar twixt L2s and L1s to enable texel distribution. But due to screen-space tiling, the volume of texels that need to land in multiple L1s should be much less than in R6xx. This is because texels at the borders of screen-space tiles are candidates for multi-L1 sharing, whereas in R6xx texels in every quad of screen space could be candidates for multiple L1s.

Vertex data normally consists of one or more streams (1D) that are consumed at roughly equal "element frequency". So it would seem to make sense for there to be a single vertex data cache as in RV770. It's not clear if R6xx had multiple instances of vertex data cache (one per SIMD) though.

I'm still wondering how a single vertex data cache is going to support 10 TUs though. Perhaps the SIMDs take it in turns, strictly round-robin?

2) tmu themselves changed. While they always had separate L1 caches (I think - the picture is misleading), now the separate 4 TA and point sampling fetch units are gone (thus one tmu is 4 TA, 16 fetch, 4 TF instead of 8 TA, 20 fetch, 4 TF). Also, early tests indicate they are no longer capable of sampling fp16 at full speed (dropping to half and one quarter at fp32 IIRC).

I have to say I'm confused by the fp16 situation. The amount of design effort that went into making R600 single-cycle fp16, indeed the conversion of int8 texels into fp16 texels, makes me wary of accepting that they've reverted to an int8 setup. Waiting to find out more.

3) ROPs. They now have 4xZ fill capability (at least in some situations) instead of just 2. The R600 picture indicates a fog/alpha unit which is now gone, though I doubt it really was there in the first place (doesn't make sense should be handled in the shader ALU). The picture also indicate shared color cache for R600, I don't know if this was true however. Could be though (see next item).

Like the uncertainty over L2 texture cache, I'm unsure whether R6xx has a single colour buffer cache or multiple instances each dedicated to an RBE. I suspect the latter, since screen-space tiling makes colour (and Z and stencil) essentially private to an RBE.

I'm pretty sure that RV770's RBEs only use their local MC, whereas R6xx appeared to allow all RBEs to access all MCs. I guess this means a revised way of tiling Colour, Z and stencil data in memory. Though we've never really had much idea how earlier GPUs tiled render targets...

4) no more ring bus. Clearly with rv770 ROPs are tied to memory channels (just like nvidia G80 and up), and there are per-memory channel L2 texture caches. Instead of one ring-bus it seems there's now different "busses" or crossbars or whatever for different data (it's got a "hub", it's got some path for texture data etc.)

The hub appears to be for low-bandwidth (or low duty-cycle) data. This makes me wonder if we'll see the "unification" of two GPUs' memory as has been long discussed, for the X2 board.

I just don't see how there'll be enough bandwidth through the two hubs (one per GPU) to allow anything other than the transmission of completed render targets, i.e. AFR mode.

5) Other stuff like the local data store, read/write cache etc.

LDS is a big deal. I have a suspicion that AMD has configured this as a read-only share between elements in a hardware thread. I wrote my theory here:

http://forum.beyond3d.com/showpost.php?p=1179619&postcount=4340

Making it read-only means it's "collision free" and latency-tolerant. I reckon this means that thread synchronisation (in order to be able to share data across elements safely) becomes very cheap and a normal part of the Sequencer's task of issuing threads and load-balancing them.

I dare say it's notable that, like GT200, RV770 is lower-clocked. I reckon this reflects the process/yield/die-size scaling issues that lead AMD to a multi-chip GPU strategy.

Global data share seems fiddlesome, now that's asking SIMDs to cooperate, I presume. Though if SIMDs are cooperating in their use of vertex data cache (taking it in turns) perhaps there's a higher level thread synchronisation mechanism in RV770. Something more interesting than the mundane creation and termination of hardware threads by the command processor.

Jawed

AlphaWolf · Jun 24, 2008

jimmyjames123 said:
I'm just confused why people would consider the GTX 260 "DOA" when based on a single review. If one can get higher playable settings some of the time with the GTX 260, in addition to PhysX and CUDA-application support down the road, then some will pay more to get it ($100 difference between the 4870 and GTX 260 is not that bad compared to the cost differential between GTX 260 and GTX 280).

Probably because the review doesn't really fall outside of expectations. The 260 wasn't exactly kicking the crap out of the 4850. PhysX and CUDA are just unknowns to most consumers, the small amount of people that will amounts to a niche.

The real problem is that NVIDIA has to cut the price to keep a competitive price/performance ratio. This really hurts their margins and ROI. Even though GT200 is an expensive chip, they may have to suck it up and bring the price of the GTX 260 down another $50 to make it a more competitive value until GT200b shows up.

I don't think it'll stay higher than the 4870 price for long, although they'll target the 1GB model most likely. Or maybe they'll start stripping them to 448MB, but I'm not sure how that will effect performance.

Lukfi · Jun 24, 2008

AlphaWolf said:
I don't think it'll stay higher than the 4870 price for long, although they'll target the 1GB model most likely. Or maybe they'll start stripping them to 448MB, but I'm not sure how that will effect performance.

I dunno why, but where memory size is a limiting factor for nVidia, it isn't for ATi. IMO 448 MB would essentially kill the GTX 260 just as 256 MB kills the 8800 GT.

toTOW · Jun 24, 2008

mhouston said:
That's actually from toTOW in the Folding@Home forums. I'll see if I can get him over here to explain how he did this. It's on air. ;-) This totally voids the warranty.

Thank you mike for pointing this thread to my attention ... I'm surprised to see that my adventures are already spreading all over the world

ChronoReverse said:
What did they use to overclock past 700MHz?

I'm editing the BIOS with an hexadecimal editor (with the help of RBE from to techpowerup to locate the correct offsets). GPU and memory clocks mod work from the BIOS, but I had to physically mod the board for the voltage.

Wirmish said:
725 MHz @ 1.200V -> Screenshot
775 MHz @ 1.265V -> Screenshot
800 MHz @ 1.???V -> Screenshot

1.3V for this bench, but none of these values have been fine tuned ... I'm trying to get the maximum form the chip.

I've just tested 825 MHz @ 1.3~1.32V, but I'm reaching the limits of a component (the GPU or the VRMs ... I don't know yet) : strange lightning flashes in the middle of the screen and gradient halo ...

Here are some additional pictures : two shots of the modded board

The board in the case, with the two extra fans :

Broken Hope · Jun 24, 2008

Lukfi said:
I dunno why, but where memory size is a limiting factor for nVidia, it isn't for ATi. IMO 448 MB would essentially kill the GTX 260 just as 256 MB kills the 8800 GT.

Always makes me wonder what ATI do differently, they can match Nvidia cards with much more memory, is Nvidia more wasteful of RAM than ATI?

digitalwanderer · Jun 24, 2008

Dumb question toTOW, but did you happen to find a way to control fan speeds on the stock cooler whilst playing with that hex editor?

BTW- Nice job, I bow to your voodoo!

mczak · Jun 24, 2008

Jawed said:
The hub appears to be for low-bandwidth (or low duty-cycle) data. This makes me wonder if we'll see the "unification" of two GPUs' memory as has been long discussed, for the X2 board.

I just don't see how there'll be enough bandwidth through the two hubs (one per GPU) to allow anything other than the transmission of completed render targets, i.e. AFR mode.

There's no indication the hub is really low-bandwidth (pcie 2.0 is already 8GB/s each direction). Not suited as a route-all-traffic-around-the-chip catch-all, yes, but might be sufficient for texture fetch / vertex fetch across the CrossfireX port.
Also, maybe it would be possible to operate in "mixed mode" - so for instance vertex buffers, compressed textures and render targets used as textures won't be duplicated but reused fp16 textures will be (though of course unfortunately those also take the most space...).
It certainly shouldn't prevent the use of SuperTiling (assuming vertex work is just all done on both chips).

Entropy · Jun 24, 2008

A.L.M. said:
This diagram shows clearly that is the idle consuption the problem, not the full. But the idle power draw can be fixed with a simple driver issue.
There's something very fishy in that test...

The two tests do confirm each other. In one the HD4850 draws 38W more than a standard HD3870, in the other it draws 22W more than an overclocked sample. This is normal.
Regarding the power draw of the HD4870, we already have examples here that achieving HD4870 frequencies require hiking the voltage, indicating that AMD is at a point on the curve where increasing the frequency of the part increases the power draw drastically.

These power draw numbers are at best internally consistent, since different test vectors give different results for different boards, and there is no standardized procedure. Still, there have been several tests that show that the HD4850 behaves as expected in this review, so there is little reason to doubt that the HD4870 data is in the right ballpark. It's disappointing, but not entirely unexpected.

A.L.M. · Jun 24, 2008

Entropy said:
The two tests do confirm each other. In one the HD4850 draws 38W more than a standard HD3870, in the other it draws 22W more than an overclocked sample. This is normal.
Regarding the power draw of the HD4870, we already have examples here that achieving HD4870 frequencies require hiking the voltage, indicating that AMD is at a point on the curve where increasing the frequency of the part increases the power draw drastically.

These power draw numbers are at best internally consistent, since different test vectors give different results for different boards, and there is no standardized procedure. Still, there have been several tests that show that the HD4850 behaves as expected in this review, so there is little reason to doubt that the HD4870 data is in the right ballpark. It's disappointing, but not entirely unexpected.

I think that 50W of TDP for 125MHz on core and for the GDDR5 are more than enough, don't you?

I know that you can't compare numbers between the two tests, that's obvious, but it's not my point. My point is: how the hell a GX2 draws only 18W more than a 9800GTX in full load or an HD3870X2 only 21W more than a HD4850? It's impossible, unless you do a very bad job in defining what should be considered as "full load", imho.
Look at the latest Anandtech review:

This one too looks completely understandable and compatible with the declared TDPs, for example.

Entropy · Jun 24, 2008

A.L.M. said:
I think that 50W of TDP for 125MHz on core and for the GDDR5 are more than enough, don't you?

Don't forget the 12W(!) fan on the HD4870.

I haven't looked into nVidias typical power draws over different reviews, I only checked for consistency over the RV770 data. Anandtech reports by far the smallest increase going from the HD3870 to the HD 4850, so I wonder a little at how they measured it.
GDDR5 shouldn't add much or anything over GDDR3 according to the spec sheets. I really would have appreciated if the rumoured HD4850 clocked GPU with GDDR5 had materialized, even though it probably didn't provide much performance advantage to justify the additional cost.

ChronoReverse · Jun 24, 2008

Interesting how Anandtech shows 8800GT load consumption lower than 3870 load while HardOCP shows about the same.

toTOW · Jun 24, 2008

digitalwanderer said:
Dumb question toTOW, but did you happen to find a way to control fan speeds on the stock cooler whilst playing with that hex editor?

No I didn't had a look to these parameters ... when automatic regulation started to be the limit, I plugged the fan directly to 12V ... and then I replaced it with the Zalman cooler.

digitalwanderer · Jun 25, 2008

Thanks.

Mintmaster · Jun 25, 2008

jimmyjames123 said:
I'm just confused why people would consider the GTX 260 "DOA" when based on a single review. If one can get higher playable settings some of the time with the GTX 260, in addition to PhysX and CUDA-application support down the road, then some will pay more to get it ($100 difference between the 4870 and GTX 260 is not that bad compared to the cost differential between GTX 260 and GTX 280).

I really doubt any other review will show anything different. When looking at 4850 vs. the rest, this review is very much in line with every other one.

The real problem is that NVIDIA has to cut the price to keep a competitive price/performance ratio. This really hurts their margins and ROI. Even though GT200 is an expensive chip, they may have to suck it up and bring the price of the GTX 260 down another $50 to make it a more competitive value until GT200b shows up.

I don't even know if that's enough, and I doubt NVidia has any desire to sell chips with low/negative margin. From the rumours, it's already priced lower than they wanted. Board cost is probably similar to the 3870X2, which would have similar RAM cost, board complexity, cooling, power, etc. The 3870X2 is generally slower than the 4870 but can't be priced below $299 due to cost (except for clearance purposes, of course, as it will soon be EOL).

IMO NVidia would rather bleed a little market share and live with low sales of the 260. There's still the 9800 GX2 and 9800 SLI for their loyal users to get a more competitive product near that price point.

Anyway, all things considered, I'm really impressed with the ATI cards this go-around. They are definitely back in the game!

Indeed, and it was really needed. Just look at how willing NVidia was to drop $100 off the price. They must have halved the price they were charging AIB partners for G92 chips, because all the other components remained the same cost and retailers/AIBs still want their share of the profit.

It just shows you how NVidia was feeling zero pressure from ATI and was acting almost like a monopoly. Can't say I can blame them, though.

AlphaWolf · Jun 25, 2008

so newegg had the sapphire 4870 512MB up for a bit at $309.99. Doesn't seem like we'll be waiting until July.

As for the memory performance making a difference on the 4870,

If you look at this chart, you'll see that the 4870 performs better than its 20% clockspeed advantage in 33 out of 36 comparisons. Therefore I think its safe to say that memory speed is helping its performance, unless there's something hidden we don't know about.

Karoshi · Jun 25, 2008

AlphaWolf said:
so newegg had the sapphire 4870 512MB up for a bit at $309.99. Doesn't seem like we'll be waiting until July.

add salt:
Fudzilla: Wednesday is Radeon 4870 day

tacopaco · Jun 25, 2008

so newegg had the sapphire 4870 512MB up for a bit at $309.99. Doesn't seem like we'll be waiting until July.

Hmmm when was that up? I didn't catch it....

AMD: R7xx Speculation

mczak

Mintmaster

jimmyjames123

Jawed

AlphaWolf

Specious Misanthrope

Lukfi

toTOW

Broken Hope

digitalwanderer

mczak

Entropy

A.L.M.

Entropy

ChronoReverse

toTOW

digitalwanderer

Mintmaster

AlphaWolf

Specious Misanthrope

Karoshi

tacopaco

Similar threads