AMD: R7xx Speculation

Status
Not open for further replies.
4 or 5%?

Look at the transistor counts for R600 versus RV670 and note that the latter has 50% of the former's MCs.

Jawed

More like ~7%.

R600 = 720 million transistors
RV670 = 666 million transistors

If we took PCIe 2.0, UVD and DX10.1 compatibility off the RV670, then we could've been looking at an almost 60~70 million transistor difference between the two, despite the fact that both of them carry the same ALU count and structure.
That would translate to roughly 8~12%, surely not a negligible difference on such complex chip designs.
Cut the memory bus to 128bit and that's another 5~6% on top of that, for a close to 20% gap.

A fifth of a "relatively" high-end chip discarded for use on a cheap mainstream card means profit pressure for sure.
And i'm also pretty sure that the 55nm half-node is still not at the maturity levels exhibited by the parent 65nm node.
 
If we took PCIe 2.0, UVD and DX10.1 compatibility off the RV670, then we could've been looking at an almost 60~70 million transistor difference between the two, despite the fact that both of them carry the same ALU count and structure.
Run the same comparison with RV610->RV620 and RV630->RV635. UVD is supposed to be very small, something like 3M transistors I think.

Jawed
 
To lay out my thoughts on possible improvements over RV670...

Render back ends:

  • quadruple the depth/stencil units
  • quadruple the z/stencil cache
  • 8xMSAA per clock
  • 8xZ per clock

(alpha/blend units and color cache untouched)

Texture Engine Units:

  • Double the texture address units
  • Double the texture samplers
  • Double the texture filtering units
  • Double the L1 cache
  • Double the vertex cache
  • Double the L2 cache

Shaders:

  • More SIMD arrays but shorter(Anywhere from 8 to 12)
  • Tweaks in the set up engine
  • Larger shader instruction and constant cache
  • Ultra threaded dispatch processor set up with 4 arbiters and 4 sequencers per SIMD(hey I can hope)

Ya, some of it seems pretty far out there like the ultra threaded dispatch processor and 8xMSAA per clock. But I would be just content enough with 4xMSAA/8xZ per clock to at least match G92's capability.
 
On another note: if RV770 does indeed have 480 ALU's it looks like it will fall short of the 1TFlop mark. I'm not really confident that they can hit the required 1Ghz+ core speed. Nvidia probably doesnt have a chance of hitting 1Tflop of MADDs either.

I thought that 1t flop was pretty far out there, but RV770 should hit close to 800gflops.
 
The same way that 3 quads per SIMD worked with 4 quad TMUs in R580 (1 quad TMU per SIMD).

Jawed

Maybe I'm missing something but isn't the whole point of the R600 setup (where one quad TMU serves the same quad in each SIMD) to better balance texturing work across the chip? If each TMU is instead tied to a SIMD then that TMU goes unused if that SIMD isn't running code requiring texturing. In R6xx if at least one of the threads running in the four SIMDs need texturing work the TMU's get utilized.
 
Maybe I'm missing something but isn't the whole point of the R600 setup where one quad TMU serves the same quad in each SIMD to better balance texturing work across the chip? If each TMU is instead tied to a SIMD then that TMU goes unused if that SIMD isn't running code requiring texturing.
I think the balancing of texturing workload comes from:
  • the high level thread allocation policy of the GPU spreads all workloads equally - screen-space tiling of pixel shader workload is the best example of this
  • L2 cache associativity - supports the coherency of multiple concurrent vertex/geometry/pixel threads so that texels aren't evicted too early
The sheer count of threads in each shader unit then keeps the TUs busy. Don't forget that texturing is a "look ahead" process in R6xx (just like R5xx) - texture results can be delivered dozens of clock cycles ahead of when they're actually required.

Looking at the way code is assembled on R6xx it seems that up to 8 texture fetches are performed in a single clause. (This comes at considerable register cost...)

In R6xx if at least one of the threads running in the four SIMDs need texturing work the TMU's get utilized.
I think it's reasonable to view R600 as having a single 16-wide TU which is shared across all four ALU SIMDs (that are 16-wide). We know L2 is centralised in R600 so it makes sense that the TUs are organised as a single SIMD processor. Each texturing clause then runs on the TU over 4 clocks, delivering 64 texturing results back to the originating batch.

So assume that RV770, with its 24 ALU quads, has a 32-wide TU, with quads A-H.

This is where I've revised my thinking, working in terms of batch size, not in terms of ALU SIMD width.

In the 12-SIMD RV770 each batch is 32-wide (2 quads * 4 clocks), or has 8 quads:
  • TU A - batch 1
  • TU B - batch 2
  • TU C - batch 3
  • ...
  • TU H - batch 8
So each of the 12 SIMDs takes it in turn to "control" the TU, for what is effectively 1 TU clock per instruction in the TU clause.

In the 4-SIMD RV770, each batch is 96-wide (6 quads * 4 clocks), or 24 quads:
  • TU A - batch 1, 9, 17
  • TU B - batch 2, 10, 18
  • ...
  • TU H - batch 8, 16, 24
And so each of the 4 SIMDs takes it in turn to control the TU, with each batch's texture clause running for 3 TU clocks per instruction.

Note that the mapping from TU to ALUs is not 1:1. The mapping is from a physical quad in the TU to logical quads in the batch. In the latter configuration, batch quads 1, 9 and 17 belong to SIMD quads 1, 3, 5, while batch quads 8, 16 and 24 belong to SIMD quads 2, 4, 6.

This latter organisation isn't what I proposed earlier :LOL: I've revised because I think the key is that there's a single TU, and I've found a way of thinking about a batch that enables "filling" a single TU processor.

I'm averse to the 12-SIMD version simply because of the large amount of control overhead... Also, I wonder if it's compatible with the concept of a single TU. Note that in this configuration each clause only runs for 1 clock in the TU pipeline. Is it reasonable to presume the TU can execute a different instruction on each successive clock or does it need to do so for several clocks?

This is similar to the way the ALU pipeline runs an instruction for 4 clocks. In R600 the TU runs an instruction for 4 clocks (still guessing). In the 4-SIMD RV770 each instruction would run for 3 clocks.

Hmm...

---

My earlier suggestion for a 4 SIMD RV770 would feature four 8-wide TUs. Each TU would be under the control of a single ALU SIMD. Each TU clause would run for 12 clocks per instruction (24 quads in the batch divided by 2 quads in the TU)... Seems pretty unlikely.

---

So, after all that, the 1-clock per TU clause instruction makes me think it's unlikely that RV770 is a 12-SIMD design. But that presumes all this stuff about there being a single SIMD for the TUs is correct. I'm left reckoning the 4-SIMD design is most likely (though 3-clocks per TU clause instruction makes me a bit wary, would be nicer if it was 4).

Jawed
 
Everyone here appears to be stating the GT200 will be available in July and the RV770 available in June, but this news listings from Tech Fuzz suggests later than that, well into Q3 or even Q4 before we see them:

http://www.techfuzz.com/roadmaps/2008.aspx

I'm also confused on the difference between the RV770 and the R700... as this thread is about the R7xx, not the RV770? It appears the R7xx will be well into Q4 if not Q1 of '09.
 
Everyone here appears to be stating the GT200 will be available in July and the RV770 available in June, but this news listings from Tech Fuzz suggests later than that, well into Q3 or even Q4 before we see them:

http://www.techfuzz.com/roadmaps/2008.aspx

I'm also confused on the difference between the RV770 and the R700... as this thread is about the R7xx, not the RV770? It appears the R7xx will be well into Q4 if not Q1 of '09.

Well that site is all wrong. It also says this

April

* nVidia GeForce GeForce 9800 GTX (Code name: D9E = Desktop 9 Enthusiast, aka G100 or GT200) is expected to be launched the first week of April. This will be nVidia's 9th-Gen enthusiast GPU and will phase-out the D8E series. The 9800 GTX will be manufactured using TSMC's 65 nm process, contain over 1 billion transistors, and support DirectX 10.1 and Shader Model 4.1. The 9800 GTX will contain 128 processor cores running at 1688 MHz, a core clock at 675 MHz, and 512MB DDR3 running at 1100 MHz over a 512-bit memory interface. Video card makers will likely launch additional 1GB and higher clocked versions of the card at a later date. The 9800 GTX will compete against AMD R700-based video cards. It will have two SLI bridges and support Tri-SLI. The card requires two 6-pin PCIe power connectors and has a dual-slot cooler. TDP is expected to be around 250W.

9800GTX is just a G92 based card. Nothing next gen like they're claiming.

According to what I read on Fudzilla R700 is the name for the whole line of chips. RV7xx will be actual specific chips within that line. RV770 is the alledged flagship. So there wont be an uber powerful true R700 model down the line if they are correct, RV770 is it.
 
Thank you both, that clarifies it for me. All the numbers and brandings are starting to make my head spin... I'm sure you understand ;).
 
More and more it appears that in a move to get back to profitability, AMD/ATI has for the most part abandoned the ultra high end enthusiast chips. And are instead focusing on the meat of the market in budget, mainstream, and performance mainstream markets.

R600 being the last attempt from ATI for the ultra high performance crowd. With the R700 series appearing to follow in the footsteps of Rv670 with focus on the mainstream.

Their nod/efforts in terms of the ultra high end appear to be focused on Crossfire and it's direct offspring the X2 cards.

Oddly enough that renewed focus seems to have put ATI ahead of Nvidia when it comes to multi-GPU rendering.

Scaling is roughly equivalent for the two with ATI appearing to have a slight edge in 3-way scaling.

Flexibility has no competition with ATI being able to mix and match any 385x and 387x derivative card with any other for 2, 3, or 4 way CF.

As well functionality is currently higher with multimonitor usable with multi-GPU.

Although Nvidia does still have the advantage in user defineable SLI profiles. But again that's an enthusiast class feature, and not in the markets that ATI appears to currently be focused on.

Just looking at things superficially it just appears that while ATI and Nvidia are competing in the graphics card business. They are each targetting different (yet overlapping) audiences.

Amd focused on price/perf, functionality and flexibility. While Nvidia seems focused on performance and price/perf (only due to pressure from ATI).

I'm wondering when we'll see indications that Nvidia is moving to be more competitive with SLI? After all, it appears that they are still locking certain cards to certain SLI configurations. As well it doesn't appear you can mix and match say a GX2 with a GTX or GT. And still not even a rumor of multi-monitor on SLI?

Regards,
SB
 
Actually they never made an ultra high end card with the R600- it targeted the GTS, not the GTX. The R5XX was their last attempt at an ultra high end card
 
Last edited by a moderator:
I'm quite positive ATI/AMD plannet R600 to be "ultra high end" first, they just couldn't deliver that in the end, but it doesn't mean they didn't try
 
I'm wondering when we'll see indications that Nvidia is moving to be more competitive with SLI? After all, it appears that they are still locking certain cards to certain SLI configurations. As well it doesn't appear you can mix and match say a GX2 with a GTX or GT. And still not even a rumor of multi-monitor on SLI?

Define competitive. I don't think Nvidia considers themselves behind considering they essentially have a monopoly on the multi-GPU market. The advantages of crossfire X right now are mostly academic in terms of the mixing and matching of various SKU's. Those aren't practical combinations that people are actually likely to use.

The big advantage IMO is with respect to multi-display support but again you're talking about a very small (yet arguably vocal) number of consumers who have both a multi-GPU and multi-display setup. Another issue that Nvidia should be very aware of is the increasing attractiveness of AMD and Intel based motherboards....that could be the most dangerous threat to their multi-GPU throne.

In the end though, no amount of flexibility is going to trump perf/$. What three or four GPU setup can AMD offer right now that cannot be beaten by an Nvidia solution for the same money?
 
Define competitive. I don't think Nvidia considers themselves behind considering they essentially have a monopoly on the multi-GPU market. The advantages of crossfire X right now are mostly academic in terms of the mixing and matching of various SKU's. Those aren't practical combinations that people are actually likely to use.

The big advantage IMO is with respect to multi-display support but again you're talking about a very small (yet arguably vocal) number of consumers who have both a multi-GPU and multi-display setup. Another issue that Nvidia should be very aware of is the increasing attractiveness of AMD and Intel based motherboards....that could be the most dangerous threat to their multi-GPU throne.

In the end though, no amount of flexibility is going to trump perf/$. What three or four GPU setup can AMD offer right now that cannot be beaten by an Nvidia solution for the same money?

For the same money muddles the waters. With the way prices are going, you're probably going to be able to get 2 3870X2s for the price(or near that) of 2 9800GTXs, which is a comparison that nV might or might not win. Once you go higher up the foodchain though, they have no serious competition for the moment(IMHO).
 
Define competitive. I don't think Nvidia considers themselves behind considering they essentially have a monopoly on the multi-GPU market. The advantages of crossfire X right now are mostly academic in terms of the mixing and matching of various SKU's. Those aren't practical combinations that people are actually likely to use.

The big advantage IMO is with respect to multi-display support but again you're talking about a very small (yet arguably vocal) number of consumers who have both a multi-GPU and multi-display setup. Another issue that Nvidia should be very aware of is the increasing attractiveness of AMD and Intel based motherboards....that could be the most dangerous threat to their multi-GPU throne.

In the end though, no amount of flexibility is going to trump perf/$. What three or four GPU setup can AMD offer right now that cannot be beaten by an Nvidia solution for the same money?

Meaning that say someone had a 3870 already. And had been considering Crossfire. Now they have the option to add a 3870x2 for triple CF.

Or if someone has a 3870x2. They can just get a 3870 or 3850 for Triple CF since Quad CF has virtually no gains.

Say someone with 9800 GX2 wants to get more performance. They have no choice but to go Quad SLI with another GX2 even though Quad SLI (just like CF) has virtually no performance benefits when compared to Tri SLI. No option at all currently.

Likewise if you have 8800 GTs, Tri SLI is no option for you. You have to buy new cards. With 3850's you can still do Tri-CF without having to reinvest in 3 new cards.

Sure Nvidia's SLI is better for Nvidia, but ATI's CF is certainly much more beneficial for consumer's.

Likewise, multimonitor may not be a big deal if all you care about is performance. But I'd gladly sacrifice performance to avoid the headache of constantly enabling and disabling SLI everytime I wanted to go back to normal desktop use after gaming. Especially and added headache of having to then setup your desktop all over again.

And as stated before multi-card scaling is generally EQUAL (so performance due to the technology of multiGPU is basically equal) between the two.

IE - SLI tech has no performance advantages over CF tech currently.

So if you ONLY look at the SLI and CF implementations minus the cards. SLI is WAY behind CF.

It's basically equal in multi-GPU scaling. Yet behind in flexibility. Behind in added value usefulness (multi-mon).

The one place it STILL shines is in user defineable SLI profiles for games that aren't yet driver optimized for SLI. Yet that is an enthusiast feature that isn't geared to the mainstream. An area that ATI is currently targetting full force it appears.

I really can't see how anyone could seriously think SLI's implementation isn't falling behind at this moment. Sure the performance of the cards used is greater than the competition, but the implementation is still far behind.

Currently ATI is the only one that appears to be innovating and refining multi-GPU. I'm sure Nvidia MUST be doing something, but they haven't shown anything yet, other than Hybrid SLI which won't work without very specific hardware. So it's about equal to ATI's CrossfireX power saving for low end MBs in that it's extremely limited and not available for use for the vast majority of gamers.

Not arguing that Nvidia doesn't have the fastest solutions due to having the fastest cards. However, compared side by side with CF. It's certainly lacking in a lot of area's.

Which I think is just a side benefit of ATI focusing on the mainstream rather than focusing on the smaller enthusiast market.

Regards,
SB
 
Status
Not open for further replies.
Back
Top