The LAST R600 Rumours & Speculation Thread

Status
Not open for further replies.
With all those extra transisitors, a lower process node, all that memory bandwidth and the 512bit memory controller, there's got to be something more in R600 that we don't yet know about.

A. First we don't know for sure what R600 will do with massive bandwidth.
B. Transistors doesn't tell anything (Example)
Between G80 and R600 the different is 39 million transistors.
Between G71 and R580 the different is 106 million transistors.
Between G70 and R520 the different is 19 million transistors.
Between NV45 and R480/R420 the different is 62 million transistors.
Between NV30 and R300 the different is 23 million transistors.
Based on the history for the GPU’s transistor count and lower process node; for me it does not mean much.
G80 = 681 million transistors “090nm tech”
R600 = 720 million transistors “080nm tech”
R580 = 384 million transistors “090nm tech”
G71 = 278 million transistors “090nm tech”
R520 = 321 million transistors “090nm tech”
G70 = 302 million transistors “110nm tech”
NV40/NV45 = 222 million transistors “130nm tech” G6800Ultra
R480 = 160 million transistors “110nm tech” X850XT-PE
R420 = 160 million transistors “130nm tech” X800XT-PE
NV30 = 130 million transistors “130nm tech” G5800 Ultra
R300 = 107 million transistors “150nm tech” R-9700Pro

1. And something more about R600 that we don't yet know about, the time will tell.
 
"Between G71 and R580 the different is 106 million transistors"

The reason they were about equal =

A. R580 had only 16 texture units vs. G71 24 textures units.
If Radeon R580 series had 24 or 32 textures units. It would simply destroy G71 performance.

48 pixel shaders on R580 was very awesome vs. only 24 found in G71.

ATI was relying raw-power on pixel-shaders which brought some success but in a lot of situation G71 fought strong because rely of 24 texture units advantage over ATI which was only 16.

Time will tell about how R600 will be utilized 100% efficiency to produce same outcome with nvidia. Because both ATI and Nvidia has different methods to produce same results.
 
Last edited by a moderator:
A. R580 had only 16 texture units vs. G71 24 textures units.
If Radeon R580 series had 24 or 32 textures units. It would simply destroy G71 performance.

48 pixel shaders on R580 was very awesome vs. only 24 found in G71.
No offense but you have no idea of what you're talking about, big confusion here, I would revise your numbers if I were you :)
 
If we consider MADD-capable ALU/group of ALUs (which can work on individual pixel) as a "pixel shader", than this numbers are correct, aren't they? (thats a question, not an insult)
 
FMarks what else

I don't think that's strictly true. Both AMD and Nvidia must have pre-production DX10 software from game developers (or they are writing code for those developers), and they must both have their own software written for testing purposes. It's also more than likely that AMD has some retail G80s to compare to their own unreleased R600.
 
So, in short, there's no way I can back any of this up. I'm hopeful that there'll be a discussion, that's all...
Well, the idea is a bit out-of-this-world, but original and makes sense in theory, I guess - I don't think it makes much sense in practice though, simply because it's likely to be more expensive than truly scalar ALUs! :)

To illustrate that, let's take G80. You have 8-wide ALUs with 32 pixels batches, or 16 vertices batches. Let us take the former case; that means a single scalar register can be 32-wide! So, rather than reading register 1/2/3 simultaneously for a MADD, just first read register 1, then read register 2, then read register 3 - in three different clock cycles. That's fundamentally free.

So, you have your single-ported register file there; you could do some smarter things, but that's the basic idea. So now you have your registers loaded and you just go through the ALU pipeline. The only catch there is that now you'll have 4x more pixels in flights (-> 4x more registers!) in your ALUs. So, to fix that, you can just allow multiple instructions of the same thread/batch/pixel/whatever to hbe worked on at once. Consider a vec4 instruction, excluding dot products which are horizontal; what prevents you from just having four independent scalar ADD instructions in the pipeline at the same time? Nothing.

So, there are three semi-obvious ways to implement that. The first is to implement actual vec4/vec3/etc. instructions that let the scheduler know it can send multiple instructions at the same time; this gives subpar results in more complicated scenarios, however. The second is to use a register scoreboard. The third is to implement an actual OoOE engine, which obviously isn't much of an option if you want high perf/mm2!

Erik Lindholm, who is the lead behind the G80 Shader Core architecture, has a patent on that second option. It should be relatively cheap to implement in hardware, and extremely efficient *if* the compiler does its job properly. So, unlike in past architectures, the compiler should both reduce register usage AND reduce the number of threads necessary at any given time to hide latency. Ideally, you want it to mix both goals and find the best compromise in terms of latency tolerance. This is one of the few places where there is clear and obvious potential for driver-based performance boosts. Optimality requires a lot of clever instruction reordering and intelligent register usage.

In case you aren't familiar with the concept, the basic idea of a scoreboard (excluding instruction windows) is that whenever an instruction writes to a register (or rather, 16 or 32 registers, in G80's case, since that's the granularity!), it reserves that register and any instruction that reads that register will block further scheduling until the register has been written to.

So, overall what I'm trying to say here is that through clever design, and in the specific case of a GPU, there is nothing that makes a scalar ALU more expensive than a Vec4 ALU, per-se. Furthermore, a Vec4 ALU will need some hacks for dot products, which might or might not complicate its design - but at the very least, it's likely to reduce its clockability (this is the often cited reason for CELL not to implement dot products...)

I'm not going to say a Vec4 ALU (from the shader writer's pov) is a bad design choice in itself, it depends on your architecture a bit, but if you design things from scratch around a scalar ALU, I don't think it's going to cost you much of anything extra, and it's nearly certainly going to cost you (a lot?) less than sophisticated hacks like the ones you proposed! ;)


Uttar
 
It's also more than likely that AMD has some retail G80s to compare to their own unreleased R600.

I wonder if the 5-10% are a worst case estimate or whether they are comparing hand-tuned stuff for their R600 against G80, and getting 10% in the best case. The former would be okish, but the latter ... ohoh.

I'm very curious to hear some developer opinions on R600 (like "It rocks! 10x faster than G80 xD") ... something is really weird this time, no leaked 3dMarks, nothing ... same goes for the G80 refresh, we should hear something about it already.
 
I wonder if the 5-10% are a worst case estimate or whether they are comparing hand-tuned stuff for their R600 against G80, and getting 10% in the best case. The former would be okish, but the latter ... ohoh.

I'm very curious to hear some developer opinions on R600 (like "It rocks! 10x faster than G80 xD") ... something is really weird this time, no leaked 3dMarks, nothing ... same goes for the G80 refresh, we should hear something about it already.

There is a rule about keeping secrets: don't tell anyone. The second rule is that if you have to tell someone, only tell someone you trust.

Obviously as launch approaches, you end up having to tell people you "don't trust", but it seems to me that over the last few major launches, both AMD and Nvidia have been much, much better at locking down real information. They are both restricting information and pre-release hardware to smaller groups of more trusted people, tied in to stronger NDAs, and have been weeding out weaker people in the chain of information.
 
There is a rule about keeping secrets: don't tell anyone. The second rule is that if you have to tell someone, only tell someone you trust.

Obviously as launch approaches, you end up having to tell people you "don't trust", but it seems to me that over the last few major launches, both AMD and Nvidia have been much, much better at locking down real information. They are both restricting information and pre-release hardware to smaller groups of more trusted people, tied in to stronger NDAs, and have been weeding out weaker people in the chain of information.

Even though some of the G80 stuff leaked months before launch; just that no one believed it.
 
Even though some of the G80 stuff leaked months before launch; just that no one believed it.

Considering how long G80 was in production (several years), it's still pretty good that info only leaked a short while before launch. Even then, Nvidia's misinformation was so good that no one believed what were the actual true specs. A few weeks before launch, we still had very well respected and connected people on these forums catagorically stating that G80 would not be what it turned out to be.

Keeping specs of new products secret before launch is common for big competative companies nowadays, we shouldn't be surprised that there's no leaks until the last possible minute. Keeping stuff under wraps has now become more important than spoilers.
 
Who knows. I'm sure AMD has a backup plan if the GF 8900 (?) kicks too much butt.

But I doubt we'll see another "R520 fiasco". Hopefully the R600 will pwn. We need two players in this game. ;)
There seems to be a very strong assumption by several people in this thread that R600 must be substantially superior to G80, simply because it is coming out 6 months later. This assumes that ATI has always intended to launch R600 6 months after G80. I don't think that's necessarily true.

If you go back in time a bit most of the rumours about the R600 release date suggested that R600 and G80 would be out almost simultaneously. Rumoured R600 dates then slipped from Novemberto January, then February, and now March/April.

I think it's very likely that ATI intended R600 to be a direct, immediate rival to G80, but then simply didn't manage to get it out the door. Maybe they were unlucky compared with Nvidia and needed one more respin than G80 did? I don't know. But, either way, I think we are in a situation not unlike what happened with R520 (although not quite as bad!): Nvidia and ATI parts were supposed to launch at the same time, but the ATI one experienced long delays and the Nvidia one didn't. (This is the reason there were only about 3 months between R520 and R580 - they were being worked on by separate teams, and R580 came out more or less on time).

I may well be wrong; but I can't see any reason to assume that ATI did intend there to be such a long gap between G80 and R600. And I think it is therefore very rash to assume that R600 must necessarily be a "next generation after G80" product; it's at least as likely to be the same generation with directly comparable performance.
 
There seems to be a very strong assumption by several people in this thread that R600 must be substantially superior to G80, simply because it is coming out 6 months later.
Count me out of that.

My assumption is based solely on the bandwidth.

There's always the chance that ATI lost the plot totally and designed something with ~120GB/s+ but the TMU/ROP/streamout/constant-buffer capability of ~G80. Utterly wasteful and entirely out of character...

Jawed
 
My assumption is based solely on the bandwidth.
We don't know how efficient either architecture is in terms of bandwidth. Furthermore, if you compared 16x CSAA and 8x MSAA, or 16xQ CSAA and 12x MSAA (this is 100% hypothetical), and one IHV optimized for the CSAA and the other for MSAA...

The architectures are vastly different in terms of how they make sure of the available memory bandwidth, anyway. Consider the marketing information on R5xx's Z-Compression: It can get up to insanely high compression ratios with FSAA. Now look at G7x, or even G8x: Where are the marketing numbers for that? They're nowhere to been seen. And that's not just because they didn't bother to brag about that - but because NVIDIA's compression ratios don't *seem* to scale as much with MSAA. On the other hand, with more MSAA, they are more likely to successfully compress tiles - so the effective ratio goes up anyway.

There are other factors there, such as how "hierarchical-Z" works. NVIDIA, afaik, works by rejecting entire triangles at a time, and their scheme is actually NOT hierarchical (!), and I wouldn't be surprised if they had little on-chip storage for it (just read the min/max from memory whenever the need arises!) - so, all of these factors have to be taken into consideration, and it'll be interesting to *try* to figure out how both G80 and R600 work there, in due time...

Some other things to take note of: texture cache. The G80 has 128KiB of L2 iirc - and a fair bit of L1 too. Overall, that's several times more than the R580 I think. I'd expect NVIDIA and AMD's color-compression algorithms to be roughly identical, but that's hard to say - a naive implementation would stop compressing completely when there there are two distinct colors per pixel even if you had 6x or 8x MSAA, while a more complicated one would handle that better.

It'll be interesting to compare R600 and G80 at similar memory bandwidth numbers though, that's for sure! :) Although that might be even fairer if we had a G80 GDDR4 sample so that the same memory types are used - hmm!

And before anyone thinks I'm trying to say NVIDIA is more efficient than AMD in terms of compression, caches, etc. - I'm not saying that. Especially so when we don't know how the R600 will differ from the R580 from that point of view, anyway! For all we know, the R600 might be more efficient than the G80 there; but then, you'd seriously ponder how it's going to make sure of all that memory bandwidth...


Uttar
 
Count me out of that.

My assumption is based solely on the bandwidth.

There's always the chance that ATI lost the plot totally and designed something with ~120GB/s+ but the TMU/ROP/streamout/constant-buffer capability of ~G80. Utterly wasteful and entirely out of character...

Jawed

I agree. As I've been saying since the 512 bit bus came to light, "what the heck does R600 need that memory bandwidth for?" Maybe the answer is pretty simple - there's nothing special about the increased AA (which I'm expecting) or physics, it could just be that R600 has the pure processing ability to fill that bandwidth in normal use. It has enough TMU, vertex/shader/geometry processors running fast enough that it can utilise all that extra bandwidth, and maybe needs it in order not to choke the performance the chip is capable of.

If you're going to pull out all the stops (as Nvidia did with G80), then this is the time in the product lifecycle to do it ie at the DX10/Vista/GPGPU inflection point. We've never had three important inflection points all arrive together like this, and this is the time to bring your best game to the market.
 
Last edited by a moderator:
Well, the idea is a bit out-of-this-world, but original and makes sense in theory, I guess - I don't think it makes much sense in practice though, simply because it's likely to be more expensive than truly scalar ALUs! :)
Truly scalar ALUs, seen in G80, don't solve the batch-size/dynamic branching problem. In fact, they make it worse (since more objects have to be issued in parallel). G80 partly attacks this problem by splitting each of its shader clusters into two independent SIMD arrays, each of 8 objects. But the cost is that each half now requires its own dedicated instruction fetch/decode, branching and register fetch/store logic. So, G80 suffers a ~2x cost there.

To illustrate that, let's take G80. You have 8-wide ALUs with 32 pixels batches, or 16 vertices batches. Let us take the former case; that means a single scalar register can be 32-wide! So, rather than reading register 1/2/3 simultaneously for a MADD, just first read register 1, then read register 2, then read register 3 - in three different clock cycles. That's fundamentally free.
No, you can't do that. You can't have an operand read rate that's slower than your ALU retire rate. All 3 operands must be read in parallel. And then there's the special function co-issue as well, which requires another operand fetch. Go back and re-design your register file ;) NVidia even provides a handy patent to help you...

Erik Lindholm, who is the lead behind the G80 Shader Core architecture, has a patent on that second option. It should be relatively cheap to implement in hardware, and extremely efficient *if* the compiler does its job properly. So, unlike in past architectures, the compiler should both reduce register usage AND reduce the number of threads necessary at any given time to hide latency. Ideally, you want it to mix both goals and find the best compromise in terms of latency tolerance. This is one of the few places where there is clear and obvious potential for driver-based performance boosts. Optimality requires a lot of clever instruction reordering and intelligent register usage.
Agreed, e.g. with clever compilation, it's possible to "unroll" vector operations across a clause of code such that the cost of intermediate temporary registers is much-reduced. I would hope that the bulk of this has already been put into G80's compiler, it's the low-hanging fruit.

The scalar ALU and the independent special function co-issue is dramatically easier to write a compiler for than G7x. While the SF ALU is also the interpolator, it's still less complex than the G7x pipeline design where SF was split asymmetrically across both ALUs and where TEX locked-out most of the first ALU. It is just a mess.

In case you aren't familiar with the concept, the basic idea of a scoreboard (excluding instruction windows) is that whenever an instruction writes to a register (or rather, 16 or 32 registers, in G80's case, since that's the granularity!), it reserves that register and any instruction that reads that register will block further scheduling until the register has been written to.
Without a similar scoreboard Xenos wouldn't work ;)

So, overall what I'm trying to say here is that through clever design, and in the specific case of a GPU, there is nothing that makes a scalar ALU more expensive than a Vec4 ALU, per-se. Furthermore, a Vec4 ALU will need some hacks for dot products, which might or might not complicate its design - but at the very least, it's likely to reduce its clockability (this is the often cited reason for CELL not to implement dot products...)
Xenon clocks at the same speed as Cell and has a DP4 pipeline.

The hack required for a DP is nothing more than a loop-back register in the pipeline that holds the intermediate result from the first component until the second component has been multiplied, then added, then fed-back to the loop-back. Nothing particularly amazing.

I'm not going to say a Vec4 ALU (from the shader writer's pov) is a bad design choice in itself, it depends on your architecture a bit, but if you design things from scratch around a scalar ALU, I don't think it's going to cost you much of anything extra, and it's nearly certainly going to cost you (a lot?) less than sophisticated hacks like the ones you proposed! ;)
The design for an ALU I'm proposing is nothing more than a low-cost per clock, SIMD array. It has:
  • 1 cycle ADD or MUL
  • 2 cycle MADD
  • at least a 2 cycle RCP
  • upto 8 cycle SIN/RSQ/EX2/LOG
You can feed it:
  • 4 objects in parallel, with all 4 components (i.e. RGBA or XYZW)
  • 4 objects in parallel with 3 components
  • 8 objects, with 2 components
  • 16 objects, with 1 component each
The register file arrangement I've proposed enables these different fetch patterns and on top of that enables thread packing, which allows the GPU to ignore quads that are predicated-out and swap-in quads that require execution. Using one of those scoreboard thingies to keep track of per thread per quad predication...

The pay-back for the cost of all this logic is a dramatic increase in throughput:
  • the pipeline acts like a "scalar" ALU, which increases the per-component utilisation
  • ALU component utilisation lost to predicated-out objects is reduced because the "effective batch size" can be made significantly smaller
Obviously, all of this is just based upon my interpretation of the patent documents.

Jawed
 
We don't know how efficient either architecture is in terms of bandwidth.
Still waiting for an in-depth B3D review of G80 performance :cry: what's the matter with a DX9 part 1 and a D3D10 part 2?

Furthermore, if you compared 16x CSAA and 8x MSAA, or 16xQ CSAA and 12x MSAA (this is 100% hypothetical), and one IHV optimized for the CSAA and the other for MSAA...
Remember Xenos has 4xAA per clock ROPs and RV530 has double-rate ROPs whether AA is off or on. Whether ATI chose to deploy these techniques in R600 remains to be seen though.

... Z-Compression ... hierarchical-Z ...
It would be nice if someone actually bothered to investigate why R5xx is twice as fast as G7x in Oblivion foliage. Methinks it has something to do with Z and early rejection, etc.

Some other things to take note of: texture cache. The G80 has 128KiB of L2 iirc - and a fair bit of L1 too. Overall, that's several times more than the R580 I think.
In my opinion, the L2 cache in G80 serves five purposes:
  • vertex buffers
  • texels
  • constant buffers
  • colour/z/stencil buffers
  • streamout buffers
it would be nice to know for sure...

I'd expect NVIDIA and AMD's color-compression algorithms to be roughly identical, but that's hard to say - a naive implementation would stop compressing completely when there there are two distinct colors per pixel even if you had 6x or 8x MSAA, while a more complicated one would handle that better.
Then there's the ATI patent application for colour compression in floating point render targets with AA.

http://www.beyond3d.com/forum/showpost.php?p=888727&postcount=1293

to be fair, that's compression of the colour in the AA samples, rather than compression of colour, per se.

Jawed
 
Last edited by a moderator:
If you go back in time a bit most of the rumours about the R600 release date suggested that R600 and G80 would be out almost simultaneously. Rumoured R600 dates then slipped from Novemberto January, then February, and now March/April.

Where are you getting THAT info from?

when we first had leaks about R600 (medio 2005) we were still waiting on R520 and G70 was just anounced/released.
That time we actually expected G80 to rival R580, G80 should've launched in May 2006 (around Computex.)
R600 has allways had the target of 2006's holiday season, never ever making it a straight launch competitor to G80. G80 was a half year late, R600 will be a half year late.

In the end.. it'll be a 80nm R600 vs. a 80nm G81
 
Status
Not open for further replies.
Back
Top