memberSince97
Newcomer
FMarks what else
With all those extra transisitors, a lower process node, all that memory bandwidth and the 512bit memory controller, there's got to be something more in R600 that we don't yet know about.
No offense but you have no idea of what you're talking about, big confusion here, I would revise your numbers if I were youA. R580 had only 16 texture units vs. G71 24 textures units.
If Radeon R580 series had 24 or 32 textures units. It would simply destroy G71 performance.
48 pixel shaders on R580 was very awesome vs. only 24 found in G71.
FMarks what else
Well, the idea is a bit out-of-this-world, but original and makes sense in theory, I guess - I don't think it makes much sense in practice though, simply because it's likely to be more expensive than truly scalar ALUs!So, in short, there's no way I can back any of this up. I'm hopeful that there'll be a discussion, that's all...
It's also more than likely that AMD has some retail G80s to compare to their own unreleased R600.
I wonder if the 5-10% are a worst case estimate or whether they are comparing hand-tuned stuff for their R600 against G80, and getting 10% in the best case. The former would be okish, but the latter ... ohoh.
I'm very curious to hear some developer opinions on R600 (like "It rocks! 10x faster than G80 xD") ... something is really weird this time, no leaked 3dMarks, nothing ... same goes for the G80 refresh, we should hear something about it already.
There is a rule about keeping secrets: don't tell anyone. The second rule is that if you have to tell someone, only tell someone you trust.
Obviously as launch approaches, you end up having to tell people you "don't trust", but it seems to me that over the last few major launches, both AMD and Nvidia have been much, much better at locking down real information. They are both restricting information and pre-release hardware to smaller groups of more trusted people, tied in to stronger NDAs, and have been weeding out weaker people in the chain of information.
Even though some of the G80 stuff leaked months before launch; just that no one believed it.
There seems to be a very strong assumption by several people in this thread that R600 must be substantially superior to G80, simply because it is coming out 6 months later. This assumes that ATI has always intended to launch R600 6 months after G80. I don't think that's necessarily true.Who knows. I'm sure AMD has a backup plan if the GF 8900 (?) kicks too much butt.
But I doubt we'll see another "R520 fiasco". Hopefully the R600 will pwn. We need two players in this game.
Count me out of that.There seems to be a very strong assumption by several people in this thread that R600 must be substantially superior to G80, simply because it is coming out 6 months later.
We don't know how efficient either architecture is in terms of bandwidth. Furthermore, if you compared 16x CSAA and 8x MSAA, or 16xQ CSAA and 12x MSAA (this is 100% hypothetical), and one IHV optimized for the CSAA and the other for MSAA...My assumption is based solely on the bandwidth.
Count me out of that.
My assumption is based solely on the bandwidth.
There's always the chance that ATI lost the plot totally and designed something with ~120GB/s+ but the TMU/ROP/streamout/constant-buffer capability of ~G80. Utterly wasteful and entirely out of character...
Jawed
Truly scalar ALUs, seen in G80, don't solve the batch-size/dynamic branching problem. In fact, they make it worse (since more objects have to be issued in parallel). G80 partly attacks this problem by splitting each of its shader clusters into two independent SIMD arrays, each of 8 objects. But the cost is that each half now requires its own dedicated instruction fetch/decode, branching and register fetch/store logic. So, G80 suffers a ~2x cost there.Well, the idea is a bit out-of-this-world, but original and makes sense in theory, I guess - I don't think it makes much sense in practice though, simply because it's likely to be more expensive than truly scalar ALUs!
No, you can't do that. You can't have an operand read rate that's slower than your ALU retire rate. All 3 operands must be read in parallel. And then there's the special function co-issue as well, which requires another operand fetch. Go back and re-design your register file NVidia even provides a handy patent to help you...To illustrate that, let's take G80. You have 8-wide ALUs with 32 pixels batches, or 16 vertices batches. Let us take the former case; that means a single scalar register can be 32-wide! So, rather than reading register 1/2/3 simultaneously for a MADD, just first read register 1, then read register 2, then read register 3 - in three different clock cycles. That's fundamentally free.
Agreed, e.g. with clever compilation, it's possible to "unroll" vector operations across a clause of code such that the cost of intermediate temporary registers is much-reduced. I would hope that the bulk of this has already been put into G80's compiler, it's the low-hanging fruit.Erik Lindholm, who is the lead behind the G80 Shader Core architecture, has a patent on that second option. It should be relatively cheap to implement in hardware, and extremely efficient *if* the compiler does its job properly. So, unlike in past architectures, the compiler should both reduce register usage AND reduce the number of threads necessary at any given time to hide latency. Ideally, you want it to mix both goals and find the best compromise in terms of latency tolerance. This is one of the few places where there is clear and obvious potential for driver-based performance boosts. Optimality requires a lot of clever instruction reordering and intelligent register usage.
Without a similar scoreboard Xenos wouldn't workIn case you aren't familiar with the concept, the basic idea of a scoreboard (excluding instruction windows) is that whenever an instruction writes to a register (or rather, 16 or 32 registers, in G80's case, since that's the granularity!), it reserves that register and any instruction that reads that register will block further scheduling until the register has been written to.
Xenon clocks at the same speed as Cell and has a DP4 pipeline.So, overall what I'm trying to say here is that through clever design, and in the specific case of a GPU, there is nothing that makes a scalar ALU more expensive than a Vec4 ALU, per-se. Furthermore, a Vec4 ALU will need some hacks for dot products, which might or might not complicate its design - but at the very least, it's likely to reduce its clockability (this is the often cited reason for CELL not to implement dot products...)
The design for an ALU I'm proposing is nothing more than a low-cost per clock, SIMD array. It has:I'm not going to say a Vec4 ALU (from the shader writer's pov) is a bad design choice in itself, it depends on your architecture a bit, but if you design things from scratch around a scalar ALU, I don't think it's going to cost you much of anything extra, and it's nearly certainly going to cost you (a lot?) less than sophisticated hacks like the ones you proposed!
Still waiting for an in-depth B3D review of G80 performance what's the matter with a DX9 part 1 and a D3D10 part 2?We don't know how efficient either architecture is in terms of bandwidth.
Remember Xenos has 4xAA per clock ROPs and RV530 has double-rate ROPs whether AA is off or on. Whether ATI chose to deploy these techniques in R600 remains to be seen though.Furthermore, if you compared 16x CSAA and 8x MSAA, or 16xQ CSAA and 12x MSAA (this is 100% hypothetical), and one IHV optimized for the CSAA and the other for MSAA...
It would be nice if someone actually bothered to investigate why R5xx is twice as fast as G7x in Oblivion foliage. Methinks it has something to do with Z and early rejection, etc.... Z-Compression ... hierarchical-Z ...
In my opinion, the L2 cache in G80 serves five purposes:Some other things to take note of: texture cache. The G80 has 128KiB of L2 iirc - and a fair bit of L1 too. Overall, that's several times more than the R580 I think.
Then there's the ATI patent application for colour compression in floating point render targets with AA.I'd expect NVIDIA and AMD's color-compression algorithms to be roughly identical, but that's hard to say - a naive implementation would stop compressing completely when there there are two distinct colors per pixel even if you had 6x or 8x MSAA, while a more complicated one would handle that better.
If you go back in time a bit most of the rumours about the R600 release date suggested that R600 and G80 would be out almost simultaneously. Rumoured R600 dates then slipped from Novemberto January, then February, and now March/April.