The LAST R600 Rumours & Speculation Thread

Jawed · Jan 21, 2007

Bouncing Zabaglione Bros. said:
I agree. As I've been saying since the 512 bit bus came to light, "what the heck does R600 need that memory bandwidth for?" Maybe the answer is pretty simple - there's nothing special about the increased AA (which I'm expecting) or physics, it could just be that R600 has the pure processing ability to fill that bandwidth in normal use. It has enough TMU, vertex/shader/geometry processors running fast enough that it can utilise all that extra bandwidth, and maybe needs it in order not to choke the performance the chip is capable of.

It could well be that ATI held back on TMU and ROP capabilities and put all that extra bandwidth in there for the new features of D3D10, i.e. streamout and constant buffers.

Streamout is moderately like a simplified ROP:

write only
writes up to four separate streams concurrently
prolly writes in tiled-bursts (which requires a significant amount of tiled buffering)

Constant buffers are like vertex buffers, except that they will be consumed by multiple program instances in parallel as well as multiple times per program, whereas VBs will tend to be accessed once (per invocation). So that also makes CBs similar to textures, except that CBs are never filtered.

So, if you take the balance of these features then you could argue that streamout and CB/VBs on top of TMUs and ROPs accounts for the extra bandwidth, without needing to pump-up the TMUs and ROPs in any great fashion.

If you're going to pull out all the stops (as Nvidia did with G80), then this is the time in the product lifecycle to do it ie at the DX10/Vista/GPGPU inflection point. We've never had three important inflection points all arrive together like this, and this is the time to bring your best game to the market.

Undoubtedly. But it's also the time of maximum risk, because there's so little experience with the new concepts. Additionally, there's been plenty of features cut from D3D10 for one reason or another which means that one IHV might well have left them in while the other might never have had a chance in hell of getting there (hence the cut).

Jawed

Arun · Jan 21, 2007

And then there's the special function co-issue as well, which requires another operand fetch.

Very unlikely. The SFU unit needs to be setup by the ALU pipeline. As such, the necessary register is already loaded. And even if it wasn't, you are completely forgetting about the parallel data cache, which works as an extension of the register file, I believe. You've got an extra read and write there, if the compiler is smart about it. In fact, I suspect the compiler is being VERY smart about it already in the case of the vertex shader, but I haven't thought enough about it yet...

Agreed, e.g. with clever compilation, it's possible to "unroll" vector operations across a clause of code such that the cost of intermediate temporary registers is much-reduced. I would hope that the bulk of this has already been put into G80's compiler, it's the low-hanging fruit.

Yes, this is extremely simple to implement - but generalizing it to any independent instruction is a fair bit more complicated than just thinking about vector instructions, and could further help latency tolerance.

No, you can't do that. You can't have an operand read rate that's slower than your ALU retire rate. All 3 operands must be read in parallel.

Unless I'm horribly mistaken, you seem to have a mental block here. Your entire arguement seems to revolve around this (and batch sizes), really, and I can't quite seem to figure out how you are justifying it! What you are proposing is a register file with 3 read ports and one write port, with each port being 256-bit wide (8 FP32 registers). Compare this approach with a single-port register file that is 1024-bit wide (32 FP32 registers). This could also be seen as X register files with 256/X-bit or 1024/X-bit wide, respectively. Please note that we are feeding a 8-wide FP32 MADD ALU, not a 32-wide one, which is why this works.

I'm not implying you *have* to use multiple ports - but saying that "All 3 operands must be read in parallel" feels downright wrong to me, based on my understanding of the underlying concepts. If I am mistaken, I'd clearly appreciate an explanation - if you are, then hopefully this example was clear enough...

Uttar
EDIT: I am thinking in terms of wide etc. which is not really the right way to visualize it, but it's an elegant abstraction that hopefully makes my point clearer, which is why I presented it that way.

Jawed · Jan 21, 2007

An SM4 GPU has to support an arbitrary combination of 3 operands for a MAD:

r0, r1, r2
r27, r432, r4095

etc.

What fetch scheme supports a single-ported read of those operands in G80, in order to issue one MAD on each and every successive clock :?:

The only single-ported fetch scheme I can think of is one in which the entire 64KB for a object is read in one clock

Jawed

BrynS · Jan 21, 2007

As G80 has shown and R600 rumours augur (with its >120GB/s bandwidth), fast, single-cycle 4xMSAA for PC GPUs is here, at least at the high-end. Presumably, to leverage architecture investments, the featuresets will scale all the way down through the product line, so G86, G83, RV630, RV610 et al would be capable of single-cycle 4x too. Are the new AA modes from both IHVs likely to be performant (perhaps only with minimal application?) on GPUs with <=128-bit (under 30GB/s) external memory interfaces?

Jawed · Jan 21, 2007

Uttar said:
Very unlikely. The SFU unit needs to be setup by the ALU pipeline.

If that were the case, then SF would never be supported in a co-issue alongside the primary ALU, it would always interrupt it - similar to how TEX in G7x always interrupts ALU 1. The setup you refer to seems to be a function of the SF pipeline, not the primary ALU.

As such, the necessary register is already loaded. And even if it wasn't, you are completely forgetting about the parallel data cache, which works as an extension of the register file, I believe. You've got an extra read and write there, if the compiler is smart about it.

True, I'm forgetting about PDC, which effectively functions like the "loop-back" register I referred to earlier. Great for things like DP.

But to sustain MAD over multiple clock cycles where each MAD uses independent operands, each clock requires all source operands to be fetched at the pipeline's retire rate. Otherwise we're talking about undocumented gotchas in the G80 pipeline, similar to the 4xfp32 gotcha suffered by NV4x/G7x.

I'm certainly open to hearing about these gotchas, but apparently they're all under NDA due to CUDA NDA...

Yes, this is extremely simple to implement - but generalizing it to any independent instruction is a fair bit more complicated than just thinking about vector instructions, and could further help latency tolerance.

Independent instructions are normally compiled-out, since they don't actually do anything that affects the result of a program.

So, I'm not sure what you mean by "independent instruction" here.

Jawed

Arun · Jan 21, 2007

Jawed said:
What fetch scheme supports a single-ported read of those operands in G80, in order to issue one MAD on each and every successive clock

As I said, consider an 8 components-wide ALU with a 32 components-wide register file.

Clock 1: Write register rA for threads [B;B+32[
Clock 2: Read register rX for threads [00;32[
Clock 3: Read register rY for threads [00;32[
Clock 4: Read register rZ for threads [00;32[
Clock 5: Send threads [00;08[ to ALU
Clock 6: Send threads [08;16[ to ALU
Clock 7: Send threads [16;24[ to ALU
Clock 8: Send threads [24;32[ to ALU

Consider that this is pipelined, and it easy to see that this scheme works perfectly. During Clock 5, the results from threads [B+32;B+64[ are back, so those are being written; during Clock [6;8], registers for threads [32;64[ are already being read. Et cetera, et cetera... Do notice that this would not work on a CPU, since as per its very definition, a CPU only works on independent threads, and not "batches" of threads.

Uttar

Jawed · Jan 21, 2007

BrynS said:
As G80 has shown and R600 rumours augur (with its >120GB/s bandwidth), fast, single-cycle 4xMSAA for PC GPUs is here, at least at the high-end. Presumably, to leverage architecture investments, the featuresets will scale all the way down through the product line, so G86, G83, RV630, RV610 et al would be capable of single-cycle 4x too. Are the new AA modes from both IHVs likely to be performant (perhaps only with minimal application?) on GPUs with <=128-bit (under 30GB/s) external memory interfaces?

As far as I can tell, single-cycle 4xAA (as opposed to 2 cycles of 2xAA each) doesn't cost any more in bandwidth. This is because AA is always effected within the die using a colour and a Z buffer. When the AA computations have been completed, the result is then written to the render target in local memory (i.e. off-die).

The cost for the higher AA comes in several places:

geometry sampling, to produce the AA samples
AA sample colour generation (comparing incoming Z against prior AA samples' Z)
AA sample writing to Z buffer

So the cost isn't trivial, it's nearly double the die area for 4xAA.

It still bugs me why high-end R5xx GPUs don't have the double-rate ROPs that RV530 has. The asymmetric implementation of Fetch 4 (RV530 has it, R520 doesn't) hints that the R5xx GPUs had multiple forks in their implementations...

Jawed

Arun · Jan 21, 2007

Jawed said:
If that were the case, then SF would never be supported in a co-issue alongside the primary ALU, it would always interrupt it - similar to how TEX in G7x always interrupts ALU 1. The setup you refer to seems to be a function of the SF pipeline, not the primary ALU.

It is a function of the primary pipeline. So yes, it'll "block" the ALUs - but, it cannot block them for more than 25% of the time in theory! (128 scalar ALU ops per cycle, versus 32 SFU ops per cycle...) - and in practice, remember the SFU unit will generally be used for interpolation purposes instead, so the compromise makes sense imo.

So, I'm not sure what you mean by "independent instruction" here.

1: a = b + c;
2: d = b + d; (independent of instruction 1)
3: b = a + c; (dependent of instruction 1; independent of instruction 2)
4: c = d + b; (dependent of both instruction 2 and 3)

Uttar

BrynS · Jan 21, 2007

Jawed said:
As far as I can tell, single-cycle 4xAA (as opposed to 2 cycles of 2xAA each) doesn't cost any more in bandwidth. This is because AA is always effected within the die using a colour and a Z buffer. When the AA computations have been completed, the result is then written to the render target in local memory (i.e. off-die).
[...]

Ah thanks, although I'm thinking more specifically about the impact of single-cyle use between the two generations, i.e. will single-cycle 4xAA impact peformance to a greater degree than single-cycle 2xAA when limited to the same/similar bandwidth, e.g. on 128-bit parts?

For example, with R300 derivatives whereas performance may be acceptable on an 64-bit X300 or 128-bit 9600 with 2xMSAA (single-cycle 2xAA) at say 1024x768 on some pre-2003 games, will 4xAA (single-cycle 4xAA) likely be acceptable in some instances on equivalent DX10 parts at manageable resolutions for the fillrate, bandwidth, etc or will the new low-end/integrated derivatives eschew single-cycle 4xAA in favour of single-cycle 2xAA?

silent_guy · Jan 21, 2007

Trying to find much concrete stuff about register files in GPUs is extremely hard. One patent application I've got talks about implementing a register file (as an aside, not the main point of the patent) where every location exists twice - that's how dual-porting is "implemented" (it's a suggestion, nothing more). Now, that seems sorta unbelievable to me, actually loony.

The cheapest large area multi-ported memories have independent read ports that share 1 storage cell. It still has quite a bit of overhead compared to a single port memory. If a SP memory has a size of X, a DP memory will typically have a size of 1.7* X and a triple port memory will have something like 2.4 * X (which some variance around that number based on the size of the memory and width of the memory.)
Anything triple ported and most double ported memories require custom design or a specialized vendor: the standard memories that come with a process usually don't provide them. So there are few smaller players who can actually do that.

The alternative is exactly the loony solution: take 2 or 3 single ported memories, tie together the write ports and there's your multi-port memory. It's a very common technique (and expensive too.)

Finally, for small register files (say 32 x 32bits) that are used in smaller very dedicated execution engines, it's not uncommon to simply build them up with FF's or latches with a large per-port mux at the output to read the data. This has the additional advantage that you can read your data without a 1 cycle pipeline delay, making your microcode more efficient.

Jawed · Jan 21, 2007

Uttar said:
As I said, consider an 8 components-wide ALU with a 32 components-wide register file.

Clock 1: Write register rA for threads [B;B+32[
Clock 2: Read register rX for threads [00;32[
Clock 3: Read register rY for threads [00;32[
Clock 4: Read register rZ for threads [00;32[
Clock 5: Send threads [00;08[ to ALU
Clock 6: Send threads [08;16[ to ALU
Clock 7: Send threads [16;24[ to ALU
Clock 8: Send threads [24;32[ to ALU

Aha, thanks. What you've described is a striped fetch of pixels in a thread (your "thread" equating to individual pixels or vertices, I assume). This is a problem that Dave (dnavas) and I mulled over but never got to the bottom of.

Dump the results in the PDC and use that as "mini register file" for the ALU pipelines to operate from. Of course the PDC itself needs to be 3-ported, but being smaller it's less fiddlesome to implement.

This wouldn't work for a vertex or geometry shader though, where the thread size is only 16 vertices/primitives (2 clocks).

Often at least one operand will be a constant, so the corner case of 3 operands all being registers could be rare enough that the 1-clock bubble it introduces for VS and GS MADs is bearable.

Jawed

icecold1983 · Jan 21, 2007

"Guys over at Rage3D say that their sources tell them that the real R600 is going to be much faster - more like 40 - 60% faster than the G80."

Geo · Jan 21, 2007

The bw numbers ought to tell you that those scenarios are going to exist. The question will be how often, and how relevant to enthusiast situations. If they can deliver that at even 1920x1200 that would be sweet indeed. If it takes 2560x1600 to get that, it's not going to be as nice.

Jawed · Jan 21, 2007

Uttar said:
It is a function of the primary pipeline. So yes, it'll "block" the ALUs - but, it cannot block them for more than 25% of the time in theory! (128 scalar ALU ops per cycle, versus 32 SFU ops per cycle...).

G80 won't let you issue:

clock 1 MUL + RCP
clock 2 ADD + RCP

because the RCP issued in the first clock takes 4 clocks to complete (in a pixel shader, or 2 clocks in a VS or GS).

So RCP would block four clocks on the main ALU.

Otherwise, you're asserting that ALU and SF pipelines are truly independent.

1: a = b + c;
2: d = b + d; (independent of instruction 1)
3: b = a + c; (dependent of instruction 1; independent of instruction 2)
4: c = d + b; (dependent of both instruction 2 and 3)

OK, that's 3 clauses you're describing there:

1 + 3
2
4, dependent upon the completion of clauses 1 and 2

These clauses (or instruction streams, if you prefer) are fairly trivial to identify, something that arises irrespective of the pipeline architecture.

Jawed

Arun · Jan 21, 2007

Jawed said:
Aha, thanks. What you've described is a striped fetch of pixels in a thread (your "thread" equating to individual pixels or vertices, I assume). This is a problem that Dave (dnavas) and I mulled over but never got to the bottom of.

Heh

Yes, I tend to speak of threads as individual pixels and batches for the bigger chunks. I have been told the terms used in the CUDA docs are much more... exotic, so it'll be interesting to see if those catch on.

Dump the results in the PDC and use that as "mini register file" for the ALU pipelines to operate from. Of course the PDC itself needs to be 3-ported, but being smaller it's less fiddlesome to implement.

This might shock you, but I would suspect the PDC is bigger or as big as the register file. It is actually 256KiB, I think - and not 128KiB like it was suspected earlier based on extrapolation from some PDFs. This is because each half-cluster (8-wide ALU) has its own 16KiB PDC.

As for the VS having batches of 16, one possibility here is that it has its own dedicated register file. This would explain that the VTF numbers aren't quite as amazing as we'd have hoped them to be, too. I'm not quite convinced by this explanation, however. I am sure there are plenty of tricks with the PDC (which I would assume is single-ported, just like the register file!) they could be doing there that I am not thinking of right now...

Uttar

Jawed · Jan 21, 2007

BrynS said:
Ah thanks, although I'm thinking more specifically about the impact of single-cyle use between the two generations, i.e. will single-cycle 4xAA impact peformance to a greater degree than single-cycle 2xAA when limited to the same/similar bandwidth, e.g. on 128-bit parts?

For example, with R300 derivatives whereas performance may be acceptable on an 64-bit X300 or 128-bit 9600 with 2xMSAA (single-cycle 2xAA) at say 1024x768 on some pre-2003 games, will 4xAA (single-cycle 4xAA) likely be acceptable in some instances on equivalent DX10 parts at manageable resolutions for the fillrate, bandwidth, etc or will the new low-end/integrated derivatives eschew single-cycle 4xAA in favour of single-cycle 2xAA?

Theoretically your 4xAA GPU isn't forced to do 4xAA, it could do 2xAA. Well, Xenos works like this.

But yes, you're right to ponder whether a RV610, with 256MB of hypermemory and a piddling 32MB of local memory, forced to do a minimum of 4xAA might break down and cry. A 4xMSAA render target takes up more space and therefore consumes more bandwidth than a 2xMSAA one.

The other side of the coin is what will D3D10.1 or D3D11 require? Will a minimum of 4xAA be mandated (or the user can turn off AA entirely)?

Jawed

Jawed · Jan 21, 2007

silent_guy said:
The cheapest large area multi-ported memories have independent read ports that share 1 storage cell. It still has quite a bit of overhead compared to a single port memory. If a SP memory has a size of X, a DP memory will typically have a size of 1.7* X and a triple port memory will have something like 2.4 * X (which some variance around that number based on the size of the memory and width of the memory.)
Anything triple ported and most double ported memories require custom design or a specialized vendor: the standard memories that come with a process usually don't provide them. So there are few smaller players who can actually do that.

Pretty hairy then! Ponders: can TSMC offer such memory?...

The alternative is exactly the loony solution: take 2 or 3 single ported memories, tie together the write ports and there's your multi-port memory. It's a very common technique (and expensive too.)

What else can I say! I was shocked when I read it in the patent and needed to get some confirmation that I wasn't just imagining it.

SIMD Processor and Addressing Method

Paragraph 37.

Jawed

Jawed · Jan 21, 2007

Uttar said:
Heh Yes, I tend to speak of threads as individual pixels and batches for the bigger chunks. I have been told the terms used in the CUDA docs are much more... exotic, so it'll be interesting to see if those catch on.

Let's hope we get to see CUDA at some point - I really don't understand the secrecy, unless it's to keep AMD in the dark about the architecture until after R600 has released.

This might shock you, but I would suspect the PDC is bigger or as big as the register file. It is actually 256KiB, I think - and not 128KiB like it was suspected earlier based on extrapolation from some PDFs. This is because each half-cluster (8-wide ALU) has its own 16KiB PDC.

Well, if you have a thread with 32 pixels, and you assume that PDC can at least double-buffer threads (i.e. one active, one being setup), then 8KB for 32 pixels equates to 256 bytes per pixel, or 16x fp32s. That's 16x 128-bit.

As for the VS having batches of 16, one possibility here is that it has its own dedicated register file. This would explain that the VTF numbers aren't quite as amazing as we'd have hoped them to be, too.

What VTF numbers?...

I'm not quite convinced by this explanation, however. I am sure there are plenty of tricks with the PDC (which I would assume is single-ported, just like the register file!) they could be doing there that I am not thinking of right now...

It might be nothing more than there aren't enough threads in flight. G80 might be built upon the assumption that there'll normally be VS, GS and PS threads in flight, so that not all VS latency (from VTF) has to be hidden solely by VS threads. But if you are running a VS-only render pass (or two) then I guess you could trip up on sub-optimal latency-hiding.

Jawed

KimB · Jan 21, 2007

Jawed said:
Let's hope we get to see CUDA at some point - I really don't understand the secrecy, unless it's to keep AMD in the dark about the architecture until after R600 has released.

It's pre-release software. I'm sure it'll become public once the compiler has advanced to the final release stage.

silent_guy · Jan 21, 2007

Jawed said:
Pretty hairy then! Ponders: can TSMC offer such memory?...

Dual ported memories are definitely available. Not sure if TSMC has their own, but Artisan does.

If you want to create your own, here's how it goes: each process has design rules: geometry rules by which you have to comply to ensure that it is physically possible to produce a design. (E.g. minimum distances between via's, minimum trace lengths etc. There are hunderds of those.) If you follow the standard rules you cannot build minimum size memories.

To work around that, fabs also provide single bit memory cells: those violate the (conservative) rules, but they are still guaranteed by the fab as producable. When you design memories, it's up to you to design everything around it (address decoders, output sense amplifiers etc.), but you never ever touch the single bit cell itself.
And then it's a matter coming up which your own specs... You want a 4-ported memory? You'll need 8 sense lines per bit cell column.

All in all, custom memories will still give you the best performance and/or area, but it's also error prone and risky. Most companies stay away from it, but the big guys probably don't. Even then there's a trade off: for a small memory that's not often used, it may be cheaper to go with a standard offering, while you roll your own for a big one that's used a lot.

The LAST R600 Rumours & Speculation Thread

Jawed

Arun

Unknown.

Jawed

BrynS

Jawed

Arun

Unknown.

Jawed

Arun

Unknown.

BrynS

silent_guy

Jawed

icecold1983

Geo

Mostly Harmless

Jawed

Arun

Unknown.

Jawed

Jawed

Jawed

KimB

silent_guy

Similar threads