AMD: R7xx Speculation

Status
Not open for further replies.
I'm not ignoring the improvements, I'm just wailing against the bs terminology being used. What it really comes down to is they have some better scheduling algorithms using effectively similar SIMD ALU arrays. ie, they each one is still being fed by a single instruction pointer.

Yes, but since each thread (Nvidia terminology) can be divergent and have a different instruction pointer, it's not that unreasonable to treat them as logically independent units (even if horribly inefficient when used as such.) I'm ok with calling them stream processors from that point of view (or thread processors, if you're hung up on the already taken definition of stream processor). It's much harder to go along with the ATI terminology, but seeing that they'd be completely outclassed for this particular number, I guess marketing had no other choice than to reinvent that definition.

...but all they are really doing is batch processing of several pixels/vertexes using the same instruction stream. nothing really scalar about that.
There doesn't have to be a contradiction between having a scalar instruction set and a non-scalar architecture.

(*) By instruction pointer per thread, I'm referring to whatever method is used to keep track of the state of each thead. If you're dealing with divergent branches, somehow those threads will have to be executed one after the other, with non-participating threads disabled, because the opposite happens.
This is actually what I've been wondering about Larrabee: how will it's handle divergence within the same warp?
 
Last edited by a moderator:
from marketing of course, like I said, I'm surprised that the graphics vendors haven't touted clock buffers as a feature and given them some nonsense name!

What about my comment was inaccurate with respect to how G80's ALU's are filled?

The hardware itself is SIMD yes. But aren't things fundamentally different from how things were done previously with respect to how those SIMDs are filled per clock (especially in the case of G80). How would you market that given that it's one of the defining qualities of the new generation?

For as long as I can remember people discussed the structure of individual ALU's. Weren't the ALUs in R300 and NV30 arranged in one big SIMD array? Yet we still talked about the capabilities and structure of each individual ALU and what it took to extract maximum utilization from them. That extended to the next generation where there was a lot of talk about co-issue and dual-issue and instruction dependencies. Back then we used the term "vector ALU" to refer not to the SIMD array but to an individual ALU. Extrapolating that to today how is "scalar ALU" an inaccurate term? Back then it was a SIMD array of vector ALU's (everyone remember the pipe? :)), today it's a SIMD array of scalar ALU's. The terminology we use today is completely in line with the language used in the past so I don't understand the complaints.....
 
R300 pixel shader
  • 2 register files
  • 2 SIMDs, each consisting of (4 pixels per clock)
    • 12 lane MAD ALU (with "DX8/ADD" pre-processor stage)
    • 4 lane MAD/transcendental ALU (with "DX8/ADD" pre-processor stage)
R600
  • 4 register files
  • 4 SIMDs, each consisting of (16 objects per clock)
    • 16 lane MAD ALU - "X"
    • 16 lane MAD ALU - "Y"
    • 16 lane MAD ALU - "Z"
    • 16 lane MAD ALU - "W"
    • 16 lane MAD/transcendental ALU - "T"
G80
  • 16 register files
  • 16 SIMDs, each consisting of (8 objects per clock)
    • 8 lane MAD ALU
    • 2 lane transcendental ALU (which can also function as an 8 lane interpolator ALU)
Each SIMD in these GPUs has a single program counter shared by all objects in the batch. Predication is the sole means of "simulating independent program counters" for the objects that make up a single batch, in all these GPUs.

Jawed
 
Each SIMD in these GPUs has a single program counter shared by all objects in the batch. Predication is the sole means of "simulating independent program counters" for the objects that make up a single batch, in all these GPUs.
Yes, so basically there are 2 types of predication at work: explicit, on the instruction level with conditional codes like in ARM, as a way to avoid branches in the code, and implicit, where you temporarily freeze the state of non-active threads.

Which leads to the question: how does a thread do the bookkeeping about when active threads needs to become passive and when passive threads need to become active?
Is it sufficient to have a trigger register with a address value that indicates when an active thread should toggle to inactive and vice versa? (Hmm, wouldn't that be a de-facto instruction pointer in that it's the binary first order derivative? :D) Or is there more/less to it?

This is what I meant with "whatever method to keep track of program state". It doesn't have to be a per-thread explicit program counter. If my suggestion of a trigger register works, then that comes down to the same thing, with the same amount of FF state as a program counter though it will be slightly more power efficient.

Anyway, my original point still stands: from a programmers point of view, it's not outrageous to count a G80 as having 128 processors. I'm having a harder time with the 320 processors on R600...
 
Yes, so basically there are 2 types of predication at work: explicit, on the instruction level with conditional codes like in ARM, as a way to avoid branches in the code, and implicit, where you temporarily freeze the state of non-active threads.
As far as I know, it's purely object-based predication.

http://forum.beyond3d.com/showpost.php?p=1123486&postcount=435

Which leads to the question: how does a thread do the bookkeeping about when active threads needs to become passive and when passive threads need to become active?
Is it sufficient to have a trigger register with a address value that indicates when an active thread should toggle to inactive and vice versa? (Hmm, wouldn't that be a de-facto instruction pointer in that it's the binary first order derivative? :D) Or is there more/less to it?
I know in ATI GPUs they implement predication per nested level. So as the PC crosses the boundary formed by each test it pushes and pops predication on a predication stack. (Actually, as far as I can tell from CTM documentation for R5xx, it doesn't act as a true stack since the processor can do funky things reading predication in mixed orderings.)

Anyway, my original point still stands: from a programmers point of view, it's not outrageous to count a G80 as having 128 processors. I'm having a harder time with the 320 processors on R600...
G80 is 128 wide and R600 is 64 wide in terms of objects per clock. But in terms of processors it's quite clear that G80 has 16 (NVidia calls them multiprocessors) and R600 has 4 (ATI calls them SIMDs).

Jawed
 
As far as I know, it's purely object-based predication.

http://forum.beyond3d.com/showpost.php?p=1123486&postcount=435
Sorry, but I don't see how divergence is directly related to warp swapping (though there may be secondary reasons): if some threads don't take a branch while others do, then then the processor can simply continue executing the threads that don't branch. No need to swap the warp.
I can see why it could beneficial to swap a warp to avoid a bubble, but that would be a consequence of the VLIW instruction set and not of the SIMD nature.

I'm more interested in the low level bookkeeping. Do you see fundamental flaws in my suggested implementation? At first, I thought some kind of divergence stack would be needed to unwind different levels of divergence, but now I don't think that's necessary. (In case of subroutines, you obviously still need a stack, but that's orthogonal.)

Let's ignore explicit predication, since that's trivial anyway, and only look at real divergent branches.

At the assembly level, you always know what the next instruction to execute will be so, knowing that, isn't it indeed just not a matter of temporarily deactivating a thread until the program counter passes the reactivation mark? Seems really straightforward. I have the feeling that my example would break down in some really stupid case, but I can't pinpoint what as long as you can guarantee that there always will be, eventually, as point of reconvergence. (Only a restriction if you want to support setjump/longjump like behavior.)

I know in ATI GPUs they implement predication per nested level. So as the PC crosses the boundary formed by each test it pushes and pops predication on a predication stack. (Actually, as far as I can tell from CTM documentation for R5xx, it doesn't act as a true stack since the processor can do funky things reading predication in mixed orderings.)
Interesting. Free-wheeling now: that seems to indicate that they control divergence globally per warp instead of per thread, with a stack that's 16 deep and 16 wide. If indeed the instructions themselves determine whether or not it's a nesting/predication barrier, then that's all you need. In my system, things would be distributed per thread, each having their own 32-bit wide barrier activation toggle pointer + their own logic to calculate the next toggle pointer. Conceptually, I'd think my system is simpler, but maybe not if you want to avoid bubbles...

Edit: one possible reason to go the stack way is that it's less area. I only realized later that you need such a stack per warp, not per multi-processor. So in one case you have threads_per_warp^2 bits per warp, while in the other you'd need threads_per_warp*address_width. Could be significant enough when you're dealing with thousands of warps.
 
Sorry, but I don't see how divergence is directly related to warp swapping (though there may be secondary reasons)
You introduced the 2-way model of ARM, so I contrasted it with the 2-way models in R600 and G80, neither of which works in the way ARM does (or at least what I understand of ARM from your description: the code contains a predication flag per instruction to indicate paths). In order to entirely skip instructions the GPUs have to evaluate the predicate for the batch as a whole.

R600 has to do a batch swap to perform this evaluation - the sequencer unit that does this is outside of the ALU pipe.

The compiler for G80 serialises a branch if it's short. This implies to me that G80 evaluates the predicate status for the entire batch within the ALU pipe and flushes the pipe once it determines the clause no longer contains instructions to be executed. In effect this evaluation is only compiled if the clause is long enough for it to be worth flushing.

: if some threads don't take a branch while others do, then then the processor can simply continue executing the threads that don't branch. No need to swap the warp.
I can see why it could beneficial to swap a warp to avoid a bubble, but that would be a consequence of the VLIW instruction set and not of the SIMD nature.
R600 has an explicit "swap now" compiled into its assembly code.

Thinking about it, G80 may be simply flushing the pipe and jumping. This presumes that the operands required by the new instruction are ready. I can't help thinking that these operands cannot be guaranteed to be ready, which would then force a batch swap.

I'm more interested in the low level bookkeeping. Do you see fundamental flaws in my suggested implementation? At first, I thought some kind of divergence stack would be needed to unwind different levels of divergence, but now I don't think that's necessary. (In case of subroutines, you obviously still need a stack, but that's orthogonal.)

Let's ignore explicit predication, since that's trivial anyway, and only look at real divergent branches.
OK.

At the assembly level, you always know what the next instruction to execute will be so, knowing that, isn't it indeed just not a matter of temporarily deactivating a thread until the program counter passes the reactivation mark? Seems really straightforward. I have the feeling that my example would break down in some really stupid case, but I can't pinpoint what as long as you can guarantee that there always will be, eventually, as point of reconvergence. (Only a restriction if you want to support setjump/longjump like behavior.)
I'm not sure if you're referring to instruction-paging here (not sure if it's relevant)...

Interesting. Free-wheeling now: that seems to indicate that they control divergence globally per warp instead of per thread, with a stack that's 16 deep and 16 wide. If indeed the instructions themselves determine whether or not it's a nesting/predication barrier, then that's all you need. In my system, things would be distributed per thread, each having their own 32-bit wide barrier activation toggle pointer + their own logic to calculate the next toggle pointer. Conceptually, I'd think my system is simpler, but maybe not if you want to avoid bubbles...
I'm still trying to understand what you're suggesting: are you suggesting that each object has its own address pointer, set to the address of the instruction it is due to execute next?

Well, for what it's worth, G80 appears to bubble, so bubbling would be a match for the system you're describing.

Edit: one possible reason to go the stack way is that it's less area. I only realized later that you need such a stack per warp, not per multi-processor. So in one case you have threads_per_warp^2 bits per warp,
It needs to be the nesting level not squared.

while in the other you'd need threads_per_warp*address_width. Could be significant enough when you're dealing with thousands of warps.
In G80 each multiprocessor can only support a maximum of 24 warps (each of 32 objects). Alternatively 48 warps of 16 objects when vertex shading, I presume.

In R600 all we know is that there's a vague "low-hundreds" of batches, but I'm not aware of anything more specific. R5xx is 128 batches per SIMD, I can't think of any reason why it'd be different in R600. I expect they both share the same predication system for branching.

CTM documentation for R5xx says that nesting is upto 32 deep, but only 64 batches per SIMD are supported. Otherwise nesting is limited to 6 deep (but subroutines and loops are not supported in this mode). I wasn't aware of these modes until now...

Jawed
 
Yes, so basically there are 2 types of predication at work: explicit, on the instruction level with conditional codes like in ARM, as a way to avoid branches in the code, and implicit, where you temporarily freeze the state of non-active threads.

Which leads to the question: how does a thread do the bookkeeping about when active threads needs to become passive and when passive threads need to become active?
Is it sufficient to have a trigger register with a address value that indicates when an active thread should toggle to inactive and vice versa? (Hmm, wouldn't that be a de-facto instruction pointer in that it's the binary first order derivative? :D) Or is there more/less to it?
No, that would be a mask register which have existed in the vector world for decades. Basically depending on architecture you can have a SP mask register or use a general register, or bit slicing via a scalar register.

Effectively you do a comparison operation of X width where x is the number of lanes, the you use that output to mask off operations of lanes. This is also known as vector based predication. And no, it has no relation to an instruction pointers, because you are still effectively doing the operations just not righting the results back for some lanes.

[quiote]
Anyway, my original point still stands: from a programmers point of view, it's not outrageous to count a G80 as having 128 processors. I'm having a harder time with the 320 processors on R600...

Can I run 128 INDEPENDENT programs? No? well then you don't have 128 XXXXXX processors! G80 has 16 processors, R600 has 4! And yes the trend is towards more as that is generally more efficient.

aaron spink
speaking for myself inc.
 
Can I run 128 INDEPENDENT programs? No? well then you don't have 128 XXXXXX processors! G80 has 16 processors, R600 has 4! And yes the trend is towards more as that is generally more efficient.
When the only disagreement is about the semantics of the word 'processor', it's probably time to be happy that everything else is solved and move on...
 
Can I run 128 INDEPENDENT programs? No? well then you don't have 128 XXXXXX processors! G80 has 16 processors, R600 has 4! And yes the trend is towards more as that is generally more efficient.

aaron spink
speaking for myself inc.

Each SIMD array has two arbiters and two sequencers so wouldn't that be similar to 8 processors.

http://www.ixbt.com/video3/images/r600/diag_thread.png

diag_thread.png
 
Aaron is IMHLO definitely not wrong, as I have no single doubt either that the high numbers for today's "stream processors" are mostly for marketing purposes. The truth is that the increase in overall efficiency (especially if s.o. compares G7x to G8x) is so large that NV's marketing department probably looked for ways to indicate the difference to former generations.

By the way Dave, if the original vaporware S5 was in its basics what SGX is looking like today, I'm afraid your past performance prediction for the first was completely on the wrong track.
 
Aaron is IMHLO definitely not wrong, as I have no single doubt either that the high numbers for today's "stream processors" are mostly for marketing purposes. The truth is that the increase in overall efficiency (especially if s.o. compares G7x to G8x) is so large that NV's marketing department probably looked for ways to indicate the difference to former generations.

I think all the marketing Nvidia needed was 20+% more performance with 80% of the resources. In most cases, esp shader bound cases G92 is roughly 50% more efficient than RV670.

Which basically points out that AMD really really needs to rethink the way they schedule/organize the SIMD ALUs. They are wasting a lot of resources doing nothing currently.

The big insight in the G80/G92 design is that by serializing the execution on a per component basis the scheduling problem becomes much simpler and utilization becomes much higher. Basically each fragment ends up running 4x as long which both hides latency and reduces the scheduling required by 3/4s. Also it likely reduces register pressure as well.

Without an architectural change and assuming Nvidia does a roughly 2x design for their next high end chip, ATI is going to need roughly a 750-850 ALU design to break even.

Aaron Spink
speaking for myself inc.
 
Intel has learned its lesson with Pentium 4 netbrust architecture using deep pipelines to boost clock frequency for marketing purpose in order to beat AMD in MHz WAR. Later Intel realize and changed its strategy by cutting pipelines to 14 and redesign CPU architecture in order to boost performance, which was the successful Intel Core 2.

ATI/AMD should also learn the same lesson with R6xx/RV6xx and redesign their next upcoming generation and stop confusing people with a such high number of 320/480/640/800 stream processors.
 
I think all the marketing Nvidia needed was 20+% more performance with 80% of the resources. In most cases, esp shader bound cases G92 is roughly 50% more efficient than RV670.

Well I wasn't actually comparing G92 to RV670. I was merely trying to find an explanation for the 128-whatever marketing figures of G8x/9x. They told us laymen that G7x has 6 quads or 24 ALUs if you prefer. Something like 128 simply sells better as a number and not too many understand that easily the differences between a Vec4 and a "scalar" unit either. If one then goes even deeper and starts explaining how TEX ops have been de-coupled, the average reader starts getting lost. In other words NVIDIA isn't exactly innocent either when it comes to marketing stunts in order to describe their latest architecture. Under the light though that the differences in efficiency compared to their former generation are huge, I'm willing to close one eye.

Which basically points out that AMD really really needs to rethink the way they schedule/organize the SIMD ALUs. They are wasting a lot of resources doing nothing currently.

The big insight in the G80/G92 design is that by serializing the execution on a per component basis the scheduling problem becomes much simpler and utilization becomes much higher. Basically each fragment ends up running 4x as long which both hides latency and reduces the scheduling required by 3/4s. Also it likely reduces register pressure as well.

Without an architectural change and assuming Nvidia does a roughly 2x design for their next high end chip, ATI is going to need roughly a 750-850 ALU design to break even.

Aaron Spink
speaking for myself inc.

Don't you think that blaming any possible differences on just arithmetic throughput (even if your differences should be real) could be a tad shortsighted? You're talking about Performance GPUs for the desktop standalone market and what counts most here for higher sales is performance/buck in today's games. Pardon me but you'll have a hard time convincing me that RV670 doesn't fall short against the competition in terms of Texel/Pixel and Z fillrates and not by just a small margin.

IMHLO a healthy boost both in terms of higher arithmetic throughput and all fore-mentioned fillrates will do the trick and if we're talking about ultra high end there's no doubt that AMD will continue to address it with dual chip designs just like R680. That could be according to your definition 8 processors per chip, twice the TMUs and hopefully double the Z throughput per ROP.

For GPGPU stuff I have the impression that NV has still a few tracks to cover even against RV670, which of course builds a nice oxymoron against your 50% higher efficiency for G92.
 
In most cases, esp shader bound cases G92 is roughly 50% more efficient than RV670.
I don't buy that. It may be a bit more efficient but I doubt it's really that much. If you quote "shader bound cases" I'll point you at the 3dmark06 perlin noise test (which, coincidentally, is pretty much the only benchmark which shows scaling about what you'd expect between G92 and G94 so it's probably REALLY shader alu bound). In this benchmark, a HD3870 runs neck and neck with a 8800GTS-512, so it doesn't look that much less efficient (based on peak MAD rates the HD3870 should just be a tad bit faster).
 
Intel has learned its lesson with Pentium 4 netbrust architecture using deep pipelines to boost clock frequency for marketing purpose in order to beat AMD in MHz WAR. Later Intel realize and changed its strategy by cutting pipelines to 14 and redesign CPU architecture in order to boost performance, which was the successful Intel Core 2.

ATI/AMD should also learn the same lesson with R6xx/RV6xx and redesign their next upcoming generation and stop confusing people with a such high number of 320/480/640/800 stream processors.

I disagree. CPUs are not anywhere near as parallel as GPUs, so it is unfair to draw a corollary between the Netburst->Core transition and the difference between G8/9x and R6xx.
 
I don't buy that. It may be a bit more efficient but I doubt it's really that much. If you quote "shader bound cases" I'll point you at the 3dmark06 perlin noise test (which, coincidentally, is pretty much the only benchmark which shows scaling about what you'd expect between G92 and G94 so it's probably REALLY shader alu bound). In this benchmark, a HD3870 runs neck and neck with a 8800GTS-512, so it doesn't look that much less efficient (based on peak MAD rates the HD3870 should just be a tad bit faster).

Yet just about every game on the market runs faster on NV hardware, so one corner case where R6xx runs on-par with G8x/9x isn't indicative of the average case at all.
 
Status
Not open for further replies.
Back
Top