Kepler and Tahiti Scheduling

Correct me If I am wrong , But it seems that both Kepler (GK110/GK104) and Tahiti use a combination of hardware and software scheduling to get the job done , but who really uses hardware side more than software side?

It's obvious that NVIDIA uses software more than hardware this time , as evident by the architectural difference in in Kepler compared to Fermi , but I am not sure the method they used are that different from Tahiti's , they both gave similar performance in the end .
 
Well, with this generation, NV and AMD went in opposite directions regarding instruction scheduling. While Kepler kept score-boarding only for limited set of operations and ditched Fermi's overcomplicated scheduler, GCN introduced simple local scheduling to avoid dependencies between the vector units within a CU, lifting some of the burden from the compiler.

Anyway, the whole topic has been discussed to death here and the rest of interwebz. ;)
 
Correct me If I am wrong , But it seems that both Kepler (GK110/GK104) and Tahiti use a combination of hardware and software scheduling to get the job done , but who really uses hardware side more than software side?

It's obvious that NVIDIA uses software more than hardware this time , as evident by the architectural difference in in Kepler compared to Fermi , but I am not sure the method they used are that different from Tahiti's , they both gave similar performance in the end .
To add some specific tidbits to fellix's post, with the Kepler ISA, the compiler encodes some dependency information directly into the instructions. That way, the scheduler knows about dependencies between nearby ALU instructions without checking for them and is able to delay the issue of dependent instructions accordingly. Kepler still has a RAW pipeline delay of much more cycles than it needs to issue one instruction for a Warp (albeit it got nearly halved compared to Fermi, it's now between 10 and 12 cycles for common instructions). That means Kepler (like Fermi) needs to consider dependencies between instructions, but that task got mainly shifted to the compiler (Fermi used scoreboarding for everything). At least for short running ALU instructions. Memory accesses still use a scoreboard mechanism. But as ALU instructions are usually fixed latency, the Fermi scheduling was relatively complex for not too much gain.

GCN on the other hand does not need to consider these ALU dependencies at all. The reason is that the RAW pipeline delay appears to be just 4 cycles. Like Fermi, Kepler lacks a bypass network so the reg file reads and writes add to the perceived latency. GCN probably builds on the PS and PV pipeline registers of the VLIW architectures (a rudimentary form of an instruction controlled result forwarding/bypass) to cut out the reg file accesses from the dependency chain. It fits exactly with the execution of 64wide vectors on 16wide vALUs. GCN is able to issue and execute dependent vALU instructions back-to-back, so it doesn't need to consider dependencies between them, neither in the compiler, nor in the hardware (dependent instructions just use the results from the bypass network before they get written to the reg file). This scheme breaks a bit for sALU instructions dependent on a vALU result (for instance the VCC [vector condition code, basically a 64bit vector of flags for the 64 elements of a wavefront set for instance by comparison instructions but also acting as an overflow flag for adds] can't be ready when a dependent sALU instruction gets issued 4 cycles after the vALU instruction producing the VCC value) where the scheduler probably simply blocks the wavefront for one issue cycle (4 cycles) when experiencing such a switch. There are a few special cases where this isn't enough (for instance branching conditionally on a VCC not in the VCC register but a sGPR pair, when the preceding sALU instruction just wrote one of the sGPRs holding the VCC; another special case is copying a vGPR value to a sGPR and trying to use the just written sGPR for selecting to which vALU lane a sGPR value should be copied). These special cases basically tells us, where AMD saved on the bypass networks. They must be handled by the developer (in 99.99% by the compiler because almost nobody uses ISA instructions directly), but there are only exactly 8 of these special cases (and should be rarely used in "normal" code) requiring the insertion of independent instruction (or NOPs).
For variable delay instructions as memory accesses, GCN uses dependency counters as detailed some time ago.
 
Excellent , so it seems AMD has at least a slight edge in scheduling , If not more , but that doesn't explain to me how a 1536-core Kepler GPU matches a 2048-core Tahiti GPU in performance.(or comes slightly behind it when frequency is equal).

I mean both architectures are very similar now , NVIDIA took a hit in execution effeciency when they relied more on software ,Numerically speaking they should be far behind , not on equal footing .

Of course , I am referencing GK104 here , when talking about a full GK110 , everything makes sense , a 2880-core Kepler will be about 40% faster than a 2048-core Tahiti.
 
The full flexibility of instruction scheduling has somewhat more impact GPU compute than it does in general graphics loads, with the caveat that more complex code is encroaching into graphics as well.

There are also a number of other factors not based on ALU capability that can influence performance, and GCN is not consistently ahead by those measures.

For variable delay instructions as memory accesses, GCN uses dependency counters as detailed some time ago.

As far as dependency cases, the Sea Islands document added a new restriction for the flat addressing mode. There's apparently a race condition where the CU cannot determine if a flat access goes to memory or to the LDS, so using that mode requires setting the waitcnt for memory to 0.
 
There are also a number of other factors not based on ALU capability that can influence performance, and GCN is not consistently ahead by those measures.
Like what else for example? Triangle Throughput ? they are both tied in most metrics .. Pixel/Texel Fill rates .. etc , and Tahiti has much larger memory bandwidth .
 
Why are we trying to break down performance based on one number in a chip and related to scheduling? If you look at Compute Heavy tasks Tahiti generally outperforms GK104 quite significantly, and often by more than the number of units. If you want to look at gaming performance (which relies more than just on shaders / scheduling) when you look at other comparatively sized chips the performance comparisons are quite different.
 
Excellent , so it seems AMD has at least a slight edge in scheduling , If not more , but that doesn't explain to me how a 1536-core Kepler GPU matches a 2048-core Tahiti GPU in performance.(or comes slightly behind it when frequency is equal).
The answer is simple. There is a lot more to general 3D performance than pure compute throughput. On the later Tahiti actually is considerably faster than Kepler.
 
Like what else for example? Triangle Throughput ? they are both tied in most metrics .. Pixel/Texel Fill rates .. etc , and Tahiti has much larger memory bandwidth .

There are differences in the speed of FP16 filtering, and differences in how the ROPs are tied to the cache hierarchy.
For cases where higher-precision texture formats are being filtered, Nvidia has a more consistent profile. I don't recall which architectures were compared for how their ROPs handled more complex Gbuffer formats, but at least until recently AMD's ROPs had a different (edit: and less consistent) performance profile.

Setup and tesselation capabilities are a strong point for Nvidia, batch sizes are smaller, and there are scenarios where the different arrangement of local store, L1, and a separate texture cache can be advantageous.

Then there's the changing situation with devrel, and the less than impressive driver situation.

Historically, we'd see Nvidia doing comparatively better in situations where Tahiti could not leverage its memory bandwidth, capacity, and ALU capability to the point that it can batter the 680's bottlenecks.
There were some benches where CPU limitations were potentially showing up earlier for AMD, which may be a driver issue.
In other cases, it just looks like Tahiti wasn't as nimble an architecture. It has been less successful in hiding the particular quirks of the graphics pipeline, and it takes more available work to get a head of steam.

There are compute studies that also show that AMD's architecture starts to falter sooner if you start reducing the number of concurrent items, although it has a very strong advantage if you have enough work available.
 
As far as dependency cases, the Sea Islands document added a new restriction for the flat addressing mode. There's apparently a race condition where the CU cannot determine if a flat access goes to memory or to the LDS, so using that mode requires setting the waitcnt for memory to 0.
The reason is that flat memory acesses break the condition of in-order completion of the accesses. A flat memory accesses increases both, vm_cnt and lgkm_cnt (as the request is sent to both). And one doesn't know which one needs to decrease.
That old description wasn't also entirely correct in that point. The strict ordering is only fullfilled for vector memory accesses (also when mixing reads and writes). The lgkm count (and to a lesser extent also the exp_cnt) isn't that strict, especially when handling different classes of accesses. So while LDS accesses for itself complete in order, as well as GDS accesses and messages, mixing these types result usually in lower value for lgkm_cnt as parameter to s_waitcnt (basically the minimum of allowed accesses in flight of all types of accesess counted here), because a later LDS access can complete before an earlier GDS access. One would simply need more counters. In the same way, different types of exports can also complete out of order. And scalar memory accesses can always complete out of order. So if one has those in flight, one always has to use lgkm_cnt(0).

But at least I got the general things right in that old description. After all, I wrote that description half a year before the GCN GPUs became available outside of AMD, let alone that an ISA manual existed besides maybe an internal draft. I'm still a bit proud I figured it out (and got it roughly correct) that early. :LOL:
 
Last edited by a moderator:
Why are we trying to break down performance based on one number in a chip and related to scheduling? If you look at Compute Heavy tasks Tahiti generally outperforms GK104 quite significantly, and often by more than the number of units. If you want to look at gaming performance (which relies more than just on shaders / scheduling) when you look at other comparatively sized chips the performance comparisons are quite different.

Hm, but why do you shy away from the comparison GK104/Tahiti in gaming? The 7970 GHz has 50% more bandwidth and about 30% more GFLOPs than the GTX680, but it's only 10-15% faster on average. There clearly is an efficiency problem of some kind there.
Chip size isn't relevant, it's the specs and how this raw power is translated into performance, that counts. A GPU with 1000 shaders and 100GB/s bandwidth will be equally fast if it's 150, 200 or 250mm2. I'm surprised to have to explain this to you. I think you know that, though, and just want to divert attention away from an unfavorable comparison.
 
Hm, but why do you shy away from the comparison GK104/Tahiti in gaming? The 7970 GHz has 50% more bandwidth and about 30% more GFLOPs than the GTX680, but it's only 10-15% faster on average. There clearly is an efficiency problem of some kind there.
Chip size isn't relevant, it's the specs and how this raw power is translated into performance, that counts. A GPU with 1000 shaders and 100GB/s bandwidth will be equally fast if it's 150, 200 or 250mm2. I'm surprised to have to explain this to you. I think you know that, though, and just want to divert attention away from an unfavorable comparison.
No, because it is not the only comparison to make when you are comparing "architecture". Architectures are design to cope with a range of applications and a range of chips and taking one comparison point does not necessarily tell you much.

This thread starts off talking about scheduling / compute capabilities then tries to apply that to gaming scenarios specifically looking at GK104 vs Tahiti. As has been pointed out multiple times Tahiti has made specific design choices to accomodate multiple different markets - from a comparison perspective relating to compute perf then Tahiti and GK110 is a perfectly reasonable comparison. However, if you want to look at gaming performance looking at this one aspect that the thread started from is wrong in the first place and then comparing these particular chips will skew the conclusion based on the product target decisions.

If you look at GK104 was designed as a gaming chip (as was GK106) and something like Pitcairn was made with those same target considerations hence is a reasonable point for comparison.
 
With all due respect Mr.Dave , The way I see it .. Corporate folks will always try to tout the strengths of their products and downplays the weaknesses , I didn't hear you say the same thing when Fermi was reigning supreme over Cayman/Cypress in compute :D.

With the advent of completely new archetictures , we are always left scratching our heads over why is A faster than B? it seems the fundamentals of the hardware is still obscure to most of us normal users .. clouded by mountains of corporate marketing expressions and diagrams .. we are just trying to peel some of these layers .

As always guys , you provided the most comprehensive and straight to the point answers , and for that you have my deepest thanks.
 
Chip size isn't relevant, it's the specs and how this raw power is translated into performance, that counts. A GPU with 1000 shaders and 100GB/s bandwidth will be equally fast if it's 150, 200 or 250mm2. I'm surprised to have to explain this to you.
Using this kind of reductio ad absurdum, all you'd need for a GPU are a couple of ALUs and a memory controller. It's a pointless exercise.

The hard part of throughout systems is less in the compute itself but more in how to feed it with data.

I can assure you that two GPUs or DSPs with identical ALU architecture and identical MC will perform dramatically different if one has 10MB of cache and oversized latency hiding FIFOs and the other has nothing.

A GPU must have tons of units that need to be fed with data at high speed and cover latency. MMUs, texture units, ROPs, intermediate triangle data etc. Each with their own bottlenecks and limitations. There is no system like this that doesn't have a large set of individual trade-offs.

Raw compute power and MC BW are a necessary resource, but they are far from being sufficient to extract all the performance.
 
The full flexibility of instruction scheduling has somewhat more impact GPU compute than it does in general graphics loads, with the caveat that more complex code is encroaching into graphics as well.
I've been wondering about this. The fact that some amount of scheduling is off-loaded to the SW doesn't necessarily have to have a major impact on final performance? Can't it simply be a way to reduce area by a removing complex (I assume) dependency checker?

Unrelated: AMD has always had a 4 deep pipeline. Fermi has 20+. Kepler has half that. Why the big difference given that both run at similar clock speeds and are using the same process? Is there some other trade-off to achieve these 4 cycles?
 
Unrelated: AMD has always had a 4 deep pipeline. Fermi has 20+. Kepler has half that. Why the big difference given that both run at similar clock speeds and are using the same process? Is there some other trade-off to achieve these 4 cycles?
Post #4.
If that was too long, here the short version: nV GPUs appear not to employ result forwarding so the reg file accesses add to the latency. That's not the case for AMD GPUs, so the 4 cycles are the pure ALU latency, while Keplers 10 cycles or so for a fp32 madd include the reg reads, ALU and the reg write (otherwise 10 cycles@1GHz would be awfully slow).

PS: The VLIW architectures had 8 cycles latency, but allowed also a bunch of "horizontal" instructions within a VLIW group. The individual slots could process certain combinations of dependent ops (also dependent madds and even 2 sets of dependent muls), which is now a non-issue and allowed to reduce the latency to 4.
 
Back
Top