Kepler and Tahiti Scheduling

Post #4.
If that was too long, here the short version: nV GPUs appear not to employ result forwarding so the reg file accesses add to the latency. That's not the case for AMD GPUs, so the 4 cycles are the pure ALU latency, while Keplers 10 cycles or so for a fp32 madd include the reg reads, ALU and the reg write (otherwise 10 cycles@1GHz would be awfully slow).
I did read it, but clearly it didn't sink in. Now it does. Thanks!
 
I've been wondering about this. The fact that some amount of scheduling is off-loaded to the SW doesn't necessarily have to have a major impact on final performance?
Speaking of which , I assume Kepler still retains the Superscalar arrangement of GF104 or did they defaulted back to a mere scalar arrangement like GF100?
 
Chip size isn't relevant, it's the specs and how this raw power is translated into performance, that counts. A GPU with 1000 shaders and 100GB/s bandwidth will be equally fast if it's 150, 200 or 250mm2. I'm surprised to have to explain this to you. I think you know that, though, and just want to divert attention away from an unfavorable comparison.

Really? :???:
 
No, because it is not the only comparison to make when you are comparing "architecture". Architectures are design to cope with a range of applications and a range of chips and taking one comparison point does not necessarily tell you much.

This thread starts off talking about scheduling / compute capabilities then tries to apply that to gaming scenarios specifically looking at GK104 vs Tahiti. As has been pointed out multiple times Tahiti has made specific design choices to accomodate multiple different markets - from a comparison perspective relating to compute perf then Tahiti and GK110 is a perfectly reasonable comparison. However, if you want to look at gaming performance looking at this one aspect that the thread started from is wrong in the first place and then comparing these particular chips will skew the conclusion based on the product target decisions.

If you look at GK104 was designed as a gaming chip (as was GK106) and something like Pitcairn was made with those same target considerations hence is a reasonable point for comparison.

Yet Tahiti competes in the gaming segment as well, the respective cards are marketed explicitly as gaming cards. So while this might not necessarily say something about the whole architecture, the comparison itself is valid and interesting. And the question remains: Why is Tahiti not faster in gaming (overall) when looking at its specs, what is the reason?

You can also look at GK104 vs Pitcairn, for example the very similarly specced (in terms of compute power and bandwidth) 660 Ti and 7870. The 660 Ti has a bit less bandwidth and compute power, significantly less pixel fillrate and yet it's 10% faster on average.


Really.

Using this kind of reductio ad absurdum, all you'd need for a GPU are a couple of ALUs and a memory controller. It's a pointless exercise.

The hard part of throughout systems is less in the compute itself but more in how to feed it with data.

I can assure you that two GPUs or DSPs with identical ALU architecture and identical MC will perform dramatically different if one has 10MB of cache and oversized latency hiding FIFOs and the other has nothing.

A GPU must have tons of units that need to be fed with data at high speed and cover latency. MMUs, texture units, ROPs, intermediate triangle data etc. Each with their own bottlenecks and limitations. There is no system like this that doesn't have a large set of individual trade-offs.

Raw compute power and MC BW are a necessary resource, but they are far from being sufficient to extract all the performance.

Obviously those two items were just examples since I didn't want to make a long list. If anything, Tahiti should have a more sophisticated data feeding system than Pitcairn (maybe I'm wrong but I don't see why it should be the other way around) which makes the problem even worse. A GPU might have many transistors allocated to stuff that is not relevant for gaming (like GK110 or Tahiti), making it larger. But how is stuff that might not be used in 3D calculations, relevant for 3D performance? It's not. Also, you might clock a GPU higher or lower, deactivate clusters and the die size doesn't change, yet performance does very much so. Die size is not a reliable metric for performance, not by a long shot. It's pure nonsense.
 
Last edited by a moderator:
The reason is that flat memory acesses break the condition of in-order completion of the accesses. A flat memory accesses increases both, vm_cnt and lgkm_cnt (as the request is sent to both). And one doesn't know which one needs to decrease.

??????????????
You mean than Tahiti IOMMU maps LDS (and GDS) into the flat address space??

Essentially, a shader can then 'see' whole memory as one - and LDS/GDS is just a specific area of it that I can defer?
 
??????????????
You mean than Tahiti IOMMU maps LDS (and GDS) into the flat address space??

Essentially, a shader can then 'see' whole memory as one - and LDS/GDS is just a specific area of it that I can defer?
Almost (the GDS is not part of the flat memory space, it's still separate). And the support of that flat memory addressing starts with the C.I. generation or "GCN 1.1" or however you want to call it (3dilletante was referring to that). Tahiti doesn't support it. But otherwise it works as you described it. A shader doesn't have to know to which physical location a pointer refers to. It can be the LDS, the video memory or the host memory (in principle even swapped out to disk).
C.I. ISA manual said:
Flat Memory instructions read, or write, one piece of data into, or out of, VGPRs; they do this separately for each work-item in a wavefront. Unlike buffer or image instructions, Flat instructions do not use a resource constant to define the base address of a surface. Instead, Flat instructions use a single flat address from the VGPR; this addresses memory as a single flat memory space. This memory space includes video memory, system memory, LDS memory, and scratch (private) memory. It does not include GDS memory. Parts of the flat memory space may not map to any real memory, and accessing these regions generates a memory-violation error. The determination of the memory space to which an address maps is controlled by a set of “memory aperture” base and size registers.
The other stuff (the ordering and sending out requests in parallel to both the LDS and the caches and memory hierarchy respectively) is mentioned there too, together with adetailed description of all instructions.
 
You can also look at GK104 vs Pitcairn, for example the very similarly specced (in terms of compute power and bandwidth) 660 Ti and 7870. The 660 Ti has a bit less bandwidth and compute power, significantly less pixel fillrate and yet it's 10% faster on average.
I disagree here. First of imho the 660Ti isn't really any faster than the 7870, but yes this depends which benchmarks you believe in (I like ht4u's numbers...). And, the 660Ti does indeed have a bit less bandwidth, it has however no deficit in compute power (if you just count the flops, depending if you count turbo or not it's almost the same to quite a bit more, afaict the "average" would be definitely higher) - and don't forget it actually has the "uncounted" SFU flops still). The 660Ti also has a rather large theoretical advantage in texture fill rate (in fact a huge advantage for fp16), it can handle small tris better at least in some situations (4 tris/clock instead of 2). So those coming out nearly even is hardly a surprise even just based on theoretical numbers.
But anyway, this gets a bit off-topic. Yes Tahiti isn't terribly efficient as a gaming chip, nothing new here. I am wondering though if future nvidia chips are also going to do something about the high latencies - of course for cpus I can't even think of s one not having forwarding networks, since this is of course essential for single-thread performance. But I guess for multi-thread performance the advantage is debatable.
 
Last edited by a moderator:
Hi Gipsel,

can you link where you got such manual? I seem I can't find it online...

Thanks.

ps: but if the LDS mapping is permanently cached (as it should), would not it be easier to conditionally modify it? You cant do it for whole memory mapping, but just LDS mapping should take little space.
 
So in one benchmark suite it is barely if at all faster, in others it is those 10% faster, in one suite even 15% (TPU). So on average it is definitely faster.
On average the 660 Ti boosts to 980 MHz which gives it 2.9% more GFLOPs. That alone doesn't explain the performance differences (again on average). I'm not sure fillrates are that important today. Enough examples can be found where cards with less fillrate beat cards that have more - consistently. It certainly isn't that relevant for the average performance.

Looking at Tahiti, I would really like to know where the problems are. Scheduling or something else?
 
can you link where you got such manual? I seem I can't find it online...

Thanks.
AMD took down the C.I. manual after a day or so. But some have downloaded it and linked some mirrors.
ps: but if the LDS mapping is permanently cached (as it should), would not it be easier to conditionally modify it? You cant do it for whole memory mapping, but just LDS mapping should take little space.
I don't get this. What do you mean with a "cached LDS mapping"?
 
I don't get this. What do you mean with a "cached LDS mapping"?
btw thank you for the link - very interesting. I gave it a look, and came to the piece you were pointing to.

What I meant is that is that a VM space needs translation: if they use segmentation or pagination, it does not matter. And usually CPU caches such mapping for very fast VM->PHY+perm.check.

But another idea came to my mind - wouldnt just be enough to put the restriction of sequentially LDS mapping and then compare flat address to LDS mapmin and LDS mapmax for every flat address?
 
The reason is that flat memory acesses break the condition of in-order completion of the accesses. A flat memory accesses increases both, vm_cnt and lgkm_cnt (as the request is sent to both). And one doesn't know which one needs to decrease.
This is something which makes feel wonder if there will be a future architectural iteration that is able to get around this. Either the flexibility of having multiple memory accesses fired off sequentially isn't that great, or this compromises the CU's issue flexibility.
Heavy use of the flat addressing would lend some weight to the idea of moving the LDS accesses out of the mushy domain they are currently tracked in.

And scalar memory accesses can always complete out of order. So if one has those in flight, one always has to use lgkm_cnt(0).
This is something else I think needs to be changed someday, in part because I'd prefer promoting the scalar portion to be more of a peer to the vector path.
That, and the various special NOP conditions are ugly low-level weaknesses to have exposed at an ISA level.

What I meant is that is that a VM space needs translation: if they use segmentation or pagination, it does not matter. And usually CPU caches such mapping for very fast VM->PHY+perm.check.
An access isn't going to hit the TLBs until it is determined to be an access that falls in the memory space.

But another idea came to my mind - wouldnt just be enough to put the restriction of sequentially LDS mapping and then compare flat address to LDS mapmin and LDS mapmax for every flat address?
With the current setup, the issue logic isn't going to know ahead of time what it's supposed to do. That would require at least 64 checks against the per-lane calculated memory ranges, then a decision to send the necessary data to either domain. The VGPR and range registers would need to be loaded and added in the one issue cycle the wavefront has. Given the size of the wavefront, it seems likely that at least 1 out of the 64 items is going to split issue.
 
Last edited by a moderator:
yeah, 16x4 range checks/cycle might be too much.

Anyway, VA can offer some interesting optimization - like enforcing the first 0-X Mb as reserved for LDS. This would reduce the check complexity to a single comparison.


OT about SI:
I found this "the scalar unit determines that a wait is required (the data is not yet ready), the Vector ALU can switch to
another wavefront" which contradict something I was thinking of SI - if a wait instruction is issued, a new queued wavefront can start if data is available??
 
I cannot edit... I meant a single comparison hardwired, with no register load/need, since you just enforce the upper limit once for all.
 
OT about SI:
I found this "the scalar unit determines that a wait is required (the data is not yet ready), the Vector ALU can switch to
another wavefront" which contradict something I was thinking of SI - if a wait instruction is issued, a new queued wavefront can start if data is available??
Where does this quote come from?
Anyway, of course it can switch to another wavefront. Each of the four schedulers in a CU can hold up to 10 wavefronts concurrently (this may be reduced, if the kernels use a high amount of registers or LDS). As long as one of them is ready to run, an instruction will be scheduled to the vector ALU. The scheduler can in fact issue multiple instructions (up to 5 + "internal" ones consumed directly in the instruction buffers like barriers or waits) in a single cycle if they are of a different type. GCN can for instance issue one vALU, one sALU, one vector memory, and one LDS instruction in the same cycle, when each instruction comes from a different wavefront. Why should a single stalled wavefront block all others too?
But what is clearly not true, is that the scalar unit determines that a wait is required. The scalar unit has nothing to do with that.

Edit:
I've seen it was mentioned in some older OpenCL manual referring to s_waitcnt. But this is in fact one of the internal instructions which never leaves the instruction buffer (the instructions consumed in the instruction buffer are: s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio, s_setvskip, and s_halt). If it would be scheduled to the scalar unit, what would block the wavefront from execution in the next issue cycle? The check against the dependency counters is done in the scheduler itself. If the result indicates a still unresolved dependency, the instruction stays in the instruction buffer and is rechecked in the next issue cycle. If the dependency is resolved, the instruction is removed from the instruction buffer (as its purpose is fullfilled) and the next instruction is then ready to issue.
 
Last edited by a moderator:
I do not understand...so where the PC&flags are saved when one wavefront is waiting and another takes its place in the SIMD(+SALU 1/cycle slice etc)?
Some of the scalar registers are always dedicated/allocated for this?
 
yes, each SIMD has its own PC, i've seen it in diags, but my point was: is there a PR+Flags pair for the max wavefront in flight, so 10?
If a wavefront goes on wait, you need to save its PR+Flags+split Stack.
hmmm... maybe they are not saved bu just kept static for each wavefront, so you dont need to move them. Yep, it sounds the most reasonable one.
 
yes, each SIMD has its own PC, i've seen it in diags, but my point was: is there a PR+Flags pair for the max wavefront in flight, so 10?
If a wavefront goes on wait, you need to save its PR+Flags+split Stack.
hmmm... maybe they are not saved bu just kept static for each wavefront, so you dont need to move them. Yep, it sounds the most reasonable one.
Yes, each SIMD (or instruction buffer/Scheduler associated with an SIMD) has 10 PCs, one for each wavefront potentially running there. The register files (and the LDS) accomodate all running wavefronts at the same time (as much as one can fit in there are run up to a maximum of 10 wavefronts per SIMD; if a kernel for instance uses less than 32 vector GPRs, it can run up to 8 wavefronts on one SIMD; similar restrictions apply to the amount of scalar regs and the LDS). So there is no overhead for switching between wavefronts. You can consider this as some kind of fine-grained (and even simultaneous) multithreading.
 
Back
Top