Nvidia Volta Speculation Thread

It isn't bad but it might not be great,

A later post stated that garnering benefits would require software to be written to take advantage of it. Perhaps the first statement would have been worded better if it stated current or existing game engines would not benefit, which in terms of timing is the same as saying "modern" without the connotation that what current engines do is the optimum solution.

Code targeting existing hardware would be paranoid about any methods that might have synchronization occurring out of step with the SIMD hardware. As noted by Nvidia, one way to prevent problems is to use very coarse locking over the whole structure, even if contention is likely to be rare. Transforming the structures or algorithms so that they bulk up to match SIMD width can constrain storage or bandwidth, and may be impractical to implement.

Depending on how Volta's selection logic works, there may also be some side benefits in a divergence case, if each path is allowed to issue some long-latency instructions like a memory access. That could overlap some memory latency, potentially reducing some of the cost of occasional divergence at the price of potentially contending for cache and cache bandwidth (which Nvidia significantly increased).

In regards to the deadlock, to my understanding, a branch would run ahead and create an infinite loop. On GCN, for example, the scheduler would be selecting waves based on some metric.
An infinite loop should bias the selection rather quickly. So it may be slow, but it shouldn't lock.
I recall synchronization in the presence of divergent control flow within a wavefront being cautioned against. The specific balancing points each architecture might take for which path is selected to be active can have synchronization patterns that will deadlock them, so long as an active path that doesn't explicitly reach a point that the execution mask is switched so that other lanes that have data it needs to make meaningful progress.

Excluding a condition where nothing could advance, which really is a software issue. The scalar unit might be able to detect and resolve those issues as well during runtime.
If this is positing that the existing scalar unit can do this automatically, ascribing intelligence of that sort to existing hardware seems overly optimistic.
If this is some hypothetical:
If done in software, that goes to the point of what kind of abstraction the vendors supply. Nvidia is attempting to seal some of the leaks in the SIMT abstraction. GCN at a low level has backed away from that abstraction somewhat, but AMD doesn't seem to have forsworn the convenience+glass jaw at all levels.
Even if done in software, the exact methods employed may be impractical to employ throughout loops, or require more complex analysis of the code since there may be built-in assumptions about whether the other paths do or do not make forward progress while the current path is active.
 
Perhaps the first statement would have been worded better if it stated current or existing game engines would not benefit, which in terms of timing is the same as saying "modern" without the connotation that what current engines do is
My take was the algorithms using the technique don't commonly occur in graphics. Not necessarily because they don't lend themselves to current hardware. Doesn't mean someone couldn't discover a use though.

If this is positing that the existing scalar unit can do this automatically, ascribing intelligence of that sort to existing hardware seems overly optimistic.
The hardware capacity should be there, but I'm unaware of any capability to do so. Scalar handles control flow and could, if allowed, manipulate scheduling or even data. Volta might be capable of something similar provided a parallel thread for control. Either a dev or drivers would need to provide exception handling. Outcome obviously determined by what happened.

Even if done in software, the exact methods employed may be impractical to employ throughout loops, or require more complex analysis of the code since there may be built-in assumptions about whether the other paths do or do not make forw
Possibly impractical, however designing in a mechanism to handle exceptions would be worthwhile. A low priority, persistent thread with elevated privileges could manage it or a triggered event. Voltas solution may be a software driven approach that is compiled in when detecting a possible lock. Some creative work with masks and per lane counters are all that should be required. Will be interesting to see more details.
 
But tensors are not giving any benefit to gaming, is just a GPGPU accelerator. The way Pascal does Async compute is already pretty decent, the next thing Nvidia needs to do is to match Vega at DX12.0 feature levels to unchain devs to explore full DX12 power. Nvidia owns 70% of the dGPU market, few AAA devs feel encouraged to explore DX12 full power the way the situation is. At 2019 Maxwell will be 5 years old, so Devs will have no excuse to don't explore at least a tad of DX12 power.
 
My take was the algorithms using the technique don't commonly occur in graphics. Not necessarily because they don't lend themselves to current hardware. Doesn't mean someone couldn't discover a use though.
That could likely be the intent of the statement. A later talking point covered new paradigms and algorithms, but perhaps that was for compute only.
The number of cases where dependences are resolved by N+1 frame delays, or the eschewing of list-based structures due to per-lane capacity growth might be places where this can be revisited.

The hardware capacity should be there, but I'm unaware of any capability to do so. Scalar handles control flow and could, if allowed, manipulate scheduling or even data. Volta might be capable of something similar provided a parallel thread for control. Either a dev or drivers would need to provide exception handling. Outcome obviously determined by what happened.
GCN's scalar unit can process changes to the execution mask and control flow, but if the currently active lanes' code commits them to an infinite loop, it's not going to change its mind in the middle without there being a chunk of code in place that would somehow conclude the branch. The scheduler is drawn by the marketing diagrams as a separate block, even if the scalar unit is potentially more closely entwined with it.
The more complex branch scenario for GCN that includes fork and join blocks statically pursues the path with fewest active lanes, and the branch stack's fixed depth is architecturally linked to that behavior.

Possibly impractical, however designing in a mechanism to handle exceptions would be worthwhile. A low priority, persistent thread with elevated privileges could manage it or a triggered event. Voltas solution may be a software driven approach that is compiled in when detecting a possible lock. Some creative work with masks and per lane counters are all that should be required. Will be interesting to see more details.
It seems complicated to have a separate thread asynchronously injecting itself into another warp's scheduling.
Volta's per-thread context tracking separates out the program counter and call stack.
The SIMD hardware could still optimize for the case where PC and S are consistent within a warp, with a smaller amount of resources tracking a generally smaller set of program counter and stack combinations.
Routine SIMD execution has PC and S running with a delta of 0 across the warp. Potentially, relatively straightforward divergence could be detected as a delta showing in PC with S being constant, and the separate path would instantiate a new entry that the relevant threads would be given a pointer to.
The hardware could track the number of issue cycles between paths, or other measures of progress without requiring knowledge of the algorithm on the part of the scheduler.

More complex and progressively slower cases would be influenced by the number and magnitude of the deltas between the per-thread values among other things.
 
Do you mind explaining why the conclusion is so obviously wrong? It seems to me that being able to make forward progress guarantees in the presence of fine-grained synchronization does require some HW to ensure that various diverged "pieces" of a single warp get their fair share of execution time, regardless of what the other pieces are doing. Calling whatever that is "scheduling HW" also seems appropriate.
"scheduling hardware is back" suggest that it was removed at some point. The way I read this, it refers directly to some parts of the Internet that thinks that Nvidia removed hardware scheduling starting with Kepler and that this somehow explains their async compute performance issues. It's hard to believe, but there are actually people out there who think that Nvidia schedules individual warps by the CPU.

I don't know enough about async compute to understand the subtleties, strengths and weaknesses of scheduling between graphics and compute (and I have no doubt that AMD is stronger there), but I do know that scheduling individual warps has always been done in hardware and that Volta isn't going to be any different in this regard. The new independent thread feature will obviously require changes to that warp scheduling hardware but it doesn't seem like a major thing.

That deadlock would be rather specific to Nvidia hardware as well.
AFAIK GCN's treats divergence with PC stack that contains the least amount of active lanes? If so, that creates deadlocks just the same.
 
Kepler removed (large) parts of Fermi's scoreboarding hardware, which (in Fermi!) allowed tracking of variable instruction latencies without any compiler knowledge about it - maybe that's what people confuse? According to Nvidia the main culprit was, that this part was a power hog, so in order to trim down the fat, they moved fixed instruction latencies (most math ones) into the compilers scheduling policies. What I am very unsure about is, how this compiler scheduling is turning out in practice, when (known) math instruction alternate with (unpredictable) memory operations.

Saving power in exchange for a harder job for the software guys, but being able to fine tune after silicon was baked.
 
Kepler removed (large) parts of Fermi's scoreboarding hardware, which (in Fermi!) allowed tracking of variable instruction latencies without any compiler knowledge about it - maybe that's what people confuse?
Yes, that’s exactly it. But when discussion are often reduced to 140 characters, that “compiler” gets replaced by “SW” and suddenly you have a GPU that does all its scheduling in SW. :)

What I am very unsure about is, how this compiler scheduling is turning out in practice, when (known) math instruction alternate with (unpredictable) memory operations.
I think there’s no other choice than to still have a scoreboard or (if they are returned in order) some kind of reference counting.

Saving power in exchange for a harder job for the software guys, but being able to fine tune after silicon was baked.
I don’t think the SW task is very hard: for Maxwell, it can be done with a Perl script, see Maxas. I’m not aware of a Kepler equivalent though.
 
What I am very unsure about is, how this compiler scheduling is turning out in practice, when (known) math instruction alternate with (unpredictable) memory operations
What do you mean by unpredictable? Either it's a L1 hit, or it's a stall of unknown duration. No need to account for anything in between.
 
What do you mean by unpredictable? Either it's a L1 hit, or it's a stall of unknown duration. No need to account for anything in between.
As silent_guy said: Unpredictable as in "hope that at some not too distant point in the future a value is returned so that we can continue", which is where my imagination end how the actual scheduling around such holes is performed. Or is it just that: Wait and if condition X is met, continue. If so, you'd need to dynamically reallocate that particular Wavefront/Warp to the execution ressources, right?
 
You need to keep track of memory requests that are in flight before you can schedule the warp that ordered it.

And a stall of unknown duration is by definition unpredictable. :)
Well, if you don't hit the L1 just yield on read? Same if store doesn't complete immediately?

Sure, that's still one request per lane which needs to be kept track of. Only for the bad case though, when a L1 miss occurs, not in general and there is nothing left in the pipeline while stalled.

I would assume that the load is actually repeated after a stall occurred due to cache miss and was resolved, so it then has the same guarantees regarding latency as if it had been a hit on first try.
 
Sure, that's still one request per lane which needs to be kept track of. Only for the bad case though, when a L1 miss occurs, not in general and there is nothing left in the pipeline while stalled.
One request only? Surely it should be able to issue load request until there is a dependency stall.
There are CUDA kernel with very low number of threads where all load prefetching js done by hand. Those wouldn’t have high performance without multiple loads in flight.
 
Prefetch doesn't need to be tracked, does it? Only the actual load.

There is a prefetch instruction, even though I don't know when the compiler is ever emitting it.
 
Prefetch doesn't need to be tracked, does it? Only the actual load.

There is a prefetch instruction, even though I don't know when the compiler is ever emitting it.
I mean: multiple actual loads to prefetch a bunch of data into, for example, shared memory.
 
I'm not sure prefetch is meaningful in the context of Nvidia GPUs - memory instructions are already asynchronous, and are tied to barriers that instructions explicitly wait for when they need the data. In this way, all loads are prefetches if you want to look at it that way. I believe that Maxwell (no idea about newer architectures) has 6 barriers per warp, and so it can have 6 load/stores in flight without stalling. Barriers are statically and explicitly assigned in the ISA. Volta would be somewhat different since it has to either have separate barriers for each thread, or else perform a stall whenever divergent threads contend for one of the barriers.
 
I'm not sure prefetch is meaningful in the context of Nvidia GPUs - memory instructions are already asynchronous, and are tied to barriers that instructions explicitly wait for when they need the data. In this way, all loads are prefetches if you want to look at it that way. I believe that Maxwell (no idea about newer architectures) has 6 barriers per warp, and so it can have 6 load/stores in flight without stalling. Barriers are statically and explicitly assigned in the ISA. Volta would be somewhat different since it has to either have separate barriers for each thread, or else perform a stall whenever divergent threads contend for one of the barriers.
Let’s not get hung up on the term prefetching.

What I’m getting at is that there are kernels that do away with the whole thing about having as many warps as possible for latency hiding, that micromanage loads into shared memory instead.

Look at the matrix multiplication SGEMM routines of NervanaSys:
https://github.com/NervanaSystems/maxas/wiki/SGEMM

They only use 64 threads (so 2 warps) to reach 98% of the peak FP32 TFLOPS.

I don’t think this could work if the SM wasn’t able to have multiple read requests in flight from the same warp.

Which means that the SM needs HW resources to keep track of all those reads.

I could be wrong about this. :)
 
So who wants to open their wallet and go benchmark volta?

Today we are making the next generation of GPU-powered EC2 instances available in four AWS regions. Powered by up to eight NVIDIA Tesla V100 GPUs, the P3 instances are designed to handle compute-intensive machine learning, deep learning, computational fluid dynamics, computational finance, seismic analysis, molecular modeling, and genomics workloads.

https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/
 
That didn’t take long.

Results are pretty much in line with what we’ve seen in other benchmarks.

I’ve been trying to see if LuxMark is more on the compute or BW side of things when it comes to bottleneck.
It seems to be a mixed bag when you compare GP100 and GP102 results.
 
Back
Top