Nvidia Volta Speculation Thread

My question is what is the advantage of being able to do so? When would you use this functionality?
I'd say that idea is to make divergent control flow to "not suck". So if you were able to push some lanes/threads that take the "else" path to another warp and pull in some lanes/threads that take the "if" path into current warp you could end up with "clean" warps that only take "if" path or "else" path and that on the first look would not suck. On the second look however it would still suck because then this branch becomes a grid wide sync point and you would potentially have to shift individual threads across different SM-s or even different GPU-s.
 
Double precision is fully programmable, while tensor cores are fixed function units aimed at AI. There is likely zero chance they'll be in consumer cards or even quadro's for that matter.

It's just matrix multiplication, for compatibility there will be like 1 Tensor Core per SM in Geforce instead of 8 per 64 SPs now, like with DP now.

While reading the twitterthread and olivier girouxs answer i just had to think about nvidias and amds design philosophy:
Tom Forsyth
Seems like a tiny bit of generalization (lane-repacking) would help them enormously in so many ways. Must be some reason they didn't do it.
They're clearly not stupid, they've had LOTS of iterations to try it, so... I must be missing something fundamental why they can't do it?

olivier giroux‏ @__simt__

It goes like: scheduling hard, heuristics bad, lane repacking burns pJ's, workloads for arch N-1 don't value it, it's hard to get a net win.

I have the feeling, that nvidias philosophy is to build in features, if they don't cost too much perf/w and will be used within a short matter of time.
I've read a few times already, that some devs find it easier to develop on gcn, as it seems more flexible. Is amd maybe more accomodating the wishes of the developers? But for this they're sacrificing some perf/w? Also Vega for me seems like, we put as much features in it as we can and later we look how to get a good perf/w.
 
I'd say that idea is to make divergent control flow to "not suck". So if you were able to push some lanes/threads that take the "else" path to another warp and pull in some lanes/threads that take the "if" path into current warp you could end up with "clean" warps that only take "if" path or "else" path and that on the first look would not suck. On the second look however it would still suck because then this branch becomes a grid wide sync point and you would potentially have to shift individual threads across different SM-s or even different GPU-s.
Forget about ever joining. Not going to happen, if you truly fork. Also forget about fully forking, this would - for the usual complexity of most shaders - only be viable if the forked warp stayed on the same SM, which you can't ensure under the aspect of occupation constraints.

You can achieve a minor advantage though, using the advanced instructions for lane shuffling. For an FPU/ALU limited shader, choose a warp size 2x or 4x the native warp size. Now before you enter the divergent section, shuffle execution such that you get consequent blocks of the native warp size which are (as far as possible) either fully enabled or fully disabled.

If the architecture is no completely messed up, then fully masked sub-warps should not be executed. (As I said - if the architecture isn't messed up.) As a result, the scheduling rate of the remaining sub-warps should improve.

I am not aware that the architecture would support automatic compaction of divergent lanes in a (larger than native) warp yet (even though that would be a nice surprise!), but with the presence of proper inter-lane operations this should at least become possible manually.

Obviously this is only going to work if the workload doesn't branch on more than one or two levels. But as long as it doesn't, you should at least be able to treat a larger warp as a group of independent SIMD blocks. Where as the larger warp only provides the lockstep necessary to enable lane shuffling from a sufficiently sized pool.
 
Is there a GPU that is/was able to do this? I mean what exactly is new information in this tweet storm?
FPGAs or MIMD processors can approximate it. The storm was from confusion on the threadsync with Volta the way it was marketed. There was an understanding that some form of packing was occurring to increase efficiency with each lane somehow acting as an independent scalar. In reality it ended up being far more basic.

My question is what is the advantage of being able to do so? When would you use this functionality?
The programmer wouldn't necessarily use it directly. As it exists it covers for poor programming resulting in a deadlock. The miscommunication was that they were taking a threadgroup and packing warps from those threads. That could yield a significant boost to execution efficiency as only the remainder of every diverged path would result in a masked warp. For example four warps diverging four times results in 16x cycles of execution, whereas the packing could be 4x with zero masked lanes.

only be viable if the forked warp stayed on the same SM, which you can't ensure under the aspect of occupation constraints.
If spilling registers those constraints could be relaxed. Not all that different from preemption. Performance another concern, but may be acceptable for some models.
 
FPGAs or MIMD processors can approximate it. The storm was from confusion on the threadsync with Volta the way it was marketed. There was an understanding that some form of packing was occurring to increase efficiency with each lane somehow acting as an independent scalar. In reality it ended up being far more basic.
I must have missed the marketing and just read the stuff that was published on their website and their presentation at GTC, where they were pretty clear that their independent threads feature was mainly a way to avoid locks, which indeed makes them look more like independent scalar lanes. But I never saw any claims by Nvidia about repacking.

They do claim more efficiency for fine grained parallel algorithms, which doesn’t seem incorrect either.

The programmer wouldn't necessarily use it directly. As it exists it covers for poor programming resulting in a deadlock.
Exactly. To me, the guarantee that you can’t hang yourself with code that looks totally normal seems to be a huge step forward in terms of programmer friendliness.

It’s not about the programmer using directly, it’s mostly about the programmer not being able to use the broken old way inadvertently.

The miscommunication was that they were taking a threadgroup and packing warps from those threads. That could yield a significant boost to execution efficiency as only the remainder of every diverged path would result in a masked warp. For example four warps diverging four times results in 16x cycles of execution, whereas the packing could be 4x with zero masked lanes.
I think that miscommunication was due to people fundamentally misunderstanding the what goes on in a GPU before Volta, and thus after as well.

I mean, here’s Ryan Smith’s comment in his Volta article about independent threads:
This also means that a limited amount of scheduling hardware is back in NV’s GPUs.
If he makes such an obviously wrong conclusion, then it shouldn’t be surprising that the general commenting public or lesser journalists make even bigger ones.

Nvidia didn’t say a thing about thread repacking. Others did. That’s not miscommunication, that’s wishful thinking.
 
They’ve already done that.

I don’t understand the comment about “trying to figure out what they can do”. It’s a matrix multiplication. What is there to figure out???
How to use the Tensor Cores beyond the DL libraries, Nvidia lead engineer has mentioned it is not that simple and they are still learning what is realistically feasible and what is not.
Hence why they only talk more recently about simulation modelling potential now with the Tensor cores and Volta while the very early days was focused on the DL libraries/framework; look how many assume Tensor Cores can only be used for DL applications/framework.

It is like GEMM-cuBLAS optimisation is pretty easy with the Nvidia libraries, but there are quite a few who use C++ to hit the 90% scaling on Nvidia Tesla GPUs and their HPC applications.
Anyway it was a senior Nvidia engineer making the comment that they are still learning what is feasible with the Tensor cores/Volta; consider the recent work they did with Baidu on an alternative technique to improve BW and mixed-precision accuracy for DL training with Volta and Tensor cores.

Edit:
To clarify I am not disagreeing with you as in general it is GEMM development, but I would say POV is how to use this (Volta+Tensor) from a practical perspective with notable real-world gains both from a broader library perspective and HPC/scientific application.
Cheers
 
Last edited:
I think that miscommunication was due to people fundamentally misunderstanding the what goes on in a GPU before Volta, and thus after as well.

I mean, here’s Ryan Smith’s comment in his Volta article about independent threads:
This also means that a limited amount of scheduling hardware is back in NV’s GPUs.

If he makes such an obviously wrong conclusion, then it shouldn’t be surprising that the general commenting public or lesser journalists make even bigger ones.

Do you mind explaining why the conclusion is so obviously wrong? It seems to me that being able to make forward progress guarantees in the presence of fine-grained synchronization does require some HW to ensure that various diverged "pieces" of a single warp get their fair share of execution time, regardless of what the other pieces are doing. Calling whatever that is "scheduling HW" also seems appropriate.

However, I'm not sure why the article says the "scheduling hardware is back...", (this doesn't seem like something any previous NVidia GPUs was able to do), and I think the article also got it wrong 2 sentences back, with:
... threads can now be synchronized at a fine-grain level, and while the SIMT paradigm is still alive and well, it means greater overall efficiency.

These changes seem like they are concerned with increasing the generality of the logical execution model, leading to easier parallel algorithm implementation, not efficiency.
 
I must have missed the marketing and just read the stuff that was published on their website and their presentation at GTC, where they were pretty clear that their independent threads feature was mainly a way to avoid locks, which indeed makes them look more like independent scalar lanes. But I never saw any claims by Nvidia about repacking.
It seemed clear to me as well, however that entire Twitter thread exists. So there was definitely some miscommunication and I've seen examples of "Nvidia not using a SIMD architecture" pandered about as well. The way that diagram with all the threads was drawn was a bit confusing as it otherwise looked like standard diverged paths on SIMD.

They do claim more efficiency for fine grained parallel algorithms, which doesn’t seem incorrect either.
It doesn't appear to be more efficient beyond those algorithms being able to be implemented though. For gaming, as that thread concluded, there doesn't seem to be much in the way of gains. Hardware changes aside. That deadlock would be rather specific to Nvidia hardware as well.

Nvidia didn’t say a thing about thread repacking. Others did. That’s not miscommunication, that’s wishful thinking.
Not wishful thinking as much as the only plausible way to get more efficiency out of diverged code. That and treating each lane independently. The original blogs were big on efficiency gains which, beyond avoiding a deadlock on a corner case, don't appear to exist from an architectural perspective.

However, I'm not sure why the article says the "scheduling hardware is back..."
There is the MPS hardware that was added. That may be facilitating some synchronization for async compute. SMs are still a bit of an unknown.
 
However, I'm not sure why the article says the "scheduling hardware is back...", (this doesn't seem like something any previous NVidia GPUs was able to do), and I think the article also got it wrong 2 sentences back, with:
Because as everyone by now knows: there is only one kind of scheduling on a 21 billion transistor GPU. Sarcasm off... This referencing of hardware scheduling in Fermi and software scheduling in Maxwell/Pascal when it came to wait states after memory accesses from warp for every other possible kind of scheduling is getting old.
So obviously the hardware/software scheduling of waits in Fermi/Maxwell/Pascal is one thing, scheduling of different kernels is the second and scheduling of threads within a warp in Volta is the third. There is zero overlap.
As I understand it implementation is fairly simple: if warp stalls execution will switch to other threads within that warp.

That deadlock would be rather specific to Nvidia hardware as well.
How so?
 
Not wishful thinking as much as the only plausible way to get more efficiency out of diverged code. That and treating each lane independently. The original blogs were big on efficiency gains which, beyond avoiding a deadlock on a corner case, don't appear to exist from an architectural perspective.
Everything you need is already in the architecture, or at least should be. Lane shuffling instructions, pretty much enabling you to swap the full register sets between lanes such that they form packed warps in a thread group. Only waiting for the matching intrinsic, to trigger the repack.

Whats unfortunately not quite clear, is how these lane shuffling instructions work with already masked execution. Could you use them to swap the control flow with a lane which is currently masked? Or do all participating lanes actually need to be active, as they did in Maxwell?

Some detail is missing. It's hard to imagine as to why not even explicit lane repacking is a thing yet. Only thing I could possibly think of, is that each lane has a hard coded offset for the stack or in the register file which makes it impossible to just transfer execution to another lane without also copying a significant share of the state.
 
Everything you need is already in the architecture, or at least should be. Lane shuffling instructions, pretty much enabling you to swap the full register sets between lanes such that they form packed warps in a thread group.
You can shuffle within warp/wave not across warps/waves. This doesn't help you at all with divergent threads in a warp. You'd have to repack this over shared memory.
 
Back
Top