Last edited by a moderator:
Is there a GPU that is/was able to do this? I mean what exactly is new information in this tweet storm?
My question is what is the advantage of being able to do so? When would you use this functionality?Is there a GPU that is/was able to do this?
My thoughts exactly, nothing useful about any of this except some ramblings and expressing hate for NVIDIA in two of the comments, utterly useless.Is there a GPU that is/was able to do this? I mean what exactly is new information in this tweet storm?
I'd say that idea is to make divergent control flow to "not suck". So if you were able to push some lanes/threads that take the "else" path to another warp and pull in some lanes/threads that take the "if" path into current warp you could end up with "clean" warps that only take "if" path or "else" path and that on the first look would not suck. On the second look however it would still suck because then this branch becomes a grid wide sync point and you would potentially have to shift individual threads across different SM-s or even different GPU-s.My question is what is the advantage of being able to do so? When would you use this functionality?
Double precision is fully programmable, while tensor cores are fixed function units aimed at AI. There is likely zero chance they'll be in consumer cards or even quadro's for that matter.
Tom Forsyth
Seems like a tiny bit of generalization (lane-repacking) would help them enormously in so many ways. Must be some reason they didn't do it.
They're clearly not stupid, they've had LOTS of iterations to try it, so... I must be missing something fundamental why they can't do it?
olivier giroux @__simt__
It goes like: scheduling hard, heuristics bad, lane repacking burns pJ's, workloads for arch N-1 don't value it, it's hard to get a net win.
Forget about ever joining. Not going to happen, if you truly fork. Also forget about fully forking, this would - for the usual complexity of most shaders - only be viable if the forked warp stayed on the same SM, which you can't ensure under the aspect of occupation constraints.I'd say that idea is to make divergent control flow to "not suck". So if you were able to push some lanes/threads that take the "else" path to another warp and pull in some lanes/threads that take the "if" path into current warp you could end up with "clean" warps that only take "if" path or "else" path and that on the first look would not suck. On the second look however it would still suck because then this branch becomes a grid wide sync point and you would potentially have to shift individual threads across different SM-s or even different GPU-s.
FPGAs or MIMD processors can approximate it. The storm was from confusion on the threadsync with Volta the way it was marketed. There was an understanding that some form of packing was occurring to increase efficiency with each lane somehow acting as an independent scalar. In reality it ended up being far more basic.Is there a GPU that is/was able to do this? I mean what exactly is new information in this tweet storm?
The programmer wouldn't necessarily use it directly. As it exists it covers for poor programming resulting in a deadlock. The miscommunication was that they were taking a threadgroup and packing warps from those threads. That could yield a significant boost to execution efficiency as only the remainder of every diverged path would result in a masked warp. For example four warps diverging four times results in 16x cycles of execution, whereas the packing could be 4x with zero masked lanes.My question is what is the advantage of being able to do so? When would you use this functionality?
If spilling registers those constraints could be relaxed. Not all that different from preemption. Performance another concern, but may be acceptable for some models.only be viable if the forked warp stayed on the same SM, which you can't ensure under the aspect of occupation constraints.
I must have missed the marketing and just read the stuff that was published on their website and their presentation at GTC, where they were pretty clear that their independent threads feature was mainly a way to avoid locks, which indeed makes them look more like independent scalar lanes. But I never saw any claims by Nvidia about repacking.FPGAs or MIMD processors can approximate it. The storm was from confusion on the threadsync with Volta the way it was marketed. There was an understanding that some form of packing was occurring to increase efficiency with each lane somehow acting as an independent scalar. In reality it ended up being far more basic.
Exactly. To me, the guarantee that you can’t hang yourself with code that looks totally normal seems to be a huge step forward in terms of programmer friendliness.The programmer wouldn't necessarily use it directly. As it exists it covers for poor programming resulting in a deadlock.
I think that miscommunication was due to people fundamentally misunderstanding the what goes on in a GPU before Volta, and thus after as well.The miscommunication was that they were taking a threadgroup and packing warps from those threads. That could yield a significant boost to execution efficiency as only the remainder of every diverged path would result in a masked warp. For example four warps diverging four times results in 16x cycles of execution, whereas the packing could be 4x with zero masked lanes.
If he makes such an obviously wrong conclusion, then it shouldn’t be surprising that the general commenting public or lesser journalists make even bigger ones.This also means that a limited amount of scheduling hardware is back in NV’s GPUs.
How to use the Tensor Cores beyond the DL libraries, Nvidia lead engineer has mentioned it is not that simple and they are still learning what is realistically feasible and what is not.They’ve already done that.
I don’t understand the comment about “trying to figure out what they can do”. It’s a matrix multiplication. What is there to figure out???
I think that miscommunication was due to people fundamentally misunderstanding the what goes on in a GPU before Volta, and thus after as well.
I mean, here’s Ryan Smith’s comment in his Volta article about independent threads:
This also means that a limited amount of scheduling hardware is back in NV’s GPUs.
If he makes such an obviously wrong conclusion, then it shouldn’t be surprising that the general commenting public or lesser journalists make even bigger ones.
... threads can now be synchronized at a fine-grain level, and while the SIMT paradigm is still alive and well, it means greater overall efficiency.
It seemed clear to me as well, however that entire Twitter thread exists. So there was definitely some miscommunication and I've seen examples of "Nvidia not using a SIMD architecture" pandered about as well. The way that diagram with all the threads was drawn was a bit confusing as it otherwise looked like standard diverged paths on SIMD.I must have missed the marketing and just read the stuff that was published on their website and their presentation at GTC, where they were pretty clear that their independent threads feature was mainly a way to avoid locks, which indeed makes them look more like independent scalar lanes. But I never saw any claims by Nvidia about repacking.
It doesn't appear to be more efficient beyond those algorithms being able to be implemented though. For gaming, as that thread concluded, there doesn't seem to be much in the way of gains. Hardware changes aside. That deadlock would be rather specific to Nvidia hardware as well.They do claim more efficiency for fine grained parallel algorithms, which doesn’t seem incorrect either.
Not wishful thinking as much as the only plausible way to get more efficiency out of diverged code. That and treating each lane independently. The original blogs were big on efficiency gains which, beyond avoiding a deadlock on a corner case, don't appear to exist from an architectural perspective.Nvidia didn’t say a thing about thread repacking. Others did. That’s not miscommunication, that’s wishful thinking.
There is the MPS hardware that was added. That may be facilitating some synchronization for async compute. SMs are still a bit of an unknown.However, I'm not sure why the article says the "scheduling hardware is back..."
Because as everyone by now knows: there is only one kind of scheduling on a 21 billion transistor GPU. Sarcasm off... This referencing of hardware scheduling in Fermi and software scheduling in Maxwell/Pascal when it came to wait states after memory accesses from warp for every other possible kind of scheduling is getting old.However, I'm not sure why the article says the "scheduling hardware is back...", (this doesn't seem like something any previous NVidia GPUs was able to do), and I think the article also got it wrong 2 sentences back, with:
How so?That deadlock would be rather specific to Nvidia hardware as well.
Everything you need is already in the architecture, or at least should be. Lane shuffling instructions, pretty much enabling you to swap the full register sets between lanes such that they form packed warps in a thread group. Only waiting for the matching intrinsic, to trigger the repack.Not wishful thinking as much as the only plausible way to get more efficiency out of diverged code. That and treating each lane independently. The original blogs were big on efficiency gains which, beyond avoiding a deadlock on a corner case, don't appear to exist from an architectural perspective.
You can shuffle within warp/wave not across warps/waves. This doesn't help you at all with divergent threads in a warp. You'd have to repack this over shared memory.Everything you need is already in the architecture, or at least should be. Lane shuffling instructions, pretty much enabling you to swap the full register sets between lanes such that they form packed warps in a thread group.
Thanks for that!
Do you have a link? Or is this your opinion?She work on Open source driver, this is why she does not like Nvidia.
Do you have a link? Or is this your opinion?