Nvidia Volta Speculation Thread

After it is better to attack the messenger than the message ;) If I follow her it is because she has a good interaction with many people working into the game industry and she is a dev herself and no other dev told it is some bullshit... I verify before posting...

EDIT: After it is in part some guesswork and maybe directly from herself not 100% correct for the details...
 
Last edited:
Why would any dev tell you it's bullshit? As I asked before: what's the news here? Or if you would prefer what looks bed for NV architecture and how is AMD better?

It is not new too but for example when she speak of AMD advantage on asynch compute it is not new and it is hot changing the fact than in most of the game Nvidia card are performing better and someimes much better than AMD counterpart...
 
Last edited:
No one can become an Nvidia architecture expert in a few hours. All the more true when approached from a biased viewpoint ...
 
No one can become an Nvidia architecture expert in a few hours. All the more true when approached from a biased viewpoint ...

She is not a specialist but a dev working in game industry and it is what she heard by a tierce person if I understand well the first tweet she did...


yesterday i finally had the chance to get from the horse's mouth (i.e. not via mountains of extremely confused marketing) how nvidia's threading works
 
What went from some possible interesting discussion points for a tech forum is leading towards biased opinion crap. Frustrating.

Its a horrible shame that the PC Forums community has the inability to self-moderate and be better posters. The response from company defenders is exactly what is wrong with the entire PC Forums community. They're doing nothing but creating a toxic environment. Too much worthless and silly infighting. Set aside your differences and focus on better interactions with other posters. This stupid fighting between Nvidia/Amd/Intel needs to stop.
 
Why would an Nvidia dev respond? It's obvious she is biased and her background is primarily working with AMD.

Yeah but is she wrong about what she heard /said ? And saying "I prefer X over Y" doesn't make you anti-Y if Y do good thing. Yeah It seems she prefers how AMD chip works. Doesn't mean she can't see good thing on nvidia's side when things are good...
 
Why would an Nvidia dev respond? It's obvious she is biased and her background is primarily working with AMD.

Why would anyone respond or care about this post? It is obvious it's biased and his/her background is primarily discrediting anyone that makes negative statements against nVidia ;)
 
She is not a specialist but a dev working in game industry and it is what she heard by a tierce person if I understand well the first tweet she did...

Are you sure she is gamedev? My understanding is she is doing compiler stuff at apple.
 
Women in Tech – An Overview of Deep Learning
Hear from speakers Renee Yao, Product Marketing Manager of Deep Learning and Analytics at NVIDIA, Kari Briski, Director of Deep learning Software Product at NVIDIA, and Nazanin Zaker, Data Scientist at SAP, on different deep learning use cases, deep learning workflows, and its within the enterprise.


Volta Tensor Core accelerated training
The Training with Mixed-Precision User Guide introduces NVIDIA's latest architecture called Volta. This guide summarizes the ways that a framework can be fine-tuned to gain additional speedups by leveraging the Volta architectural features.
http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
 
Ok, I'll ask again: what exactly is bad about NV architecture in that posting?

It isn't bad but it might not be great,

the great irony here is that one of the main selling points: ~~~machine learning~~~ with the ~~~tensor units~~~: gains nothing from all this. it's the dumbest straight-line code you can imagine

absolutely none of this is meaningful for gaming performance in any modern engine sorry
 
In regards to the deadlock, to my understanding, a branch would run ahead and create an infinite loop. On GCN, for example, the scheduler would be selecting waves based on some metric. An infinite loop should bias the selection rather quickly. So it may be slow, but it shouldn't lock. An implicit syncthreads(). Excluding a condition where nothing could advance, which really is a software issue. The scalar unit might be able to detect and resolve those issues as well during runtime. The parallel INT pipeline on Volta possibly doing the same. The per thread PC seems more of a software solution than hardware implementation, which goes back to the whole grouping debate.

Some detail is missing. It's hard to imagine as to why not even explicit lane repacking is a thing yet. Only thing I could possibly think of, is that each lane has a hard coded offset for the stack or in the register file which makes it impossible to just transfer execution to another lane without also copying a significant share of the state.
As MDolenc stated, the issue is going beyond the warp size and indirection becoming increasingly problematic for hardware. Grouping requires a significantly large pool to pull from for efficiency. Taking a threadgroup of 1024 and getting 32 threads doing the same thing. Communication between lanes would be slower as results would need written out to registers for efficiency. Simply put, indexing into a 1024 wide array becomes problematic, but not impossible. Wouldn't be surprised if that's the direction Volta's successor takes.
 
In regards to the deadlock, to my understanding, a branch would run ahead and create an infinite loop. On GCN, for example, the scheduler would be selecting waves based on some metric. An infinite loop should bias the selection rather quickly. So it may be slow, but it shouldn't lock. An implicit syncthreads(). Excluding a condition where nothing could advance, which really is a software issue. The scalar unit might be able to detect and resolve those issues as well during runtime. The parallel INT pipeline on Volta possibly doing the same. The per thread PC seems more of a software solution than hardware implementation, which goes back to the whole grouping debate.
The deadlock isn't about the infinite loop specifically, though infinite loop example would also work to demonstrate this. Say that you have a divergent branch in a wave where one path will enter infinite loop waiting for _something_ and one lane in this same wave but in another path of the branch will do some calculations and set _something_. Sure you can execute different waves and even different kernels during execution on that same CU (nothing is stopping NV from doing this as well). But since none of the other waves will set _something_ it all really depends what happens in that one divergent wave.
If GPU went executing the path that will eventually set _something_ it will finish that path and then run this same wave through for the other branch with the infinite loop. Since _something_ was set the infinite loop will break.
If GPU went executing the path that enters infinite loop waiting for _something_ first then that wave will be stuck until GPU gets a reboot. Sure, GPU might continue on other waves and other kernels in the mean time. But currently GPUs can't do anything to unwind from this scenario.
 
Back
Top