Nvidia Volta Speculation Thread

this kinda shows how much they are expecting in sales from DL and HPC to push the limits like this though.

Nvidia's datacenter business has simply exploded in the past year, so I figure they now have the volume to push these kinds of risky high-margin products.

Nvidia-Quarterly-Revenue-Trend-Q1-2017-FY-2018-2-620x308.jpg
 
interestingly though they don't anyone in the AI marketplace, even Google's Tensor for training, is going to have a tough time against GV100 by the looks of it
 
Yep, one die. It's just insane. And that doesn't even get into the interposer (you can't get a traditional interposer large enough).

Is the interposer made of multiple exposed fields? The chip itself could straddle the boundary, or those regions could be stitched together with coarse enough interconnects. I recall that it is possible to get a single exposed field that is larger than a more standard stepper's reticle, just not cheaply.
 
Is the interposer made of multiple exposed fields? The chip itself could straddle the boundary, or those regions could be stitched together with coarse enough interconnects. I recall that it is possible to get a single exposed field that is larger than a more standard stepper's reticle, just not cheaply.
Winner winner. GV100's interposer has been exposed twice in order to produce a large enough interconnect area.
 
880 mm2 is staggering for an ASIC, even bigger than Intel (cough cough...) Itaniums.

Like a colleague once said, "Tiles're us (tm)".

PIty they went for 960 GFLOPS, their marketing department must have missed their bonus.
 
Given that cost plays a role, If you design the chip accordingly, would it be possible to use two separate interposers (left/right duo of HBM stakcs) as well? Since the memories do not talk to each other, i would think this might be possible.
 
Given that cost plays a role, If you design the chip accordingly, would it be possible to use two separate interposers (left/right duo of HBM stakcs) as well? Since the memories do not talk to each other, i would think this might be possible.
In theory I don't see a problem, assuming that you leave a space in GPU pins too between them. But in this case, it's apparently still 1 big block, double exposed
 
They increase the die by 33.6% while impressively keeping with same 300W and yet they go further:

The impressive part for me is more of a "oh shit I can't believe they were bullish enough to do this" than an actually technical achievement.
Each waffer can only make very few chips and most probably the great majority of them comes out with a defect (either for disabled units or for . With a 300mm wafer they're probably getting around 60-65 dies per wafer.
They're only making this because they got the clients to pay for >$15k per GPU, meaning a 2% yield (practically 1 good GPU per wafer) is already providing some profit.
10% yields (6 good chips) means getting them $90K revenue, of which they're probably getting a profit of well over $80K after putting the cards together.


FP32 compute increases by 41.5% or 2x (yeah depends upon function with Tensor).
FP64 compute increases by 41.5%
FP16 compute increased by 41.5% or 4x (yeah depends upon function with Tensor).
Squeezing into that 33.6% die increase an extra 41% Cuda cores and importantly with additional functions/units.
The FP32 and FP64 unit increase is almost a match to the increase in die area. Unlike Pascal P100, the FP32 units don't seem to do 2*FP16 operations anymore, as the Tensor cores do that instead.
So what they saved in smaller FP32 units and general die area from the 12FF transition, they invested in the Tensor cores.


Is there a game rendering application for the Tensor units?
The Tensor cores are definitely unable to unpack the values at any position in the cubic matrixes (otherwise they would be just regular FP16 ALUs). My guess is someone can just multiply 4*4 matrixes using two 4*1 matrixes with "valid" FP16 values and the 3rd dimension could just be filled with 1s, and in the end you just read the first row (EDIT: derp, forgot how to Algebra).
That said, this results in 30 TFLOPs (120/4) of regular FP16 FMAD operations.

Other than being usable as dedicated FP16 units, I don't see any rendering application for the Tensor units. They could be used for AI inferencing in a game, though..

For gaming, they'd probably be better off going back to the FP32 units capable of doing 2*FP16 operations.
Or like what they did with consumer Pascal, just ignore FP16 altogether and just promote all FP16 variables to FP32 and call it a day. This would be risky because in the future there could be developers using a lot of FP16 in rendering, but nvidia's architectures in consumer products aren't exactly known for being extremely future-proof.
 
Last edited by a moderator:
In theory I don't see a problem, assuming that you leave a space in GPU pins too between them. But in this case, it's apparently still 1 big block, double exposed
Yeah, was not trying to specifically debate GV100 here, but future product options. For consumer products, multiple exposures on extremely large interposers as well as the interposers themselves seems prohibitively expensive. With a modular approach and proper planning, you could use a one-size-fits all interposer, once HBM itself has become more mainstream.
 
Interesting stuff about Volta:
“It has a completely different instruction set than Pascal,” remarked Bryan Catanzaro, vice president, Applied Deep Learning Research at Nvidia. “It’s fundamentally extremely different. Volta is not Pascal with Tensor Core thrown onto it – it’s a completely different processor.”

Catanzaro, who returned to Nvidia from Baidu six months ago, emphasized how the architectural changes wrought greater flexibility and power efficiency.

“It’s worth noting that Volta has the biggest change to the GPU threading model basically since I can remember and I’ve been programming GPUs for a while,” he said. “With Volta we can actually have forward progress guarantees for threads inside the same warp even if they need to synchronize, which we have never been able to do before. This is going to enable a lot more interesting algorithms to be written using the GPU, so a lot of code that you just couldn’t write before because it potentially would hang the GPU based on that thread scheduling model is now possible. I’m pretty excited about that, especially for some sparser kinds of data analytics workloads there’s a lot of use cases where we want to be collaborating between threads in more complicated ways and Volta has a thread scheduler can accommodate that.

“It’s actually pretty remarkable to me that we were able to get more flexibility and better performance-per-watt. Because I was really concerned when I heard that they were going to change the Volta thread scheduler that it was going to give up performance-per-watt, because the reason that the old one wasn’t as flexible is you get a lot of energy efficiency by ganging up threads together and having the capability to let the threads be more independent then makes me worried that performance-per-watt is going to be worse, but actually it got better, so that’s pretty exciting.”

Added Alben: “This was done through a combination of process and architectural changes but primarily architecture. This was a very significant rewrite of the processor architecture. The Tensor Core part is obviously very [significant] but even if you look at FP32 and FP64, we’re talking about 50 percent more performance in the same power budget as where we’re at with Pascal. Every few years, we say, hey we discovered something really cool. We basically discovered a new architectural approach we could pursue that unlocks even more power efficiency than we had previously. The Volta SM is a really ambitious design; there’s a lot of different elements in there, obviously Tensor Core is one part, but the architectural power efficiency is a big part of this design.”
source https://www.hpcwire.com/2017/05/10/nvidias-mammoth-volta-gpu-aims-high-ai-hpc/
 
Last edited by a moderator:
Hopefully we'll get more details on the SIMT improvements. Sounds like they can now also handle irreducible CFGs? Also, based on http://images.anandtech.com/doci/11360/ssp_445.jpg, it looks like Xavier has deep learning HW that is separate from the tensor cores.

I remember discussing off-handedly a way to get SIMT hardware to get past issues with synchronization points becoming split across the currently active and inactive sides of a diverged branch, by allowing the hardware to issue from each path round-robin back in the 2009 Fermi thread. It may not be round-robin, but the architecture now seems to be flexible enough to not permanently block further progress on threads that might be holding an operation the active path needs to make progress.

Also contained in the Nvidia blog are mentions of a more streamlined ISA, and a shift to L0 instruction caches per SM. I think instruction buffers stopped being buffers in part because the instruction stream going into the SM would no longer be a FIFO sequence belonging to one active path, and a buffer wouldn't keep instructions around for when lanes happened to realign or hit the same code in a scenario like a common function or different iteration counts of the same loop.

Nvidia seems to be committing more strongly to keeping up the SIMT facad, in part by correcting a major reason why Nvidia's threads weren't threads. It's a stronger departure from GCN or x86, which are more explicitly scalar+SIMD. There are some possible hints of a decoupling of the SIMT thread and hardware thread in some of AMD's concepts of lane or wave packing, but nothing clearly planned as Nvidia's imminent product.
Perhaps it's time for debating how much of a thread their "thread" is again?
 

Yes, Anandtech reported it that way.

"NVIDIA is genuinely building the biggest GPU they can get away with: 21.1 billion transistors, at a massive 815mm2, built on TSMC’s still green 12nm “FFN” process (the ‘n’ stands for NVIDIA; it’s a customized higher perf version of 12nm for NVIDIA)."

http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

In addition, I don't think TSMC has historically had an "N" at the end of a process nickname, so it doesn't follow any conventions.
 

The new ISA and refactored scheduler contradict some predictions for a highly iterative change from Pascal, at least for the HPC variant.
The flexibility might not immediately impact code that has already been structured to avoid the program-killing facets of existing SIMD hardware, or possibly games due to them probably favoring higher coherence (due to APIs, optimizing for efficiency with current methods, multi-vendor considerations, avoiding program-killing elements of architecture, etc.).

However, some of the perf-watt changes to come can come from some of those algorithms that were walled off from GPU consideration because of they were too dangerous for the more restricted architectures. Unfortunately, maintaining two different algorithm bases doesn't sound easier on top of vendor or device specific code, so unless other GPUs start doing this the full upside may not be realized for some time. Some elements could be accelerated like the impact of pixel-sync type stalls. A workgroup might be able to launch and get much of its work done despite one pixel's hitting a sync barrier, rather than a much more significant gap nearer the front end applying to dozens of other pixels--and it might even be that the hazard would have resolved itself by the time it mattered.

The Nvidia blog's discussion of other measures of creating groups of communicating threads may also be extended by this SIMT change to lead to a more informal way of capturing parallelism dynamically, since the warps within those groups are also more flexible.

I did track down where I first thought about workarounds for SIMT's synchronization problem--back when Nvidia coined the term and much of the architectural arrangement was newer to me.
https://forum.beyond3d.com/posts/1363056/

The whole range of discussion in that portion of the thread and afterwards would be interesting to review in light of what is coming a cool 8 years after. Possibly, one of the bigger "this is dumb" elements of Nvidia's marketing may not be quite as dumb.
 
Yes, Anandtech reported it that way.

"NVIDIA is genuinely building the biggest GPU they can get away with: 21.1 billion transistors, at a massive 815mm2, built on TSMC’s still green 12nm “FFN” process (the ‘n’ stands for NVIDIA; it’s a customized higher perf version of 12nm for NVIDIA)."

http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

In addition, I don't think TSMC has historically had an "N" at the end of a process nickname, so it doesn't follow any conventions.
More likely it's just all the 12 FFC risk production hogged by NVIDIA
 
Why 12nm and not 10nm for manufacturing volta? Did every other tsmc customer go 10nm and nvidia wanted(paid for) 12nm for some reason?
 
More likely it's just all the 12 FFC risk production hogged by NVIDIA

I thought it could be a different name for 10FF risk production. 10FF is expected to reach high-volume production in two of their gigafabs in H2 2017. 10FF is actually planned to come before 12FFC.
However, transistor density in GV100 is very similar to Pascal cards so it doesn't sound like 10FF at all. TSMC claims a >50% area reduction in the 16FF+ -> 10FF transition, so a 21B transistor chip wouldn't be so big.

TSMC's roadmap is pretty dense as it is, getting yet another completely different process doesn't sound productive for them.


Maybe it's just 16FF+ with a few tweaks for being able to make such a huge die, and nvidia asked TSMC to call it "12FFN" because it sounds better on paper.
 
Back
Top