Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Status
Not open for further replies.
I wonder if it's feasible for NVIDIA to have a "compute-only" big chip on 7nm (sold at high margins) and keep the gaming line-up on 12nm?

Volta was announced at GTC 2017 but barely had any availability before Q4 2017. It could be similar for a 7nm successor in theory - but I'm not sure if that's not still too early for 7nm (even with a significant part of the chip turned off for yields)?
 
Yeah yeah. Turing is for gaming, Ampere is for gaming! They both exist, neither exist! They're launching at 7 separate times all together!

We'll find out when we find out. It is weird that they're launching 2(?) architectures right after Volta... which has 1 die and 1 product. Then again you can't train with the quoted TOPS for Volta, and deploying inferencing on ultra expensive and out of stock as it is GPUs seems a waste.

I also suppose this means TSMC's ability to produce large die 7nm products won't be around for quite a while. Launching a new gaming architecture, which Volta isn't, this year will be a boon for gamers if there's actual supply available. But how long will it take to get a 7nm Nvidia GPU? Launching yet another new line in a year or less seems unlikely, maybe not till late next year if then. "Delays" seems to be the recurrent watchword for any new silicon node.

You can train with the quoted TFLOPS on Volta. For example, large LSTM models do quite well.
 
I wonder if it's feasible for NVIDIA to have a "compute-only" big chip on 7nm (sold at high margins) and keep the gaming line-up on 12nm?

Volta was announced at GTC 2017 but barely had any availability before Q4 2017. It could be similar for a 7nm successor in theory - but I'm not sure if that's not still too early for 7nm (even with a significant part of the chip turned off for yields)?
Volta though was the project completion milestone for the Pascal P100 and designed to go live on schedule for their massive HPC contract obligations meaning Q3-Q4 2017; Nvidia presentations I have seen regarding Volta all point to HPC-scientific, none associated with Tesla or Quadro more generally.
The surprise IMO was the Titan support card for the V100 was the same die-GPU when it would be more likely to expect a 'GV102' type design, this suggests that either Nvidia could not delay this cheaper support card (look to why Titan P was originally launched) to align with next node size or there is a distinct break between Volta with its full mixed-precision/Tensor acceleration and the rest of the model range.

But like I mentioned earlier and hinted by some posts after yours, there is synergy between all 3 segment models and that being Geforce/Quadro/Tesla; the gaming design HW scientists/engineer at Nvidia feed into the other engineer teams what they would like within the achitecture at R&D time, this results in it being pretty tight between all three segments in terms of architecture.
Worth remembering there is also Quadro that bridges features/design choices and solutions used in both gaming and also Tesla, so breaking them apart becomes even more difficult and complex.
Costs and logistics to try and do this would be high and not sure it really makes sense considering how well Maxwell and especially Pascal has done in all segments, that said maybe they will have same architecture spanning 7m and 12nm *shrug*, but I do not think it worked out well generally when they tried something similar (arch rather than node size) with Maxwell Gen 1 and Maxwell Gen 2 (which is what was successful).

Just for context, Titan P and Titan V are cheap when considered for their primary role within academic and scientific labs supporting directly or indirectly the P100 and V100.

Edit:
Just for clarification and to add, also it could be debated that some in the model range do have some distinction between each other due to the Compute capability-version supported by each GPU; simple case in point the P100 (SM_60) does not have the same CUDA compute version as the GV102 (SM_61) where one supports native FP16 acceleration while the other supports vector dot products DP4A and DP2A.
The compute version will also be different for those with the Tensor cores.
Just mentioning as it is one of perspective with regards to naming and design change, too early to put any weight on how large the differences are between two model names and the professional one to that of Volta (IMO is distinct for current signed HPC-scientific full mixed-precision with Tensor projects).
 
Last edited:
I went back through some of the Nvidia presentations given late last year and early this one.
Of interest would be TensorRT 3 that is Nvidia's programmable inferencing accelerator/optimiser going forward, currently it shows a push for Tesla P4, Tesla V100, Drive PX2, Jetson TX2; Xavier is mentioned further down but should be part of the slide.
So this may suggest the 2+ Tensor cores per SM is in what replaces the P4 and P40 along with any other changes, Volta will probably be distinct with its full mixed precision capability while the P40 replacement will be faster with regards to FP32 compute but potentially lessTensor throughput due to SM-cache-BW differences, the replacement for P4 would be the scale-out efficient GPU.
With 2 Tensor per SM minimum not sure where that leaves the INT8 DP2A and DP4A vector products going forward and will Nvidia drop them *shrug*; at some point I assume Nvidia would want the Tensor cores to support such possibilities in future as an Inferencing option.

If they do the 2+ Tensor cores per SM it would also suggest that native FP16 acceleration will be coming down the chain as well, and possibly be part of Geforce as it would be more of a CUDA core feature albeit one that ties loosely also into Tensor.
For Geforce/Quadro they could disable those Tensor cores for the top GPUs (GP104/GP102 replacements for Geforce and Quadro) or do another die-GPU without them (will come down to costs/manufacturing/logistics); anything lower would not have the Tensor cores anyway looking at how Nvidia position Tesla and models.

Key point is that TensorRT3 is designed to make the most of Inferencing using Tensor cores, so looking at the positioning of Nvidia products with TensorRT kinda makes sense to see the 2+ Tensor cores per SM in what replaces the existing GP102 and GP104 Tesla GPUs.
 
Last edited:
Why is Volta not a new gaming architecture, and what would it take for something to be a new gaming architecture?

A noticeable improvement per watt, or something, for gaming? We know the Titan V doesn't make a huge difference over a Titan XP despite being nearly twice the die size and same TDP. That's not a market proposition for gaming. It does however do very well better in perf/per watt for pure compute heavy tasks. Which is great if you want that. It's the same reason it's possible there's 2 new architectures, one for gaming and one for compute, coming from Nvidia instead of one.
 
You can't compare Titan Xp and Titan V for gaming, from a die size perspective, just because they have the same marketing family.
 
A noticeable improvement per watt, or something, for gaming? We know the Titan V doesn't make a huge difference over a Titan XP despite being nearly twice the die size and same TDP. That's not a market proposition for gaming. It does however do very well better in perf/per watt for pure compute heavy tasks. Which is great if you want that. It's the same reason it's possible there's 2 new architectures, one for gaming and one for compute, coming from Nvidia instead of one.
Just to add further perspective that highlights the GPU scope-purpose.
Only has 33% more FP32 CUDA cores compared to the Titan Xp though, but yeah even then it is not scaling correctly with games for many reasons.
 
FWIW, in UHD resolution we saw +0% in GTA 5, +16% in Wolf2, +19% in Witcher3 and +68% (!) in Quantum Break:
http://www.pcgameshardware.de/Kompl...ntel-Core-i9-7900X-Nvidia-Titan-V-1250265/#a4

Not in the test, but the scene demo 2nd stage boss, which basically is purely compute, is twice as fast on a Titan V compared to an OC'ed GTX 1080 Ti (2,000e-6,000m) and also (a little bit, ~5%) fast [edit: er] than the former dominator in this test, the RX Vega 64 LCE. It was a showcase for Vega all along.
 
Last edited:
Quantum Break results show just how bad that game was in terms of optimisation for Nvidia hardware and I cannot help but think how many times AMD complained about games optimised for Nvidia (seems rather cynical to me as this was one of the games AMD promoted for its performance and DX12 design), very interesting result especially as it is the DX11 version.

Regarding good performance scaling, Hellblade is another that scales pretty well on Titan V but more so at 1440p; PCPer measured that at 38% gain over Titan Xp at 1440p and 29% at 4k.
https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-TITAN-V-Review-Part-1-Gaming/Hellblade
Their Witcher 3 pretty much aligns with PCGamesHardware.
Good performance scaling on Titan V for games it seems is rare though.
 
Last edited:
Interesting, our Hellblade-scene indicates rather mediocre scaling between TV and TXp, but we're using a lower-fps scene anyway (~102 fps in WQHD for TV, 83 for TXp).

Apart from outlier Quantum Break, Elex and Sniper Elite 4 are able to utilize the TV's ressources better than average, and of course our application test in Capture One (raw converter), where TV eclipses the other cards as well by a very healthy margin despite being limited to it's OpenCL boost of only 1.335 MHz.
 
Quantum Break results show just how bad that game was in terms of optimisation for Nvidia hardware and I cannot help but think how many times AMD complained about games optimised for Nvidia (seems rather cynical to me as this was one of the games AMD promoted for its performance and DX12 design), very interesting result especially as it is the DX11 version.
So because AMD isn't happy about x^n games optimized for NVIDIA (including dirty tricks) one game optimized for their design (being developed for console using their design) means they shouldn't have ever complained :rolleyes:

edit: http://www.pcgameshardware.de/Gefor...178/Tests/Benchmark-Review-Release-1242393/2/
Looking at the results at tad lower resolution, Quantum Break DX11 was never badly optimized for NVIDIA to begin with, the results are pretty well in line with the usual performance difference of the cards, Volta just has something the game really really likes - much more than it likes AMD hardware
For DX12 there's little benchmarks to draw conclusions from, but AMD has traditionally been relatively stronger in DX12 than 11
 
Last edited:

Not sure myself as AMD do not have great results due to it being DX11 back before they started to do driver improvements with DX11 and were focusing on DX12 with devs.
Case in point look at all those early DX12 games and then compare their DX11 performance with AMD cards.
Another factor is just how much the gains are for Volta over Pascal & Maxwell in this very specific title that has been bad for Nvidia in general, other games that are heavily optimised and designed from scratch for both hardware has gains one expects and that is under 33%; case in point is Wolfenstein 2 that still has a fair amount of emphasis for AMD but still designed for Nvidia hardware.

Some tests using Presentmon showed the Fury X being 20-25% faster than the 980ti in Quantum Break DX12.
I appreciate this comes down to scene tested and importantly the volumetric lighting and Global Illumination (which is what killed Nvidia performance).
 
Last edited:
Some tests using Presentmon showed the Fury X being 20-25% faster than the 980ti in Quantum Break DX12.
But the reverse happened in DX11:


Quantum-Break-DX11-vs-DX12-ComputerBase.png


Quantum-Break-DX11-GameGPU.png

https://www.computerbase.de/2016-09/quantum-break-steam-benchmark/3/

Not really as AMD do not have great results due to it being DX11 back before they started to do driver improvements with DX11.
It could also be argued that NVIDIA suffered from the DX12 implementation of the game.
 
But the reverse happened in DX11:


Quantum-Break-DX11-vs-DX12-ComputerBase.png


Quantum-Break-DX11-GameGPU.png

https://www.computerbase.de/2016-09/quantum-break-steam-benchmark/3/


It could also be argued that NVIDIA suffered from the DX12 implementation of the game.
Yeah Nvidia suffered with most of the early DX12 implemented games as they were more aligned with AMD, some of the more recent ones show Nvidia can actually perform ok with DX12.
Hitman is another classic example that took ages (well after launch) for it to work well with Nvidia even using DX11 and ignoring DX12, and even then it can be Chapter dependent but at least it did improve its DX11 performance very late in the day.
Regarding this game.
The 1st chart also has very different results to PCGameshardware DX11 result; Computerbase manage 31% in favour of the 980ti while PCGamesHardware Fury X was 2% faster in DX11, both at 1440p.
Not a fan of GameGPU myself.
Comes back to settings and scene as Nvidia was hammered by Volumetric Lighting and Global Illumination.
 
Last edited:
A few quick points:
  1. According to my tests, Titan V has less geometry performance: 21 geometry engines vs 28 geometry engines on Titan X (1 engine shared per 4 SMs vs 1 engine per SM).
  2. I've done some low-level analysis of the Volta tensor cores (including instruction set), and my personal opinion is it's a somewhat rushed and suboptimal design. I think there's significant power efficiency gains to be had by rearchitecting it in the next-gen.
  3. There's some interest in FP8 (1s-5e-2m) even for training in the research community and Scott Gray hinted it would be available on future GPUs/AI HW in the next year or two; so it's likely NVIDIA will support it, let alone because Jen-Hsun likes big marketing numbers...
  4. groq (startup by ex-Google TPU HW engineers) looks like they might be using use 7nm for a chip sampling before the end of the year; so there might be some pressure for NVIDIA to follow sooner than later. Remember than 7nm might be (much?) less cost efficient initially, but should still be more power efficient.
My personal guess is that we'll have 2 new NVIDIA architectures in 2018, both derived from Volta (e.g. still dual-issue with 32-wide warps and 32-wide register file but 16-wide ALUs) with incremental changes for their target markets:
  1. Gaming architecture on 12nm (also for Quadro). Might include 2xFP16 and/or DP4A-like instructions for inferencing, but no Tensor Cores or FP64. Availability within 3-6 months.
  2. HPC/AI architecture on 7nm with availability for lead customers in Q4 2018, finally removing the rasterisers/geometry engines/etc... (not usable for GeForce/Quadro).
I'm going to write up what I've found out about the V100 Tensor Cores in the next week or so and hopefully publish it soon - probably just as a blog post on medium, not sure yet... (haven't written anything publicly in ages and sadly the Beyond3D frontpage doesn't have much traffic these days ;) other suggestions welcome though!)
 
Well Titan V has 7 TPC per GPC to Pascal's 5 per GPC, one difference to other models is that the SM structure of the V100 like P100 is that it has 2xSM per TPC meaning it is shared; comes down V100 and P100 64 Cuda FP32 cores per SM while the rest of Maxwell/Pascal are 128 Cuda FP32 cores per SM causing this divergence.
I am pretty sure Nvidia very briefly in the past touched upon pros/cons of this and may have consideration for geometry and SM-TPC-Polymorph engine.
Going back to the Fermi whitepaper where they talk about the Polymorph engine and SM.
To facilitate high triangle rates, we designed a scalable geometry engine called the PolyMorph Engine. Each of the 16 PolyMorph engines has its own dedicated vertex fetch unit and tessellator, greatly expanding geometry performance. In conjunction, we also designed four parallel Raster Engines, allowing up to four triangles to be setup per clock. Together, they enable breakthrough triangle fetch, tessellation, and rasterization performance. The PolyMorph Engine The PolyMorph Engine has five stages: Vertex Fetch, Tessellation, Viewport Transform, Attribute Setup, and Stream Output. Results calculated in each stage are passed to an SM. The SM executes the game’s shader, returning the results to the next stage in the PolyMorph Engine. After all stages are complete, the results are forwarded to the Raster Engines.

The first stage begins by fetching vertices from a global vertex buffer. Fetched vertices are sent to the SM for vertex shading and hull shading. In these two stages vertices are transformed from object space to world space, and parameters required for tessellation (such as tessellation factor) are calculated. The tessellation factors (or LODs) are sent to the Tessellator.

In the second stage, the PolyMorph Engine reads the tessellation factors. The Tessellator dices the patch (a smooth surface defined by a mesh of control points) and outputs a mesh of vertices. The mesh is defined by patch (u,v) values, and how they are connected to form a mesh.

The new vertices are sent to the SM where the Domain Shader and Geometry Shader are executed. The Domain Shader calculates the final position of each vertex based on input from the Hull Shader and Tessellator. At this stage, a displacement map is usually applied to add detailed features to the patch. The Geometry Shader conducts any post processing, adding and removing vertices and primitives where needed.
The results are sent back to the PolyMorph Engine for the final pass.

In the third stage, the PolyMorph Engine performs viewport transformation and perspective correction. Attribute setup follows, transforming post-viewport vertex attributes into plane equations for efficient shader evaluation. Finally, vertices are optionally “streamed out” to memory making them available for additional processing. On prior architectures, fixed function operations were performed with a single pipeline. On GF100, both fixed function and programmable operations are parallelized, resulting in vastly improved performance.

Raster Engine
After primitives are processed by the PolyMorph Engine, they are sent to the Raster Engines. To achieve high triangle throughput, GF100 uses four Raster Engines in parallel.

Recap of the GPC Architecture
The GPC architecture is a significant breakthrough for the geometry pipeline. Tessellation requires new levels of triangle and rasterization performance. The PolyMorph Engine dramatically increases triangle, tessellation, and Stream Out performance. Four parallel Raster Engines provide sustained throughout in triangle setup and rasterization. By having a dedicated tessellator for each SM, and a Raster Engine for each GPC, GF100 delivers up to 8× the geometry performance of GT200.
Quite a lot still seems applicable to current GPC-SM-TPC (with Polymorph engine) design.
 
Yep, the GPC-SM-TPC design still seems applicable, but one twist is that it looks from my benchmarks that there's 21 geometry engines on V100, which would be 1 per 4 SM... while there's only 80 SMs out of 84 enabled, so all the geometry engines are active despite not all the SMs being active. And more confusingly, there's supposedly 6 GPCs according to NVIDIA diagrams (I haven't tested this) which means 14 SMs per GPC... but 14 isn't divisible by 4! So either the association between SMs and GPCs is orthogonal to the association between SMs and geometry engines, or possibly some SMs just don't have access to geometry engines - i.e. it has become impossible to have warps running on all SMs purely from vertex shaders (with no pixel or compute shaders). I don't know which is true (assuming my test is correct) and it doesn't really matter for me to spend any more time on the question, but it's still intriguing.

Sadly, I don't have any good low-level tessellation tests to run - the tests I'm using for geometry are actually the ones I wrote back in 2006 to test G80(!) and they're still better than anything publicly available that I know of - a bit sad, but oh well... :)

EDIT: For anyone who's interested, the results of my old geometry microbenchmark on the Titan V @ 1200MHz fixed core clock...

Code:
CULL: EVERYTHING
Triangle Setup 3V/Tri CST: 2782.608643 triangles/s
Triangle Setup 2V/Tri AVG: 4413.792969 triangles/s
Triangle Setup 2V/Tri CST: 4129.032227 triangles/s
Triangle Setup 1V/Tri CST: 8347.826172 triangles/s

CULL: NOTHING
Triangle Setup 3V/Tri CST: 2445.859863 triangles/s
Triangle Setup 2V/Tri AVG: 2445.859863 triangles/s
Triangle Setup 2V/Tri CST: 2445.859863 triangles/s
Triangle Setup 1V/Tri CST: 2461.538574 triangles/s

i.e.~21 indices/clock, ~7 unique vertices/clock and ~2 visible triangles/clock (not completely sure why it's very slightly above 2/clk; could be a bug with my test). NVIDIA's geometry engines have had a rate of 1 index/clock for a very long time, which is what implies there's 21 of them in V100 (same test shows 28 indices/clock for P102).

I think it's likely that any future gaming-centric architecture will have a higher geometry-per-ALU throughput than V100; this is maybe a trade-off to save area on a mostly compute/AI-centric chip? Or maybe NVIDIA decided the geometry throughput was just too high, which is possible although it feels a bit low now relative to everything else...
 
I wonder if it's feasible for NVIDIA to have a "compute-only" big chip on 7nm (sold at high margins) and keep the gaming line-up on 12nm?

Volta was announced at GTC 2017 but barely had any availability before Q4 2017. It could be similar for a 7nm successor in theory - but I'm not sure if that's not still too early for 7nm (even with a significant part of the chip turned off for yields)?

Not initially. New nodes are getting worse and worse at transition times and initial reliability, eg cost per transistor isn't going down near as much as it used to. I wouldn't expect a huge die on 7nm for a while. Supposedly AMD is trying MCM stuff for its 7nm, but who knows if they're still going for that after firing Raja. And AFAIK Nvidia isn't anywhere close to such. Maybe that's why they went with 12nm? Volta proved they aren't thermally limited if they can have a colossal die. Could be Nvidia's more comfortable putting out large die sizes on 12nm with a nice gain in performance per watt, as compared to being limited by die size in 7nm which initially has huge benefits for power/thermal stuff anyway (EG what AMD is limited by).
 
Status
Not open for further replies.
Back
Top