Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Status
Not open for further replies.
Yep, the GPC-SM-TPC design still seems applicable, but one twist is that it looks from my benchmarks that there's 21 geometry engines on V100, which would be 1 per 4 SM... while there's only 80 SMs out of 84 enabled, so all the geometry engines are active despite not all the SMs being active. And more confusingly, there's supposedly 6 GPCs according to NVIDIA diagrams (I haven't tested this) which means 14 SMs per GPC... but 14 isn't divisible by 4! So either the association between SMs and GPCs is orthogonal to the association between SMs and geometry engines, or possibly some SMs just don't have access to geometry engines - i.e. it has become impossible to have warps running on all SMs purely from vertex shaders (with no pixel or compute shaders). I don't know which is true (assuming my test is correct) and it doesn't really matter for me to spend any more time on the question, but it's still intriguing.

Sadly, I don't have any good low-level tessellation tests to run - the tests I'm using for geometry are actually the ones I wrote back in 2006 to test G80(!) and they're still better than anything publicly available that I know of - a bit sad, but oh well... :)

EDIT: For anyone who's interested, the results of my old geometry microbenchmark on the Titan V @ 1200MHz fixed core clock...

Code:
CULL: EVERYTHING
Triangle Setup 3V/Tri CST: 2782.608643 triangles/s
Triangle Setup 2V/Tri AVG: 4413.792969 triangles/s
Triangle Setup 2V/Tri CST: 4129.032227 triangles/s
Triangle Setup 1V/Tri CST: 8347.826172 triangles/s

CULL: NOTHING
Triangle Setup 3V/Tri CST: 2445.859863 triangles/s
Triangle Setup 2V/Tri AVG: 2445.859863 triangles/s
Triangle Setup 2V/Tri CST: 2445.859863 triangles/s
Triangle Setup 1V/Tri CST: 2461.538574 triangles/s

i.e.~21 indices/clock, ~7 unique vertices/clock and ~2 visible triangles/clock (not completely sure why it's very slightly above 2/clk; could be a bug with my test). NVIDIA's geometry engines have had a rate of 1 index/clock for a very long time, which is what implies there's 21 of them in V100 (same test shows 28 indices/clock for P102).

I think it's likely that any future gaming-centric architecture will have a higher geometry-per-ALU throughput than V100; this is maybe a trade-off to save area on a mostly compute/AI-centric chip? Or maybe NVIDIA decided the geometry throughput was just too high, which is possible although it feels a bit low now relative to everything else...
Thats Interesting.
There is definitely 6 GPC as that is the maximum design for Maxwell/Pascal/Volta; the architecture has been scaling via SM-TPC/Polymorph per GPC.
What can be a bit vague is where they disable the SMs.
Because of the sharing TPC-polymorph one could consider it as 7 SM per GPC as the TPC/Polymorph and SM are no longer a 1:1 relationship; context here specifically geometry rather than compute and yeah I appreciate it is not fully accurate but helps to keep the sharing aspect performance in perspective when comparing to the other Nvidia GPUs, which your tool even accounting for this shows the V100 still not performing ideally and below expectation.
Outside of P100 and V100 all SM are meant to have a 1:1 relationship with the associated geometry engines as per Fermi, which you see with your result for GP102.
I think those GV100 results comes back to the pros/cons that Nvidia very briefly touched upon at one point when asked about using that SM-GPC setup for gaming, trying to find if it was ever noted publicly.
You know anyone you can share your code with that has access to P100 (maybe as a Quadro GP100) and can run their own code to see if the behaviour aligns with the V100?

Really nice tool there, especially as it is identifying quirks with the V100 design.
 
Last edited:
A noticeable improvement per watt, or something, for gaming? We know the Titan V doesn't make a huge difference over a Titan XP despite being nearly twice the die size and same TDP. That's not a market proposition for gaming. It does however do very well better in perf/per watt for pure compute heavy tasks. Which is great if you want that. It's the same reason it's possible there's 2 new architectures, one for gaming and one for compute, coming from Nvidia instead of one.

This is the most sensible way of looking at it, where space and efficiency are at a premium why put things in a card that won’t be suitable for gaming.
 
A few quick points:
  1. According to my tests, Titan V has less geometry performance: 21 geometry engines vs 28 geometry engines on Titan X (1 engine shared per 4 SMs vs 1 engine per SM).
  2. I've done some low-level analysis of the Volta tensor cores (including instruction set), and my personal opinion is it's a somewhat rushed and suboptimal design. I think there's significant power efficiency gains to be had by rearchitecting it in the next-gen.
  3. There's some interest in FP8 (1s-5e-2m) even for training in the research community and Scott Gray hinted it would be available on future GPUs/AI HW in the next year or two; so it's likely NVIDIA will support it, let alone because Jen-Hsun likes big marketing numbers...
  4. groq (startup by ex-Google TPU HW engineers) looks like they might be using use 7nm for a chip sampling before the end of the year; so there might be some pressure for NVIDIA to follow sooner than later. Remember than 7nm might be (much?) less cost efficient initially, but should still be more power efficient.
My personal guess is that we'll have 2 new NVIDIA architectures in 2018, both derived from Volta (e.g. still dual-issue with 32-wide warps and 32-wide register file but 16-wide ALUs) with incremental changes for their target markets:
  1. Gaming architecture on 12nm (also for Quadro). Might include 2xFP16 and/or DP4A-like instructions for inferencing, but no Tensor Cores or FP64. Availability within 3-6 months.
  2. HPC/AI architecture on 7nm with availability for lead customers in Q4 2018, finally removing the rasterisers/geometry engines/etc... (not usable for GeForce/Quadro).
I'm going to write up what I've found out about the V100 Tensor Cores in the next week or so and hopefully publish it soon - probably just as a blog post on medium, not sure yet... (haven't written anything publicly in ages and sadly the Beyond3D frontpage doesn't have much traffic these days ;) other suggestions welcome though!)
Thanks for the good info and speculation. I am wondering about some things (maybe they'll be answered in your blog post…):
  1. How large are the potential "significant power efficiency gains" for the Tensor Cores?
  2. If there is FP8 support, would the FP8 be twice the rate of FP16? (Also would this be for the Tensor Cores?)
  3. The Google TPU2 has a significantly lower FLOPS/byte than the V100 (75 GFLOPS/byte vs. ~130 GFLOPS/byte). Do you expect the HPC/AI chip to also have a lower FLOPS/byte than the V100? I've been wondering if NVIDIA may go with > 4 HBM2 stacks or some kind of multi-GPU solution to increase bandwidth, since 4 stacks with even the new 2.4 Gbps HBM2 results in a maximum of 1.2 TB/s. I'm assuming that the HPC/AI chip has a minimum of 2x the FLOPS/W of the V100 (I estimate this lower bound mainly from the process), and not only is 1.2 TB/s "only" 37% more than the V100's 900 GB/s, but the V100 is already bandwidth-limited (or close) according to this post.
  4. For the HPC/AI architecture, do you envision a single chip with both fast DP and lots of Tensor Cores, or one DP-focused chip and another Tensor-focused chip?
 
...
I've been wondering if NVIDIA may go with > 4 HBM2 stacks or some kind of multi-GPU solution to increase bandwidth, since 4 stacks with even the new 2.4 Gbps HBM2 results in a maximum of 1.2 TB/s.

Since stacking HBM2 is so complex, for a GA102-level card, what factors would stop NV hooking up 18 Gbps GDDR6 to a 512 bit controller? Given the imbalance between bandwidth and compute, plus power savings since their last 512 bit controller (GTX 285?), imagine a new 512-bit GTX getting > 1TB/s before the HPC cards. :oops:
 
There's no necessity right now and it would eat into their margins. That's what would stop them from my point of view.
 
I’m not sure the next Ti card would need a 512-bit bus. GP102 in a 384-bit bus only uses 11 Gbs GDDR5x memory.

Keeping a 384-bit bus and utilizing GDDR6, the bandwidth could be increased by over 50%. GDDR6 tops out at 18 Gbs speeds.
 
Since stacking HBM2 is so complex, for a GA102-level card, what factors would stop NV hooking up 18 Gbps GDDR6 to a 512 bit controller? Given the imbalance between bandwidth and compute, plus power savings since their last 512 bit controller (GTX 285?), imagine a new 512-bit GTX getting > 1TB/s before the HPC cards. :oops:

I doubt we will see Nvidia ever going higher than 384 bit controller with GDDR6, could be argued the Tesla P40 ideally should had been 512 bit controller as it was promoted as the card for maximum inference throughput servers (pre-Volta) and other FP32 HPC requirements but had limited bandwidth due to 384-bit GDDR5 (not GDDR5X like Geforce but even that could be deemed too little).
 
Last edited:
Yep, the GPC-SM-TPC design still seems applicable, but one twist is that it looks from my benchmarks that there's 21 geometry engines on V100, which would be 1 per 4 SM... while there's only 80 SMs out of 84 enabled, so all the geometry engines are active despite not all the SMs being active. And more confusingly, there's supposedly 6 GPCs according to NVIDIA diagrams (I haven't tested this) which means 14 SMs per GPC... but 14 isn't divisible by 4! So either the association between SMs and GPCs is orthogonal to the association between SMs and geometry engines, or possibly some SMs just don't have access to geometry engines - i.e. it has become impossible to have warps running on all SMs purely from vertex shaders (with no pixel or compute shaders). I don't know which is true (assuming my test is correct) and it doesn't really matter for me to spend any more time on the question, but it's still intriguing.

Sadly, I don't have any good low-level tessellation tests to run - the tests I'm using for geometry are actually the ones I wrote back in 2006 to test G80(!) and they're still better than anything publicly available that I know of - a bit sad, but oh well... :)

EDIT: For anyone who's interested, the results of my old geometry microbenchmark on the Titan V @ 1200MHz fixed core clock...

Code:
CULL: EVERYTHING
Triangle Setup 3V/Tri CST: 2782.608643 triangles/s
Triangle Setup 2V/Tri AVG: 4413.792969 triangles/s
Triangle Setup 2V/Tri CST: 4129.032227 triangles/s
Triangle Setup 1V/Tri CST: 8347.826172 triangles/s

CULL: NOTHING
Triangle Setup 3V/Tri CST: 2445.859863 triangles/s
Triangle Setup 2V/Tri AVG: 2445.859863 triangles/s
Triangle Setup 2V/Tri CST: 2445.859863 triangles/s
Triangle Setup 1V/Tri CST: 2461.538574 triangles/s

i.e.~21 indices/clock, ~7 unique vertices/clock and ~2 visible triangles/clock (not completely sure why it's very slightly above 2/clk; could be a bug with my test). NVIDIA's geometry engines have had a rate of 1 index/clock for a very long time, which is what implies there's 21 of them in V100 (same test shows 28 indices/clock for P102).

I think it's likely that any future gaming-centric architecture will have a higher geometry-per-ALU throughput than V100; this is maybe a trade-off to save area on a mostly compute/AI-centric chip? Or maybe NVIDIA decided the geometry throughput was just too high, which is possible although it feels a bit low now relative to everything else...

I wouldn't bet that the 21 is the physical number of geometry engines since the Titan V is a cut down chip and this should be a fixed wired unit (and not a network approach). Did you considered this option? The visible trinangle per clock is slightly above of 2 because the frequency is not always 100% stable since a couple of generations (and it seems to fluctuate more on newer ASICs). How did you fixed the clock? Via software?
 
What is a geometry engine?

For me I would say.
At a very high level the polymorph engine (yeah name rather simplifies everything it does) and relationship with the SM; in context of Arun's tool this shows the 1:1 relationship in Nvidia's GPU (validated with his GP102 result) apart from V100 and probably the P100 if it could be tested where both those have a 2SM per 1 Polymorph engine and the associated overhead/sharing contention it creates.
Just to say worth noting the Polymorph engine and all its functions were moved into the TPC since Pascal rather than integral to SM as with Maxwell and earlier; reason is the evolution we see with P100 and V100 (changes to ratio of CUDA cores per SM,SM per GPC, associated register,etc).

At a more in-depth level comes back to the foundations set in place with Fermi for the Polymorph engine-raster engine-SM:
Nvidia paper said:
To facilitate high triangle rates, we designed a scalable geometry engine called the PolyMorph Engine. Each of the 16 PolyMorph engines has its own dedicated vertex fetch unit and tessellator, greatly expanding geometry performance. In conjunction, we also designed four parallel Raster Engines, allowing up to four triangles to be setup per clock. Together, they enable breakthrough triangle fetch, tessellation, and rasterization performance. The PolyMorph Engine The PolyMorph Engine has five stages: Vertex Fetch, Tessellation, Viewport Transform, Attribute Setup, and Stream Output. Results calculated in each stage are passed to an SM. The SM executes the game’s shader, returning the results to the next stage in the PolyMorph Engine. After all stages are complete, the results are forwarded to the Raster Engines.

The first stage begins by fetching vertices from a global vertex buffer. Fetched vertices are sent to the SM for vertex shading and hull shading. In these two stages vertices are transformed from object space to world space, and parameters required for tessellation (such as tessellation factor) are calculated. The tessellation factors (or LODs) are sent to the Tessellator.

In the second stage, the PolyMorph Engine reads the tessellation factors. The Tessellator dices the patch (a smooth surface defined by a mesh of control points) and outputs a mesh of vertices. The mesh is defined by patch (u,v) values, and how they are connected to form a mesh.

The new vertices are sent to the SM where the Domain Shader and Geometry Shader are executed. The Domain Shader calculates the final position of each vertex based on input from the Hull Shader and Tessellator. At this stage, a displacement map is usually applied to add detailed features to the patch. The Geometry Shader conducts any post processing, adding and removing vertices and primitives where needed.
The results are sent back to the PolyMorph Engine for the final pass.

In the third stage, the PolyMorph Engine performs viewport transformation and perspective correction. Attribute setup follows, transforming post-viewport vertex attributes into plane equations for efficient shader evaluation. Finally, vertices are optionally “streamed out” to memory making them available for additional processing. On prior architectures, fixed function operations were performed with a single pipeline. On GF100, both fixed function and programmable operations are parallelized, resulting in vastly improved performance.

Raster Engine
After primitives are processed by the PolyMorph Engine, they are sent to the Raster Engines. To achieve high triangle throughput, GF100 uses four Raster Engines in parallel.

Recap of the GPC Architecture
The GPC architecture is a significant breakthrough for the geometry pipeline. Tessellation requires new levels of triangle and rasterization performance. The PolyMorph Engine dramatically increases triangle, tessellation, and Stream Out performance. Four parallel Raster Engines provide sustained throughout in triangle setup and rasterization. By having a dedicated tessellator for each SM, and a Raster Engine for each GPC, GF100 delivers up to 8× the geometry performance of GT200.

Very late Edit:
And the tool IMO is finding possibly some of the cons in the setup of both P100 and V100 in context of geometry performance with their ratio and sharing-contention.

One aspect interesting and a consideration going forward with the raster engine is that it is one per GPC and originally designed around 4SM with 4 Polymorph engines per GPC and 4 GPC in total, the GPC has increased to 6 maximum in the largest designs since Maxwell onwards.
However and importantly as the arch continues to scale how has Nvidia changed this internally for Raster Engines as we now have 14SM (in gaming would be setup as 7SM) and 7 Polymorph engines per GPC with Volta.
Pascal increased it from 4 to 5 for the SM and Polymorph engines per GPC.
A notable increase of throughput to each Raster Engine/unit since Pascal if they did not revise it heavily from Fermi.
 
Last edited:
Agreed and since the geometry stages in the pipeline aren't decoupled from the topology there is no feasible solution or explaination for 21 GE. With 28 you are fine when dividing it by 7 which is 4 (math skill +10.000 :D).
 
With the latest announcement around DX12 Ratracing at GDC 2018.
One quote stands out from Nvidia and is from Tony Tomasi in an article:
PCGamesn said:
There’s definitely functionality in Volta that accelerates raytracing,” Tomasi told us, “but I can’t comment on what it is."

But the AI-happy Tensor cores present inside the Volta chips certainly have something to do with it as Tomasi explains:
.....
Nvidia’s new Tensor cores are able to bring their AI power to bear on the this problem using a technique called de-noising.

“It’s also called reconstruction,” says Tomasi. “What it does is it uses fewer rays, and very intelligent filters or processing, to essentially reconstruct the final picture or pixel. Tensor cores have been used to create, what we call, an AI de-noiser.

“Using artificial intelligence we can train a neural network to reconstruct an image using fewer samples, so in fact tensor cores can be used to drive this ai denoiser which can produce a much higher quality image using fewer samples. And that’s one of the key components that helps to unleash the capability of real-time raytacing.”
https://www.pcgamesn.com/nvidia-rtx-microsoft-dxr-raytracing

Yeah not going to impact gamers for some time as a complete solution, but will be interesting to see how this will unfolds sooner in the professional world, especially with Volta onwards.
 
Regarding the bolded part of your quote: Even larger caches would already accelerate raytracing, so that's basically a non-statement until elaborated upon further by Nvidia.
 
Regarding the bolded part of your quote: Even larger caches would already accelerate raytracing, so that's basically a non-statement until elaborated upon further by Nvidia.
Cache is not a functionality though *shrug*, going this route though you might as well say Volta has smaller CUDA core to SM/register/etc ratio as applicable as functionality, but that is specifically V100 rather than say Volta architecture generally; yeah depends if one can actually differentiate in same way like one can with Pascal and P100.

If it was Cache, then they would not be so hesitant to comment on that being the functionality as it is already a known factor.
Edit:
I linked earlier the performance gains between V100 with and without AI denoise/reconstruction, the gains are considerable and so taking the architecture such as Cache/SM-register/etc structure out of the equation.
This was in the Volta speculation thread.
With the AI-Tensor aspect the gains going back to 2017 were 8x at an SSIM rating of 0.93, 4.8x at SSIM rating of 0.95.
The solution has matured since then and that demo using rendered Bistro, so this does come back to HW functionality rather than caches IMO.

voltaperf6-624x682.png


I included some other links in the Volta speculation thread.
 
Last edited:
Why am I not surprised …
Cache is not a functionality though *shrug*, […]
In terms of marketing, it is.I'm not saying that there isn't something else, but larger caches would suffice for that quote not to be a lie, ergo is the quote as given above basically worthless.
 
Last edited:
In terms of marketing, it is.
Well see my Edit response with performance figures of V100 with and without AI-denoise/reconstruction.
It goes beyond cache/SM/registry/etc
I really do not think he is inferring cache.
Otherwise they might as well say they have functionality in Volta for Amber.
 
Last edited:
He is not inferring anything, but giving a marketing answer that's within the scope of his briefing and at the same time giving the impression of having adressed the question asked.

Tonys surname, btw, is Tamasi I believe, not Tomasi as in the pcgamsn article.
 
Last edited:
Status
Not open for further replies.
Back
Top