Thats Interesting.Yep, the GPC-SM-TPC design still seems applicable, but one twist is that it looks from my benchmarks that there's 21 geometry engines on V100, which would be 1 per 4 SM... while there's only 80 SMs out of 84 enabled, so all the geometry engines are active despite not all the SMs being active. And more confusingly, there's supposedly 6 GPCs according to NVIDIA diagrams (I haven't tested this) which means 14 SMs per GPC... but 14 isn't divisible by 4! So either the association between SMs and GPCs is orthogonal to the association between SMs and geometry engines, or possibly some SMs just don't have access to geometry engines - i.e. it has become impossible to have warps running on all SMs purely from vertex shaders (with no pixel or compute shaders). I don't know which is true (assuming my test is correct) and it doesn't really matter for me to spend any more time on the question, but it's still intriguing.
Sadly, I don't have any good low-level tessellation tests to run - the tests I'm using for geometry are actually the ones I wrote back in 2006 to test G80(!) and they're still better than anything publicly available that I know of - a bit sad, but oh well...
EDIT: For anyone who's interested, the results of my old geometry microbenchmark on the Titan V @ 1200MHz fixed core clock...
Code:CULL: EVERYTHING Triangle Setup 3V/Tri CST: 2782.608643 triangles/s Triangle Setup 2V/Tri AVG: 4413.792969 triangles/s Triangle Setup 2V/Tri CST: 4129.032227 triangles/s Triangle Setup 1V/Tri CST: 8347.826172 triangles/s CULL: NOTHING Triangle Setup 3V/Tri CST: 2445.859863 triangles/s Triangle Setup 2V/Tri AVG: 2445.859863 triangles/s Triangle Setup 2V/Tri CST: 2445.859863 triangles/s Triangle Setup 1V/Tri CST: 2461.538574 triangles/s
i.e.~21 indices/clock, ~7 unique vertices/clock and ~2 visible triangles/clock (not completely sure why it's very slightly above 2/clk; could be a bug with my test). NVIDIA's geometry engines have had a rate of 1 index/clock for a very long time, which is what implies there's 21 of them in V100 (same test shows 28 indices/clock for P102).
I think it's likely that any future gaming-centric architecture will have a higher geometry-per-ALU throughput than V100; this is maybe a trade-off to save area on a mostly compute/AI-centric chip? Or maybe NVIDIA decided the geometry throughput was just too high, which is possible although it feels a bit low now relative to everything else...
There is definitely 6 GPC as that is the maximum design for Maxwell/Pascal/Volta; the architecture has been scaling via SM-TPC/Polymorph per GPC.
What can be a bit vague is where they disable the SMs.
Because of the sharing TPC-polymorph one could consider it as 7 SM per GPC as the TPC/Polymorph and SM are no longer a 1:1 relationship; context here specifically geometry rather than compute and yeah I appreciate it is not fully accurate but helps to keep the sharing aspect performance in perspective when comparing to the other Nvidia GPUs, which your tool even accounting for this shows the V100 still not performing ideally and below expectation.
Outside of P100 and V100 all SM are meant to have a 1:1 relationship with the associated geometry engines as per Fermi, which you see with your result for GP102.
I think those GV100 results comes back to the pros/cons that Nvidia very briefly touched upon at one point when asked about using that SM-GPC setup for gaming, trying to find if it was ever noted publicly.
You know anyone you can share your code with that has access to P100 (maybe as a Quadro GP100) and can run their own code to see if the behaviour aligns with the V100?
Really nice tool there, especially as it is identifying quirks with the V100 design.
Last edited: