Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Status
Not open for further replies.
Seeing how "to-the-metal" some of these mining kernals are, I'm somewhat skeptical of the ability of Nvidia to stop hand-optimized miners from performing on their hardware.

Given the lack of such optimization in games, that would just make it more obvious.
I think it was mentioned at least early on that some of the kernels were using instructions not being generated for standard APIs at the time, so that's another tell.

The thing about duty cycling or throttling is that it's below the level of the shader, and the shader itself is subject to analysis by the driver.
Hardware vendors these days may not consistently use the monitoring and trusted-computing aspects of their platforms for this specific purpose, but they do for others.

A miner could try to hack their way through or build their own software tool chain, but that costs them too and it's not a common skill set.
Competitive pressures also go against trying to take such a lucrative find and broadcasting it.

Perhaps some of the tweaks or override settings could be put on a timer as well, just to raise the time it takes to build out a mining rack.
They could pay a bit extra to unlock things faster--creating micro transactions for mining hardware.
 
Given the lack of such optimization in games, that would just make it more obvious.
I think it was mentioned at least early on that some of the kernels were using instructions not being generated for standard APIs at the time, so that's another tell.

The thing about duty cycling or throttling is that it's below the level of the shader, and the shader itself is subject to analysis by the driver.
Hardware vendors these days may not consistently use the monitoring and trusted-computing aspects of their platforms for this specific purpose, but they do for others.

A miner could try to hack their way through or build their own software tool chain, but that costs them too and it's not a common skill set.
Competitive pressures also go against trying to take such a lucrative find and broadcasting it.

Perhaps some of the tweaks or override settings could be put on a timer as well, just to raise the time it takes to build out a mining rack.
They could pay a bit extra to unlock things faster--creating micro transactions for mining hardware.

I'm out of my depth here, but I've caught some of the semi-public development of the AMD GCN3-optimized Lyra2Re2 kernal (NextMiner) as it is discussed over on the Vertcoin Discord and they are compiling initial shader code and then patching in custom GCN3 assembly for all the performance-critical parts. That level of optimization seemed to me to be past the point of where the driver would be able to effect what it was doing, but maybe I'm mistaken?
 
I'm out of my depth here, but I've caught some of the semi-public development of the AMD GCN3-optimized Lyra2Re2 kernal (NextMiner) as it is discussed over on the Vertcoin Discord and they are compiling initial shader code and then patching in custom GCN3 assembly for all the performance-critical parts. That level of optimization seemed to me to be past the point of where the driver would be able to effect what it was doing, but maybe I'm mistaken?

One ideea would be to code the power management engine to throttle the chip. It probably has the ability to measure usage of cathegories of instructions. Not dissimilar with intel's reduced clocks in avx code, but applying that principle in an exaggerated manner , of course
 
Turing is the name of quick deployment GPUs designed specifically for crypto market, named after Alan Turing who broke the Nazi "Enigma" cryptography. NVIDIA will indeed block mining on consumer hardware!

https://www.techpowerup.com/241552/...ng-chip-jen-hsun-huang-made-to-save-pc-gaming


There are other sources for the info:
https://www.theinquirer.net/inquire...pto-mining-chips-to-ease-the-strain-on-gamers
https://www.digitaltrends.com/computing/nvidia-turing-ampere-graphics-cards-gtc-2018/

Two possibilities:
-Either Turing will be very powerful at mining than Geforce, driving difficulty up and making Geforce alternatives irrelevant (aka GTX 1050).
-Or that Geforce will be just as good at mining as Turing, in that case NVIDIA will block or slow down mining algorithms on the Geforces.

The latter of the two is what TPU used as a source and Inquirer has never ever been reliable source for anything. And even then, it's speculation, not fact like you make it out to be
 
Inquirer has never ever been reliable source for anything.
I seem to remember the Inquirer was the one who broke out the meltdown and specter story.
And even then, it's speculation, not fact like you make it out to be
It's just convenient that's all. Earlier CarstenS posted a link about a potential NVIDIA block for mining on consumer hardware. Now we hear about a potential NVIDIA specific mining SKU, connect the dots.
It's all rumors at this point, but they are convenient indeed.
 
Agreed it'd be a bloody shame to waste Turing's name on such a frivolous thing... To focus his legacy just on Enigma feels inadequate! But it does seem about time for NVIDIA to split their architecture into "Graphics" (Ampere?) vs "HPC/AI" (Turing?) given the increasingly large size of the latter (+GP100/GV100 barely being sold for graphics) and I suppose you could try reusing the latter for crypto, rather than the former as miners do today...

I wonder how much the following V100 config would cost for NVIDIA to manufacture:
- 8GB HBM2 (4 stacks but with 2 dies per stack instead of 4 dies, same bandwidth)
- 68 SMs and 5 GPCs (1 fully disabled GPC for redundancy)
- No display outputs, cheaper power circuitry, lower TDP
- Disabled tensor cores, FP64, FP16, ECC, etc...

I'm not sure if that would end up compute-limited for Ethereum or if it'd still be bandwidth-limited, but it'd probably be a monster for CryptoNight at the very least... HBM2 might make it more power-efficiency but less cost-efficient versus a GDDR5X GPU.

I'm not sure how you would improve on that for crypto efficiency on Ampere/Turing except for removing the silicon that isn't necessary for other markets, but it doesn't feel worth it to create a chip just for crypto, so you'd probably want to only remove either "graphics" components or "HPC/AI" components.

For example, you might end up with a "Turing" chip that's HPC/AI-centric (no rasterisers or texture units), and then it could be resold for crypto with tensors/FP16/FP64/ECC disabled. Heck, if you wanted to go really crazy, you could even make FP32 half-rate and only keep the integer and other pipelines full-rate...

It feels like there's some nice potential synergy with making a HPC/AI chip (low volume, high MSRP) and amortising it with the crypto market (medium volume, low MSRP). But the only way for that to generate more profits for NVIDIA overall is if:
1) It allows them to gain share in the crytocurrency mining market vs AMD.
2) They can buy enough memory chips to not be horribly supply limited on those cards...
 
Last edited:
I'm out of my depth here, but I've caught some of the semi-public development of the AMD GCN3-optimized Lyra2Re2 kernal (NextMiner) as it is discussed over on the Vertcoin Discord and they are compiling initial shader code and then patching in custom GCN3 assembly for all the performance-critical parts. That level of optimization seemed to me to be past the point of where the driver would be able to effect what it was doing, but maybe I'm mistaken?
That would bypass code generation in the driver, but at some point code needs to be loaded and and the hardware pointed to it, which leaves it open to inspection by the driver. Higher-level analysis than the limited window known to the execution units could be performed for part of the heuristic. The driver or firmware can also get an idea of the number and variety of kernels and resources being used and their lifetimes/contents, which is more homogeneous.
What the GPU does when it reaches a specific instruction can also be tuned or trapped.
Specific system elements like the DVFS control and hardware monitoring can operate at a low level, but the decision to wall off enough of the GPU would need to be done in advance for a new design rather than retrofitting existing ones. AMD has a fair amount of on-die privileged processing available with its PSP and related cores, and Nvidia probably has resources I haven't seen details for or could add them.
 
Is it possible that the Nvidia "Turing" could be a pure Tensor Core processor replacing the SM for tensor cores?

It's much more reasonable than a Crypto Chip. I also bet it's an AI optimized architecture and not cryptos. But i don't think they will fully replace the SM. I'll expect an inference optimized chip. V100 gives you FP16 multipli and FP32 accumulation. That's nice for training, but you don't need such accuracy for inference. Maybe Int8 Tensor Cores? Could be much smaller then V100 TCs. Would it be possible to put double the amount of such TCs than in V100 per SM?
 
Last edited:
Is it possible that the Nvidia "Turing" could be a pure Tensor Core processor replacing the SM for tensor cores?
The closest to that would be the Volta Full Height Half Length V100 with a 150W TDP, which is more orientated towards its Tensor Cores than Cuda FP64/FP32 cores.
If it is ever officially released.
 
Maybe Turing is just the RISC-V-based successor to FaLCon? Even though i'd be surprised if they made this a dedicated talking-point for their next-gen architecture with mainstream audience.
 
It makes more sense for Nvidia to make a dedicated AI chip than a crypto chip, because there is almost no risk in terms of future demand.

And the Turing name is just as appropriate.
 
Yes and even if they'll decide to build a mining chip, the timeframe doesn't workout. Before Q2 last year no one cared for mining and nvidia won't have planned a chip before. Designing a finfet process chip takes time. You need at least 18 months with design, prototyping, production to get such a chip on the shelves. Probably quite a bit more.
 
One of the names could be to do with the evolution of the mixed-precision compute capabilities (including register related) along with further advancements in how cycles are used with regards to the CUDA cores (consider how with Volta they improved the ability to use Int32-FP32-SFU operations simultaneously); if this happens Turing would fit more with this IMO and some are expecting the evolution of the mixed-precision compute design.
IMO Turing would be more to do with improvements around compute, but yeah it does seem more obvious the name is associated with machine learning so who knows *shrug*; could be either related to Turing Machine (for compute) or his paper (for machine learning).
But some are expecting an evolution of the CUDA compute-operation capabilities, outside of the debate regarding names.
 
Last edited:
Volta already improved the compute capabilities a lot. I don't think they will change it a lot before voltas 7nm successor. That's why i believe in TCs with lower accuracy. This should be easy to do, without too much effort.
We already know, that a GPU with more TC Flops than V100 is sampling in Q2:

gtc_2017_pre_brief-2_575px.png

https://www.anandtech.com/show/1191...-pegasus-at-gtc-europe-2017-feat-nextgen-gpus
Xavier is 2x30 Tops, this leaves 2 x 130 TOPS for the 2 gpus. Nvidia mentioned the gpus on pegasus are post-volta architecture. So either Nvidia pushed many tensor cores in their gaming architecture and they changed the name to Turing, because of the company Ampere. Or we have a gaming architecture and a AI architecture. But if Turing is a new AI architecture, why don't they put it into Xavier? There are many open questions. It seems V100s 150W TDP version for inference is not coming anymore.
 
Volta already improved the compute capabilities a lot. I don't think they will change it a lot before voltas 7nm successor. That's why i believe in TCs with lower accuracy. This should be easy to do, without too much effort.
We already know, that a GPU with more TC Flops than V100 is sampling in Q2:

gtc_2017_pre_brief-2_575px.png

https://www.anandtech.com/show/1191...-pegasus-at-gtc-europe-2017-feat-nextgen-gpus
Xavier is 2x30 Tops, this leaves 2 x 130 TOPS for the 2 gpus. Nvidia mentioned the gpus on pegasus are post-volta architecture. So either Nvidia pushed many tensor cores in their gaming architecture and they changed the name to Turing, because of the company Ampere. Or we have a gaming architecture and a AI architecture. But if Turing is a new AI architecture, why don't they put it into Xavier? There are many open questions. It seems V100s 150W TDP version for inference is not coming anymore.
Yes and No depends upon what your perspective is.
It improved it a bit relative to what one expects as an evolution from Pascal and when considering AMD's capabilities, the mixed-precision compute/concurrency-simultaneous operations can be evolved a lot more from what was done with Pascal.
Tensor has no benefit for gaming, improved compute and concurrent-simultaneous operations can.
Volta V100 does 120 TOPs, so the double dGPU each with 130 TOPS on Pegasus is an expected improvement relating to that model design.
I mentioned the V100 150W because that is closer to what one would expect from a Tensor product, and if it can do FP16 inferencing it can do training as well.....
There will be a push for mixed-precision requirements (without DP) in gaming down the line, and Nvidia needs to incorporate that into Geforce at some point, but in a way that aligns with the evolution of what they did with Pascal-Volta and would provide synergy with other Tesla and Quadro GPUs beyond the flagship mixed-precision (with DP).
Remember Volta and all its Nvidia presentations going back years talk about it within a very specific HPC framework and not from a Tesla broad product line that is say from GP104 to GP100 (flagship HPC mixed-precision) when looking at Pascal, which one also then needs to consider Quadro and that segment including on top of the traditional range Nvidia is also looking to also push mixed-precision with the Quadro GP100.

Anyway to reiterate Nvidia still has quite a lot of room to evolve their compute capabilities and operations from a CUDA core-register perspective, including the mixed-precision and simultaneous operations.
Do not forget that a complete product line will take over 12 months to come to market; meaning any product design is not just short term but a strategy to last quite awhile covering a broad range of segments and models.
Pascal had an accelerated product release but it aligned with the need to do the P100 followed by the V100 for large obligated HPC contracts in late 2017; it changed the strategy completely on how to launch products and they started with the highest risk/cost/yield issue 610mm2 die Pascal (or 815mm2 Volta) and worked their way down rather than working their way up as in the past.
 
Last edited:
Volta in Xavier has twice the TensorCore throughput of GV100. So something has changed from GV100.
Also the DL Tegra-Jetson (Drive/Drive II/Pegasus/etc) follow a slightly different cycle to that of Tesla/Quadro/Geforce.
Context being when in sample status rather than announced.
 
Last edited:
Yeah yeah. Turing is for gaming, Ampere is for gaming! They both exist, neither exist! They're launching at 7 separate times all together!

We'll find out when we find out. It is weird that they're launching 2(?) architectures right after Volta... which has 1 die and 1 product. Then again you can't train with the quoted TOPS for Volta, and deploying inferencing on ultra expensive and out of stock as it is GPUs seems a waste.

I also suppose this means TSMC's ability to produce large die 7nm products won't be around for quite a while. Launching a new gaming architecture, which Volta isn't, this year will be a boon for gamers if there's actual supply available. But how long will it take to get a 7nm Nvidia GPU? Launching yet another new line in a year or less seems unlikely, maybe not till late next year if then. "Delays" seems to be the recurrent watchword for any new silicon node.
 
Status
Not open for further replies.
Back
Top