Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Discussion in 'Architecture and Products' started by Geeforcer, Nov 12, 2017.

Tags:
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Given the lack of such optimization in games, that would just make it more obvious.
    I think it was mentioned at least early on that some of the kernels were using instructions not being generated for standard APIs at the time, so that's another tell.

    The thing about duty cycling or throttling is that it's below the level of the shader, and the shader itself is subject to analysis by the driver.
    Hardware vendors these days may not consistently use the monitoring and trusted-computing aspects of their platforms for this specific purpose, but they do for others.

    A miner could try to hack their way through or build their own software tool chain, but that costs them too and it's not a common skill set.
    Competitive pressures also go against trying to take such a lucrative find and broadcasting it.

    Perhaps some of the tweaks or override settings could be put on a timer as well, just to raise the time it takes to build out a mining rack.
    They could pay a bit extra to unlock things faster--creating micro transactions for mining hardware.
     
  2. mrcorbo

    mrcorbo Foo Fighter
    Veteran

    Joined:
    Dec 8, 2004
    Messages:
    3,564
    Likes Received:
    1,981
    I'm out of my depth here, but I've caught some of the semi-public development of the AMD GCN3-optimized Lyra2Re2 kernal (NextMiner) as it is discussed over on the Vertcoin Discord and they are compiling initial shader code and then patching in custom GCN3 assembly for all the performance-critical parts. That level of optimization seemed to me to be past the point of where the driver would be able to effect what it was doing, but maybe I'm mistaken?
     
  3. entity279

    Veteran Regular Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,229
    Likes Received:
    422
    Location:
    Romania
    One ideea would be to code the power management engine to throttle the chip. It probably has the ability to measure usage of cathegories of instructions. Not dissimilar with intel's reduced clocks in avx code, but applying that principle in an exaggerated manner , of course
     
  4. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,166
    Likes Received:
    1,836
    Location:
    Finland

    The latter of the two is what TPU used as a source and Inquirer has never ever been reliable source for anything. And even then, it's speculation, not fact like you make it out to be
     
  5. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,749
    Likes Received:
    2,515
    I seem to remember the Inquirer was the one who broke out the meltdown and specter story.
    It's just convenient that's all. Earlier CarstenS posted a link about a potential NVIDIA block for mining on consumer hardware. Now we hear about a potential NVIDIA specific mining SKU, connect the dots.
    It's all rumors at this point, but they are convenient indeed.
     
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,166
    Likes Received:
    1,836
    Location:
    Finland
    Actually it was The Register (and Reddit)
    I suppose one could mistake The Register for The Inquirer, same color scheme etc
     
  7. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Agreed it'd be a bloody shame to waste Turing's name on such a frivolous thing... To focus his legacy just on Enigma feels inadequate! But it does seem about time for NVIDIA to split their architecture into "Graphics" (Ampere?) vs "HPC/AI" (Turing?) given the increasingly large size of the latter (+GP100/GV100 barely being sold for graphics) and I suppose you could try reusing the latter for crypto, rather than the former as miners do today...

    I wonder how much the following V100 config would cost for NVIDIA to manufacture:
    - 8GB HBM2 (4 stacks but with 2 dies per stack instead of 4 dies, same bandwidth)
    - 68 SMs and 5 GPCs (1 fully disabled GPC for redundancy)
    - No display outputs, cheaper power circuitry, lower TDP
    - Disabled tensor cores, FP64, FP16, ECC, etc...

    I'm not sure if that would end up compute-limited for Ethereum or if it'd still be bandwidth-limited, but it'd probably be a monster for CryptoNight at the very least... HBM2 might make it more power-efficiency but less cost-efficient versus a GDDR5X GPU.

    I'm not sure how you would improve on that for crypto efficiency on Ampere/Turing except for removing the silicon that isn't necessary for other markets, but it doesn't feel worth it to create a chip just for crypto, so you'd probably want to only remove either "graphics" components or "HPC/AI" components.

    For example, you might end up with a "Turing" chip that's HPC/AI-centric (no rasterisers or texture units), and then it could be resold for crypto with tensors/FP16/FP64/ECC disabled. Heck, if you wanted to go really crazy, you could even make FP32 half-rate and only keep the integer and other pipelines full-rate...

    It feels like there's some nice potential synergy with making a HPC/AI chip (low volume, high MSRP) and amortising it with the crypto market (medium volume, low MSRP). But the only way for that to generate more profits for NVIDIA overall is if:
    1) It allows them to gain share in the crytocurrency mining market vs AMD.
    2) They can buy enough memory chips to not be horribly supply limited on those cards...
     
    #107 Arun, Feb 16, 2018
    Last edited: Feb 16, 2018
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    That would bypass code generation in the driver, but at some point code needs to be loaded and and the hardware pointed to it, which leaves it open to inspection by the driver. Higher-level analysis than the limited window known to the execution units could be performed for part of the heuristic. The driver or firmware can also get an idea of the number and variety of kernels and resources being used and their lifetimes/contents, which is more homogeneous.
    What the GPU does when it reaches a specific instruction can also be tuned or trapped.
    Specific system elements like the DVFS control and hardware monitoring can operate at a low level, but the decision to wall off enough of the GPU would need to be done in advance for a new design rather than retrofitting existing ones. AMD has a fair amount of on-die privileged processing available with its PSP and related cores, and Nvidia probably has resources I haven't seen details for or could add them.
     
    mrcorbo likes this.
  9. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    112
    Likes Received:
    129
    It's much more reasonable than a Crypto Chip. I also bet it's an AI optimized architecture and not cryptos. But i don't think they will fully replace the SM. I'll expect an inference optimized chip. V100 gives you FP16 multipli and FP32 accumulation. That's nice for training, but you don't need such accuracy for inference. Maybe Int8 Tensor Cores? Could be much smaller then V100 TCs. Would it be possible to put double the amount of such TCs than in V100 per SM?
     
    #109 Samwell, Feb 17, 2018
    Last edited: Feb 17, 2018
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    The closest to that would be the Volta Full Height Half Length V100 with a 150W TDP, which is more orientated towards its Tensor Cores than Cuda FP64/FP32 cores.
    If it is ever officially released.
     
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Maybe Turing is just the RISC-V-based successor to FaLCon? Even though i'd be surprised if they made this a dedicated talking-point for their next-gen architecture with mainstream audience.
     
  12. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    It makes more sense for Nvidia to make a dedicated AI chip than a crypto chip, because there is almost no risk in terms of future demand.

    And the Turing name is just as appropriate.
     
    Alexko and xpea like this.
  13. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    112
    Likes Received:
    129
    Yes and even if they'll decide to build a mining chip, the timeframe doesn't workout. Before Q2 last year no one cared for mining and nvidia won't have planned a chip before. Designing a finfet process chip takes time. You need at least 18 months with design, prototyping, production to get such a chip on the shelves. Probably quite a bit more.
     
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    One of the names could be to do with the evolution of the mixed-precision compute capabilities (including register related) along with further advancements in how cycles are used with regards to the CUDA cores (consider how with Volta they improved the ability to use Int32-FP32-SFU operations simultaneously); if this happens Turing would fit more with this IMO and some are expecting the evolution of the mixed-precision compute design.
    IMO Turing would be more to do with improvements around compute, but yeah it does seem more obvious the name is associated with machine learning so who knows *shrug*; could be either related to Turing Machine (for compute) or his paper (for machine learning).
    But some are expecting an evolution of the CUDA compute-operation capabilities, outside of the debate regarding names.
     
    #114 CSI PC, Feb 18, 2018
    Last edited: Feb 18, 2018
  15. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    112
    Likes Received:
    129
    Volta already improved the compute capabilities a lot. I don't think they will change it a lot before voltas 7nm successor. That's why i believe in TCs with lower accuracy. This should be easy to do, without too much effort.
    We already know, that a GPU with more TC Flops than V100 is sampling in Q2:

    [​IMG]
    https://www.anandtech.com/show/1191...-pegasus-at-gtc-europe-2017-feat-nextgen-gpus
    Xavier is 2x30 Tops, this leaves 2 x 130 TOPS for the 2 gpus. Nvidia mentioned the gpus on pegasus are post-volta architecture. So either Nvidia pushed many tensor cores in their gaming architecture and they changed the name to Turing, because of the company Ampere. Or we have a gaming architecture and a AI architecture. But if Turing is a new AI architecture, why don't they put it into Xavier? There are many open questions. It seems V100s 150W TDP version for inference is not coming anymore.
     
    ImSpartacus likes this.
  16. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yes and No depends upon what your perspective is.
    It improved it a bit relative to what one expects as an evolution from Pascal and when considering AMD's capabilities, the mixed-precision compute/concurrency-simultaneous operations can be evolved a lot more from what was done with Pascal.
    Tensor has no benefit for gaming, improved compute and concurrent-simultaneous operations can.
    Volta V100 does 120 TOPs, so the double dGPU each with 130 TOPS on Pegasus is an expected improvement relating to that model design.
    I mentioned the V100 150W because that is closer to what one would expect from a Tensor product, and if it can do FP16 inferencing it can do training as well.....
    There will be a push for mixed-precision requirements (without DP) in gaming down the line, and Nvidia needs to incorporate that into Geforce at some point, but in a way that aligns with the evolution of what they did with Pascal-Volta and would provide synergy with other Tesla and Quadro GPUs beyond the flagship mixed-precision (with DP).
    Remember Volta and all its Nvidia presentations going back years talk about it within a very specific HPC framework and not from a Tesla broad product line that is say from GP104 to GP100 (flagship HPC mixed-precision) when looking at Pascal, which one also then needs to consider Quadro and that segment including on top of the traditional range Nvidia is also looking to also push mixed-precision with the Quadro GP100.

    Anyway to reiterate Nvidia still has quite a lot of room to evolve their compute capabilities and operations from a CUDA core-register perspective, including the mixed-precision and simultaneous operations.
    Do not forget that a complete product line will take over 12 months to come to market; meaning any product design is not just short term but a strategy to last quite awhile covering a broad range of segments and models.
    Pascal had an accelerated product release but it aligned with the need to do the P100 followed by the V100 for large obligated HPC contracts in late 2017; it changed the strategy completely on how to launch products and they started with the highest risk/cost/yield issue 610mm2 die Pascal (or 815mm2 Volta) and worked their way down rather than working their way up as in the past.
     
    #116 CSI PC, Feb 18, 2018
    Last edited: Feb 18, 2018
  17. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    120
    Likes Received:
    181
    Volta in Xavier has twice the TensorCore throughput of GV100. So something has changed from GV100.
     
  18. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Also the DL Tegra-Jetson (Drive/Drive II/Pegasus/etc) follow a slightly different cycle to that of Tesla/Quadro/Geforce.
    Context being when in sample status rather than announced.
     
    #118 CSI PC, Feb 19, 2018
    Last edited: Feb 19, 2018
  19. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    985
    Likes Received:
    277
    Grall and pharma like this.
  20. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    322
    Likes Received:
    82
    Yeah yeah. Turing is for gaming, Ampere is for gaming! They both exist, neither exist! They're launching at 7 separate times all together!

    We'll find out when we find out. It is weird that they're launching 2(?) architectures right after Volta... which has 1 die and 1 product. Then again you can't train with the quoted TOPS for Volta, and deploying inferencing on ultra expensive and out of stock as it is GPUs seems a waste.

    I also suppose this means TSMC's ability to produce large die 7nm products won't be around for quite a while. Launching a new gaming architecture, which Volta isn't, this year will be a boon for gamers if there's actual supply available. But how long will it take to get a 7nm Nvidia GPU? Launching yet another new line in a year or less seems unlikely, maybe not till late next year if then. "Delays" seems to be the recurrent watchword for any new silicon node.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...