Impact of nVidia Turing RayTracing enhanced GPUs on next-gen consoles *spawn

Discussion in 'Console Technology' started by vipa899, Aug 18, 2018.

  1. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,200
    Likes Received:
    5,469
    yup, they got pretty great results using statistical approximation which is basically in a nutshell what we're doing with AI.
    And you can do this without using a NN. Different statistical algorithms would work.
    But is also could have been done using DNN, which benefits from massive data sets.
    They'd have to run it though and see the results though.
    Post here on resetera:
    #1

    I guess it largely depends on the algorithm used and how effective it is.
     
    AlBran likes this.
  2. jlippo

    Veteran Regular

    Joined:
    Oct 7, 2004
    Messages:
    1,263
    Likes Received:
    334
    Location:
    Finland
    Now Unreal Engine has temporal upscaling and it's really easy to get working, we should see it even on PC games from now on.

    Temporal Upscaling is cheap in terms of processing power needed as well so I'm really looking forward for first game using it and DLSS to get some decent comparison done.
    I wouldn't be surprised if TU is faster and gives better quality in many cases and it works with dynamic resolution..
     
  3. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    39,907
    Likes Received:
    9,999
    Location:
    Under my bridge
    I didn't pass judgement on the value of Tensor cores. I said for upscaling they seemed inefficient.

    If true, that's make DLSS as an upscale solution even more silicon inefficient. ;)

    To be clear, I am only describing upscaling games, which DavidGraham seems to be unaware of the state of the art, and which we know works wonderfully well in compute. So the case for DLSS on Tensor (Turing implemention, as this thread is about Turing) is can it upscale better/cheaper? That means comparing results to silicon costs. And if, as you say, DLSS can be performed on compute, then it's value needs to factored into that.

    The major argument you're presenting is Tensor cores are worth it anyway and if they're in there, they can be used for upscaling, which may be a fair argument, but that doesn't disprove Tensor cores are an efficient solution just for upscaling.

    I'm seeing lots of numbers and no clear way to compare them. Can you provide the Tensor cost size as a figure? What percentage of die on RTX2070 is Tensor cores? How much compute could be had in the same space, and how much compute is needed for high quality temporal injection (we know that's a fraction of PS4's 1.8 TFs)?
     
    Silent_Buddha, mrcorbo and turkey like this.
  4. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    119
    Likes Received:
    179
    GV100 is 33% bigger and has 40% more compute units (SM) than GP100.

    nVidia's Bill Dally talked a lot about the cost of FP units in the last years. They are really cheap on the latest process technologies. But the biggest problem is to feed them efficiently. TensorCores are just FP units included into the normal infrastrukture of the compute units. Their size doesnt really matter on 16nm and forward.
     
  5. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    39,907
    Likes Received:
    9,999
    Location:
    Under my bridge
    I find that difficult to believe. Being able to do that much maths is going to require an amount of silicon. Tensor units also aren't just FP units, but are fixed-function matrix maths units. This reddit discussion goes at length with two arguing whether Tensor cores take up silicon or not, with no clear consensus. Google's Tensor processor is >300 mm² at 28 nm.

    I have no idea whether they take up 3% or 30%, but it's an amount and we can't consider their efficiency until we know that amount.
     
  6. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,200
    Likes Received:
    5,469
    lol from this perspective yes I suppose you are correct. To counter point I would have to bring in the analogy of dedicated hardware blocks such as HVEC decoding/encoding vs Compute or SHAPE audio blocks vs True Audio on Compute. In both instances choices were made to leverage dedicated silicon to create a specific functionality to run at higher speeds or use less resources than the compute based method. I have no doubts that we could do HVEC decoding and amazing audio over compute, we know it works, but the hardware was created none the less, likely because the pay off in performing these actions over fixed function hardware, given the available silicon made a lot of sense, even for some features that are not necessarily used all the time.

    If I take that concept and draw it over to TensorCores, specifically, let's just assume that TensorCores could not support anything else but DLSS, then we're looking at tensorcores as a fixed function hardware whose main objective is to upscale and antialias at the cost of silicon vs using that space for more compute for instance.

    So for the sake of discussion there are some assumptions I need to make, the first that Temporal AA and Temporal Injection should be similar enough in nature such that the performance cost and output should be similar. The second that, we take JHH demo at value, and that he's is not attempting to deceive the audience.

    That being said, here we see TAA vs DLSS. And in summary, without watching the video, they present that DLSS has better upscaling and enough performance increase to be considered as at least a deviation, or from what I can see, or 2 better than what we're seeing on TAA.


    Now, I don't have enough data points to prove this, but there are other assumptions we need to make as well. One is, because the tensor cores are separate from compute, they are doing their thing without trashing things on the compute side of things. It should in theory be able to go full-tilt without trashing things on the compute side. On the TAA side, i have to assume there is, to some degree, some drawbacks on the pipeline because you're asking compute to do everything from rendering to upscaling etc. There may also be some implementation restrictions I may not be fully aware of so perhaps it may be a reasoning point as to why we may not yet be seeing widespread adoption across all titles.

    That being said if we can see that DLSS performance is indeed better in both quality and it can process faster, thus the increase in the frame rate, the only remaining question is how much silicon is being used to support tensor cores which I don't know either. But the question then becomes whether that space with even more compute could perform equally as well here running DLSS or say TAA. And we still need to consider things like architecture, and being able to feed the compute, shared caches etc. It's not so simple as just beefing up the compute values. Whereas tensor cores are a dedicated unit that don't necessarily need to interact with the compute environment as far as I understand. If we go back to the original example of dedicated HVEC and Shape; you can't just add more CUS, like for Xbox can't go from 12 to 13 just because we remove the dedicated blocks. Perhaps you'd move to 16 Compute units (2 groups of 8) and make 1 redundant to have 14 CUs. But perhaps you don't have space for that either. But enough space for say... a small dedicated block.

    Quick throwback here:
    But Xbox has 14 CUS, and PS4 has 20 CUs. Even if you removed all the esram, we're looking at only fitting 3 more pairs of CUs in there to get to 20. And so I don't believe we can just generalize too easily that removing the dedicated blocks for more compute is straightforward.
    [​IMG]
    To round out the discussion, it also does more than just DLSS. And I think that's an important aspect as well. As we find more ways we can leverage NN for games, I think having an AI accelerator makes a lot of sense. And I think you know, derailing the thread, there is sufficient evidence we see the industry moving in this direction.

    Machine Learning Graphics Intern:
    https://www.indeed.com/viewjob?jk=0...g+in+Games&tk=1d18ti15d5ico803&from=web&vjs=3

    Machine Learning Engineer at Embark - new studio founded among many senior devs including Johan repi, who posts here from time to time
    Machine Learning Engineer

    Machine Learning at Ubisoft MTL
    https://jobs.smartrecruiters.com/Ubisoft2/105304105-machine-learning-specialist-data-scientist

    Senior AI Software Engineer
    https://ea.gr8people.com/index.gp?opportunityID=152736&method=cappportal.showJob&sysLayoutID=122

    EA has an entire AI/NN ML division within Seed apparently
    https://www.ibtimes.com/ea-e3-2017-ai-machine-learning-research-division-launches-2551067

    This is a small list, but I suspect to see it continue to grow over the years. With so much being poured into this area of research I am not opposed to dedicating silicon to support this entirely separate function for faster performance in this area. Yes compute can do it, but compute cannot do it faster than tensor cores can at least when concerning running tensorflow based models.
     
    #846 iroboto, Jan 15, 2019
    Last edited: Jan 15, 2019
  7. vipa899

    Regular Newcomer

    Joined:
    Mar 31, 2017
    Messages:
    922
    Likes Received:
    354
    Location:
    Sweden
    What can the tensorcores be used for in games, besides dlss reconstruction?
     
  8. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    39,907
    Likes Received:
    9,999
    Location:
    Under my bridge
    Whatever AI can be used for, which is a broad and emergent technology.
     
    vipa899 likes this.
  9. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    339
    Likes Received:
    409
    I assume TAA here means TAA at full resolution rendering, while DLSS means rendering at low resolution, but upscaling AND AA assisted by tensor cores.
    For a fair comparison they would need to show TAA + temporal reconstruction upscaling with low res rendering. (Pretty sure quality and performance can compete easily)
    Not sure, though.

    Interesting. I thought tensors are driven from compute: They do only matrix multiplications, while compute has to manage program flow, logics and data transfer as usual. So i assume CUs are not free for other tasks while performing DLSS or denoising?
    Would be interesting to know more about! (never looked it up because tensors are not exposed anyways yet).

    I see another disadvantage of DLSS that maybe nobody has mentioned yet: The need to train per game with a supercomputer. So to use it, you need to be AAA and you need NVs help i guess. Or you need at least a lot of resources.


    Killer feature would be procedural character animation IMO. Although i work on this myself and i use only physics but no AI, at some point things become complex and fuzzy.
    But that's just an example. AI might be useful and spur new innovations we can not predict yet.
     
    OCASM likes this.
  10. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    39,907
    Likes Received:
    9,999
    Location:
    Under my bridge
    The analogy there would be dedicated upscale blocks rather than fully programmable AI cores. Tensor and compute and two different problem-solving paradigms. Both can achieve results, so it's a case of picking which is the best bang-per-buck. For some workloads compute is actually pretty ideal, especially image stuff using 'standard' (non-neural net) algorithms, because it's been developed over 20 years to be ideal for that.

    Well, we know TAA and temporal injection aren't similar in output. TAA doesn't improve resolution. Insomniac's TI does antialiasing akin to TAA but at a relatively higher resolution, just like the DLSS demo.

    Reconstruction has to happen after the image has been constructed, so it's a clean slice. Render, then upscale, then render the next frame.

    Really, this discussion needs a comparable reference. Insomniac haven't gone into details on how their method works versus other reconstruction and it's not on any PC titles AFAIK. What games are there on PC with reconstruction options?

    That's probably very true and I'm not against them in principle. Upscaling games isn't anything new though and DLSS isn't really doing anything special yet - DavidGraham's enthusiasm is misplaced because he wasn't aware what was already happening with clever developers and the options fully programmable compute have opened up this gen.
     
  11. Allandor

    Regular Newcomer

    Joined:
    Oct 6, 2013
    Messages:
    363
    Likes Received:
    169
    well not really. Someone in another forum calculated the size of the chip, frequency and the average performance in benchmarks of the 2080TI. Yes the card is faster (nothing new here) but if you calculate performance per mm² the performance was almost reduced by 30% per mm².
    Also DLSS has many flaws (btw, also TAA). There are quite pretty upscaling techniques on consoles with only minor flaws. Really don't know why nvidia didn't invest in those. DLSS is really a waste of resources, not that easy to implement after all (only one title so far) and has really mixed results (from "it doesn't do anything at all" to "flickering" to "compression"-artifacts) with a heavy performance hit (compareable to TAA @1800p vs 1440p DLSS (to 4k)).
    nvidia is just trying to invent something new, something that isn't optimized for the use-case just to be the first.
     
  12. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,200
    Likes Received:
    5,469
    In the demo, both are operating on 1440p frame buffer. So everything is the same except the algorithm used and the hardware doing it.

    That part is hidden to me. Tensors are just really high performing 16f multiply adders. I think there are some other little things in there like accumulators but generally should not have things like rasterizers or texture mapping touching them. So I think tensors being put into a compute engine seems not as direct as it could be. I could be wrong, but I'm not a hardware guy.

    As for the disadvantage of DLSS:
    Training is cheap, it's could easily become the net positive here when you consider the cost of labour (time, complexity, talent etc). Take your finished product and hand over to another company to do processing and it's done. Extrapolate this concept over a variety of different tasks, like content creation and labour costs go down significantly.
     
  13. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,200
    Likes Received:
    5,469
    Agreed. Tensor is extremely specific here. Which may or may not work against its presence. We could in theory do everything using standard compute, ie. MI60 or just about any non Volta card has been doing since 2010.

    Yea I'm not sure what the costs here are, but if injection takes more processing power than just AA, then the resulting performance should be worse.

    At least the version that MS shows, (i'll grab a picture in a bit) it would appear that the two need to come together. It's not upscaling after the image is completed, it's part of the render chain. I suspect it's possible Nvidia may also do it this way.

    Yup I agree that there is a time and place for everything. Unfortunately a little early for me to put out some real points, there just aren't enough data points on DLSS. We'll know more once the DLSS patch for BFV is released.
     
  14. Ike Turner

    Veteran Regular

    Joined:
    Jul 30, 2005
    Messages:
    1,685
    Likes Received:
    1,337
    DLSS has been shown to be "not utterly crappy" only when trained on pre-recorded content (Final Fantasy Bench, Infiltrator tech demo, Porsche Tech demo & now Futurmark bench)….which kind of defies the whole point of it...unless only "playing" benchmarks on your brand new GPU is the new hip thing to do.
     
  15. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,709
    Likes Received:
    2,753
    Location:
    Pennsylvania
    I don't think we have enough games with implementations to make any conclusions. We have... 1?
     
    DavidGraham and vipa899 like this.
  16. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    339
    Likes Received:
    409
    iroboto: "Yup I agree that there is a time and place for everything. Unfortunately a little early for me to put out some real points, there just aren't enough data points on DLSS. We'll know more once the DLSS patch for BFV is released."

    Makes sense.

    Looking up here https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
    "During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores. CUDA exposes these operations as warp-level matrix operations in the CUDA C++ WMMA API. These C++ interfaces provide specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently utilize Tensor Cores in CUDA C++ programs."

    This is basically my source of the assumption compute is necessary to control tensor cores. But i further assume tensor operations take usually long time and the compute warp becomes idle for other work. (If not this option might come with future hardware.)
    Unfortunately this does not answer my question if async compute is already possible to fill the bubbles. Likely we will see with upcoming MS API...

    I like this: "Many computational applications use GEMMs: signal processing, fluid dynamics, and many, many others."

    Fluid would be an interesting non AI application for games. Propagating light through volume data maybe another.
     
  17. AlBran

    AlBran Ferro-Fibrous
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,223
    Likes Received:
    5,227
    Location:
    ಠ_ಠ
    I suppose it's a little curious that nV went to such an extreme on die size (and then R&D for no less than 3 separate ASICs), and shoving in RT/tensor as opposed to more of the traditional GPU bits on there unless there were mitigating factors to scaling up the GPU that way (power density, bandwidth).

    Clearly they want to keep the pro-grade features/performance for the $1500USD+ market, so it's interesting where 20xx fits strategically in conjunction with the seemingly out-of-nowhere API support from MS as well, and what business case there would have to be to push game developers into the trenches here. Seems a bit much just to be the next step for GameWorks (if you know what I mean).

    /AlFoilToiletRoll
     
    #857 AlBran, Jan 15, 2019
    Last edited: Jan 15, 2019
  18. turkey

    Regular Newcomer

    Joined:
    Oct 21, 2014
    Messages:
    667
    Likes Received:
    382
    Do we also need to consider the impact on power / thermal efficiency.
    The FF fight sequence is non deterministic and changes each time, the assets do not change but it's not a set sequence of images.

    Another point is how difficult it is (or I assume is not) to add dlss to an existing render pipeline. Could be a huge dlss win there.
     
  19. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,200
    Likes Received:
    5,469
    For enterprise and cloud use, we are forced to use the volta/V100 series. Nvidia strictly forbids the use of pro-sumer cards to support the resale of the functionality. We recently just dialed in a bill for 62K CAD for 4xV100s. Mental because functionality wise.. it's the same. I really dont' want to get into that argument here, but honestly... it's a same just way more expensive.

    So perhaps this is why? Or they are hoping more people will get into the industry and then when enough prototyping in done on a small scale, get hit in the face with the v100 bill.
     
    vipa899 and AlBran like this.
  20. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,200
    Likes Received:
    5,469
    Yea those are volta ;) I don't know if Turing ones are different. But thank you for the reading material will go through it and adjust my understanding of their tech.

    indeed you could be entirely correct here. I don't see them moving away from this just because Turing. i could be wrong though.
     
    #860 iroboto, Jan 15, 2019
    Last edited: Jan 15, 2019
    DavidGraham likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...