Ray-Tracing, meaningful performance metrics and alternatives? *spawn*

Discussion in 'Rendering Technology and APIs' started by Scott_Arm, Aug 21, 2018.

  1. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    13,020
    Likes Received:
    3,281
    These numbers pretty much match up with remedy. 2-5x improvements, which is very significant.

    The most interesting perf number they give is transparency leading to very small speedups. Its pretty obvious, I guess, but they actually have an estimate.
     
  2. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    39,349
    Likes Received:
    9,323
    Location:
    Under my bridge
    I'll discuss Strange Brigade in that thread.
     
  3. JoeJ

    Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    133
    Likes Received:
    163
    35% is a whole lot, but irrelevant for me. I want to invest this time into something more useful than upscaling (Notice with texture space shading upscaling is no longer necessary).
    I was following AI based tech. I remember papers about SSAO, AA and more screenspace tech done with AI. It took twice the time and resulted in half the image quality.
    Spending 35% of power just for upscaling is a joke. But how can this happen at all? I would assume Tensor cores process the old frame async while the rest of the GPU works an the next. Another disappointment.
    I really can't get excited about any of this. This is like Apple selling one button solutions to the masses as a stats symbol, just like 'It's hip, easy and it works - show you can afford it'.

    Sorry if i did not follow your thoughts exactly, same to David about the 400% perf + with Turing. Sure there is improvement, but in the end it does not contribute.
    I need to make one more correction - this time not the math or number, just the unit names:

    2070: 19.8 fps = (1.4 x 2.5 res) / 50ms per frame = 0.07 RT work per millisecond
    1080ti: 10.1 fps = (3.8 x 2.1 res) / 100ms per frame = 0.079 RT work per millisecond

    My math here is much easier to grasp than yours: GTX 1080 traces twice the pixels and gets half the framerate, thus both equally fast.
    Period. And no need to look what affects those sums in detail and how. Only the sums matter.

    No, according to the poster he uses all his many NV GPUs at the same resolution. His intention is not to downplay NV at all. He is a fan just enjoying how everything works.

    Sure there is constant frame cost like transforms and BVH update. I ignore that intentionally. Within my own GI work the cost of those things is totally negligible. With RTX it surely is higher but never high enough to justify equaling end results.
    No matter what other pipeline details you bring up to discuss - they never can contribute enough to justify this.

    I'm not aware of a single proper raytracing algorithm where ray triangle checks are the bottleneck. Instead the ray-box checks already cost more, and cache missing traversal the most, if we assume a simple implementation.
    A complex implementation (MBVH, sorting many rays to larger tree branches, sorting ray hits to materials, etc.) has additional costs but are key to good performance.
    We do not know what NV here does at all, and how it is distributed between RT core and compute under the hood. I guess RT Cores can do box and triangle checks in parallel, and compute does the batching stuff. Pure speculation - so is yours.

    The guy mentions however that Titan performance drops at some spots if i get him right, not sure. But such problems can be solved. RT cores are useful but not justified, IMHO. I could even say they are just cooling pads looking at those results.

    I also agree geometric complexity will show RT core beating the older GPUs. But the proper solution here is a data structure that combines LOD, BVH and geometry. RTX does not allow for this. If compute is equally fast, i prefer just that to solve problems directly instead hiding their symptoms with brute force.


    I hope AMD and Intel draw the proper conclusions here: DXR support yes, improved compute sheduling yes, RT cores no - better more flexible CUs or drawing less power.
    And i hope next gen consoles will have 'old school' GCN, and the devs find time to work on RT based on that. Personally i'll do my best here...
     
  4. JoeJ

    Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    133
    Likes Received:
    163
    I am happy with those numbers - i'm always happy with what given hardware can do.
    But 2-5x speedup does not justify fixed function HW. 10x would be the minimum. Those cores take chip area that could be used to compensate and could be utilized for everything else too during the whole frame.
    They claim 10x, but in practice it does not work. It is broken and wrong IMHO.
    Probably driver and game improvements will help over time, but we will loose options for proper comparison pretty quickly i guess. Only hope is consoles keeping up.
     
  5. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    13,020
    Likes Received:
    3,281
    Could the area they take compensate? I'm legit asking. I think I saw someone say 10% of the SM space is taken up by RT cores. Not sure if that number is correct. So you sacrifice 10% of general ALU for a 200-500% improvement in ray casting and intersection tests. Obviously the former is better if you're not tracing triangles, or if you're not ray tracing at all. The latter is better if you assume most games are going to still rasterized and devs will be interested in implementing some ray tracing of triangles.

    I don't know what the correct call is, but if you want your game to run at 60+ fps, spending 3-5ms just on ray intersections is very costly. Once you're down to 1ms it's a lot more manageable. Just by using that 10% of space for more general ALU, I don't think you could really compensate for the loss of the RT cores if you want to trace triangles.
     
    DavidGraham, vipa899, pharma and 2 others like this.
  6. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    338
    Likes Received:
    184
    DLSS also does temporally stable AA with sharp edges and w/o ghosting and other TAA artifacts, if it was just upscaling, it would not cost anywhere near 6 ms on 2080 Ti.

    No, it's hard to get what you are caculating here at all

    Can you explain what you are doing here? Where did you get FPS numbers and other things in steps?
    If these are SW Reflection demo numbers, then they are simply wrong, 2070 (4K DLSS) is 2.7x faster than GTX1080 Ti (1440p) when you subtract DLSS cost and add TAA cost instead to get 1440p result for 2070, that's almost 3x more work done with 50% less flops and 1/2 of ROPs and geometry pipeline engines

    I don't separate these, by ray-triangle intersections test I meant necessary ray-box checks as well. All giga rays RT-cores numbers also count in only effective rays afaik (these, which hit final triangle in BVH)

    These are heavy with RT

    To be honest, this is ridiculous

    AMD/ATI has always sacrificed sheduling in order to gain more flops (with the exception of async compute scheduling, which is pretty easy to amortize anyway), it would be surprising to see them turning in the exactly opposite direction, though, RT obviously requires this change (i know, there are certain patents on hybrid narrow SIMD / wide SIMD CUs from AMD, but these are just patents), otherwise 64 wide wavefront architecture would suffer badly from such a divergent workload as RT. Once they abandon 64 wide wavefronts, they will need much more wavefront schedulers per CU (Volta and Turing already have 4x more schedulers per SM), and these will have to be more complex than currently in GCN since back-to-back instruction execution will be no longer possible (unless they go for a very narrow SIMDs). Also, they will have to overhaul caches and significantly grow these in sizes (done in Volta/Turing). So yes, if they go for a complete software solution, this will be entertaining.

    There is no way they can go for RT with 'old school' GCN
     
    #646 OlegSH, Jan 1, 2019
    Last edited: Jan 1, 2019
    OCASM, iroboto, DavidGraham and 2 others like this.
  7. JoeJ

    Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    133
    Likes Received:
    163
    Yes, it depends then on how much you do raytracing or anything else (including shading, physics, mining...).
    But you can see yourself there is a potential conflict of interests, and the hardware vendor almost dictates what you have to do to utilize the HW the most.

    And there is another, much worse and concrete problem here: Because the RT cores do intersection tests and BVH travesal. The former is a simple atomic operation deserving fixed function with little doubts (although, the box test is really simple - faster than loading the data from LDS i'm sure)
    But the latter, tree traversal, is a fundamental tool of almost any algorithm that is not brute force. And traditionally GPUs are bad at traversing trees. Making this fixed function makes the data structure static, so it can not be adapted for general purpose. (It is also black boxed currently anyways)
    Instead it would make so much more sense to make GPUs powerful in programmable and flexible tree traversal. This could be achieved with device side enqueue, presistent threats and other techniques available in other APIs but not gfx APIs. Plus Turing (and the TitanV already) has very fine grained new stuff here ('work generation shaders' what ever that is.)

    If you add all this to compute, you can make the RT you need. This means completely different algorithms for primary / secondary / coherent and incoherent rays in practice.
    It would be much more work, not something you can add in a month, but the result would be faster. NV may implement this under the hood or not and decide based on driver assumptions or given API hints. They may extend to this in hidden form as soon as other vendors catch up with next gen. But the developer neither has control nor knowledge, nor options to improve and adapt. This is frustrating and hinders progress. (Remember: The developer might want to do something completely different, like physics simulation, also requiring trees and rays.)

    I've said this multiple times. But with those results here it is no longer assumption but sad truth.
     
  8. JoeJ

    Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    133
    Likes Received:
    163
    I get the numbers from the screenshot of the PCGH site i have posted earlier. David Graham posted more with similar numbers. I assume mine are worst case, but because they are doubles of each other, my math is as simple as possible with them.

    My numbers come from this test setup: All GPUs output at 4K. But RTX renders only at 1440p - not the other way around! (In the screenshot they say 'native resolution')
    But in my initial post a accidently reversed the resolution numbers - my mistake! Sorry again for all this confusion!

    So once more my final result with correct numbers and math:
    2070: 19.8 fps = (1.4 x 2.5 res) / 50ms per frame = 0.07 RT work per millisecond
    1080ti: 10.1 fps = (3.8 x 2.1 res) / 100ms per frame = 0.079 RT work per millisecond

    I hope it makes sense now.



    Exactly what i think :) :D
     
  9. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    13,020
    Likes Received:
    3,281
    The RTX 2070 is only 7.5 Tflops. The GTX 1080ti is something like 11 Tflops. I don't really understand how this comparison works. The SW Reflections demo is also going to be heavily testing shading performance.

    We know from actual developers at EA Seed and Remedy that ray costs were 2-5x lower (independently of shading) before launch comparing RT core performance on a 2080ti vs a card without RT cores (TitanV).
     
    DavidGraham likes this.
  10. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    338
    Likes Received:
    184
    Can you please provide a link for this screenshot? I can't find it anywhere
     
  11. JoeJ

    Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    133
    Likes Received:
    163
    No, they removed the test it seems. I saved the shot to discuss with another dev - it's all i have. I noticed what it really means only while posting here and assumed an error with benchmarking.
    The site the largest gaming tech site in germany: http://www.pcgameshardware.de/

    But @DavidGraham found this and added links in his post above :) : https://pclab.pl/art78828-20.html
     
  12. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    338
    Likes Received:
    184
    I know, I did my calculations based on these
    2070 shows 21.2 FPS here at 4K with DLSS - https://forum.beyond3d.com/posts/2053525/
    1080 Ti shows 9.7 FPS at 1440p - https://forum.beyond3d.com/posts/2053525/
    As we know, 4K DLSS is 1440p reconstructed to 4K, 4K DLSS cost is 6ms on 2080 Ti - https://forum.beyond3d.com/posts/2053548/
    For 2070, DLSS cost should be around 10 ms (based on FLOPS numbers)
    In order to get 1440p we need to subtract DLSS time and add TAA, (1000/21.2) total avg frame time - 10 ms DLSS + 1.6 ms (TAA time in FFXV) = 38.77 ms or 25.79 FPS at 1440p
    25.79 FPS (RTX 2070) / 9.7 FPS (GTX 1080 Ti) = 2.66
    That's a huge difference considering that 2070 has almost 50% less flops and 1/2 of ROPs / geometry pipeline throughput
     
    DavidGraham and JoeJ like this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,982
    Likes Received:
    2,428
    Location:
    Well within 3d
    Tensor cores and RT cores both presumably free up register file accesses, either through reuse and dedicated data paths, and possibly some amount of independent sequencing that doesn't go back to the general purpose register file for each step.
    A general-purpose shader cannot skip the power cost or bandwidth lost to other workloads, although some of the current excess barrier insertions appear to be losing some of that benefit.
    Even so, if BVH traversal and intersection testing is able to run through multiple memory accesses, address calculations, and internal functions that do not need to contend for the instruction caches, that's a lot of general-purpose paths that are often far too wide for the workload.

    For GCN, parts of this seem better suited to a separate domain like the scalar unit and register file, but the existing scalar is already heavily used and may be too constrained in the operations it supports.

    I'm curious what that statement comes from, or what data is being used to infer it. So far, I have only seen marketing diagrams that I am reluctant to trust to provide good proportions.
    There may also be area increase in the SM for elements like the improved load/store pipeline and caches, which the RT hardware likely needs but can be beneficial for other workloads as well.

    Unless you run into a different limit, like register file power consumption or area lost to the infrastructure to support those ALUs.
    The register files are a common refrain for both Nvidia and AMD. We can point to things like, Nvidia's operand reuse cache, AMD's reuse slots in its VLIW days, Vega's marketing talking about having input from their Zen team to optimize the register file. Various AMD patents like the "Super-SIMD" patent there was some buzz about a little while ago cited the limiting factor the register file has for GPUs. That one concerned itself with finding ways to utilize register file access cycles lost due to the architecture being sized for 3-operand instructions despite half the instructions not needing 3 operands, and an output cache to reuse outputs and avoid some accesses to the maxed-out register file. While the patents may not provide a complete picture, they seem to indicate there's a rough ceiling in terms of how many vector register files can be deployed, and general-purpose hardware is tied to them.
     
    OCASM, OlegSH, iroboto and 2 others like this.
  14. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    13,020
    Likes Received:
    3,281
    I think it was based on a picture in the turing whitepaper, but I may be remembering totally wrong.
     
  15. JoeJ

    Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    133
    Likes Received:
    163
    I have checked all this:

    DLSS cost:

    4K+TAA = 46.5 fps 21.5 ms
    4K-TAA = 50.5 fps 19.8 ms

    4K cost = 1.7 ms / 2.25 = 0.75 ms for 1440p

    1440p+TAA = 77 fps 13 ms
    1440p-TAA = 12.25 ms

    upscaled to 4K = 57 fps 17.5 ms

    DLSS = 4.5 ms
    -----------------------------------------

    Comparison:

    PCGH 1080Ti @ 4K: 10.1 fps
    pclab 1080Ti @ 4K: 4.3 fps - wtf ???


    continue with pclab numbers:

    1080Ti 2560*1440 9.7 fps 103 ms
    2070 upscaled 3840*2160 21.2 fps 47 ms - 4.5ms DLSS = 42.5ms

    103 / 42.5 = 2070 is 2.42 times faster than 1080Ti


    Looking closer to the numbers of pclab, they vary over a factor of >2! But i agree pclab must be correct. (I got smaller cost for DLSS which makes more sense too, but let's ignore this)
    Those numbers look pretty accurate, because looking at DXR numbers here:

    we can see TitanV is more than 2 times faster than 2080Ti with RT, because it already has 'advanced sheduling' stuff (Not sure how we should name this properly?)
    It all makes sense now, thanks for clearing this up!

    So i take back RT should be enabled for BFV, although, at 30fps or 720p they would still do.
    Apologizes to NV - a bit of PCGHs fault.

    If a speedup of 2.42 justifies fixed function restrictions is a matter of opinion. At the moment yes, but at the cost of a restricted future.
    All this seems to confirm the match of TitanV vs. 2080Ti performance in BFV as well. So IMO RT cores are still not worth it - totally not.

    But personal opinion aside, i guess we agree on the numbers now?
     
    #655 JoeJ, Jan 1, 2019
    Last edited by a moderator: Jan 1, 2019
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,982
    Likes Received:
    2,428
    Location:
    Well within 3d
    I saw a grainy die shot of TU102 in the whitepaper, and various block diagrams of the chip or an SM. It doesn't seem precise enough to be confident that those can be used to get that figure. Nvidia's diagrams don't necessarily respect the layout or the margins of chip regions.

    Although if the pictures from the flickr account for Fritzchens Fritz's other die shots are any indication, we may have a better view of TU102 very soon. Inferring which blocks belong to what parts of silicon may still be a challenge, and even though the die shot may be better the pictures may be zoomed out enough to show the whole chip rather than dig down to the SM level.

    edit:missing word
     
  17. pixeljetstream

    Newcomer

    Joined:
    Dec 7, 2013
    Messages:
    30
    Likes Received:
    60
    Hopefully this doesn't sound arrogant, but @JoeJ the abstraction chosen for dxr is intentionally hiding details about what the rtcores can do. Given Nvidia has been actively working in GPU raytracing for a decade and has amassed tons of experts on raytracing in general (siggraph papers etc.), you can assume that a lot experience of many people has gone into the design. While that is no guarantee for success, and every iteration in design is a compromise, some of your statements make it sound like the people who worked on this have not thought it through.

    Furthermore, things like dxr are fluent apis, which will change over time based on feedback. Likewise under NDA developers know about more details and hw vendors typically visit developers and research studios and exchange ideas/plans/ possible solutions for future architectures in advance. So these new features don't happen in a vacuum of a few.

    The research on doing RT without dedicated hw won't stop and will continue to be important to influence sw/hw design. If we look at fabrication processes, free launch is over, imo we will see greater efficiency gains by smart fixed function extensions, rather than brute force general purpose units.
     
    #657 pixeljetstream, Jan 2, 2019
    Last edited: Jan 2, 2019
    OCASM, milk, DavidGraham and 3 others like this.
  18. pixeljetstream

    Newcomer

    Joined:
    Dec 7, 2013
    Messages:
    30
    Likes Received:
    60
    Personally I am more on the rasterization side of things, but I think having an alternative to 3d textures for spatial lookups is great, and I see dxr just as a beginning to something more generic after a bit of evolution.
     
  19. JoeJ

    Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    133
    Likes Received:
    163
    I do not doubt their expertise. They know much more about RT then i do. And about hardware anyways.

    Although i see serious issues here with bringing the performance to the street, i guess we will see RT cores to take off with following GPUs.
    But as it is now the hardware seems far from optimal, with a questionable need for RT cores at all, and TitanV == 2080ti in BFV really proofs this, or not? But due the lack of competition their success IS guaranteed anyways.

    And to make adoption easy and hinder competition, they protect their 10y of experience with an API that is more like OpenGL 1.0 than DX12, and they stamp questionable FF HW into silicon which can only be underutilized as long as we are hybrid.
    We will be much longer hybrid than necessary, i'm afraid of.

    But the research is now entirely in their hands, locking out all the other experts with 10yo experience. They become just consumers with further inventions prevented.
    A more open and flexible approach would have been better and also possible as we see now. I would LOVE TitanV kind of GPU without RT cores and fine grained sheduling exposed to GPGPU as well - it would have been f###ing perfect!!!
    It's 2019 - why do you think we need to go through DX8,9,10... again, just because it's about rays?

    I do not get why everybody just follows without any doubt and critique. Just adopting and moving on seems the only option, but i'm just not happy with that.
     
  20. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    39,349
    Likes Received:
    9,323
    Location:
    Under my bridge
    Titan V is bigger and more expensive.

    Because they're more willing to accept the compromises needed to get realtime raytracing into affordable hardware. ;) I haven't followed the numbers of this thread to know how RTX compares to non-RTX other than it accelerates aspects by a significant margin, but the lack of clear consensus shows it's not at all obvious as you imply, that everyone should be aware of the disadvantage RT cores brings. The data should be obviously "RT cores don't accelerate anything so what's the point?" for your argument to make sense. Your last numbers show a 2.4x speed increase with RT cores. That's significant.

    At this point though, it's all wild speculation. Until someone (AMD) brings non-RT-core raytracing to the GPU to compare, it's only theoretical improvements one could make in software versus theoretical performance from a fixed-function unit and a couple of GPU comparisons. It's far from conclusive data.
     
    OCASM, pharma and DavidGraham like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...