NVidia Ada Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Jul 10, 2021.

Tags:
  1. techuse

    techuse Veteran

    I suppose I’m skeptical that a multi chiplet GPU will work consistently across the breadth of games without developers having to account for it. It’s quite a complex problem to solve.
     
    DegustatoR likes this.
  2. CarstenS

    CarstenS Legend Subscriber

    Then your first reply ("it's just more SEs") is basically worthless. So thanks for that.
    edit: Oh wait, did you just say MI200? Wrong thread maybe, again?

    Then they would also have to rewrite Shader Exports (vGPR -> Prim/RB/GDS) and some atomics to be done elsewhere.
     
    Last edited: Aug 2, 2021
  3. trinibwoy

    trinibwoy Meh Legend

    From a software perspective it’s still one big GPU. The game engine or graphics api won’t know that there are multiple dies just like games don’t know or care that there are multiple GPCs or SMs today. Yes it’s a complex problem but it’s a hardware problem. So really the main question is performance. And of course power consumption.
     
    Lightman, PSman1700, no-X and 3 others like this.
  4. Rootax

    Rootax Veteran

    And that's why RDNA3 being that "on paper" is a big deal :D

    On paper PowerVR B series can do that too if I believe the slides.
     
  5. OlegSH

    OlegSH Regular

    It's not just a HW problem. With current API model, all draw calls must be executed in order, current single die GPUs are having hard time keepeng execution units busy during geometry processing and that's why all that work distribution in HW is needed (other than work expansion handling), when moving to the multi-die it can only get worse and probably by a lot.
     
    DavidGraham, BRiT and DegustatoR like this.
  6. Bondrewd

    Bondrewd Veteran

    Quips again, huh?
    Neat, also cringe.
    No, it's literally that.
    Work distribution is neither a novel problem nor modern GPUs are small from big macroblock POV.
    NV does 7 GPC for A102 and it's only ever going to bloat.
     
  7. JoeJ

    JoeJ Veteran

    Let's say global scheduling fails to distribute the draw call work over 2 chiplets, and chiplet 1 finishes earlier. (Could be minimized with using more fine grained queues on both chiplets, and more frequent pulling of smaller pieces of work from the top level queue.)
    This would be no problem if we have async work around, which will leverage the more, the more such async work we have around.
    Also, even if it's impossible to guarantee all draws / dispatches finish at the exact same time, we are still almost twice as fast due to having twice the power.
    But we may want to give up on things like guaranteed draw order of each individual triangle, which we rarely need and already have extensions for to turn this ordering off.
    Do i miss something? (I'm just bad with graphics pipeline)

    I imagine it like a job system, where work is small chunks and multiple threads pull once done with negligible overhead. And we can do rendering with multi threading easily this way on CPU.
    We also divide the framebuffer into tiles to avoid hazards, which is no problem either, and dGPUs already have that 'binned rasterizer' stuff which relates to tiles? (Triangle order can be preserved this way too, like mobiles do with the same APIs)
     
  8. CarstenS

    CarstenS Legend Subscriber

    Literally, it's GPCs, not SEs. And work distribution: Yeah, tell me more Cpt. Obvious.
    A100 already has been at 8 before, 4 of which are talking to a 20-MByte-L2-Partition each.

    If we're talking MCM, it's not an intra-chip problem anymore, so the question is a real one: Will there be an additional synchronization stage inside the GPC/SM chiplet? How much data will have to be synchronized between chiplets? For graphics, there's a reason, why ROPs for quite some time are part of the GPCs and RBs have beend moved into the SE as well. And maybe this can be discussed with more than half-sentence one-liners.
     
  9. Bondrewd

    Bondrewd Veteran

    Those things carry the same purpose.
    Actually it's a bit more funky than that.
    Consult the Citadel paper if it's out.
    It's more or less is given the packaging involved.
    It's a very straightforward solution if a tad bit on a complex/expensive side.
    Wut.
    They've moved in there with Ampere.
    AMD moved them one step up the chain with Navi, even.
    That would be less cute and funny and you wouldn't be able to apply for a capeshit writer position thus no.
     
  10. Leoneazzurro5

    Leoneazzurro5 Regular

    Maybe someone is missing the information that an inter-die communication in a 3D stacked configuration can easily go up to some TB/s. In V-cache on Zen3, they added 64Mbytes of cache at basically zero penalties. I expect on the GPU side to have some more latency but bandwidth (for graphic or other data) to be a non-issue.
     
    no-X likes this.
  11. troyan

    troyan Regular

    And a 3090 has nearly 1TB/s offchip bandwidth. Bandwidth is not the problem. Having multiple chips means synchronisation work between them. Every memory movement will cost effciency.
     
    PSman1700 likes this.
  12. trinibwoy

    trinibwoy Meh Legend

    Sure but that’s a separate problem which as you said also applies to existing single chip solutions. Yes it will get worse on chiplets but it will get worse on wider monolithic chips too.
     
    no-X and Jawed like this.
  13. Leoneazzurro5

    Leoneazzurro5 Regular

    Lol. If the issue is sinchronization between SE/GCDs, it would be quite trivial to put a dedicated synchronization channel in the 3D stack, as connection through the stack is almost the same as on-chip (and THAT is one of the reasons 3D stacking by meand of TSV is very important).
     
  14. DegustatoR

    DegustatoR Veteran

    I think it's fair to expect MCM GPUs to not scale as well as a monolithic GPU would, and there will certainly be some cases where they may not scale at all in some older games.
    Thus saying that it's a purely h/w problem is a bit disingenuous since while it may be somewhat helped with h/w the more obvious option is to target such architecture with s/w.
    So while the general direction of all HP designs going into multichip era is forced by the end of physical scaling era it's hardly a clear cut that the first MCM GPU attempts will actually fare better than the last big single die ones.
     
  15. troyan

    troyan Regular

    How is the connection between the stack identical to a L2 cache? Would this be the case a L2 cache is unnecessary.
     
  16. CarstenS

    CarstenS Legend Subscriber

    Hence, literally.
    Correct.
    Whatever that means. I take it, you're trying to be a nice guy and get me another job, no?
     
    Last edited: Aug 2, 2021
    Lightman likes this.
  17. trinibwoy

    trinibwoy Meh Legend

    There's no need to explicitly expose the multi-chip nature of future gpus to the application. The workload is so massively parallelizable that it would be a failure of hardware and software design if developers had to concern themselves with how many chiplets are on a card. Improvements to api's to remove unnecessary serialization will help but as mentioned above this is not a unique problem to chiplet solutions.

    Why would older games not scale? They're still spitting out lots of pixels that are being written to a shared framebuffer. None of that changes with chiplets.
     
    Silent_Buddha and CarstenS like this.
  18. Leoneazzurro5

    Leoneazzurro5 Regular

    Where I did mention L2 cache. I said that in a 3D stack you can easily add not ony the connection from the compute die to the cache die but also synchronization signals that can propagate through the entire stack, because the 3D stack is basically equivalent to a very big IC with some latency penalties in the interconnection points. I wonder why I even try to explain trivial concepts to someone that thinks a +30% gain in terms of perf/W on the same process node is bad. It's clear that either 1) you are not understanding what a 3D stack is, thus not understanding that every single signal can be passed though the TSVs and not only memory inteface ones or 2) You are trying to spread FUD. As it's quite evident by your post hystory that the second is quite the most probable one, let me put in ignore so I avoid my eyes getting hurt by your FUD spreading.
     
  19. OlegSH

    OlegSH Regular

    That's not a solution. AMD has moved in the exactly opposite direction with RDNA for a reason - it had failed with GCN at finding async workloads on PC. Straightforward ports from consoles simply didn't have enough of work to fill PC's GPUs.
    GCN was bad at scaling beyond 4096 ALUs and there is nothing magic in RDNA.
    Feeding 2 ~8000 ALUs chips would require having in flight at least 32 000 work items per cycle (assuming 2 wavefronts is enough to cover at least instruction latencies and there is a perfect prefetch to cover memory stalls).
    Now I can imagine there will be plenty of draws in modern games with vertices count way below 32 000 (that was the whole idea behind DX12 after all), which would require tons of bookkeeping and work balancing in order to keep executing them in parallel with other draw calls (and there unavoidably will be serialization points since all draws must be finished in a serial order).
    Feeding such machine would be hard, distributing work across chips won't be easy and finding async work is not easy either, these are all unsolved problems.
    If you had ever dealt with performance modelling, you would intuitively know that such config would have crappy scaling in graphics workloads (way below linear), but lets hope I am wrong and all the problems would be magically mitigated (even though that's not the case today and all points that things may get only worse with inter chip work distribution).

    This would require either new APIs, completely compute based pipelines or a mixed software-hardware TBDR like approach (which has tons of other drawbacks).
     
    PSman1700 likes this.
  20. Jawed

    Jawed Legend

    NVIDIA Lovelace vs AMD RDNA 3, what has not been told about their GPUs (techunwrapped.com)

    But, the article speculates that Lovelace has been confused for Hopper. And then seemingly goes wild: "Hence, we think that the configuration of 144 Shader Units could correspond to Hopper and not Lovelace, since it is said that Lovelace will be a multi-chip GPU."

    Separately there's noise in Twitterverse that Lovelace will have much higher clocks, similar to RDNA 2/3. That's one way of using less SMs, which is the central bone of contention that the article focuses on.
     
Loading...

Share This Page

Loading...