NVidia Ada Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Jul 10, 2021.

Tags:
  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Contingencies are fine but you don’t design your fundamental system architecture around foundry capacity. A sensible contingency is to tape out both a monolithic chip and chiplets. That way if your chiplets go bust you have a backup. However you wouldn’t choose to skip chiplets completely because of wafer supply. On the contrary that would be even more reason to embrace chiplets.

    So you don’t believe the rumors that Lovelace is on TSMC 5nm?

    If chiplets make too much business sense and Nvidia’s chiplet arch is ready then they would make chiplets. However you’re also saying that they would choose to not make chiplets which presumably would not make business sense. So which is it?

    If all the stars are aligned to bring chiplets to market it would be beyond silly to just sit on the tech and not bring it to market due to something as fleeting as wafer capacity. There are so many disadvantages to this including lost opportunities to refine the architecture based on real world experience and massive strategic risk to competitive advantage. So at the risk of repeating myself if we don’t see Nvidia gaming chiplets it’s because they decided it actually doesn’t make the best business sense or their tech simply isn’t ready.
     
  2. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    Is AMD really going throught the L3 cache instead of L2? Doesnt sound very effcient.
    With Chiplets the workload has to be scheduled even every time. Rewriting into and moving data in the L3 cache is an effciency killer. nVidia hasnt even solved the problem in the HPC/DL market and they are working on automatic scheduling workload over a programming model since 10 years or so.

    With UE5 Nanite, BVH for hardware RT and Direct3D I/O the VRAM usage will only go up. So a L3 cache wont help AMD which makes scheduling a even harder part with a chiplet design.
     
    PSman1700, DavidGraham and xpea like this.
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Of course, and that's not what I meant (i.e. skipping chiplets completely, but only with one product line so you have a fallback [not for tech, but for cash] just in case). But when you don't have another horse in the race, you don't bet your company on a single tech that you've not really had a pipe cleaner before. Unless you're really really desparate and it's either succeed or be bankrupt in another quarter anyway. That's why it makes all the sense in the world, to go chiplets with one branch and - while you think you still can - go monolithic with the other one. That's what AMD did, that's what Intel's kind of doing (in a way). They've all dipped their toes in MCM with HBM though.
     
    PSman1700, Lightman, pharma and 2 others like this.
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Well yeah, by definition everything can't fit in L2 on RDNA 2 and a lot of traffic hits L3.

    Yeah presumably scheduling traffic doesn't go through the cache hierarchy so it will be interesting to see where the PCIe host link sits and how scheduling is managed across GCDs. I imagine there would be a dedicated IF link between dies for this traffic that doesn't go through the MCDs. Question is there going to be some sort of host I/O scheduler die that controls multiple GCDs or will it be setup like the old school crossfire days with one GCD serving as "master".
     
  5. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    And that doesnt look very efficient. Navi22 has 3MB L2 and 96MB L3 cache and yet it loses to a 3070 with just 4MB L2 cache in every workload. A 3070 without Tensor- and RT Cores would be as big as Navi22.

    And that is the huge problem with a chiplet design. The additional overhead makes it less competitive when the competition can stay well within the limitation of the process. So the successor of GA104 can be twice as effcient and easily beat AMD's chiplet competition.
     
    #145 troyan, Aug 1, 2021
    Last edited: Aug 1, 2021
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Of course chiplets aren't as efficient as monolithic dies. The entire point of chiplets is to raise peak performance beyond what is possible with a single die.
     
    DegustatoR likes this.
  7. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,424
    Likes Received:
    908
    I don’t see how chiplets are feasible with the extent of frame data reuse. How will latency and synchronization not kill performance and frametimes?
     
  8. JoeJ

    Veteran

    Joined:
    Apr 1, 2018
    Messages:
    1,523
    Likes Received:
    1,772
    All chiplets share the same VRAM, where the previous frame lives? So no problem at all?
    I'm not that worried. A workgroup does not communicate with other workgroups, and ofc. a workgroup runs on only one chiplet. Synchronization work, like distributing workgroups over many chiplets, and signaling when they are all done so the dispatch has finished, should cause very little work and BW in comparison to the actual work itself? Also synchronizing caches is more relaxed on GPUs if this were a problem?
    I guess it will work pretty well, and earlier thoughts about necessary workarounds don't seem necessary. I also see no hurdle for RT, as it's just reading memory, but there's no communication and sync. Maybe atomics to VRAM become more expensive, but fancy visibility buffer ideas will still keep their advantages i guess.
    Though, i know very little about HW details in comparison to other guys here.
     
  9. Rootax

    Veteran

    Joined:
    Jan 2, 2006
    Messages:
    2,400
    Likes Received:
    1,845
    Location:
    France
    The vram will be unified, yes ?

    Now, I guess multichips will create some latency here and there, but my guess is it will be négligeable vs the raw power gain. And they put a lot of work into that, there a reason it was not done before (without duplicating vram and count on sli/afr solution).
     
    pharma likes this.
  10. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,424
    Likes Received:
    908
    But what if chiplet A needs information from chiplet B? How do you manage cache locality? How do you manage scheduling?
     
  11. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    ?
    It's just more SEs.
    The same way we do on A100.
    Kinda yuck but it can't be helped.
     
  12. Dictator

    Regular

    Joined:
    Feb 11, 2011
    Messages:
    681
    Likes Received:
    3,969
    I imagine just by being inefficient? AFR style doubling? Paying for a bit of inefficiency for over all higher performance?
     
    DegustatoR, pharma and PSman1700 like this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    This isn’t like SLI where each GPU has its own memory and has to copy data to the other GPU. Both chiplets share the same VRAM and there is no data that belongs to only one chiplet.

    Think about what happens today when an SM or WGP writes new data like populating the g-buffer. How do the other SMs and WGPs get access to the newly written data?
     
    Silent_Buddha and Lightman like this.
  14. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    SEs use GDS.
     
  15. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Well not anymore.
    MI200 onwards RIP for that rudiment.
     
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    GDS is an application controlled scratchpad for shared data within a kernel. It doesn’t solve the general problem of keeping data coherent across caches on the chip.

    A100 actually is an interesting case study as it essentially has 2 independent L2 caches on a single die. So there is some coherence mechanism keeping them in sync with the benefit of really fast on chip links. Chiplets will solve the same problem just using slower on package communication.
     
  17. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    798
    Likes Received:
    1,624
    Graphics pipeline does a lot of work reshuffling across GPC partitions - geometry redistribution for workload balancing, screen-space partitioning for better locality, etc.
    As of now, the large crossbar solves all these distribution issues, if it wasn't the case everyone would simply partition this complex whole chip level network into smaller, more efficient and manageble pieces like NVIDIA did in GA100 (luckily partitioning is ok for compute), which is already 2 GPUs on a single die.
    When we break the crossbar into 2 pieces, there will be obvious and expected efficiency losses for the workload balancing operations, especially in graphics since distributing work evenly across 2 chips would never be the case.
     
    BRiT and DegustatoR like this.
  18. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,424
    Likes Received:
    908
    Inefficiency wasn't the only problem, frametimes were hugely impacted.
     
    CarstenS likes this.
  19. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    It's not even particularly slower since SoIC is hella fancy.
    Still has funny downsides, too.
    Either way it's just ripping the bandaids off.
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Frametimes were inconsistent using AFR because the different GPUs were working on different frames in parallel and it was difficult to ensure consistency of when frames would be delivered to the monitor. This is not a problem for chiplets as all of the chiplets will be cooperating in rendering a single frame at a time and therefore deliver consistent performance frame to frame.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...