NVidia Ada Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Jul 10, 2021.

Tags:
  1. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Co-rrect!
    Clks and physdes in general is a big focus for nextgen nV parts.
     
    Lightman and pjbliverpool like this.
  2. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    Then dont quote me. I mentioned L2 cache because this is used to share informationen between compute units on a chip. With a chiplet design this has to go over the shared "off" chip cache. "Off" chip bandwidth is not the problem it is always effciency and data locality.
     
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Please do not selectively quote partial sentences to fit whatever you think it is.
    I explicitly said "If the constellation is such, that only with a heavily overclocked Hopper they can claim perf kingship in the desktop and/or gaming space, …" right before the part you quoted. Just saw this now.
     
    #183 CarstenS, Aug 2, 2021
    Last edited: Aug 2, 2021
  4. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Can't do that.
    Use your eyes better then.
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    That article is all over the place. This bit doesn't make any sense -

    "There is another rumor that speaks of change in the organization of your GPUs in the next generation by NVIDIA, where the minimum unit will be the SM and the subcores will disappear, so the SM unit will have a general scheduler instead of having one in each subcore, in that aspect it will look much more like the architecture from AMD where the lowest level cache is shared for all SM equally."

    A partition within an SM is generally equivalent to a SIMD within a CU. Each has its own execution units, wavefront scheduler and register file. If Nvidia gets rid of partitions it will make their architecture less like RDNA not more. Also I have no idea what they mean by AMD's architecture sharing the lowest level cache across all SMs - that's definitely not true. Each CU has a private L0.
     
    DegustatoR likes this.
  6. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Think he's referring to the Mac-Lineup based on RDNA2, where they have a dual GPU card.
    But you're right, TFLOPS rarely tell the whole story. Especially in realtime graphics, there's so much more.

    That's why it'll be very interesting to see whether or not MCM will really behave like a singular large GPU from the first try or if anyone might need some iterations. I guess it'll be a load better already than SLI or Crossfire ever was.
     
  7. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    Yes, the W6800X Duo has a clockrate of less than 2000Mhz, around 15% higher than the A6000 with less memory and bandwith per each chip. Even a 3090 with nearly 1TB/s off chip bandwidth delivers more compute performance with 50W less power.

    nVidia doubled compute throughput with Ampere over Turing and didnt scale every (fixed) function with it. It was a genius move. Yet people here think that AMD can just improve effiency on an architecture level by 4x+ to put three "RDNA2" chips on a package for 75 TFLOPs. At the same time nVidia would have a problem to even double compute performance with Lovelace while using 33%+ more power than Ampere.

    These speculation are not grounded by reality and are just baseless.
     
    Bondrewd and PSman1700 like this.
  8. JoeJ

    Veteran

    Joined:
    Apr 1, 2018
    Messages:
    1,523
    Likes Received:
    1,772
    They doubled FP32 units, not compute power. Thus teraflops becomes an even worse unit to compare different architectures. Same for benchmarks, depending on how much math they do.
     
  9. JoeJ

    Veteran

    Joined:
    Apr 1, 2018
    Messages:
    1,523
    Likes Received:
    1,772
    I don't think AMD has moved into the opposite direction in terms of async compute support working less well than on GCN?
    I also think it's the devs fault they never utilized compute at all, aside of some culling, binning lights to a grid, and other trivial stuff. But AMD obviously was wrong in assuming they just would. If they had invaded studios with researched applications and support like NV does for RT for example, they now would not need to build a monster GPU just to take the lead.
    So if we get a 75tf GPU now, 7 times more powerful than consoles, then i don't see why we are worried it could not scale just those console games 7 times faster, or with 7 times more pixels, both being just pointless.
    No, if we want to utilize this monster, we likely have to add some more features to the game, and those features surely use some compute and are thus well suited for async workloads to compensate speculated shortcomings. And i guess we may use more parallelism than just one single compute workload beside gfx work.

    But that's just my opinion. If you ask me if we really need such GPUs for games, i'm really not sure.
     
    xpea likes this.
  10. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,090
    Just like 640k was enough. Its typically console users who think specs are high enough.
     
  11. Granath

    Newcomer

    Joined:
    Jul 26, 2021
    Messages:
    80
    Likes Received:
    82
    As far as I remember, common consensus from reviews was that this doubling of thru... didn't work and didn't scale well. Low throughout Navi21 was on par with 3090 (of course excluding raytraycing). Probably in special scenarios ampere rocks but in regular games is on par.
     
  12. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    No shit.
    Twice the FMA per the same amount of r/w ports is what client amperage is.
     
  13. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    If you're talking about the register file, IIRC it has been providing operands for two instructions per clock since the introduction of GV100.
     
    pharma, DegustatoR and PSman1700 like this.
  14. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    Ampere didn't double the ports and reg file because Turing already had these sufficient for running INT32 in parallel.
    Ampere hits its FP32 peaks fine when the code is pure FP32. Gaming code isn't though and thus it doesn't show double throughput.
    If we assume that Lovelace will be just Ampere scaled up then it will scale just as well as RDNA2 in comparison to RDNA1 did.
     
    DavidGraham and PSman1700 like this.
  15. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Which is why I've said double the FMA per the same amount of r/w ports.
    Needs that ILP juice too.
    Yea.
    Kinda the point really.
     
  16. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    Not really. It's not like Ampere is VLIW2.
     
    PSman1700 likes this.
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    What's the definition of compute power?

    Yeah, where is this idea coming from that Ampere doesn't have enough regfile bandwidth to feed two 16-wide FP32 pipes. The register file has been providing enough operands for 32 FMAs per clock since forever when the pipes were 32-wide. Each Ampere FP32 pipe is a SIMD-16 and takes 2 clocks to execute each instruction over the 32-wide warp. So the operand collector just alternates between pipes every other clock. It's the same total width as a Pascal SM - 128 FMAs per clock.
     
    DavidGraham, Picao84 and PSman1700 like this.
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Got a source for this? Haven't seen any reference to Turing or Ampere needing to issue overlapping FP32+INT32 instructions from the same warp.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Is there a tool that lists out the exact machine code that is scheduled on Ampere? Like the various AMD tools that list out the ISA code?
     
  20. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I am not 100% sure but if the 2 instructions per clock come from different warps what you’re asking for probably doesn’t exist.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...