Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Discussion in 'Architecture and Products' started by Geeforcer, Nov 12, 2017.

Tags:
Thread Status:
Not open for further replies.
  1. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    219
    Likes Received:
    39
    I submit that rdna is superior, to turing @ gaming.
    Take the original 2070 Turing's die (tu-106) and compare it to the 5700xtx's (navi-10) die.

    Look at the transistor count, SPs, ROPS, etc. Then look at the performance..
     
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,288
    Likes Received:
    3,546
    Take that Navi die, put it at 16nm and come back and ask the same question.

    The 2070 also supports a whole lot more of functional units, Tensor cores, Ray Tracing cores, INT32 units ..etc, which means it does more functions at the same transistor budget as Navi, and with an older process.
     
  3. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,711
    Likes Received:
    904
    Turing is a much better arch then rdna1 and probably then rdna2 too. Their next 7nm arch should be here soon, then they compete on the same nodes.
     
  4. Qesa

    Joined:
    Feb 23, 2020
    Messages:
    4
    Likes Received:
    6
    As depicted in the SM diagram, it's only capable of issuing a single warp per clock cycle anyway, which makes me believe it's fake. Not just int instructions will cause fp bubbles - ld/st and mufu will too, and you'd have to go back as far as big fermi to see that happen.

    It could be believable if it's a dual issue scheduler like small fermi-pascal had, but that ain't what's in the diagram
     
    TheAlSpark likes this.
  5. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    949
    Likes Received:
    46
    Location:
    LA, California
    The various units can only process half (or less) of a warp per cycle. Dispatching one warp per cycle means you can do 1 int and 1 fp warp every 2 cycles, or (in the rumored SM configuration, as opposed to Turing) 2 fp warps every 2 cycles. I don’t think register file bandwidth would need to change at all, since RF bandwidth requirements for concurrent int+fp warps and concurrent fp+fp warps are the same. (Assuming the int32 execution units support 3 input 1 output instructions like multiply-accumulate.)
     
    Lightman and iamw like this.
  6. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    499
    Likes Received:
    220
    From what I can tell that's literally beyond TSMC's reticle limit for 12nm. Not to mention just a ton of fake looking numbers "Double the performance, 50% more "cores" at the same time!!!" etc.

    Feels fake, I won't say totally fake, 2080ti has a huge die size too, but how are they going to pack that much more into only another 50mm of die size? That's less than a 10 percent increase.
     
  7. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    288
    Likes Received:
    189
    Because it wouldn't be 12nm. It would never be 12nm at this point. The only logic they apply for claiming it could be 12nm is because it's huge. Well, it's a GA100 that the rumour puts at 826mm2, so for a replacement of 815mm2 Volta, not incredibly "surprising".
     
    Cuthalu, pharma, DavidGraham and 2 others like this.
  8. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    764
    Likes Received:
    360
    Reticle is 858mm^2 or so.
     
    Frenetic Pony likes this.
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,604
    Likes Received:
    648
    Location:
    New York
    32 FP + 16 INT ops certainly require more operands than the current Turing config of 16 FP+ 16 INT.
     
  10. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    127
    Likes Received:
    154
    Ignore all the other stuff written there. It's just Wccftech mixing so many different rumours.
    As Benetanegia wrote it's of course not 12nm. It would be 826 mm² in 7nm (probably EUV). Nothing special about it for Nvidia after Volta. Nvidia is going to the max for HPC Chips. But this die size gives us no indication of consumer gpus.

    If they need to go to 826 mm² for 70-75% more performance as written in the nextplattform article about the GA100 supercomputer, that's pretty underwhelming. Consumer chips will have reduced Die size, so we can expect max. 40-50% speed increase.
     
    #510 Samwell, Feb 24, 2020
    Last edited: Feb 24, 2020
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    Performance as in theoretical TFLOPS throughput or in real, non-cherry-picked applications within the same TDP budget?
     
  12. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    949
    Likes Received:
    46
    Location:
    LA, California
    But the rumored config seems to be 16 int +16 fp or 16 fp + 16 fp, which does not. Supporting concurrent execution of 3 warps would require increased RF bandwidth as you say. It would also require the ability to dispatch 3 or more warps every 2 cycles, and if maintaining the current level of latency hiding ability is important, increased RF size to support additional warps.
     
  13. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    244
    Likes Received:
    107
    One thing which was in the leaks a long time ago, NVIDIA wants to improve the rasterizer. What they can do to run it better?
     
  14. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,577
    Likes Received:
    2,296
    Could they be referring to improvements for Mesh shaders? They re-vamped the rasterization pipeline when they added programmable mesh shaders.
    March can't come soon enough!
     
  15. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    244
    Likes Received:
    107
    Do modern chips have a scaduel issue because rasterizer don’t get enough pixel done? Or why they want to improve it. I can understand that rasterizer is at the front, so when the Frontend is lame, the rest of the chip is also lame.

    a big question for me is, have rasterizer data and shading data have to be processed in line or can you do shading work for a pixel which is even not rasterizer?
     
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,604
    Likes Received:
    648
    Location:
    New York
    There is no bottleneck as long as the rasterizer and ROPs can handle the same number of pixels. Any improvements to the rasterizer are likely to improve functionality and not raw speed.

    For any non trivial application the shader core should not be bottlenecked by the rasterizer. Specific use cases like a depth pre-pass may lean more on the rasterizer but your typical shader will be memory or compute bound.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,604
    Likes Received:
    648
    Location:
    New York
    I’m not following you. The diagram posted on twitter shows 32 FP units and 16 INT units.

    Turing can schedule one full warp per clock. It takes 2 clocks to actually execute a warp because the execution units are only 16 wide. This allows the Turing scheduler to switch between issuing INT and FP ops each clock for full utilization of all execution units.

    If nvidia goes back to 32 wide execution for FP then there will be no free clock in which to issue INT ops and there will be bubbles in the FP pipeline.
     
  18. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    949
    Likes Received:
    46
    Location:
    LA, California
    I'm reading the diagram as 2 16 wide FP units, and 1 16 wide INT unit, so that the scheduler can switch between issuing to different FP units every cycle (unlike Turing), or switch between INT and FP units (like Turing). Yes, one wouldn't be able to use all 3 16-wide units concurrently, so there'd be a bubble in at least one INT or FP unit every cycle. But it seems like a pretty non-invasive way to increase peak FP throughput without having to scale other aspects of the SM. If power spent in instruction execution is relatively small compared to the cost of obtaining/moving operands, then this design seems like it doubles peak FP throughput without increasing peak SM power consumption very much. So it all seems pretty plausible to me...
     
    TheAlSpark, pharma, Putas and 2 others like this.
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,604
    Likes Received:
    648
    Location:
    New York
    Ah, now I get it. Yeah that would be interesting and a relatively cheap way to increase FP throughput. It begs the question though of why go through all that trouble instead of just using Pascal style 32-wide combined INT+FP units.
     
  20. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    288
    Likes Received:
    189
    For the same reason they included the INT unit in the first place, I suppose (it's more efficient?). This move (if at all real and not fake) would simply fix the FP:INT ratio. Because, right now in Turing there's nothing to switch to, nothing to schedule to the INT pipe in 64% of cases, since there's supposedly only 36 INT per 100 FP instructions.

     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...