NVIDIA Maxwell Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Feb 9, 2011.

Tags:
  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Awesome.

    I thought Kepler was a step backwards. They increased the ALU's / core but didn't increase the SM/L1 at all. All in all, it's great that GPU's are finally getting more cache.
    KNL is supposed to be brutal with it's caching, see RWT.
     
  2. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Revisiting the hot clock thing seems unlikely for power reasons. Maybe better IPC, due to e.g. smaller SIMD size? The diagram also makes it look like the warp scheduler to ALU ratio went up.

    I'm wondering if the L2 on that diagram should really read L3.
     
  3. constant

    Newcomer

    Joined:
    Feb 9, 2014
    Messages:
    22
    Likes Received:
    8
    Yes it looks like 32 SP:s per warp scheduler (like good old GF100).

    This signals that they're improving the on-chip memory to SP ratio. Hopefully they will also increase the number of shared memory banks up from 32 to 64, this will require an increase in the number of load/store units.

    All in all looks much more balanced than Kepler. Will be interesting to see if they can also increase the L1/SMEM scratchpad up from 64 KB to 128 KB.

    Also why not increase the register file from 256 KB to 512 KB? Registers don're require as much die area as cache does right?

    More registers would enable more threads in flight etc => better latency hiding => better utilization of bandwidth => higher througput.

    OK, this post turned out to be more like my wishlist :)

    128 KB L1/smem
    64+ LD/ST units
    512 KB registers

    Exciting times!
     
  4. UniversalTruth

    Veteran

    Joined:
    Sep 5, 2010
    Messages:
    1,747
    Likes Received:
    22
    GTX 480, GTX 570 and GTX 660 should have practically identical performance. What are you arguing? I have checked through three techpowerup reviews to reach this conclusion
     
  5. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    I think that the SMM itself is not the compute unit (in OpenCL speak) anymore, but the four sub-elements are. I speculated before that they are going to break up the SMX equals CU design.

    In my opinion it's very similar to AMD CUs with the difference that it's 32 ALUs instead of 64 ALUs (because of Thread-Group-Size) and therefore the Quad-TMU needs to be shared by two CUs because a 32 ALUs with a Quad-TMU would be a too low ALU:TMU ratio. Question is if it's 1xSIMD32, 2xSIMD16 or 4xSIMD8. The last option would give them similar scheduling than a AMD CU with round robin scheduling from one Warp-Scheduler and 4 clocks latency between instructions to use for store forwarding.
     
  6. constant

    Newcomer

    Joined:
    Feb 9, 2014
    Messages:
    22
    Likes Received:
    8
    GT200 --> 8 SP:s executed one warp in 4 clocks ( 8 SP:s / SM - one scheduler )
    Fermi --> 16 SP:s executed one warp in 2 clocks ( 32 SP:s / SM - two schedulers* ) **
    Kepler --> 32 SP:s executed one warp in 1 clocks (192 SP:s / SM - 4 schedulers*)

    Now for maxwell I can spot 4 schedulers aswell.

    *Capable of doing dual-issue
    ** Fermi had an exception with the GF114 that had 48 SP:s / SM (think GTX460), Kepler is similar to this.
     
  7. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Yes I read that as "higher effective throughput compared to theoretical throughput" (so better utilization really) as well.

    I wouldn't really call this an exception, rather GF100 (and the unbroken version, GF110) was the exception, since all other chips were like that, not just gf114.
     
  8. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    It's not 4 global schedulers per SMM. It's one Scheduler per Sub-CU. Like I said most likely the SMM is not the CU itself anymore.
     
  9. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    You need more bandwidth to access registers than cache because you need to read up to 3 operands and write 1 result at the same time. So I'm afraid it's going to be just the opposite.
     
  10. constant

    Newcomer

    Joined:
    Feb 9, 2014
    Messages:
    22
    Likes Received:
    8
    Using your terminology: one per CU and 4 CU:s per SMM (btw you haven't clearly defined what you mean by a CU yet).
     
  11. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    Yeah I did. What OpenCL defines as compute unit.
     
  12. tviceman

    Newcomer

    Joined:
    Mar 6, 2012
    Messages:
    191
    Likes Received:
    0
    That's why I said nevermind. I was looking at gtx 660 ti scores, not gtx 660.
     
  13. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,489
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Any idea how the LDS/L1 will fit along this new SMM subdivision?
    Probably shared among all the four sub-SMs or local and then there would be a second level shared cache for each SMM, followed by the big 2MB "L3". :???:
     
  14. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Yeah, the hot-clock thing was a joke. I know sometimes it's hard to tell whether I'm being dumb on purpose or from ignorance :> Warning: I'm about to be dumb from ignorance....

    Yes, there appears to be a dispatcher per 16 alus, and a scheduler per 32. Before we had 8 dispatchers and 4 schedulers across 192 alus (which makes for 24 and 48). Those same dispatchers also fed dp work, and we have yet to see the dp capabilities on here.

    Are the dispatchers doing less work? Is batch size decreasing? I would have thought that such decreases would reflect in CUDA, and I thought those updates were for in-pipeline job spawning (raison d'etre of denver), but maybe there are other changes as well? Maybe the dispatchers are not doing less, and some instructions can be dual-issued -- maybe int & fp, maybe mul & add -- mul&add don't seem particularly crazy if you assume the bandwidth for an additional arg, but it isn't clear to me that there'd be a lot of gain? Another crazy note -- the 2011 talk that was mostly about power use also mentioned work being done on vliw architectures....
     
  15. zorg

    Newcomer

    Joined:
    Aug 1, 2012
    Messages:
    32
    Likes Received:
    0
    Location:
    Sweden
    I heard that the four subdivision will share one 64KB LDS, and one L1Data/texture cache plus the texturing block. The LDS bandwith is 64 byte/cycle. Each subdivision has a 64KB register file and 8 LD/ST units.

    But I don't have the cards, I just speak with my Chinese friend.

    It won't be a compute monster if true. Maybe even worse than Kepler. :(
     
  16. tviceman

    Newcomer

    Joined:
    Mar 6, 2012
    Messages:
    191
    Likes Received:
    0
    That just goes to show Nvidia is further bifurcating their GPU's from the flagship die. I, for one, think it's a good move from a business / die size / profit margin perspective. Let the big die be big and do what it does best, let the other dies continue to focus on the best possible bang for buck mm^2 in graphics.
     
  17. zorg

    Newcomer

    Joined:
    Aug 1, 2012
    Messages:
    32
    Likes Received:
    0
    Location:
    Sweden
    If the rumors are accurate. Don't get me wrong, it seems legit, but who knows.
    He also said that the API functionality will be the same as Kepler. It's hard to believe that they still don't support 64 UAVs ... :cry:
     
  18. DSC

    DSC
    Banned

    Joined:
    Jul 12, 2003
    Messages:
    689
    Likes Received:
    3
    Fermi and Kepler supports UAVs. MS is too boneheaded in their D3D11.1 design.

    http://nvidia.custhelp.com/app/answers/detail/a_id/3196/~/fermi-and-kepler-directx-api-support

     
  19. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Why would that be worse? That would mean LDS is still shared per SMX/SMM, and there's still a (slight) net increase of LDS size (because the SMM is smaller), as well as LD/ST units. Or said differently, all in all there's the same amount of LDS, LD/ST units, but less ALUs (and TMUs) per SMM, but of course more of those SMM. I don't know though off-hand what LDS bandwidth Kepler had, but otherwise I don't see anything which would make compute worse.
     
  20. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    If the sub elements are the CUs, then every one of them must have it's own LDS (required by the programming models). But the L1 could still be shared.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...