Nvidia Pascal Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 25, 2014.

Tags:
Thread Status:
Not open for further replies.
  1. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    149
    Likes Received:
    183
    1:3 is impossible, but the architecture should be able to handle any amount of Xx4 per SMM, so 4 DP per 128 like now, 8, 12....
     
    iMacmatician likes this.
  2. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    Impossible? Wow, strong words here. :)

    SP:DP ratio clearly does not have to be a power of two. Cell for example has an 1:7 ratio. I'm sure there's a bunch of others too.
     
  3. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland

    Im really not sure the "CELL" architectures is worth as an example for anytthing ..

    I dont know what change have been made between Pascal and Kepler in this sense ( lets forget Maxwell in this purpose )...
     
  4. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    808
    Likes Received:
    478
    Wasn't Cell 1:8 but one was disabled to increase yields? Besides, general practice seems to be power of two for various good reasons.
     
  5. HKS

    HKS
    Newcomer

    Joined:
    Apr 26, 2007
    Messages:
    32
    Likes Received:
    17
    Location:
    Norway
    Correct, if you bought the IBM BladeCenters with Cell you got CPUs with all the 8 SPEs enabled.
     
  6. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
  7. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    149
    Likes Received:
    183
    Not a power of two, but you can't divide 128 by 3:) You can have a bit over 1:3 or under 1:3, just not the exact as long as you stay with 128 SP SMMs.

    Then it would be just 96 FP32 SP per SMM or if you use the FP64 Units for FP32 also, then you are at 1:4 again. Won't happen, as Pascal should be just a evolution of Maxwell. Some Parts like Shaders were changed in Maxwell already, while Pascal should go more for stuff like proper Async Support, HBM and so on. 128 FP32 per SMM will stay.
    My assumption of n times 4 FP64 Units per SMM is based on this hardware.fr diagramm. Now they have 4 FP64 Units per SMM and they can add whatever multipliers of this. So nearest to 1:3 would bei 44:128.
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    If there's no difference between the Maxwell ULP GPU FP16 implementation and Pascal's FP16, I don't see why they would really "need" in absolute terms to not use dedicated FP64 SPs. The above presentation with the supposed 4 TFLOPs DP must be ancient by the way. But assuming for dumb speculative math's sake, synthesis for a FP64 SP@1GHz under 16FF+ for 2000 units should be at less than 30mm2 all together. What am I missing?
     
  9. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    Well that's another option - 32 wide DP shared between 2 partitions and which can also be used for SP, giving 1:3. I don't find your arbitrary multiple of 4 theory very convincing though - if you had e.g. 22 wide execution for DP shared across 2 partitions, you'd have to be able to issue from 2 warps per cycle to fill the execution units, and there wouldn't be a static mapping between execution lane and index inside the warp. That sounds complicated.
     
  10. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    There is no indicator for this. Async execution of pure compute load works since Kepler (when using CUDA, not the 3D queue), but mixed dispatch and draw call load isn't supported by neither the frontend nor by the SM/X/M.

    And neither has Nvidia announced any improvements in this area for Pascal.
     
  11. Godmode

    Newcomer

    Joined:
    May 3, 2004
    Messages:
    11
    Likes Received:
    0
    I think your interpretation of the chart is wrong. The captions (Pascal, Volta, ...) in the chart indicate the computing power and memory bandwidth and not the squares. The squares a markers for the years.

    This was my initial guess:
    http://www.forum-3dcenter.org/vbulletin/showpost.php?p=10852347&postcount=1368

    I tweaked my calculation a bit and with 48 GFlop/W FP32 computing power, a 1:3 ratio of FP64 to FP32 units and about 225W TDP, you get about 10.8 TFlop/s FP32 and 3.6 TFlop/s FP64 computing power, which exactly matches with the chart on page 7.

    I also assume, we see a similar organisation of compute units like GK210 (192 FP32 units with 64 dedicated FP64 units) , the GK110 refresh with larger caches and register files. If you assume 1:2 or 1:4, the FP64 computing power doesn't match with the chart on page 7.
     
  12. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    That interpretation doesn't make sense for the existing chips.

    Code:
    Part   DP GFLOPS  DP GFLOPS  DP GFLOPS
           (Square)   (Caption)  (Actual)
    K80         1900       2200       1864 (base), 2912 (max boost)
    K40         1400       2000       1430 (base), 1680 (max boost)
    K20         1200       1500       1173
    M2090        600       1100        666
    
    Part   GB/s       GB/s        GB/s
           (Square)   (Caption)   (Actual)
    K40          290        360        288
    K20          210        270        208
    M2090        170        230        177
    
    Square and Caption estimates are from rough pixel counting.
    
    EDIT: Fixed K80 DP GFLOPS base value, it should be 1864 not 1870.
     
    #392 iMacmatician, Nov 19, 2015
    Last edited: Feb 10, 2016
  13. Godmode

    Newcomer

    Joined:
    May 3, 2004
    Messages:
    11
    Likes Received:
    0
    My bad! And thank you for the comparison.
     
  14. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    Yeah, sure! But, you could theoretically have 128 SMMs that each produce a DP result in 3 cycles instead of 1, right...? :p
     
    Ext3h likes this.
  15. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    There would be no point in that. If you use the same ALUs for the job, you should either be doing 1/4 (waste half internal bandwidth when doing fp64, use the existing MUL resources) or 1/2 (use bandwidth efficiently, have twice the mul resources that you'd need for fp32).
     
  16. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    In fact, there would be. Think 1:2 in term of width and MUL array utilization, but count in that the MUL array still has a higher latency when operating on fp64, same as all other adders necessary later in the FPU.
     
  17. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    OK, I feel stupid now.

    ~3:1 makes sense, but for a different reason. And that is if you don't use 2 quarters of a 52bit MUL array, but only two 27bit MUL arrays with a single loop (One pass \, one pass /, one final full width addition). It's the latency of the following IEEE 754 specific exponent addition and shift circuit which is mostly constant. So It's technically a 4:1, but if the tail is long enough, it looks like 3:1 since fp64 gets the same (almost) static penalty as fp32.

    If you can cut the latency on the tail, the 4:1 becomes more obvious, while 2:1 means wasting die space on the MUL array, but being loop-free. And actually even closer to 1.5:1, if you don't reduce the tail for SP. Low latency in the tail goes at the cost of pipelining options, potentially requiring partially dedicated DP and SP backends.

    Architectures with a ratio worse than 4:1 aren't looping more in the MUL array, the cost comes from reusing resources in the backend. Full width for SP, looping operations or even reused function units for DP.

    Oh, and that even number ratio? Most likely to simplify scheduling to the SMMs, allowing fp64 ops only at fixed rate. Saves a lot of hassle if you can rely on having a virtually fixed pipeline length at a time (^= no simultaneous mixed operation), otherwise you would need to handle stalls.

    Hey, FPUs aren't actually that complicated :-D
     
    nnunn likes this.
  18. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    You could also do 1:3 by having 11 FP64 units and looping over 3 cycles (wasting 1/33th of the FP64 ALUs with clock gating). Not pretty, not likely, but not impossible.

    Also I still suspect this is finally the generation where NVIDIA makes a HPC-only chip without a rasteriser. They've already gone in that direction with K80 by making a new chip with a bigger register file etc... The next logical step is to optimise it further by removing 3D-only subsystems, and optimising the flagship 3D GPU by keeping a very low FP64 ratio (and smaller local memory than the HPC chip, and possibly not including NVLink)
     
  19. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    Is it possible GK210 already removed the raster/video specific units and used the die savings for the doubled shared memory and register files? That would explain why the compute-only K80 is the only SKU that uses GK210.
     
  20. huebie

    Newcomer

    Joined:
    Apr 10, 2012
    Messages:
    29
    Likes Received:
    5
    It is possible in Terms of technical possibilites, but in Terms of costs i would not bet on it. ;) There may be a slow Transition from a pure rasterizer to a more elegant compute chip.
    But Keep in mind that some of the 3D-functionalities, are still a good choice for HPC too (e.g. the TAU/TFU). What you can strike with a red pencil is definitly the UVD, Display-scanout, ROPs (not 100% sure about that, but i haven't seen any Code using that), some Things on the frontend...

    ps: sorry for small and big letters, my MS Edge Browser sets them to high or low automatically X-D
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...