Nvidia Pascal Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 25, 2014.

Tags:
Thread Status:
Not open for further replies.
  1. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    Please note that Nvidia has never been crystal clear about where those DP units sit. I did my best to represent what they were willing to share. And some details are missing to keep those diagrams readable, each 32-wide SP unit should actually be 2 16-wide units.
     
  2. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    Don't atomic memory ops execute in the ROP?
     
  3. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,489
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Data/texture sampling will probably remain the last dedicated HW feature of all to be removed from the pipeline. It's still so much more efficient than a pure kernel substitute. I wouldn't mind a higher quality AF alternative though, since all of the latest GPU architectures now offer direct access to the texture cache by the shaders.
     
  4. huebie

    Newcomer

    Joined:
    Apr 10, 2012
    Messages:
    29
    Likes Received:
    5
    I have no clue :) I'm not a programmer... Edit: But since atomitcs are heavily using VRAM-Access it makes sense to me.

    Agreed.
    Does anybody know what happend to the SFUs in Transition from Kepler to Maxwell? Haven't read anything about it. What exactly was in the SFU in GK110 (and below)?? Some sort of extensions like SSE on CPUs?
     
  5. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,489
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    The SFU units are dedicated to handle higher-order non-pipelined operations, like RCP, RSQ, SIN, COS, MOV, attribute interpolation and similar. In some past NV architectures, the SFU was able to execute a parallel FMUL op in certain conditions.
     
  6. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Noted, and interesting - do you know if each 16 wide unit executes a different warp over 2 cycles? If so, and a warp only has a few active threads, can a warp be processed in just 1 cycle?

    Regarding DP... I initially thought that having 1 DP unit per partition would be simpler than having to arbitrate for 2 shared units. But ... it seems like even if your DP:SP ratio is very low, you probably want to do all your register reads for a warp at once, rather than as needed (otherwise, wouldn't there be interference with register file reads for subsequent SP ops?). And I guess then your are in the situation where you have a bunch of source data and you want to do a computation which will take a long time (because you have very few DP units). That starts looking like a texture lookup, so it seems like a good idea to ship off your DP source register values and switch warps just like you would ship off addresses and gradients when doing a texture lookup (I don't really know what happens for texture lookups, so this is yet another assumption on my part), and you could presumably use the same datapaths for DP and texturing. So in the end, I feel like your diagram is very plausible.

    But... assuming my speculation has any relation to reality (and I am not a HW person, so that is very questionable), then I also think this approach probably makes a lot less sense if the DP:SP throughput ratio is high.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Doesn't Maxwell have an operand collector?
     
  8. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Fermi had them so they probably do
     
  9. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    611
    Likes Received:
    1,052
    Location:
    PCIe x16_1
  10. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    No idea. I guess my point is that you likely want something along those lines for long/unpredictable latency ops, because you want to be able to execute other stuff in the mean time and not have to deal with unpredictable contention for register file ports. But you might want to just feed register file lookups directly to ALUs for other kinds of operations. If my mental model of things is busted though, please let me know :)
     
  11. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    I don't know if my interpretation is correct but from the last SC15 leaked slide, it looks like big HPC Pascal will have 1:2:4 ratio for FP64:32:16 precision:
    http://cdn.wccftech.com/wp-content/uploads/2015/11/NVIDIA-Pascal-GPU-Mixed-Precision.jpg
    Unified ALUs or dedicated FP64 ones, the question is still pending...

    my guess is that GP100 will have new mixed precision ALUs to replace old Kepler series and compete with intel KNL, when GP102 will be the new gaming flagship with only FP32/16 Maxwell X1 derivated ALUs. Titan may be GP100 or GP102 depending on AMD pressure

    source: http://wccftech.com/nvidia-pascal-volta-gpus-sc15/ (yes I know :embarrased:)
     
  12. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    My guess would be, that we will see more dedicated units again in the future (for FP64 at least, 16/32 should integrate nicely) - reason is in the posted picture. In the past, there was much more area pressure, while today the focus shifts more and more toward power pressure.
     
  13. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    So you think HPC Pascal will have 1:2 ratio on FP64:FP32 ALUs ? :runaway:
    I may be wrong but I think a single mixed precision FP16/32/64 ALU will cost less die/logic/registers than one FP64 + two FP16/32 ALUs.
     
  14. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
    A wee bit more die area for quite a bit less power consumption.
     
    CarstenS likes this.
  15. itisravi

    Joined:
    Nov 25, 2015
    Messages:
    1
    Likes Received:
    0
    I am thinking Pascal will have 4096 FP32 and 2048 FP64 units. (128 FP32 & 64 FP64 per SM).

    Since DP GFlops = 2 x Freq x FP64 units and assuming ~1000 MHz frequency, to achieve 4000+ GFLOPs, it must have 2000+ FP64 units. Therefore, 2048 FP64 units make most sense.

    For Volta, it should be 6144 FP32 and 3072 FP64 units. With a higher peak frequency of 1140 MHz, it will be able to hit 7 TFlops.

    Now as for the FP64:FP32 ratio, if it is 1:3, then

    1. Each SM must move back to "192 FP32 + 64 FP64" like K40/K80, as it will be difficult to implement 1:3 with 128 FP32 units.
    2. Pascal will have 6144 cuda cores. I don't know if Nvidia would want to build such a big chip with a new process, considering the yield issues.
    3. Volta will have 9216 cores, which I don't think is even possible with the 16nm process.

    Looking from the perspective of consumer GPUs, it would make more sense for Nvidia to go from Maxwell (3072 cores) to Pascal (4096 cores) and then as the process matures, to Volta (6192 cores).
     
  16. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    We know Pascal is 4 TFlop/s DP.
    Also it has been stated FP16 would be 4x Maxwell.
    That implies 12 TFlop/s SP and 24 TFlop/s FP16

    48x SM each doing 128 FP32 / 64 FP64 (in 1.5 cycles ?) and 256 FP16 could do that.
    So basically twice Maxwell with flexible FP32 / FP16 / FP64
    (alternatively 32x SM with 192 FP32 / 384 FP16 dedicated 64 FP64 (in 1 cycle))

    For anything less (like 4096 cores) the 1TB/s bandwidth seems total overkill.
     
    #416 Voxilla, Nov 25, 2015
    Last edited: Nov 25, 2015
  17. superjoeyprof

    Joined:
    Mar 11, 2015
    Messages:
    8
    Likes Received:
    2
    4 TFLOPS DP is planed with the top Tesla, but it will be a dual GPU solution like the Tesla K80.
     
    Grall and Lightman like this.
  18. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    Ok, that is interesting. Then we only need a Pascal with 12 TFlop/s FP32 and 2 TFlop/s FP64.
    That makes it more feasible with:
    48x SM each doing 128 FP32 / 64 FP64 (in 3 cycles) and 256 FP16
     
  19. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    773
    Likes Received:
    200
    Does the 1 TB/s memory bandwidth in the slide also refer to a dual-GPU card?

    The Tesla K80 has fewer enabled SMXs and a lower base clock than the K40, resulting in much lower FLOPS per GPU (at base clock) for the former. Since the slide from the last page seems to use the base clock of the K80 for the DP GFLOPS, I think the 4 DP TFLOPS for Pascal is also from the base clock. If the relevant Pascal chip is anywhere near Big Kepler in terms of power consumption, I would expect it to have closer to 3 DP TFLOPS in a single-GPU Tesla configuration, unless supejoeyprof's 4 TFLOPS value is unrelated to the value in the slide (in that case disregard this paragraph).
     
    pharma likes this.
  20. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    360
    Likes Received:
    252
    Why it doesn't match HBM2 bandwidth then?
    If DP rate was 1/4 with 2 GPUs, bandwidth would be rated at 2 TBytes/second with 2xHBM2. K80 bandwidth perfectly matches it's flops in the charts on page 7 - http://www.ecmwf.int/sites/default/files/HPC-WS-Posey_0.pdf
     
    pharma likes this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...