Nvidia Pascal Announcement

Discussion in 'Architecture and Products' started by huebie, Apr 5, 2016.

Tags:
  1. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Thanks. __half2 is already the packed format (2x half precision in a 32bit word), contrary to the old __half which is just 16 significant bits in a 32 bit word.
     
  2. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,175
    Location:
    La-la land
    I've seen several posts here by sebbbi and whatnot claiming that FP16 would be beneficial to graphics tasks as well. The thing is not that they didn't make FP16 2x as fast as FP32, it's that they made it 64x slower than FP32. What in the name of fuck is that. Seriously.

    At least if FP16 had been same rate as FP32, one could have benefited from data storage gains even if there were no performance gains, but this is ridiculous. We can only hope AMD rides to the rescue here.
     
    Lightman, Kaarlisk and CarstenS like this.
  3. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Why would Nvidia possibly do FP16x2 with a dedicated unit? Is that tiny efficiency gain on the pure FP32 core, compared to a mixed prevision FPU, really worth killing FP16 all together?

    Btw.: Old FP16 should still work fine, at least I don't see why that wouldn't still be handled by the FP32 cuda cores at truncated precision. Even though it's a really strange decision then to include a FP16x2 unit at all then, if you would achieve a higher throughput just by unpacking and running via the regular SP cores...
     
  4. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    623
    Likes Received:
    1,095
    Location:
    PCIe x16_1
    IMO, you guys are overthinking this. FP16 is a valuable feature to the deep learning community, so NVIDIA is relegating it to Tesla.

    Note that as best as I can currently tell, FP16 performance is only limited for CUDA (and potentially OpenCL) use cases. Sandra's DX Compute Shader test currently reports that FP16 operations run at FP32 speeds, though I don't know a ton about the test or just how much support for native FP16 DXCS really offers.
     

    Attached Files:

    homerdog, BRiT, fellix and 1 other person like this.
  5. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,934
    Likes Received:
    2,264
    Location:
    Germany
    It's because it is not exposed as native in the other APIs, so it's probably running thought FP32-mode there. Weird.
     
  6. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    596
    Likes Received:
    306
    FP16 are great for tons of operations in graphics where 16FP is precise enough. But if it doesn't improve the rate over FP32, then there is no reason tu support FP16. So, FP16 support in consumer version of Pascal is a joke. Deep learning only/ machine learning only is just an excuse.

    I am talking about architecture features. I am not talking about performance. Everyone can affirm without any doubt that performance are great.
     
  7. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,934
    Likes Received:
    2,264
    Location:
    Germany
    So, that leaves only one possible reason: They crippled it in order to make sure, it is not used for games (or by developers cutting in cheap like it was done for cryptocurrencies), where architectures supporting it natively could actually be drawing an advantage out of it. Is that what you're saying?
     
  8. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Because that's most likely the old, non-vectorized operations. Which are essentially just single precision ops, where the result is truncated, and 16bit of the 32bit register are wasted. Can still be used to pack the 16bit floats for compact storage, but can't be utilized to improve the ALU throughput.
     
    BRiT likes this.
  9. constant

    Newcomer

    Joined:
    Feb 9, 2014
    Messages:
    22
    Likes Received:
    8
    Honestly, caliming that FP16 throughput is 1/64th crippled is slightly disingenuous as in practice you would still be able to get full FP32 throughput WITH the benefit of truncated storage.

    Just do:

    //load
    __half2float conversion
    // Compute as if FP32
    __float2half conversion
    //store
     
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Also worth remembering NVIDIA also ensured some aspects of Tesla did not make it to the consumer products; sure there were differences in the Dynamic Parallelism between Tesla and consumer going back to Kepler days.
    Cheers
     
  11. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    As long as you are using the __half type, you still get full SP throughput as it is. The problem only occurs with the new __half2 type, which is 2x 16bit packed into the low and high word of a 32bit register.
     
    BRiT likes this.
  12. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    You're mistaken on this. The FP16 instructions in Nvidia's implementation use dedicated FMA units that consume a serious amount of transistors and die area. Just like FP64: the big xx0 chip has full rate units, but all the xxy chips have just enough units to allow software to run.
     
    pharma likes this.
  13. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,031
    Likes Received:
    5,576
    Yet IMG/ PowerVR has had dedicated FP16 units in their mobile GPUs for years, and we've seen developers go on record saying that FP16 is enough for a number of shading effects and using the smaller variables when possible translates into lower bandwidth requirements, higher performance and lower power consumption. IIRC, GCN was initially praised for bringing dedicated instructions for FP16.

    So which one is it?
     
    pixelio likes this.
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,934
    Likes Received:
    2,264
    Location:
    Germany
    Are you're saying, that GP100 has
    - 3.840 FP32 ALUs and
    - 1.920 FP64 ALUs and
    - what - 7.680 FP16 ALUs

    all tied to the same register files and serviced by the same schedulers? Sounds like a massive waste of resources to my laymans' mind.

    Maybe rather the FP64 units act as quad-speed FP16-ones? So they can easily adjust architectures - like reducing THEM back to 1/8th speed for consumer products?
     
    Lightman, Grall and BRiT like this.
  15. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Option B: The GP100 has actually all mixed precision ALUs. The GP104 has mostly single mode FPUs, and only one or two full mixed precision cores per SMM?

    The fundamental assumption that all CUDA cores are equal is possibly wrong. Might actually be not 128 CUDA cores on the GP104, but more like 127+1, except that this time the 127 are the incomplete ones.
     
  16. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Just to add, why does the Pascal Tesla architecture only show FP64 and FP32 cores if there was dedicated FP16 ALUs.

    Also quoting NextPlatform from their Pascal Tesla article - looks like part of this information came from discussions with Jonah Alben (many years involved with GPUs designed at NVIDIA)
    Cheers
     
    Lightman, pixelio, pharma and 2 others like this.
  17. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    I don't think that quote actually refers to the CUDA cores used in the GP104.

    If my suspicion is correct, then GP104 isn't fully equipped with the "new" cores, but still uses old single precision cores for the most part, and only a couple of the new ones to provide at least formally the same instruction set. Essentially just the same way, as it only includes enough double precision cores to formally fulfill the specifications, but not enough that you could actually make good use of it.

    With the GP100, Nvidia apparently didn't have to make such compromise. Every single FP32 core supporting the new vectorized FP16 ops, and enough FP64 cores to exhaust the feeds as well.

    In order to verify that, we would need to test whether the throughput is also limited for other instructions newly added with Pascal, at least for these were one could assume that they would require any significant amount of additional die space to implement.
     
  18. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    A few points:
    1. Scalar fp16<>fp32 conversions aren't free. They're 32 ops/clock in CUDA. Note that you can get "free" type conversions on read-only texture loads but that's not what we're discussing.
    2. Type conversions really begin to add up. For example, a simulated fp16x2 FMA operation would require 8 type conversions, two fp32 FMAs and, in my example, one register "merge". That's 11 ops!
    3. A native full-rate fp16x2 FMA would have throughput of 2 FMAs per clock.
    4. I don't recall the source, but I thought I read that an fp16 unit is like 1/3rd the size of an fp32 unit. Presumably fp16x2 would be 2/3rds. That's not "free".
    5. An fp16 is suitable for representing color and performing blends. It's also useful for image processing and saves bandwidth.
    6. But why the excitement over native fp16x2 in a shader/compute kernel? Register pressure! You can squeeze more of them into a register file while doubling your throughput.
    7. An incomplete list of platforms with "wide" fp16x2 support: Intel Gen 8+, PowerVR, Adreno 530, Tegra X1, GP100, ...
    Here's a simulated fp16x2 FMA:
    Code:
    half2 fma_half2(half2 a, half2 b, half2 c)
    {
    #if __CUDA_ARCH__ >= 530
      return __hfma2(a,b,c);
    #else
      return __floats2half2_rn(fmaf( __low2float(a), __low2float(b), __low2float(c)),
                               fmaf(__high2float(a),__high2float(b),__high2float(c)));
    #endif
    }
    
    Here's the SASS for the simulated hfma2():
    Code:
    F2F.F32.F16 R3, R0;
    F2F.F32.F16 R6, R0.H1;
    F2F.F32.F16 R4, R1;
    F2F.F32.F16 R7, R1.H1;
    F2F.F32.F16 R5, R2;
    F2F.F32.F16 R8, R2.H1;
    FFMA R3, R3, R4, R5;
    FFMA R4, R6, R7, R8;
    F2F.F16.F32 R0, R3;
    F2F.F16.F32 R5, R4;
    XMAD.PSL.CLO R0, R5, 0x1, R0;
    While the native hfma2() is just a single instruction:
    Code:
    HFMA2 R3, R0, R1, R2;
     
    #1178 pixelio, May 29, 2016
    Last edited: May 29, 2016
    nnunn, Grall, Alessio1989 and 5 others like this.
  19. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    So you do not think they can tweak the fp32 core in the P100 (Tesla in general) back to being 1:1 for FP16 when released for consumers?
    They also mention in theory they could do FP16 over FP64 giving 1:4 in one cycle with these Cuda cores, just not doable due to other constraints.
    Anyway makes me wonder where this will be going with Volta, which could be a monster in terms of the mix precision capability.

    Cheers
     
  20. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    I was under the same impression, that FP16 was good enough precision wise for certain things and there's performance to be had by using it.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...