Recent content by pixelio

  1. pixelio

    Pascal FP16/FP32/INT8 Performance

    More on Kepler vs. Maxwell integer instruction counts and throughput here: https://devtalk.nvidia.com/default/topic/804281/maxwell-integer-mul-mad-instruction-counts/
  2. pixelio

    Nvidia Pascal Announcement

    FWIW, that's exactly what GPU-Z reports (42-46W) when I drive the GT3e to 99%. GPU-Z reports far less power being used by a 384-core Quadro K620 for the same workload (38% of TDP). Presumably Pascal will be even more efficient. But is there a good description anywhere of how a GeForce card...
  3. pixelio

    Nvidia Pascal Announcement

    I've been hacking some compute kernels on an Intel Broadwell GT3e lately and it's a fascinating GPU. I'm still tracking down some seemingly odd GEN codegen for 64-bit load/stores to local memory but, otherwise, for my use case performance seems to be competitive with a similarly spec'd discrete...
  4. pixelio

    Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

    Perfect. Thanks, that answers the question. And you're correct about 90%... ~86% seems to be the ceiling on my cards (which is exactly 275 out of 320). So the CUDA folks will have to file some more bug reports.
  5. pixelio

    Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

    Has anyone verified that a GDDR5X 1080 can actually achieve over ~300GB/sec of bandwidth? Is there a good OpenGL synthetic that can test this? No one on the CUDA forums has broken ~230 GB/sec. Maybe the 10x0 devices are similar to Maxwell v2 and also default to a stock MEM clock power state...
  6. pixelio

    Nvidia Pascal Announcement

    Kepler and Maxwell can't perform any fp16 computations unless there are some triple-secret non-CUDA instructions that I don't know about. I'm guessing the source of confusion here is that you can get "free" conversion from fp16 to fp32 if you pull your data through the texture hardware...
  7. pixelio

    Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

    You are definitely seeing an HFMA2 op in the SASS. I was just pointing out that the 1080 throughput you report seems very close to the throughput of an FMA built out of the half/half2 data conversion intrinsics which are available on pre-sm_53 GPUs. So perhaps it's microcoded?
  8. pixelio

    Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

    No... sm_50 and sm_52 do not have this capability. If so you would just be able to access half-words and would achieve 64 ops/clock/SMM. So 2*M*N fp16 vs. 1*M*N fp32 takes 6x more time? A simulated (necessary for pre-sm_53) half2 FMA takes 11 instructions but you get 2 fp16 FMAs per thread...
  9. pixelio

    ARM Bifrost Architecture

    It looks good! The other new "feature" is that you can easily calculate total AFUs by taking the number of Bifrost cores and multiplying times 12. :) A Bifrost 32 core design would have 384 AFUs.
  10. pixelio

    Nvidia Pascal Announcement

    I agree! Maxwell is a fantastic development platform and Pascal appears to be all of that but faster, cheaper, more efficient and better with new instructions (dpXa) and as of yet unknown other improvements. FP16x2 is a neat feature and it was sort of expected after the Tegra X1 (sm_53) and...
  11. pixelio

    Nvidia Pascal Announcement

    A few points: Scalar fp16<>fp32 conversions aren't free. They're 32 ops/clock in CUDA. Note that you can get "free" type conversions on read-only texture loads but that's not what we're discussing. Type conversions really begin to add up. For example, a simulated fp16x2 FMA operation would...
  12. pixelio

    Nvidia Pascal Announcement

    Inspection of GP104's fp16x2 and new dot product-accumulate (dpXa) SASS instructions is happening over here.
  13. pixelio

    Nvidia Pascal Announcement

    It seems more likely they're dedicated units? My original guess of 16 ops/clock imagined that 8 fp16x2 units (16 fma16/clock) would've shipped paired with 8 FP64 units (4 fma64/clock) lifted from the GP100. We'll all know soon once CUDA devs start digging. Or NVIDIA could just tell us. :)
  14. pixelio

    Nvidia Pascal Announcement

    The code was generated from explicit PTX but CUDA intrinsics would have the same result.
Back
Top