Recent content by pixelio

Pascal FP16/FP32/INT8 Performance

More on Kepler vs. Maxwell integer instruction counts and throughput here: https://devtalk.nvidia.com/default/topic/804281/maxwell-integer-mul-mad-instruction-counts/
- pixelio
- Post #9
- Aug 14, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement

FWIW, that's exactly what GPU-Z reports (42-46W) when I drive the GT3e to 99%. GPU-Z reports far less power being used by a 384-core Quadro K620 for the same workload (38% of TDP). Presumably Pascal will be even more efficient. But is there a good description anywhere of how a GeForce card...
- pixelio
- Post #1,927
- Jul 28, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement

I've been hacking some compute kernels on an Intel Broadwell GT3e lately and it's a fascinating GPU. I'm still tracking down some seemingly odd GEN codegen for 64-bit load/stores to local memory but, otherwise, for my use case performance seems to be competitive with a similarly spec'd discrete...
- pixelio
- Post #1,912
- Jul 28, 2016
- Forum: Architecture and Products
Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

Perfect. Thanks, that answers the question. And you're correct about 90%... ~86% seems to be the ceiling on my cards (which is exactly 275 out of 320). So the CUDA folks will have to file some more bug reports.
- pixelio
- Post #490
- Jul 20, 2016
- Forum: Architecture and Products
Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

Has anyone verified that a GDDR5X 1080 can actually achieve over ~300GB/sec of bandwidth? Is there a good OpenGL synthetic that can test this? No one on the CUDA forums has broken ~230 GB/sec. Maybe the 10x0 devices are similar to Maxwell v2 and also default to a stock MEM clock power state...
- pixelio
- Post #486
- Jul 20, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement

Kepler and Maxwell can't perform any fp16 computations unless there are some triple-secret non-CUDA instructions that I don't know about. I'm guessing the source of confusion here is that you can get "free" conversion from fp16 to fp32 if you pull your data through the texture hardware...
- pixelio
- Post #1,221
- Jun 1, 2016
- Forum: Architecture and Products
Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

You are definitely seeing an HFMA2 op in the SASS. I was just pointing out that the 1080 throughput you report seems very close to the throughput of an FMA built out of the half/half2 data conversion intrinsics which are available on pre-sm_53 GPUs. So perhaps it's microcoded?
- pixelio
- Post #284
- Jun 1, 2016
- Forum: Architecture and Products
Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

No... sm_50 and sm_52 do not have this capability. If so you would just be able to access half-words and would achieve 64 ops/clock/SMM. So 2*M*N fp16 vs. 1*M*N fp32 takes 6x more time? A simulated (necessary for pre-sm_53) half2 FMA takes 11 instructions but you get 2 fp16 FMAs per thread...
- pixelio
- Post #279
- Jun 1, 2016
- Forum: Architecture and Products
ARM Bifrost Architecture

It looks good! The other new "feature" is that you can easily calculate total AFUs by taking the number of Bifrost cores and multiplying times 12. :) A Bifrost 32 core design would have 384 AFUs.
- pixelio
- Post #3
- May 30, 2016
- Forum: Mobile Graphics Architectures and IP
Nvidia Pascal Announcement

I agree! Maxwell is a fantastic development platform and Pascal appears to be all of that but faster, cheaper, more efficient and better with new instructions (dpXa) and as of yet unknown other improvements. FP16x2 is a neat feature and it was sort of expected after the Tegra X1 (sm_53) and...
- pixelio
- Post #1,193
- May 30, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement

A few points: Scalar fp16<>fp32 conversions aren't free. They're 32 ops/clock in CUDA. Note that you can get "free" type conversions on read-only texture loads but that's not what we're discussing. Type conversions really begin to add up. For example, a simulated fp16x2 FMA operation would...
- pixelio
- Post #1,178
- May 29, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement

Inspection of GP104's fp16x2 and new dot product-accumulate (dpXa) SASS instructions is happening over here.
- pixelio
- Post #1,149
- May 28, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement

It seems more likely they're dedicated units? My original guess of 16 ops/clock imagined that 8 fp16x2 units (16 fma16/clock) would've shipped paired with 8 FP64 units (4 fma64/clock) lifted from the GP100. We'll all know soon once CUDA devs start digging. Or NVIDIA could just tell us. :)
- pixelio
- Post #1,147
- May 28, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement

The code was generated from explicit PTX but CUDA intrinsics would have the same result.
- pixelio
- Post #1,146
- May 28, 2016
- Forum: Architecture and Products
Nvidia Pascal Announcement
- pixelio
- Post #1,141
- May 28, 2016
- Forum: Architecture and Products