Is fp16 enough for the transform matrix?
If you want to position something on a 2048 pixel screen, then yes, otherwise, no.
Is fp16 enough for the transform matrix?
If you want to position something on a 2048 pixel screen, then yes, otherwise, no.
Tiling perhaps? Makes sense for all the interpolation work and devs should be able to work around the precision issue in screen space easily enough. Would save a lot of space in various distributed caches.If you want to position something on a 2048 pixel screen, then yes, otherwise, no.
That won't improve accuracy much. For deep learning it works because the operations are summing many consecutive FP16 multiplications. Essentially calculating the sum of many binary results. In graphics I can't think of m/any good examples of doing that. Maybe kinematics or tessellation where the results can be a bit more fungible with longer dependencies. Will be interesting to see what devs come up with as a decade ago tessellation and kinematics weren't very practical to possibly warrant less than FP24. Blending is mostly multiplication, but downsampling would involve some addition.The output is FP32 (which has 23 bits of mantissa). So you'll be able to position things far more precisely than that.
Below P100 vs V100 on CUDA:Weird results - comparing V100 OpenCL and Cuda
P100 | DGX-1 V100
CUDA Score 320031 | 743537
Sobel 528482 | 1382119
23.3 Gpixels/sec | 60.9 Gpixels/sec
Histogram Equalization 455379 | 996475
14.2 Gpixels/sec | 31.1 Gpixels/sec
SFFT 66489 | 101670
165.7 Gflops | 253.5 Gflops
Gaussian Blur 538403 | 1897300
9.43 Gpixels/sec | 33.2 Gpixels/sec
Face Detection 49263 | 108700
14.4 Msubwindows/sec | 31.7 Msubwindows/sec
RAW 1139825 | 2743361
11.0 Gpixels/sec | 26.6 Gpixels/sec
Depth of Field 571644 | 1499040
1.66 Gpixels/sec | 4.35 Gpixels/sec
Particle Physics 397917 | 786603
62904.7 FPS | 124350.1 FPS
Yes sure but if you look at Sobel, Histogram and Gaussian blur scores, they all show 3 times performance improvement, which is way beyond the V100 CUDA core increase...Well you still need to take into account the large increase in cuda cores, you could work out there potential performance increase per sm for those types of workloads. Also can look at perf/mm comparison since V100 is massive.
I'm not saying the numbers are wrong or in any way incomprehensible, I'm saying that you can't simply extrapolate potential V100 performance delta from these numbers and that additional factors such as CUDA core count need to be considered as well. It's stupid click bait articles from the usual sites (like the "source" here) that lead to nvidiots posting on forums saying that GV104 will be 2-3 times as fast as Pascal.Yes sure but if you look at Sobel, Histogram and Gaussian blur scores, they all show 3 times performance improvement, which is way beyond the V100 CUDA core increase...
Hopefully, not a single sane person will claim that :|I'm not saying the numbers are wrong or in any way incomprehensible, I'm saying that you can't simply extrapolate potential V100 performance delta from these numbers and that additional factors such as CUDA core count need to be considered as well. It's stupid click bait articles from the usual sites (like the "source" here) that lead to nvidiots posting on forums saying that GV104 will be 2-3 times as fast as Pascal.
NVIDIA often supports OpenCL only at the bare minimum level. GP100 scores 320K with CUDA, but only 278K with OpenCL.Weird results - comparing V100 OpenCL and Cuda
True and it's also interesting to note that Volta shows much bigger performance gap between OpenCL and CUDA than with Pascal. Maybe Nvidia put all their initial effort on CUDA, leaving OpenCL behind for now...NVIDIA often supports OpenCL only at the bare minimum level. GP100 scores 320K with CUDA, but only 278K with OpenCL.
V100 compute uarch seems extremely solid.