- INT32 cores enable Turing GPUs to execute floating point and non-floating point processes in parallel
- 36% additional throughout for floating point operations.
- “50% improvement in delivered performance per CUDA core”.
- ‘50% increase in effective bandwidth on Turing compared to Pascal’.
- NVDEC decoder with HEV YUV444 10/12b HDR, H.264 8K and VP9 10/12 HDR support
And more...
A bit strange decision by Nvidia to keep the number of raster pipes on TU104 the same as on the TU102. The fragment scan-out rate will now outpace the ROPs.
So the question I have: are GPU shader cores superscalar already or the Turing ones were made superscalar (first ones?)?
This would imply that a significant chunk of work done by the shader core consists of integer ops. Performance could be squeezed out by the compiler by arranging the compiled shader code to effectively scheduler for this.
Kepler was superscalar but I don’t think that definition applies to Turing. Based on what nvidia has shared so far Volta/Turing can’t issue multiple instructions per clock from the same warp.
So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%
It doesn’t say explicitly that the instruction have to come from different warps. I always assumed it wasn’t necessary, but that may be simply wrong.
Either way: since the FP32 and INT are issued in an alternating way, I don’t think you can call it superscalar?