In the general case, yes. For inference and the heavy contribution that 8-bit dot products that feed into each wide layer of a neural net, it is apparently very significant.Current GPUs are balanced around their fp32 performance. Suddenly increasing ALU performance by 4:1 without scaling up anything else isn't going to bring 4x performance gains.
It's particularly so in the case of the MI8 product AMD is marketing as an inference product, which is a 150W @ 5.7 TFLOP Polaris 10 versus the nearest inference-targeted Nvidia Pascal: the 50-70W @5.5 TFLOP @ 22 INT8 Tegra P4.
A lot gets forgiven if there's an order of magnitude difference in effectiveness without even considering Nvidia's superior positioning in this space.
The Vega MI25, if ~300W @ 25TFLOP @FP16 @ INT8>=25, would be much closer to the 250W @47 INT8 P40, perhaps enough to compensate in those other bottlenecks.
The promotion step is expensive in terms of occupancy and power, and in certain networks it is actually possible to shave things below 8-bit INT multiplication with 32-bit accumulation, potentially making it denser. The pre-existing type handling of GPU hardware and the re-emerging reduced-precision path is what much of the functionality falls out of, with some change on top of it to provide easier use or physical throughput. Perhaps AMD considers it moot since there are indications that some of the consumers of inference hardware are looking for dedicated hardware, or what it's putting out there as being suited for inference is the best it can rush out.GPUs already have narrow typed memory loads/stores (RGBA8, RGBA16, RGBA16f), so there's no need extract/pack 8 bit values manually (in shader code). Your neural network memory layout (and off-chip bandwidth) would be exactly the same as it would be if you only had fp16 or fp32 ALUs.
That's a generalization for most code, but it seems marketable in this niche.4x "faster" ALU would hit other GPU bottlenecks in most code.
I am trying to track down a paper where AMD's GPUs were used for radio signal analysis and were made to perform double-rate 8-bit math with some creative packing into a 32-bit register and periodic overflow handling to keep the two portions from interfering with one another.Also, I am sure that future games will use neural networks for many tasks. Super-resolution, antialiasing and temporal techniques are the most promising ones currently. But there has been also physics simulation papers where the network was trained to solve fluid navier-stokes equations, etc.
That was on hardware that predates SDWA, if I recall correctly.
SDWA could reduce the overhead related to using a rigidly 32-bit hardware path, and possibly more benefit could come if future hardware actually increased throughput as well.
That is what I've been alluding to as a possible way to close some of the marketing/performance gap in the space the MI line is targeting.Can't you already do generic 8 bit register processing using SDWA in GCN3/GCN4? This modifier allows instructions to access registers at 8/16/32 bit granularity (4+2+1 = 7 choices). No extra instructions to pack/unpack data needed. We need to wait until GCN5 ISA documents are published to know exactly how SDWA interacts with 16b packed ops.
Getting SDWA to feed into double-rate hardware can get the benefits of the compressed representation through the entire pipeline. Possibly, there are some ways of fiddling with the representation given the more arbitrary extraction to have a packing level other than 4, such as 3 or 2 with shifting precisions between them.
I am curious if there are some scenarios such as image processing where something like the in-built delta-compression hardware could be elaborated. In the cases of high correlation or some of the trivial reject instances, a control word for a tile could resolve down to a broadcast (or high-performance scalar thread, if one were interested), although I'm not sure how highly correlated the values would be. The more forgiving error margins with inference might point to a level of fudging possible, possibly if the control word+information on the reference value indicates that a tile's values do not rise to a level sufficient to perturb the output for a whole swath of inputs/neurons.
However, there would be various reasons why the current way of things wouldn't help as much, such as the compression being incoherent and not able to be queried directly. I suppose the math could be done in compute, just that the dedicated hardware, synchronization, and automatic generation of metadata would not be available.
Metadata like that, or a way to add some programmable behavior to the compression/correlation path, could be queried prior to wavefront generation could even do something like change the allocation requirements for a kernel, or even not launch one at all in trivial cases. That kind of tiling, then the hierarchical nature of some ways data can be compressed, then maybe address-range monitoring for edits to tiles, could suppress more calculations that wind up doing little. That's just me thinking to myself without too much time to reflect, however.
***edit: Apologies for the late edit, but the correct MI product is the MI6 product, as the TFLOP count would indicate.
Last edited: