AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Current GPUs are balanced around their fp32 performance. Suddenly increasing ALU performance by 4:1 without scaling up anything else isn't going to bring 4x performance gains.
In the general case, yes. For inference and the heavy contribution that 8-bit dot products that feed into each wide layer of a neural net, it is apparently very significant.
It's particularly so in the case of the MI8 product AMD is marketing as an inference product, which is a 150W @ 5.7 TFLOP Polaris 10 versus the nearest inference-targeted Nvidia Pascal: the 50-70W @5.5 TFLOP @ 22 INT8 Tegra P4.
A lot gets forgiven if there's an order of magnitude difference in effectiveness without even considering Nvidia's superior positioning in this space.
The Vega MI25, if ~300W @ 25TFLOP @FP16 @ INT8>=25, would be much closer to the 250W @47 INT8 P40, perhaps enough to compensate in those other bottlenecks.

GPUs already have narrow typed memory loads/stores (RGBA8, RGBA16, RGBA16f), so there's no need extract/pack 8 bit values manually (in shader code). Your neural network memory layout (and off-chip bandwidth) would be exactly the same as it would be if you only had fp16 or fp32 ALUs.
The promotion step is expensive in terms of occupancy and power, and in certain networks it is actually possible to shave things below 8-bit INT multiplication with 32-bit accumulation, potentially making it denser. The pre-existing type handling of GPU hardware and the re-emerging reduced-precision path is what much of the functionality falls out of, with some change on top of it to provide easier use or physical throughput. Perhaps AMD considers it moot since there are indications that some of the consumers of inference hardware are looking for dedicated hardware, or what it's putting out there as being suited for inference is the best it can rush out.

4x "faster" ALU would hit other GPU bottlenecks in most code.
That's a generalization for most code, but it seems marketable in this niche.


Also, I am sure that future games will use neural networks for many tasks. Super-resolution, antialiasing and temporal techniques are the most promising ones currently. But there has been also physics simulation papers where the network was trained to solve fluid navier-stokes equations, etc.
I am trying to track down a paper where AMD's GPUs were used for radio signal analysis and were made to perform double-rate 8-bit math with some creative packing into a 32-bit register and periodic overflow handling to keep the two portions from interfering with one another.
That was on hardware that predates SDWA, if I recall correctly.
SDWA could reduce the overhead related to using a rigidly 32-bit hardware path, and possibly more benefit could come if future hardware actually increased throughput as well.

Can't you already do generic 8 bit register processing using SDWA in GCN3/GCN4? This modifier allows instructions to access registers at 8/16/32 bit granularity (4+2+1 = 7 choices). No extra instructions to pack/unpack data needed. We need to wait until GCN5 ISA documents are published to know exactly how SDWA interacts with 16b packed ops.
That is what I've been alluding to as a possible way to close some of the marketing/performance gap in the space the MI line is targeting.
Getting SDWA to feed into double-rate hardware can get the benefits of the compressed representation through the entire pipeline. Possibly, there are some ways of fiddling with the representation given the more arbitrary extraction to have a packing level other than 4, such as 3 or 2 with shifting precisions between them.


I am curious if there are some scenarios such as image processing where something like the in-built delta-compression hardware could be elaborated. In the cases of high correlation or some of the trivial reject instances, a control word for a tile could resolve down to a broadcast (or high-performance scalar thread, if one were interested), although I'm not sure how highly correlated the values would be. The more forgiving error margins with inference might point to a level of fudging possible, possibly if the control word+information on the reference value indicates that a tile's values do not rise to a level sufficient to perturb the output for a whole swath of inputs/neurons.

However, there would be various reasons why the current way of things wouldn't help as much, such as the compression being incoherent and not able to be queried directly. I suppose the math could be done in compute, just that the dedicated hardware, synchronization, and automatic generation of metadata would not be available.
Metadata like that, or a way to add some programmable behavior to the compression/correlation path, could be queried prior to wavefront generation could even do something like change the allocation requirements for a kernel, or even not launch one at all in trivial cases. That kind of tiling, then the hierarchical nature of some ways data can be compressed, then maybe address-range monitoring for edits to tiles, could suppress more calculations that wind up doing little. That's just me thinking to myself without too much time to reflect, however.

***edit: Apologies for the late edit, but the correct MI product is the MI6 product, as the TFLOP count would indicate.
 
Last edited:
I would imagine that 4xint8 becomes useful in power limited scenarios like self driving car. The cameras in specific car is well known quantity and parts of inferencing algorithm could be optimized for int8. Quite likely the feed from cameras would also be preprocessed to make inferencing algorithm more efficient.

Even fp16 as is might not be too awesome for learning side, at least not without understanding the context specific math well and doing proper implementation.

There is much outside games that is done in neural networks. Limiting thinking to gaming context would be a mistake.

From nvidia it will be interesting to see what volta is and does it get deployed to supercomputer during 2017.
 
Cross-posting here as it relates to the discussion.

A native FP64 high performance scalar to absorb the special functions and MADD, MUL, DP in the SIMD I'd think could get away with it without significantly increasing logic. Similar concept may work for 4xINT8. Disable the SIMD and use all the decoding logic to feed it. Not quite the same as 4xINT8 across the SIMD, but it probably wouldn't bottleneck as easily. Might be possible to consider INT4 at that point.
 
and how would the driver interpenetrate that?

Edit: I'm asking this because they still need to maintain backward compatibility in what ever they are doing. keep in mind even the pieces in cars, they are trying to make those pieces "upgradeable" in the future.
 
and how would the driver interpenetrate that?
Same way it currently does when an instruction doesn't follow the cadence? Most of the implementation I'd expect to be hidden from the driver and programmer. The scalar could probably be made to run every instruction the SIMD does. Doubt it would be particularly efficient, but CPUs can run GPU code. As for the other way around masking off all but the first lane of the SIMD would be simple enough. The way I see the implementation working shouldn't affect backwards compatibility. Just offer a whole host of new options as the scalar and SIMD get specialized a bit more. Only works if they are separate units to schedule against within a SIMD. Even without my scalar idea they could take a vector instruction and farm it off to the scalar if the required instructions existed. I'm sure AMD could already make ISA instructions to serialize entire vectors if they wanted. The ROCm stack they keep talking about is already designed to move across CPU and GPU. That's already a scalar to SIMD transition right there.

If they added FP32, I could see FP64 along with some special functions being an interesting addition. FP64 special functions would be particularly nasty to implement efficiently. In the past I thought some infrequent instructions were limited to the first lane. I'd have to go double check that however. Even in the Instinct video they mention running 4 bit kernels and we already know byte level addressing exists. Would be a lot more efficient with a scalar on each SIMD possessing a LDS style crossbar and some internal registers. All the old HSA documents mentioned fusing CPU, DSP, and GPU. A CU level scalar, 4x SIMDs, and say 4x DSP/scalar with the SIMDs would do it. Surely someone at AMD thought of that combination after all their HSA work on that very concept back around 2012.
 
This preview is coming earlier than I thought.
Let's hope this also means the cards themselves are coming within a couple of months at most. Otherwise it'll just become off-putting.
 
AMD send a email where they state: AMD has 83% share of the global VR system market. This year at GDC, we took the stage with Crytek, Fox Studios, Microsoft, Oculus, Rebellion, Sulon, Ubisoft, and more to announce AMD's great VR wins. AMD has been named the exclusive GPU technology provider for Crytek VR First™ initiative. AMD Radeon Pro Duo, combined with the AMD LiquidVR™ API, is the world's fastest and most powerful virtual reality (VR) creator platfor.

So It seems that this will be a good year for AMD(it has to be for them) and that their are dominating? the(yet very small) VR market.
 
AMD send a email where they state: AMD has 83% share of the global VR system market. This year at GDC, we took the stage with Crytek, Fox Studios, Microsoft, Oculus, Rebellion, Sulon, Ubisoft, and more to announce AMD's great VR wins. AMD has been named the exclusive GPU technology provider for Crytek VR First™ initiative. AMD Radeon Pro Duo, combined with the AMD LiquidVR™ API, is the world's fastest and most powerful virtual reality (VR) creator platfor.

So It seems that this will be a good year for AMD(it has to be for them) and that their are dominating? the(yet very small) VR market.
This was last year's GDC Press Release and is no longer relevant IMO. They sure as hell don't have any 83% share of anything..
 
AMD send a email where they state: AMD has 83% share of the global VR system market. This year at GDC, we took the stage with Crytek, Fox Studios, Microsoft, Oculus, Rebellion, Sulon, Ubisoft, and more to announce AMD's great VR wins. AMD has been named the exclusive GPU technology provider for Crytek VR First™ initiative. AMD Radeon Pro Duo, combined with the AMD LiquidVR™ API, is the world's fastest and most powerful virtual reality (VR) creator platfor.

So It seems that this will be a good year for AMD(it has to be for them) and that their are dominating? the(yet very small) VR market.

I wonder how they get to 83%. The 970 is officially vr-worthy and it sold ridiculously well (found in nearly 5% of all steam machines surveyed).

I've become a little disillusioned with AMD marketing in the past couple years. They end up lying (or nearly lying) to dress up products that don't match the hype.

It's hard to stay optimistic about vega when you hear about them throwing out statistics that are obviously bogus (albeit not directly related to Vega in this case).

I'd be happy if they got within spitting distance.

No kidding, just getting close would do wonders for their laptop business. The likes of Apple can't be tickled about being limited to Polaris 11-levels of performance when gp106 is just kicking ass and taking names.
 
I wouldn't call Polaris a complete failure, as it is also used in the PS4 Pro. And Vega will be used in XB1+
Nobody is complaining there and the checkboarding is even perceived as a good solution for 4K.
Nvidia isn't even competing.


Polaris didn't give anything to AMD, although it improved considerably over the previous gen products, Pascal put Polaris back to where AMD was prior.

Lets take at look at < 150 watts power consumption, then live presentations of p11 vs gtx 950 that was all a shame. Performance was just blown up by the inet because of the perf/watt information AMD stated. Without references, people expected more.
 
Back
Top