AMD Polaris Rumors and Discussion


I am very interested in that Radeon 530. If there's a desktop version I have to wonder if it's a laptop-only GPU (Iceland) with a display I/O chip!
On the other hand if you go to the product pages you quoted and click on "Supported Rendering Format" and "Connectivity" it says "No" to everything : no to every H264/H265 video codecs and no to every kind of monitor output.

So if this thing supports "Desktop".. Either there's an actual desktop version and their stuff is a bit wrong, or their sense of humor is weird. Or are they making a headless "graphics" board that only pairs through PCIe 8x to an Excavator / Bristol Ridge AM4 APU? (same GCN tech and similar performance, for Crossfire).
Is it strictly an OEM product? That would make most sense. Perhaps the "desktops" are all-in-one or SFF, with laptop hardware basically?


I take AMD's word for linux support (that's a motivation to get low end GCN 1.2) but without outputs I think this will be a dead weight that lets me execute OpenCL code (let's see, what do I want a linux supported graphics card for.. smooth graphics, hardware H264 and little games, or OpenCL accelerated offline rendering? lol)

http://www.amd.com/en/products/graphics/radeon-530
http://www.amd.com/en/products/graphics/radeon-520

Will be banging my head for the next hour wondering what it's about.

(not a Polaris, but close enough)
 
I wonder if you could use the Tensor Cores narrow capabilities for some post process effects effectively. Maybe as part of some Gameworks libaries? Would be fitting for Nvidia and could be leveraged as an incentive both against AMD cards as well as older-gen Nvidia cards: "Get xyz-effect for free on Volta-based cards".

Recently, I found this while scavenging ebay for some retro-hardware. It is from the description of a x87 co-processor of the 286 era.
ebcwQiW.jpg

The reference to dramatic improvement in graphics applications made me smile, even though at that time, it was likely correct.
 
I wonder if you could use the Tensor Cores narrow capabilities for some post process effects effectively.
Tensor core operation = Multiply two fp16 4x4 matricies and add third one (fp32). Output 4x4 matrix as fp32.

Because the multiply precision is only fp16, it obviously isn't enough for most coordinate system transforms (world, view, projection, etc). But you could do color space transforms with that. Not that common in games however. I am much more interested about double rate fp16 & int16 regarding to games. These will provide tangible perf boosts without the need to rewrite your algorithms.
 
Tensor core operation = Multiply two fp16 4x4 matricies and add third one (fp32). Output 4x4 matrix as fp32.

Because the multiply precision is only fp16, it obviously isn't enough for most coordinate system transforms (world, view, projection, etc). But you could do color space transforms with that. Not that common in games however. I am much more interested about double rate fp16 & int16 regarding to games. These will provide tangible perf boosts without the need to rewrite your algorithms.
Yeah, x-rate FP16 in general processing core seems much more useful, especially given all the research that has gone into it from mobile and consoles possibly.

But I am also looking forward to see, if anything useful can be done with the Tensor ALUs. Obviously, they ARE quite specialized, but OTOH, there's quite a bit of calculating power to be tapped into. What I cannot asses though is, whether or not this warrants looking into algorithms from a different point of view, especially when it's unclear if the consumer-grade GPUs will have those Tensor blocks at all.
 
Tensor core operation = Multiply two fp16 4x4 matricies and add third one (fp32). Output 4x4 matrix as fp32.

Because the multiply precision is only fp16, it obviously isn't enough for most coordinate system transforms (world, view, projection, etc). But you could do color space transforms with that. Not that common in games however. I am much more interested about double rate fp16 & int16 regarding to games. These will provide tangible perf boosts without the need to rewrite your algorithms.

I thought Nvidia's tensor unit did the multiplication operation in FP32 and the FP16 limitation was purely for the storage of the inputs, hence why the output is a FP32 matrix.

Granted, that's according to this post:

https://forum.beyond3d.com/posts/1980946/

Volta is a really interesting design, looks almost like some ASIC for DL.

As for their tensor core, I believe Nvidia has already made it clear: it use FP16 only for storage, and do the multiply ops in full precision (FP32) and then add the result to a FP32 variable.

Thats why they use SgemmEX for benchmark against Pascal, since SgemmEX did that exact the same way (as contrast to the Hgemm routine): load the data from various precision but done the computation (multiply+add) in full (FP32) precision.

Which means tensor core is a full precision matrix multiplication unit with FP16 data input, thats why nvidia is more confidence of puting this tensor core for not just inference/forecast but also training the network as well.

And due to the computation is fully FP32 just like SgemmEX, the precsion loss is only limited to the FP16 storage stage, so I can think of many application outside of DL domain that could benefit from the vast computing resource GV100 offers.
 
Obviously, they ARE quite specialized, but OTOH, there's quite a bit of calculating power to be tapped into.
Thinking on the arrangement, tensor cores shouldn't be that different from the 16 wide SIMD in Polaris, but with the adders doubled for 2xFP32. Then run two operations. Double again to maximize register throughput with accumulation and four operations per lane. An accumulator is just an adder with a single storage register.

That could be Vega's RF cache with 4xFP32 accumulators per lane. MUL is faster than FMA, so boosts clocks a bit while requiring two operands. Doesn't work well if the FMA result needs flushed, but I'm guessing it gets used shortly after and discarded in most cases.

I thought Nvidia's tensor unit did the multiplication operation in FP32 and the FP16 limitation was purely for the storage of the inputs, hence why the output is a FP32 matrix.
It's a question of significant figures. A repeating decimal has as many bits of precision as you care to hold on to. Then if the exponents are in different domains the result is discarded anyways. To my understanding that's why fuzzy/DL math drives engineers crazy, but works in practice. The results are more or less Boolean.
 
I thought Nvidia's tensor unit did the multiplication operation in FP32 and the FP16 limitation was purely for the storage of the inputs, hence why the output is a FP32 matrix.
In a real FP32 mul, you enter the multiplier with 2 23-bit inputs and end up with a 23-bit output.
In the Tensor core case, if it doesn't throw away any bits in the multiplication, you enter with two 10-bit inputs, end up with 20 bits (which enters an adder with 23 bits.)

So even in the highest precision case, it doesn't really make sense to say that the multiply is FP32, because both the input and output have less bits than a real FP32.
 
But it's fine because they changed the specification on the page right? Consumers will know to look for the faster version if they care? :rolleyes:
 
In a real FP32 mul, you enter the multiplier with 2 23-bit inputs and end up with a 23-bit output.
In the Tensor core case, if it doesn't throw away any bits in the multiplication, you enter with two 10-bit inputs, end up with 20 bits (which enters an adder with 23 bits.)

So even in the highest precision case, it doesn't really make sense to say that the multiply is FP32, because both the input and output have less bits than a real FP32.

Yeah, Nvidia mention the only 32-bit option is for accumulate.
But then latest approach from Nvidia and Baidu showed real world accuracy as good as 32-bit with regards to training/inferencing using loss scaling, and performance improved by additional processes; I linked the paper some time ago - yeah requires fine tuning for each DL solution and how many will adopt this with Volta.
In summary it came down to as Nvidia states:
There are several options to choose the loss scaling factor. The simplest one is to pick a constant scaling factor. We trained a number of feed-forward and recurrent networks with Tensor Core math for various tasks with scaling factors ranging from 8 to 32K (many networks did not require a scaling factor), matching the network accuracy achieved by training in FP32. However, since the minimum required scaling factor can depend on the network, framework, minibatch size, etc., some trial and error may be required when picking a scaling value. A constant scaling factor can be chosen more directly if gradient statistics are available. Choose a value so that its product with the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16).

A more robust approach is to choose the loss scaling factor dynamically. The basic idea is to start with a large scaling factor and then reconsider it in each training iteration. If no overflow occurs for a chosen number of iterations N then increase the scaling factor. If an overflow occurs, skip the weight update and decrease the scaling factor. We found that as long as one skips updates infrequently the training schedule does not have to be adjusted to reach the same accuracy as FP32 training. Note that N effectively limits how frequently we may overflow and skip updates. The rate for scaling factor update can be adjusted by picking the increase/decrease multipliers as well as N, the number of non-overflow iterations before the increase. We successfully trained networks with N = 2000, increasing scaling factor by 2, decreasing scaling factor by 0.5, many other settings are valid as well. Dynamic loss-scaling approach leads to the following high-level training procedure:
  1. Maintain a master copy of weights in FP32.
  2. Initialize S to a large value.
  3. For each iteration:
    1. Make an FP16 copy of the weights.
    2. Forward propagation (FP16 weights and activations).
    3. Multiply the resulting loss with the scaling factor S.
    4. Backward propagation (FP16 weights, activations, and their gradients).
    5. If there is an Inf or NaN in weight gradients:
      1. Reduce S.
      2. Skip the weight update and move to the next iteration.
    6. Multiply the weight gradient with 1/S.
    7. Complete the weight update (including gradient clipping, etc.).
    8. If there hasn’t been an Inf or NaN in the last N iterations, increase S.

For a framework such as Caffe2 they mention:
Caffe2 includes support for FP16 storage and Tensor Core math. To achieve optimum performance, you can train a model using Tensor Core math and FP16 mode on Caffe2.

When training a model on Caffe2 using Tensor Core math and FP16, the following actions take place:
  • Prepare your data. You can generate data in FP32 and then cast it down to FP16. The GPU transforms path of the ImageInput operation can do this casting in a fused manner.
  • Forward pass. Since data is given to the network in FP16, all of the subsequent operations will run in FP16 mode, therefore:
    • Select which operators need to have both FP16 and FP32 parameters by setting the type of Initializer used. Typically, the Conv and FC operators need to have both parameters.
      • Case the output of forward pass, before SoftMax, back to FP32.
      • To enable Tensor Core, pass enable_tensor_core=True to ModelHelper when representing a new model.
    • Update the master FP32 copy of the weights using the FP16 gradients you just computed. For example:
      • Cast up gradients to FP32.
      • Update the FP32 copy of parameters.
      • Cast down the FP32 copy of parameters to FP16 for the next iteration.
  • Gradient scaling.
    • To scale, multiply the loss by the scaling factor.
    • To descale, divide LR and weight_decay by the scaling factor.
 
Last edited:
Really depends on what they do with clocks. It could be faster in some cases. Less cores/power at higher clocks would improve geometry performance that would be beneficial in that segment. Not uncommon to see AMD having difficulty feeding cores in poorly optimized titles without async.
 
Looks like there's some stock issues of the normal 560's so they're filling the inventory with these instead. They're clocked a lot lower as well, there'd be a significant performance gulf. Some "new" 560's are priced higher than in-stock real 560's.
 
AMD Statement About Radeon RX 560 896 shader SKUs
We just received an official response from AMD to this observation, this is the official statement:

It’s correct that 14 Compute Unit (896 stream processors) and 16 Compute Unit (1024 stream processor) versions of the Radeon RX 560 are available. We introduced the 14CU version this summer to provide AIBs and the market with more RX 500 series options. It’s come to our attention that on certain AIB and etail websites there’s no clear delineation between the two variants. We’re taking immediate steps to remedy this: we’re working with all AIB and channel partners to make sure the product descriptions and names clarify the CU count, so that gamers and consumers know exactly what they’re buying. We apologize for the confusion this may have caused.”

AMD had subsequently changed the specification website without informing the public and or media about it.
http://www.guru3d.com/news-story/amd-radeon-rx-560-statement-from-amd-on-896-shader-skus.html
 
I couldn't find a webpage for the "560D" at AMD's website. They just updated the RX 560 product page to include the lower specs, and the expression "560D" cannot be found anywhere in the page. But now they blame others for not being clear?
AMD should have just called it RX 555, but no, that would be too simple and easy for AMD, as their PR department is always looking for some action, like a workaholic bomb squad.
 
Is it public relations that would be responsible for deciding the product numbering?
At least part of the problem is built into what board partners are doing, and I don't think they count as part of the general public.

The stealth update to the AMD product site might be somewhat on marketing, although if a marketer's bosses say everything has to fit in the 560 category, they don't have many good options. At this point, since there are already 560 cards out there, marketing wouldn't be able to walk it back.

Is there a known date for the spec page change, or when these salvage 560 products made it to market? Given lead times, it seems like this must have been initiated a fair amount of time before it was noticed by consumers.
 
Back
Top