AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

seahawk · Dec 13, 2016

Can you see them releasing an out of spec part in this segment? I am confident that < 300W means less than 300W but more than 225W. I hope to be wrong though.

Razor1 · Dec 13, 2016

ImSpartacus said:
In that particular deep learning part, I think AMD is literally just saying that the card will meet the PCIe spec limit of 300W.

But yeah, for consumer desktop cards, we're probably stuck at 250W for the simple reason that OEMs are used to it. They all have cooling solutions for exactly that amount of heat in a blower form factor, so it'd be a drop-in for them. Remember that OEMs actually care about this stuff.

yeah and Polaris met the <150? Guess what they used the same thing for M16, so I wouldn't trust any of their TDP figures withstanding that is the max they will use and its probably right around there.

PS how much does the SSD take up in wattage anyways? Shouldn't be that much.

Anarchist4000 · Dec 13, 2016

Going off the spec sheet for a 1TB Samsung 960 Pro between 6W and 40mW. Then double that for maybe 10-12W.

Deleted member 13524 · Dec 13, 2016

Regarding Vega's PCB, it seems Raka Koduri showed a non-functional prototype of what he called a "Vega Cube", which is four Vega boards attached together perpendicularly:

Although the interposers aren't there (they obviously want to keep the chip's size a secret for a while longer), we can see that Vega 10 can (and probably will) be used with small PCBs.

He called it a "100TF cube", suggesting it has the performance of 4x MI25 cards.
I guess that thing should have to be watercooled because we're looking at >1kW within the cube alone.

I wonder if it would run Crysis.

3dilettante · Dec 13, 2016

Razor1 said:
yeah and Polaris met the <150? Guess what they used the same thing for M16, so I wouldn't trust any of their TDP figures withstanding that is the max they will use and its probably right around there.

The throughput and bandwidth numbers show a slightly lower clock and memory speed, and this is after a firmware change and months of manufacturing maturation. It seems like a reasonable number to skate under by now.

On another note, packed 2xFP16 is seemingly the more straightforward way to leverage an architecture that needs to maintain FP32, but it is somewhat less flexible than if the mask and other flags could have somehow been extended as well.
At the same time, why couldn't Vega have packed 8-bit math on the same paths, even if it's only 2x INT8? It wouldn't be enough to close the full gap in that regard, but it would help.

Razor1 · Dec 13, 2016

ToTTenTranz said:
Regarding Vega's PCB, it seems Raka Koduri showed a non-functional prototype of what he called a "Vega Cube", which is four Vega boards attached together perpendicularly:

Although the interposers aren't there (they obviously want to keep the chip's size a secret for a while longer), we can see that Vega 10 can (and probably will) be used with small PCBs.

He called it a "100TF cube", suggesting it has the performance of 4x MI25 cards.
I guess that thing should have to be watercooled because we're looking at >1kW within the cube alone.

I wonder if it would run Crysis.

"We are the Borg. Your biological and technological distinctiveness will be added to our own. Resistance is futile."

3dilettante · Dec 13, 2016

It seems like the PCBs at that point are interposer and power delivery only. I'm not sure if there's room for something like NAND unless the central area is not all interposer or it mounts non-volatile somewhere on there as well.

If these are compute boards, would the remaining hardware still necessary be offloaded to some other PCB real estate or put through a switch fabric?
Would that make the little rectangles connectors? It would seem like the communications capabilities of the GPUs would be expanded over what they were previously, although if it's three per GPU it's not as high as GP100's link count.

Kaotik · Dec 13, 2016

ToTTenTranz said:
Regarding Vega's PCB, it seems Raka Koduri showed a non-functional prototype of what he called a "Vega Cube", which is four Vega boards attached together perpendicularly:

Although the interposers aren't there (they obviously want to keep the chip's size a secret for a while longer), we can see that Vega 10 can (and probably will) be used with small PCBs.

He called it a "100TF cube", suggesting it has the performance of 4x MI25 cards.
I guess that thing should have to be watercooled because we're looking at >1kW within the cube alone.

I wonder if it would run Crysis.

To my understanding the GPUs are supposed to be towards the inside.
Regardless, it's probably just a reference to "brain in your hands" in the pressdeck, rather than actual prototype

sebbbi · Dec 13, 2016

3dilettante said:
At the same time, why couldn't Vega have packed 8-bit math on the same paths, even if it's only 2x INT8? It wouldn't be enough to close the full gap in that regard, but it would help.

I don't see much benefit in 2xINT8 when you have 2xINT16 and 2xFP16. For integer math INT16 will produce bit exact results, unless you overflow your 8 bit integer. But I fail to see many use cases where you'd want to use 8 bit register and overflow purposefully. Also fp16 has 10 mantissa bits + 1 hidden mantissa bit = 11 bit worst case precision for [0,1] normalized range and 12 bit precision for [-1, 1] normalized range. Thus always better than 8 bit normalized math (for deep learning stuff). I don't see cases where 2xINT8 (normalized) would be preferable to 2xFP16 in deep learning.

3dilettante · Dec 13, 2016

sebbbi said:
I don't see much benefit in 2xINT8 when you have 2xINT16 and 2xFP16. For integer math INT16 will produce bit exact results, unless you overflow your 8 bit integer. But I fail to see many use cases where you'd want to use 8 bit register and overflow purposefully. Also fp16 has 10 mantissa bits + 1 hidden mantissa bit = 11 bit worst case precision for [0,1] normalized range and 12 bit precision for [-1, 1] normalized range. Thus always better than 8 bit normalized math (for deep learning stuff). I don't see cases where 2xINT8 (normalized) would be preferable to 2xFP16 in deep learning.

GCN in particular can access its registers in byte granularity, so the motivation would be getting 2x the math rate over single-rate operations while doubling the effective register capacity over FP16. INT8 would be set as a possible minimum since there are other architectures going for 8-bit integer math, and AMD isn't in a position to be rewarded for standing out from the crowd.

sebbbi · Dec 13, 2016

3dilettante said:
GCN in particular can access its registers in byte granularity, so the motivation would be getting 2x the math rate over single-rate operations while doubling the effective register capacity over FP16. INT8 would be set as a possible minimum since there are other architectures going for 8-bit integer math, and AMD isn't in a position to be rewarded for standing out from the crowd.

Yeah. 8 bit register storage helps. And 8 bit addressing is already there. Good if you need lots of registers. But do you really need that many registers often? 16 bit already halves your register pressure. 8 bit LDS storage should be more important for storing the neural net.

I am also a bit sceptical about the gains of 4xUINT8 packed ALU ops. Fp16/int16 ready double the ALU rate. Not even all fp32 shaders are ALU bound. Doubled ALU (16b) means significantly less ALU bound cases. Boosting only ALU doesn't reduce the other bottlenecks. Bandwidth is the same (you obviously store/load results to memory 8 bit values in both cases).

3dilettante · Dec 13, 2016

sebbbi said:
Yeah. 8 bit register storage helps. And 8 bit addressing is already there. Good if you need lots of registers. But do you really need that many registers often? 16 bit already halves your register pressure. 8 bit LDS storage should be more important for storing the neural net.

It's more of a niche, but allowing for more contexts in the register file can reduce the load on the LDS and improve occupancy. As noted, the 16-bit paths already have excess capacity for INT8, and could possibly allow for partial gating of the units while INT8 is in use.

I was primarily contemplating this the purposes of being compared to Nvidia's inference products with INT8 instructions, for which AMD's products are extremely outclassed. Getting at least 2x rate and better register usage would get AMD to the point of being embarrassed.

I am also a bit sceptical about the gains of 4xUINT8 packed ALU ops. Fp16/int16 ready double the ALU rate. Not even all fp32 shaders are ALU bound. Doubled ALU (16b) means significantly less ALU bound cases. Boosting only ALU doesn't reduce the other bottlenecks. Bandwidth is the same (you obviously store/load results to memory 8 bit values in both cases).

A lot of neural net research is focused on compressing state in order to get as much context swapped in at a time and calculated at low power. Getting a lot of evaluations done seems to be very mathematically dense, given the amount of iteration and the additional work being taken to shuffle the bits around once they are loaded. Compressing the data also helps with interconnect bottlenecks, or in the case of possible GPUs with non-volatile storage more benefit from the limited bandwidth and read/write endurance the more data is kept on-chip and iterated between broadcast/writeback.

Deleted member 13524 · Dec 13, 2016

sebbbi said:
I am also a bit sceptical about the gains of 4xUINT8 packed ALU ops. Fp16/int16 ready double the ALU rate. Not even all fp32 shaders are ALU bound. Doubled ALU (16b) means significantly less ALU bound cases. Boosting only ALU doesn't reduce the other bottlenecks. Bandwidth is the same (you obviously store/load results to memory 8 bit values in both cases).

AFAIK, the usefulness of 4*UINT8 ops has been marketed only for neural network inference and not graphics.

itsmydamnation · Dec 13, 2016

when will we have 1 nibble int ops dammit!

Anarchist4000 · Dec 14, 2016

ToTTenTranz said:
He called it a "100TF cube", suggesting it has the performance of 4x MI25 cards.
I guess that thing should have to be watercooled because we're looking at >1kW within the cube alone.

3dilettante said:
If these are compute boards, would the remaining hardware still necessary be offloaded to some other PCB real estate or put through a switch fabric?
Would that make the little rectangles connectors? It would seem like the communications capabilities of the GPUs would be expanded over what they were previously, although if it's three per GPU it's not as high as GP100's link count.

I thought they were cubes from one of the racks they demoed from a partner. Four internally facing Vegas with an 80mm fan blowing through a central heatsink for better cooling. Then 4 of those across the width of the rack. I'd imagine the SSDs are optional in that configuration, but they may be using some form of external cable for a fabric. If designing a proprietary cage to hold them, it stands to reason there will be a fair few Vegas in whatever they build.

sebbbi · Dec 14, 2016

3dilettante said:
A lot of neural net research is focused on compressing state in order to get as much context swapped in at a time and calculated at low power. Getting a lot of evaluations done seems to be very mathematically dense, given the amount of iteration and the additional work being taken to shuffle the bits around once they are loaded. Compressing the data also helps with interconnect bottlenecks, or in the case of possible GPUs with non-volatile storage more benefit from the limited bandwidth and read/write endurance the more data is kept on-chip and iterated between broadcast/writeback.

Current GPUs are balanced around their fp32 performance. Suddenly increasing ALU performance by 4:1 without scaling up anything else isn't going to bring 4x performance gains.

GPUs already have narrow typed memory loads/stores (RGBA8, RGBA16, RGBA16f), so there's no need extract/pack 8 bit values manually (in shader code). Your neural network memory layout (and off-chip bandwidth) would be exactly the same as it would be if you only had fp16 or fp32 ALUs.

4x "faster" ALU would hit other GPU bottlenecks in most code. Everybody who has done some fp16 code (on PS3 or mobile GPUs) knows that fp16 doesn't double your performance. Sometimes you get close to 2x and sometimes you get close to zero gains. Current desktop GPUs are already dense in ALU performance. Only a small part of the shaders are ALU bound (even at fp32). fp16/int16 are great additions and will make some shaders much faster. But the developer needs to do lots of extra optimizations to reduce other bottlenecks (sampler & address calculation, load/store issue rate & latency, memory bandwidth). Lower precision packed ALU and packed registers do not reduce any of these bottlenecks. It is hard to achieve peak performance when you introduce 2:1 or 4:1 imbalance to the architecture.

Google TPU on the other hand is designed from the ground up to do 8 bit ALU processing. It has limited use cases compared to GPUs, but it fits this specific task better.

ToTTenTranz said:
AFAIK, the usefulness of 4*UINT8 ops has been marketed only for neural network inference and not graphics.

Yes. However if AMD doesn't have 4xUINT8, it doesn't mean that their GPUs are 2x slower in neural networks (compared to P100). There will be some kernels that run slower on fp16/int16, but I'd be surprised to see ~2x gains over fp16/int16 in common real world use cases. Of course 4xUINT8 will help, and any gains with zero extra power consumption are welcome. 2x higher numbers certainly help marketing as well

Also, I am sure that future games will use neural networks for many tasks. Super-resolution, antialiasing and temporal techniques are the most promising ones currently. But there has been also physics simulation papers where the network was trained to solve fluid navier-stokes equations, etc.

CarstenS · Dec 14, 2016

Apart from current shaders not being thoroughly compute limited, would a 2x or 4x increase be enough to warrant looking into new algorithms moving some solutions from being fetch/bandwidth bound to more compute? IIRC This has been the case in the past as well when some things were being computed more quickly and efficiently compared to earlier solutions where they had to be loaded from memory.

Besides - although it has not been announced yet (as haven't been quite a few things about Vega), I would not rule out the possibility that NCUs - if it means New Compute Units as has been speculated - do indeed feature 4x INT8 of some kind as well.

sebbbi · Dec 14, 2016

CarstenS said:
Apart from current shaders not being thoroughly compute limited, would a 2x or 4x increase be enough to warrant looking into new algorithms moving some solutions from being fetch/bandwidth bound to more compute? IIRC This has been the case in the past as well when some things were being computed more quickly and efficiently compared to earlier solutions where they had to be loaded from memory.

Agreed. Shifts in ALU:TMU:BW allow/force developers to change their algorithm/data design. However in modern games and even in modern GPGPU, FP32 ALU performance isn't the biggest limiting factor. New faster GPUs keep roughly the same ALU:TMU:BW balance as old ones. People are used to thinking that 2x more flops = roughly 2x faster. This mostly holds as everything else is also increased by roughly 2x. But if you increase only the flops (and not bandwidth and the count of CUs / SMs or their other functionality like samplers) the result is nowhere close to 2x faster in general case.

But neural networks are a bit of special case. Some algorithms in this field do benefit from additional ALU performance. But only time will tell how much 4xUINT helps over 2xUINT/2xFP16. All I am saying that don't blindly look at the marketing 8/16 bit ALU flops.

sebbbi · Dec 14, 2016

3dilettante said:
It's more of a niche, but allowing for more contexts in the register file can reduce the load on the LDS and improve occupancy.

Can't you already do generic 8 bit register processing using SDWA in GCN3/GCN4? This modifier allows instructions to access registers at 8/16/32 bit granularity (4+2+1 = 7 choices). No extra instructions to pack/unpack data needed. We need to wait until GCN5 ISA documents are published to know exactly how SDWA interacts with 16b packed ops.

CarstenS · Dec 14, 2016

sebbbi said:
All I am saying that don't blindly look at the marketing 8/16 bit ALU flops.

Won't do, don't worry.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

seahawk

Razor1

Anarchist4000

Deleted member 13524

Guest

3dilettante

Razor1

3dilettante

Kaotik

Drunk Member

sebbbi

3dilettante

sebbbi

3dilettante

Deleted member 13524

Guest

itsmydamnation

Anarchist4000

sebbbi

CarstenS

Moderator

sebbbi

sebbbi

CarstenS

Moderator