In that particular deep learning part, I think AMD is literally just saying that the card will meet the PCIe spec limit of 300W.
But yeah, for consumer desktop cards, we're probably stuck at 250W for the simple reason that OEMs are used to it. They all have cooling solutions for exactly that amount of heat in a blower form factor, so it'd be a drop-in for them. Remember that OEMs actually care about this stuff.
yeah and Polaris met the <150? Guess what they used the same thing for M16, so I wouldn't trust any of their TDP figures withstanding that is the max they will use and its probably right around there.
Regarding Vega's PCB, it seems Raka Koduri showed a non-functional prototype of what he called a "Vega Cube", which is four Vega boards attached together perpendicularly:
Although the interposers aren't there (they obviously want to keep the chip's size a secret for a while longer), we can see that Vega 10 can (and probably will) be used with small PCBs.
He called it a "100TF cube", suggesting it has the performance of 4x MI25 cards.
I guess that thing should have to be watercooled because we're looking at >1kW within the cube alone.
I wonder if it would run Crysis.
To my understanding the GPUs are supposed to be towards the inside.Regarding Vega's PCB, it seems Raka Koduri showed a non-functional prototype of what he called a "Vega Cube", which is four Vega boards attached together perpendicularly:
Although the interposers aren't there (they obviously want to keep the chip's size a secret for a while longer), we can see that Vega 10 can (and probably will) be used with small PCBs.
He called it a "100TF cube", suggesting it has the performance of 4x MI25 cards.
I guess that thing should have to be watercooled because we're looking at >1kW within the cube alone.
I wonder if it would run Crysis.
I don't see much benefit in 2xINT8 when you have 2xINT16 and 2xFP16. For integer math INT16 will produce bit exact results, unless you overflow your 8 bit integer. But I fail to see many use cases where you'd want to use 8 bit register and overflow purposefully. Also fp16 has 10 mantissa bits + 1 hidden mantissa bit = 11 bit worst case precision for [0,1] normalized range and 12 bit precision for [-1, 1] normalized range. Thus always better than 8 bit normalized math (for deep learning stuff). I don't see cases where 2xINT8 (normalized) would be preferable to 2xFP16 in deep learning.At the same time, why couldn't Vega have packed 8-bit math on the same paths, even if it's only 2x INT8? It wouldn't be enough to close the full gap in that regard, but it would help.
I don't see much benefit in 2xINT8 when you have 2xINT16 and 2xFP16. For integer math INT16 will produce bit exact results, unless you overflow your 8 bit integer. But I fail to see many use cases where you'd want to use 8 bit register and overflow purposefully. Also fp16 has 10 mantissa bits + 1 hidden mantissa bit = 11 bit worst case precision for [0,1] normalized range and 12 bit precision for [-1, 1] normalized range. Thus always better than 8 bit normalized math (for deep learning stuff). I don't see cases where 2xINT8 (normalized) would be preferable to 2xFP16 in deep learning.
Yeah. 8 bit register storage helps. And 8 bit addressing is already there. Good if you need lots of registers. But do you really need that many registers often? 16 bit already halves your register pressure. 8 bit LDS storage should be more important for storing the neural net.GCN in particular can access its registers in byte granularity, so the motivation would be getting 2x the math rate over single-rate operations while doubling the effective register capacity over FP16. INT8 would be set as a possible minimum since there are other architectures going for 8-bit integer math, and AMD isn't in a position to be rewarded for standing out from the crowd.
It's more of a niche, but allowing for more contexts in the register file can reduce the load on the LDS and improve occupancy. As noted, the 16-bit paths already have excess capacity for INT8, and could possibly allow for partial gating of the units while INT8 is in use.Yeah. 8 bit register storage helps. And 8 bit addressing is already there. Good if you need lots of registers. But do you really need that many registers often? 16 bit already halves your register pressure. 8 bit LDS storage should be more important for storing the neural net.
A lot of neural net research is focused on compressing state in order to get as much context swapped in at a time and calculated at low power. Getting a lot of evaluations done seems to be very mathematically dense, given the amount of iteration and the additional work being taken to shuffle the bits around once they are loaded. Compressing the data also helps with interconnect bottlenecks, or in the case of possible GPUs with non-volatile storage more benefit from the limited bandwidth and read/write endurance the more data is kept on-chip and iterated between broadcast/writeback.I am also a bit sceptical about the gains of 4xUINT8 packed ALU ops. Fp16/int16 ready double the ALU rate. Not even all fp32 shaders are ALU bound. Doubled ALU (16b) means significantly less ALU bound cases. Boosting only ALU doesn't reduce the other bottlenecks. Bandwidth is the same (you obviously store/load results to memory 8 bit values in both cases).
I am also a bit sceptical about the gains of 4xUINT8 packed ALU ops. Fp16/int16 ready double the ALU rate. Not even all fp32 shaders are ALU bound. Doubled ALU (16b) means significantly less ALU bound cases. Boosting only ALU doesn't reduce the other bottlenecks. Bandwidth is the same (you obviously store/load results to memory 8 bit values in both cases).
He called it a "100TF cube", suggesting it has the performance of 4x MI25 cards.
I guess that thing should have to be watercooled because we're looking at >1kW within the cube alone.
I thought they were cubes from one of the racks they demoed from a partner. Four internally facing Vegas with an 80mm fan blowing through a central heatsink for better cooling. Then 4 of those across the width of the rack. I'd imagine the SSDs are optional in that configuration, but they may be using some form of external cable for a fabric. If designing a proprietary cage to hold them, it stands to reason there will be a fair few Vegas in whatever they build.If these are compute boards, would the remaining hardware still necessary be offloaded to some other PCB real estate or put through a switch fabric?
Would that make the little rectangles connectors? It would seem like the communications capabilities of the GPUs would be expanded over what they were previously, although if it's three per GPU it's not as high as GP100's link count.
Current GPUs are balanced around their fp32 performance. Suddenly increasing ALU performance by 4:1 without scaling up anything else isn't going to bring 4x performance gains.A lot of neural net research is focused on compressing state in order to get as much context swapped in at a time and calculated at low power. Getting a lot of evaluations done seems to be very mathematically dense, given the amount of iteration and the additional work being taken to shuffle the bits around once they are loaded. Compressing the data also helps with interconnect bottlenecks, or in the case of possible GPUs with non-volatile storage more benefit from the limited bandwidth and read/write endurance the more data is kept on-chip and iterated between broadcast/writeback.
Yes. However if AMD doesn't have 4xUINT8, it doesn't mean that their GPUs are 2x slower in neural networks (compared to P100). There will be some kernels that run slower on fp16/int16, but I'd be surprised to see ~2x gains over fp16/int16 in common real world use cases. Of course 4xUINT8 will help, and any gains with zero extra power consumption are welcome. 2x higher numbers certainly help marketing as wellAFAIK, the usefulness of 4*UINT8 ops has been marketed only for neural network inference and not graphics.
Agreed. Shifts in ALU:TMU:BW allow/force developers to change their algorithm/data design. However in modern games and even in modern GPGPU, FP32 ALU performance isn't the biggest limiting factor. New faster GPUs keep roughly the same ALU:TMU:BW balance as old ones. People are used to thinking that 2x more flops = roughly 2x faster. This mostly holds as everything else is also increased by roughly 2x. But if you increase only the flops (and not bandwidth and the count of CUs / SMs or their other functionality like samplers) the result is nowhere close to 2x faster in general case.Apart from current shaders not being thoroughly compute limited, would a 2x or 4x increase be enough to warrant looking into new algorithms moving some solutions from being fetch/bandwidth bound to more compute? IIRC This has been the case in the past as well when some things were being computed more quickly and efficiently compared to earlier solutions where they had to be loaded from memory.
Can't you already do generic 8 bit register processing using SDWA in GCN3/GCN4? This modifier allows instructions to access registers at 8/16/32 bit granularity (4+2+1 = 7 choices). No extra instructions to pack/unpack data needed. We need to wait until GCN5 ISA documents are published to know exactly how SDWA interacts with 16b packed ops.It's more of a niche, but allowing for more contexts in the register file can reduce the load on the LDS and improve occupancy.
Won't do, don't worry.All I am saying that don't blindly look at the marketing 8/16 bit ALU flops.