AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Silent_Buddha · Sep 20, 2016

sebbbi said:
More ALUs per CU would be a stupid idea. AMDs CUs are already occupancy limited by register count. Nvidia halved the ALU count per SM in Pascal P100. This gives them more register space per thread and allows P100 to run complex shaders faster. AMD is already register bottlenecked in complex shaders. I would rather see AMD following Nvidia's lead than going to the opposite direction, especially as the register pressure seems to be a bigger problem for AMD.

1.5 GHz isn't impossible for Vega. There are custom GTX 1080 models with 1.75 GHz base clock and 1.9 GHz boost clock. Maxwell (980 Ti) was only running at 1 GHz (1075 MHz boost). Nvidia achieved 75% clock improvement by the shrink in a single generation. Why couldn't AMD achieve 50% clock improvement in two generations?

You misinterpreted what I said. Basically I was saying that either more ALUs or greatly increased clocks represents a potentially larger change for AMDs GPUs than what we got from Polaris. Polaris would need to be re-architected to achieve those clocks. Vega is going to be on the same node, so a node shrink won't be able to help achieve a greater clock speed.

So both are possibilities. I agree with you, however, that more ALUs isn't the way to go, hence I'd expect greater clock speed for Vega. And that means that at least some part of Vega is going to be significantly different from Polaris if only that the Polaris architecture as it currently stands is not capable of reaching that clock.

Regards,
SB

3dilettante · Sep 20, 2016

sebbbi said:
I was talking about an architectural change (similar to Pascal P100, but in reverse direction). In current GCN architecture SIMD count and register file capacity are obviously tied.

There's a floor where a register file needs to have the capacity to allow at least 256 registers to be addressed, and they are long enough to satisfy the full wavefront cadence.
I don't recall a reason in the ISA for why the physical implementation couldn't have more registers than can be encoded.

If you added 50% extra SIMDs and registers into a single CU, then there would be 50% more clients to the CU shared resources: 4 texture samplers, 16 KB of L1 cache and 64 KB of LDS. There would be lots of L1 trashing, occupancy would be horrible in shaders that use lots of LDS and more shaders would be sampler (filtering) bound. You could counteract these issues by having 6 texture samplers, 24 KB of L1 cache and 96 KB of LDS in each CU. However a 50% fatter CU like this would be less energy efficient as the smaller one, since the shared resources are shared with more clients. There would be more communication overhead and longer distance to move data. I am not convinced this is the right way to go.

It's true those could need to scale, but Polaris power efficiency is wrecked with products trying to get a few tens of MHz above the default boost clocks. There are reviews of factory OC models with essentially zero overhead regardless of fan speed or voltage forcing, and the AIBs are readily jumping to 8-pin power connectors for very modest gains.
Resource scaling may be linear or worse than linear, but GCN when pushed in clock goes seemingly exponential with power consumption.
(edit: Or it crashes.)

sebbbi · Sep 20, 2016

3dilettante said:
It's true those could need to scale, but Polaris power efficiency is wrecked with products trying to get a few tens of MHz above the default boost clocks.

Why not then just simply add 50% more CUs? Should be more power efficient than making each CU 50% fatter. Less shared resources = good for power efficiency.

iMacmatician · Sep 20, 2016

kalelovil said:
It will be unusual for AMD to directly replace Polaris 10 with Vega 11 within 12 months of the Polaris release. Since GCN's launch they've favoured extended GPU lifespans and filling in the lineup gaps in alternating years.
Perhaps it is a replacement in terms of positioning rather than absolute performance, e.g. Vega 10 becomes RX 590 / Fury RX, Vega 11 RX 580, Polaris 10 RX 570/560

So I had a thought, could Vega 11 be partially targeted at notebook VR? If Vega 11 has similar GFLOPS/TDP to Vega 10, then it could have Polaris 10 performance at ~110 W or less.

Jawed · Sep 20, 2016

I had a patent rummage and "Dynamic Wavefront Creation For Processing Units Using A Hybrid Compactor":

http://www.freepatentsonline.com/y2016/0239302.html

came up. Still don't believe a GPU will actually do this...

It's interesting that the document refers explicitly to wavefronts and workgroups. Wavefronts from within a workgroup are the only sources for work items to form dynamic wavefronts.

Apart from the fact that fragmentation rapidly eats through available wavefronts, varying loop iteration counts also hinder formation/compaction of dynamic wavefronts. Workgroup based formation/compaction becomes a possible technique to control fragmentation allowing tracking through and compaction after varied loop iterations, i.e. constraining the scope of fragmentation, stopping it running away (assuming a workgroup is only allowed a small set of dynamic wavefronts). I don't think it can fully solve fragmentation though.

(I haven't really delved into the logic of this workgroup-wavefront hybrid compaction scheme, so I'm prolly missing the big picture - mostly because I'm sceptical this will appear in a GPU.)

But it had occurred to me that AMD could do a bait and switch on us: what if a wavefront is really small, say 16 work items. And a workgroup is at least 4 wavefronts? If smaller SIMDs are coming, perhaps wavefronts shrink too? The population of wavefronts in the GPU would be higher and fragmentation would simply be less of a problem. (Though one might argue: with such small wavefronts why bother with dynamic wavefront formation?)

Alternatively, to make this stuff work reasonably well the register file needs to get much larger to support a larger population of wavefronts. I've been asking for this since, oh, before GCN. So, erm, I expect to continue to be unlucky in this regard...

A1xLLcqAgt0qc2RyMz0y · Sep 21, 2016

Okay, a number of questions:

What is the expected die size for Vega 10?

With 64CU's (on 14nm?) for Vega 10 will this be the largest GPU AMD has ever produced?

What is the history for AMD (yield,speed) on massive sized GPUs?

What magic is AMD doing to get the clocks from (1120/1266 RX 480) to 1500 Mhz (34%/18.5%) for Vega 10?

Esrever · Sep 21, 2016

A1xLLcqAgt0qc2RyMz0y said:
Okay, a number of questions:

What is the expected die size for Vega 10?

With 64CU's (on 14nm?) for Vega 10 will this be the largest GPU AMD has ever produced?

What is the history for AMD (yield,speed) on massive sized GPUs?

What magic is AMD doing to get the clocks from (1120/1266 RX 480) to 1500 Mhz (34%/18.5%) for Vega 10?

Vega 10 should be much smaller than Fiji and probably about the same size as hawaii. It shouldn't be much bigger than 2x polaris 10.

3dilettante · Sep 21, 2016

sebbbi said:
Why not then just simply add 50% more CUs? Should be more power efficient than making each CU 50% fatter. Less shared resources = good for power efficiency.

I agree that would be a conceptually straightforward way of increasing throughput. I was coming from the context of recent Vega rumors that had already given a CU count for a constraint.
GCN as we know it in that scenario is at its CU maximum. At least conceptually, it is a minor tweak to increase that limit, although certain items with chip-crossing interconnects like the L1-L2 network, shared L1 scalar caches, instruction fetch blocks, and message/export would need to expand their routing and addressing capability to service more clients if that were the only change.
There are some items that could benefit from fatter CUs, such as large workgroups or workgroups in the long tail of a period of execution before a barrier that only speed up with quicker or fatter CUs, rather than multiple CUs that a single workgroup cannot split across. Some levels of clock gating can benefit from high-speed information within a CU, while others might benefit from multiple CUs with more aggressive coarse gating.

Shutting down the extra CUs could give power to boost the remaining CUs.
What makes me uncertain at this point with Polaris is that this is another data point in a string that crosses multiple years, multiple generations, multiple fabs, and multiple nodes where GCN seems to be showing that there is something consistent, unforgiving, and physical about the architecture's clock headroom and efficiency past its target range. Perhaps this next time the headroom will appear, although I am not sure if this eagerness to reach for these speeds is optimal long-term for either vendor.
If there were ever an analysis or post-mortem of the architecture below the hood, I would be delighted to see it.

The following thoughts below are more of a rambling tangent:
There are some things about GCN that from a CPU standpoint appear to be rather heavy, with a specialized encoding with some non-streamlined operand and execution behavior, broad hardware resources, ways to route data between the lanes of a decently wide unit, a rather switch-happy multithreading policy, inconsistently handled memory spaces, and a hardwired 4-cycle latency. There wouldn't be a question for CPUs why a processor with a pipelined FPU with 4-cycle latency wouldn't clock as high as one with 6 or more. That's not to say there couldn't be other paths, or other pipelines not exposed to the ISA that are themselves too short for the amount of work they do to be clocked high.
Nvidia's ISA transitioned to a more streamlined one, it did not pursue very short forwarding latency as aggressively, and among other things has been working on updating its in-house custom controller to RISC V, which is philosophically focused on straightforward and performant implementation.

It's not purely exclusive of increasing SIMD count. One way to do it would be to increase from the 4-cycle cadence so the new SIMDs can fit in the issue loop, which would effectively extend the pipeline and so could allow for a higher clock speed. That might not play nice with the way everything seems to tie together with GCN (SIMD count goes to cadence, cadence to width gives batch/cache/export size, etc). However, it may be that the elegant solution raises the cost of changing any one element and keeps implementations at a local minimum when we know there are examples that appear more optimal globally.

sebbbi · Sep 21, 2016

3dilettante said:
and a hardwired 4-cycle latency. There wouldn't be a question for CPUs why a processor with a pipelined FPU with 4-cycle latency wouldn't clock as high as one with 6 or more. That's not to say there couldn't be other paths, or other pipelines not exposed to the ISA that are themselves too short for the amount of work they do to be clocked high.

That's a good point. GCN architecture relies heavily on fixed 4-cycle latency. There is no per-instruction tweaking possibility. It's either 4 or a multiple of 4 (16 for 32 bit int mul for example). But is 4 cycles too short for multiply-add? Skylake VFMADD has throughput of 1 per cycle and only 4 cycles of latency. Skylake is clocked almost 4x higher (4.2 GHz turbo). Do GPUs need extra time for example to fetch the registers from the massive register files?

Alessio1989 · Sep 21, 2016

Dear Santa, please, at least pixel sync for the love of silicon...

Deleted member 13524 · Sep 21, 2016

Esrever said:
Vega 10 should be much smaller than Fiji and probably about the same size as hawaii. It shouldn't be much bigger than 2x polaris 10.

Actually if Vega was simply Polaris + HBM, then Vega 10 with 64 CUs should be smaller than 2x Polaris 10. It has less than twice the number of CUs, ROPs are probably doubled to 64 and the HBM memory controller takes significantly less space than a GDDR5 512bit one.

But Vega is probably a significantly different architecture, so it could go either way.

Anarchist4000 · Sep 21, 2016

Jawed said:
But it had occurred to me that AMD could do a bait and switch on us: what if a wavefront is really small, say 16 work items. And a workgroup is at least 4 wavefronts? If smaller SIMDs are coming, perhaps wavefronts shrink too? The population of wavefronts in the GPU would be higher and fragmentation would simply be less of a problem. (Though one might argue: with such small wavefronts why bother with dynamic wavefront formation?)

Makes more sense to think of it as the CU receiving a thread block and then creating waves from that. One of the diagrams in that patent mention "Wait for all wavefronts following the same control path to reach compaction point". Another interesting line "If the number of threads is greater than the maximum possible number of threads in a wavefront, then multiple wavefronts will be formed." It's always possible fork() becomes a new shader command as well.

It's also possible SIMDs become larger. 32/64 wide with the ability to schedule on smaller units addressing divergence issues.

At the very least all of this will make for some interesting architectural articles.

3dilettante · Sep 21, 2016

sebbbi said:
That's a good point. GCN architecture relies heavily on fixed 4-cycle latency. There is no per-instruction tweaking possibility. It's either 4 or a multiple of 4 (16 for 32 bit int mul for example). But is 4 cycles too short for multiply-add? Skylake VFMADD has throughput of 1 per cycle and only 4 cycles of latency. Skylake is clocked almost 4x higher (4.2 GHz turbo). Do GPUs need extra time for example to fetch the registers from the massive register files?

Skylake is an interesting change, with an FMA with 4 cycles for dependent issue being an improvement over prior Intel implementations, at an apparent cost of making some formerly lower-latency ADD or MUL instructions longer to match. Bulldozer's FMA unit that eventually got to 5 GHz with a 5-6 latency unit, which does demonstrate that it might be one thing to be able to hit those speeds and actually deriving benefit being another.

The differences in cycle time can go to the priorities of the design. The trade-off for the implementation of the units, process, density, and power gives a CPU pipeline that has more area and power per FLOP and comparatively less register file. If the physical and implementation factors were controlled for, having more stages could give more slack.

One difference that I noted looking at instruction tables is that GCN's latency represents forwarding and issue latency, which are two separate concepts for Skylake. A dependence resolves in 4 cycles for both, but Skylake's port can issue another non-dependent instruction the next cycle, while GCN cannot. GCN behaves in this case as if a SIMD is a non-pipelined unit that has to mash together operations that a CPU can prioritize differently. Whether that makes it harder or easier to drive the clock, I'm not sure.
Also hidden is the pipeline leading up to this bigger execution stage. The depth of that part might influence clock speeds as well, and at least part of it needs to fit into the SIMD issue cycle. The portion you mentioned concerning register access would be part of the wave's 4 cycles. Operand access from the file needs to be initiated earlier than the actual FMA, so not all of the 4 cycles per wavefront are FMA. There's several stages of work Skylake separates out in the OoO engine and operand access that fit in that time period for GCN.

sebbbi · Sep 21, 2016

3dilettante said:
One difference that I noted looking at instruction tables is that GCN's latency represents forwarding and issue latency, which are two separate concepts for Skylake.

That's a big difference. I haven't calculated complete pipeline latencies for long time. PPC VMX-128 pipeline had extremely long multiply-add latency. I am glad that consoled moved out modern OoO designs, and also at the same time instruction latencies dropped heavily. A CPU programmer no longer needs to worry about pipelining in most cases.

3dilettante said:
A dependence resolves in 4 cycles for both, but Skylake's port can issue another non-dependent instruction the next cycle, while GCN cannot. GCN behaves in this case as if a SIMD is a non-pipelined unit that has to mash together operations that a CPU can prioritize differently. Whether that makes it harder or easier to drive the clock, I'm not sure.

GCN is executing 64 wide waves on a 16 wide SIMD. You don't need to find independent instructions. Lanes (0-15) (16-31) (32-47) (48-64) of the same register are guaranteed to be independent. That gives you four cycles worth of independent work to issue. For programmer's (and compiler writer's) point of view, the latency of every instruction is 1. Results can be immediately used by the next instruction. This simplifies the scheduling and the compiler. IPC (when looked outside) is only 1 finished instruction per 4 cycles. Long shaders stay long time in execution, reserving registers for long time. Also if a single lane of the 64 wide wave stalls (memory), the whole wave is unable to proceed (= registers are reserved, but no work is done). This further increases the register load compared to narrower SIMD sizes. My experience with GCN is that AMD needs more registers per lane. Nvidia doubled registers per lane in P100. I would hope we get at least 50% increase.

A better GCN shader compiler for PC would also be great (current one wastes too many registers): https://twitter.com/idSoftwareTiago/status/754889052009017344

ieldra · Sep 21, 2016

I read a rumor months back (that I can't find) that AMD would employ custom cell libraries for the first time with Vega.

I'm very curious as to what changes they effected to change the clock-voltage characteristics so much; using new ip with lower density would certainly help but will hurt their bottom line, particularly when paired with HBM (considering when it will release, it will have to sell for less than a 1080 does now).

What do they mean by 'GFX ip' anyway? I assumed it encompasses FFU, does it?

I can't find the source right now but if I'm not mistaken Polaris 10 at around 1500mhz draws around 230 watts (excluding 30w vram) and since we're talking about a ~400mm² die with roughly 9bn transistors (estimated) its quite impressive that their power envelope is lower.

Having said all this, it's a rather fishy rumor frankly considering Vega 11 had been confirmed to be the bigger die. Vega 10 with 4096 ALU using gfx ip 9.0 has been confirmed for ages, the Vega 11 rumor sounds a little... Conjectural, which leads me to be rather suspicious of this whole rumor.

That said it wouldn't be unusual for a GCN card to be roughly 20-25% faster in terms of shader xput than its competitor ; Fury X vs Titan X round 2 basically

Alexko · Sep 21, 2016

3dilettante said:
In terms of purely TDP versus peak arithmetic throughput, 12 TF at 225W relative to the RX 480's 5.8 TF at 150W would put Vega's perf/W at 1.36x that of Polaris.
It's a bit short of AMD's older perf/W roadmap slide that gave Vega 1.5x the efficiency of Polaris, although TF doesn't equal performance and the RX 480 is not as optimistic a starting point as what AMD's marketing used in its slide.

Vega would need to overachieve a bit, given that Polaris didn't really match up with that roadmap and that optimistic projection wouldn't have erased the competitive deficit even if it were hit.

For reference, one of the aforementioned patents from the big AMD thread is:
http://www.freepatentsonline.com/20160085551.pdf
This covers a variable SIMD-width CU, with an 8, 4, and 2-wide SIMD trio in place of the customary 16-wide. If that triad is actually put in place of one SIMD, it at least would (ed: not) regress from having 4 SIMDs in the CU.
One of the claims in the patent had the possibility that each of the smaller SIMDs could actually be 8-wide, just with selective gating.
The chain of assumptions could give 3x8x4=96 ALUs per SIMDx64x2FLOPx1GHz=12TF.
(edit: Missed the OP update, I'll leave the math out out here.)

Perhaps that and HBM could make up some of the efficiency gap.

There was a patent mentioned before about creating a tiled and binning front end with hidden surface removal built in, which might generate irregularly sized wavefronts that this ALU arrangement would cater to.

Then again, the RX 480 is more of a 175~180W card than a 150W one. Of course, if AMD keeps counting watts in the same fanciful fashion, that's irrelevant.

Razor1 · Sep 21, 2016

ToTTenTranz said:
Actually if Vega was simply Polaris + HBM, then Vega 10 with 64 CUs should be smaller than 2x Polaris 10. It has less than twice the number of CUs, ROPs are probably doubled to 64 and the HBM memory controller takes significantly less space than a GDDR5 512bit one.

But Vega is probably a significantly different architecture, so it could go either way.

It is smaller than 2x polaris, polaris is what 232 mm 2, Vega is to be 400mm2.

3dilettante · Sep 21, 2016

Alexko said:
Then again, the RX 480 is more of a 175~180W card than a 150W one. Of course, if AMD keeps counting watts in the same fanciful fashion, that's irrelevant.

I've commented on GPU makers' divergence from physical reality before, which is why mostly compare these kinds of aspirational data points within a vendor so there's a better chance of keeping the figures iso-delusional.

msia2k75 · Sep 22, 2016

ieldra said:
Having said all this, it's a rather fishy rumor frankly considering Vega 11 had been confirmed to be the bigger die. Vega 10 with 4096 ALU using gfx ip 9.0 has been confirmed for ages, the Vega 11 rumor sounds a little... Conjectural, which leads me to be rather suspicious of this whole rumor.

Where has Vega 11 been confirmed as the bigger die?

ImSpartacus · Sep 22, 2016

msia2k75 said:
Where has Vega 11 been confirmed as the bigger die?

I was actually going to ask the same thing.

I've heard folks mention Vega 11 as the big Vega, but I can't recall a source.

I'm sorta disappointed that amd can't design a consistent codename system. Their post-Polaris system is an improvement, but it's not quite where it needs to be. We're really pretty spoiled with Nvidia's codenames.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Deleted member 13524

Guest