Hm, I don't really know anything about anything really, but seems likely from a strictly layperson's perspective there's no any single cause of vega's power draw. Would scheduling really account for a hundred watts increase in dissipation?
The speculation that the hardware that makes GCN work well with DX12 sounds to me like another case of conflating compile-time scheduling at the warp register dependence level for Nvidia versus the front-end command processor level.
DX12 doesn't care about the former, and both Nvidia and AMD have some amount of hardware management of the latter.
The exact amount and specific limitations of the front-end hardware for each vendor and generation isn't strictly clear, but both have processors in their front ends for managing queues, ASIC state, and kernel dispatch.
To go by AMD's command processors, there are simple proprietary cores that fetch command packets and run them against stored microcode programming, putting things on internal queues or updating internal registers. Other processors internally set up kernel launches based on this. These cores do not have the bandwidth, compute, or large ISA and multiple hardware pipelines of a CU. From Vega's marketing shot, the command processor section (~8 blocks?) appears to take up about 1/3 of the centerline of the GPU section.
We know AMD doesn't have billions and billions to spend, and what money they do have must be shared with console SoC and x86 CPU divisions.
One item to note is that AMD doesn't budget the money paid by its semi-custom clients for the engineering costs the same way as its purely in-house R&D. That's part of the point of the semi-custom division, so what is actually spent on developing the IP shared with the consoles is not as straightforward as just looking at what AMD's line for what it spends out of its own budget.
As such GCN may be resourced-starved but not strictly as much as it might appear.
However, another potential set of effects may stem from trying to fold architectural features into a mix from multiple partners whose vision and requirements of the base architecture may not align with advancing AMD's own products sufficiently. Then there's the potential cost of having resources diverted towards contradictory or partially backwards-looking directions, and making the hardware general enough that the architecture can be tweaked--potentially in a sub-optimal way for AMD.
As a result, doesn't it seem chances are fairly high that vega isn't nearly as efficiently laid out as it could have been, and that much power is spent/lost just on shuffling bits around the die? Hardware units themselves might also be comparatively inefficiently designed compared to NV's chips.
GCN architecturally also defines certain elements that move more data more frequently than Nvidia's, such as differences in warp size and things like the write-through L1s.
Further details are somewhat obscured by Nvidia's reluctance to disclose low-level details like AMD has, but we do know Nvidia has stated there's been an ISA revamp after Fermi, a notable change with Maxwell, and Volta apparently has changed the ISA to some extent again. GCN has had some revisions, and flags at least one notable encoding change since SI. I think there are pros and cons to either vendor's level of transparency.
Iteration rate alone may weigh against GCN, and AMD has spent resources on a number of custom architecture revisions that do not necessarily advance AMD's main line. The ISA and hardware are kept generic enough and complex enough to support or potentially support quirks that no individual instantiation or family of them will support, which means there's some incremental reduction on how svelte a given chip can be. That's not including that much of this could be improved if the effort and investment were expended--but weren't.
IIRC, Intel has like, hundreds of skilled silicon engineers just for laying out their CPUs (and, GPUs too these days I suppose, heh). No idea how big a team is under Raja, but hundreds of guys would probably be quite an expense. And they typically have to release multiple dies in different performance brackets. So automation is probably out of necessity, not choice.
Die shots for Ryzen, Intel's, and other chips shows a lot of those blobby areas that are indicative of automated tools. With advancing nodes and expanding problem spaces, tools have eroded the set of items that humans can do better--or at least better enough. AMD might use automation more, but even Intel is automating most of its blocks as well.
There's still some specific areas like SRAM or really performance-critical sub-blocks that can be targeted. AMD even marketed Zen's design team being brought in on Vega's register file, although I wonder if it was the team involved in the L1 or L2/L3 given the use model versus Zen's highly-ported 4GHz+ physical register file.
It may also be helped by AMD's aiming to have Vega on the same process as Ryzen for Raven Ridge, but that can also be a case where the physical building blocks are being forced to straddle the needs of a CPU and GPU.
I wonder if maybe AMD spends more effort on reining in power consumption in a chip like Xbox Scorpio's than they do for desktop PCs, seeing as consoles are much more sensitive to heat dissipation than a gaming PC is...?
Console GPUs fit in a niche that doesn't try to beat Nvidia and don't try that hard to reach laptop/portable power, and they don't care too much about features that might matter to a professional or HPC product. A design pipeline and philosophy that apparently does decently as long as it doesn't aim that high or that low, apparently.
The very rough first order approximation for power consumption, assuming everything is running on the same clock, is the number of gates spent on something. You can refine it further by the number of toggling gates. And the number of RAM block accesses.
There's also the wires, and GCN defines a lot of them being driven, shuffled, and sent varying distances.
Per AMD, a good chunk of Vega's transistor growth was about driving them.