And/or bandwidth and/or geometry performance. In the article's conditions the Vega 56 and 64 have the exact same bandwidth and the same 4 geometry engines working at the same clocks.
With primitive shaders working these two factors might be less of a bottleneck.
Aside from HBM and the data fabric, there is also the same L2-L1 bandwidth and the same RBE/export behavior to the L2 and beyond.
Latencies could be generally equivalent iso-clock, and if there are internal variations with the firmware settings they don't show up in the testing.
Some of those elements might change with DSBR and workable primitive shaders. Backpressure due to RBE thrashing and conflicts with CU traffic could slow execution of wavefronts or delay their final export and releasing of resources. Less successful early culling can potentially leave the iso-clock setup pipeline and wavefront launch process more burdened as well.
If AMD's patents on how it implements binning and tiled rasterizer reflect Vega's implementation, the front end's behavior without the features active would make the setup process longer-latency, which Vega may not be balanced for if it's the norm rather than a minority case. As this is happening in a stage whose output generally is amplified into a larger amount of pixel shader work, the level of parallelism and latency tolerance may profile differently and have more specialized concerns given the interaction with the more specialized paths and fixed-function blocks.
The latency angle has made me curious about the significantly higher wait count limit for Vega's vector memory instructions, and if this is a case of where decisions at each level of abstraction are bleeding through.
It doesn't seem like GCN suddenly incurred four times the memory latency, but it might matter more for the new shader variants than it does for more free-form pixel or compute shaders. The tendency for the front-end work to wind up spanning fewer CUs, interactions with fixed-function paths, taking a minority of CUs because of it being pre-amplificiation, and the more complex merged/primitive shader code might place a greater premium on per-wavefront latency handling. That's admittedly speculation in the absence of knowing how the new shaders profile.
That Vega's ISA splits the latency count field the way it does may be another indication of wanting binary backwards compatibility, or perhaps like the implementation-specific triangle coverage instruction it is a sign of the ISA reflecting different scenarios (or different CU revisions?) that need to be able to ignore the new bits.
Raven Ridge is bringing up to 11 NCUs which is more than enough to meet the performance target of a GT4 Iris Pro even if it clocks at ~900MHz. Problem here is bandwidth. Supporting LPDDR4X or a single HBM stack would have done wonders for Raven Ridge, but it doesn't look like it's happening.
The pricing and volume situation may keep this from happening going forward, given the orders of magnitude greater volume of the laptop market and the memory market's pricing trends. HBM seems to be hitting a point where it is too acceptable to buyers willing to pay a premium for APUs that will likely need to hit low price points--something likely to be assumed to be the case for AMD for some time just because that's generally what AMD gets and it would take time to reverse that perception.
The pricing clock tends to reset with every new memory type or variation as well.
Perhaps if Raven Ridge is among the last of the monolithic APUs, future implementations can allow flexibility where now AMD would need to add costs for itself to balance uncertainties in DRAM pricing and the disparate price points it needs to hit.
To avoid cluttering the review thread, I will append a note in response to a post you made:
https://forum.beyond3d.com/posts/2001017/
3 - Also as mentioned by Raja in the same tweet, the Infinity Fabric being used in Vega 10 wasn't optimized for consumer GPUs and that also seems to be holding the GPU back (maybe by holding back the clocks at iso TDP). Why did they use IF in Vega 10? Perhaps because iterating IF in Vega 10 was an important stepping stone for optimizing the implementation for Navi or even Vega 11 and Raven Ridge. Perhaps HBCC was implemented around IF from the start. Perhaps Vega SSG doesn't have a PCIe controller for the SSDs and IF is being used to implement a PCIe controller in Vega.
I think that tweet was in reference to area more than other factors. The IF strip is a measurable amount of area.
I don't really know why it would be a limitation beyond that, given it is described as a mesh and client Vega really shouldn't be stressing it enough to cause it to be a notable limitation.
Its clock domain is constant, which likely wouldn't change for a client-optimized version for power reasons. It may also help service certain heterogeneous compute functions if its domain is used as a timekeeper.
HBCC appears to sit in the coherent slave position noted for implementation of Zen's fabric, where there is an intermediary between the links and a memory controller, although what it's tasked with shouldn't be a major limiter since what gaming needs is a small subset. The unused features would be an area cost, generally.
IF itself doesn't implement controllers or PCIe. In Zen, the fabric interfaces with controllers that then plug into the interface and PHY.