Sorry for double-posting, but it just occured to me, that AMD actually did NOT say, Vega was designed with four geometry engines. Only that in Vega, four geometry engines could handle 11 polygons. Vega's physical implementation could have more than those 4, if they're trying and understate a characteristic for once. A little far-fetched, I know, but still possible.
The text of the slide says:
New Programmable Geometry Pipeline
Over 2X peak throughput per clock
The respective footnote (full):
Geometry throughput slide: Data based on AMD Engineering design of Vega. Radeon R9 Fury X has 4 geometry engines and a peak of 4 polygons per clock. Vega is designed to handle up to 11 polygons per clock with 4 geometry engines. This represents an increase of 2.6x. VG-3
Geometry engines are the fixed-function portion of a shader engine, and Vega 10's CU count has been apparently pinned at 64. The design norm would be to have 4 shader engines and 16 CUs each. That doesn't rule out AMD changing something, but if the pattern holds the next increment is a big jump to 8 shader engines, with a lower CU to fixed-function ratio than Polaris.
Rasterizer and ROP throughput would presumably jump by a similar magnitude.
The number of RBE clients now plugged into AMD's L2 is another item of concern, since there would have been less to worry about with that many ROPs back when they were incoherent and AMD didn't dare.
As notable as that would be, it would seemingly be contrary to the efficiency goals implied by the binning logic and probably belied by the emphasis on a clock increase. It would also be rather ironic if Vega is supposed to bring in features heralding the replacement of fixed-function primitive handling by having the highest ratio of dedicated hardware to programmable throughput in generations.
I'm open to being pleasantly surprised, but the conservative interpretation of the footnote is that Fury X has 4 geometry engines and a peak of 4 polygons, and Vega (non-specific, possibly speaking for whole Vega family) can go up to ("up to" is another way of saying "peak") 11 polygons per clock with 4 geometry engines. A >5x increase in any SKU of the Vega line would be something AMD would be sorely tempted to put into marketing.
I look forward to more information on the L2 and memory controller/interconnect. One interpretation of all of this is that the high bandwidth cache controller is where the Infinity Fabric is, which leaves the L2 less disrupted by not letting the fabric's throughput change the L2's traditionally higher internal bandwidth, particularly with geometry, compute, and pixel data paths hitting the L2. It doesn't seem to make too much sense for a consumer graphics discrete, but perhaps this is a hallmark of Vega's non-consumer ambitions.
Another question I have is the L2's slice structure and capacity. Fiji had 32 channels of HBM, and had memory synthetics that showed there was a general equivalence to Hawaii until access patterns found a way to exceed the on-die capabilities of the whole hierarchy, rather than finding any scaling of L2 capability with the higher channel count. That's almost as if the L2 cache was not distributed fully for 4 stacks of HBM. Vega's keeping to 2 stacks of HBM2 allows a Hawaii-type pairing of slices to channels, which might make for better utilization of memory bandwidth.