AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

When I first saw that diagram, I did not think of asymmetrical SIMDs but of lane gating, which can help quite a bit if you expect to run into power limits or have a rather aggressive clock boost in place. Since the number of ALUs itself hasn't been AMDs problem even on 28nm, they surely would be able to invest some of the saved area in 14 nm into more sophisticated power management. Initially, I'd have thought only 1 or maybe 2 out of the four SIMDs in a CU to sport a feature like this for area reason, since I honestly have no idea how costly it would be. But the truck diagram, even if maybe not legit, seems to imply otherwise (the trailers are still all there, none completely saved).
 
No mention of lane gating or variable sized wavefronts (so probably not coming or at least not heavily advertised).
New geometry pipeline with tile bins and HSR (one could call it "partially deferred"?) increasing the throughput to up to 11 triangles/clock with 4 geometry engines and helping the bandwidth and energy efficiency.
No more ROP caches (handled by L2 now).
NCU can handle 128 FP32 instructions per clock instead of 64. Just larger? And NCU is optimized for higher clock speeds.
HBC is not HBM.
 
CUs now have 128 ALUs, or maybe they are clocked at twice the frequency. I can't really tell from this slide.
Geometry performance seems to be substantially enhanced over Polaris, which was already seemingly on pair with Pascal.
ROPs using L2 cache probably means there's a lot more L2 cache in it. It could also mean Vega is even less dependent on VRAM bandwidth.
 
  • "Storage Network" on the cache controller along with NVRAM and System DRAM
  • "NCU is optimized for higher clock speeds and higher IPC" (Looks like 2x packed math for higher IPC, no idea on optimizations)
  • 512 8bit ops/clock, Double Precision is "Configurable" (No idea how, but seemingly different from the 8bit, 16bit, and 32bit pow2s or it would have been shown)
Not much on CU configuration beyond apparently twice as many lanes as before. The CU to NCU graphic is kind of strange. Regular/packed math at seemingly twice the frequency. No explanation how the clocks increased.

No mention of the scalar, but I'm guessing there has to be more than one now? Without ridiculously higher clocks, beyond the SIMD increase, I doubt it could keep pace with 8 SIMDs generating just flow control. Design would seemingly have half as many CUs or the quoted FLOPs would be substantially more with double ALUs and higher clocks. Would also correspond to half as many nodes on that mesh.

It could also mean Vega is even less dependent on VRAM bandwidth.
For the deferred stuff with that binning it should be substantially less dependent. Lots more cache seems likely for the bins and driving twice as many SIMD lanes as before, so that should help too.
 
CUs now have 128 ALUs, or maybe they are clocked at twice the frequency. I can't really tell from this slide.
Both. Twice the throughput per clock (128 fp32 ops) and higher clocked.
Geometry performance seems to be substantially enhanced over Polaris, which was already seemingly on pair with Pascal.
Polaris isn't on par with Pascal. The tile binning rasterizer could propel its performance past a Pascal with the same amount of rasterizers, depending on the details like how much on chip storage there is for the tile bin cache and how well the work distribution ties in with the framebuffer tile cache in L2 (as the specialized ROP caches may be gone, one needs way more bandwidth there, or the L2 cache just backs the still existing ROP caches [the slide has also an L1 for the pixel pipeline backed by the L2]).
 
Gipsel, why are you sure, HBC is not HBM? I did not see that from the leaked slides.
 
And where does it say anything about HBC NOT being HBM(gen2)? Please do not only go by fancy names printed on a marketing slide.


Remember Xeon Phi? What one of the applications for it's HMC is? DRAM-Caching.
 
It's clearly a new cache on the chip and it would be news to me if you would need one for HBM(2) integration.
 
Polaris isn't on par with Pascal.

I'm simply talking about this:

OEGElO.png


source: http://www.anandtech.com/show/10540/the-geforce-gtx-1060-founders-edition-asus-strix-review/15



Sure, it's a one-case scenario, but Polaris seems to behave a lot better in titles that are geometry-intensive (i.e. gameworks games) than previous GCN GPUs.
 
Both. Twice the throughput per clock (128 fp32 ops) and higher clocked.
The only way to get twice the FP32 throughput per clock is if the SIMDs got chained together and executed consecutive instructions in a single cycle. Simply doubling the number I wouldn't equate to the meaning of twice the throughput. It should mean each ALU doubled throughput if that were the case. Or quoting native FP64 performance which doesn't appear to be the case.

edit: just for the sake of it, do you think it's a coincidence that HBM2-slides are right after the HBC-slide and before HBCC?
Worth mentioning it followed the slide about "Introducing the world's most scalable GPU memory architecture"
 
Second thought on that 128 ALUs, what if MADD was counting as two ops or they could dual issue certain FP32 instructions? In that case we're looking at the traditional SIMD count, but with the packed math adding some functionality.

EDIT: This could also be the FMA4 instructions and new GPU specifics. Extra operands would be useful with the packed math.
 
Last edited:
Back
Top