AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

CarstenS · Jan 5, 2017

When I first saw that diagram, I did not think of asymmetrical SIMDs but of lane gating, which can help quite a bit if you expect to run into power limits or have a rather aggressive clock boost in place. Since the number of ALUs itself hasn't been AMDs problem even on 28nm, they surely would be able to invest some of the saved area in 14 nm into more sophisticated power management. Initially, I'd have thought only 1 or maybe 2 out of the four SIMDs in a CU to sport a feature like this for area reason, since I honestly have no idea how costly it would be. But the truck diagram, even if maybe not legit, seems to imply otherwise (the trailers are still all there, none completely saved).

revan · Jan 5, 2017

Waiting for experts comments:

http://videocardz.com/65406/exclusive-amd-vega-presentation

Gipsel · Jan 5, 2017

revan said:
http://videocardz.com/65406/exclusive-amd-vega-presentation

No mention of lane gating or variable sized wavefronts (so probably not coming or at least not heavily advertised).
New geometry pipeline with tile bins and HSR (one could call it "partially deferred"?) increasing the throughput to up to 11 triangles/clock with 4 geometry engines and helping the bandwidth and energy efficiency.
No more ROP caches (handled by L2 now).
NCU can handle 128 FP32 instructions per clock instead of 64. Just larger? And NCU is optimized for higher clock speeds.
HBC is not HBM.

kalelovil · Jan 5, 2017

I wonder if there are a few slides missing in the leak.
There is none featuring the '4x Power Efficiency' claim from the previous image.
https://www.forum-3dcenter.org/vbulletin/showthread.php?p=11252606#post11252606

Deleted member 13524 · Jan 5, 2017

CUs now have 128 ALUs, or maybe they are clocked at twice the frequency. I can't really tell from this slide.
Geometry performance seems to be substantially enhanced over Polaris, which was already seemingly on pair with Pascal.
ROPs using L2 cache probably means there's a lot more L2 cache in it. It could also mean Vega is even less dependent on VRAM bandwidth.

Anarchist4000 · Jan 5, 2017

"Storage Network" on the cache controller along with NVRAM and System DRAM
"NCU is optimized for higher clock speeds and higher IPC" (Looks like 2x packed math for higher IPC, no idea on optimizations)
512 8bit ops/clock, Double Precision is "Configurable" (No idea how, but seemingly different from the 8bit, 16bit, and 32bit pow2s or it would have been shown)

Not much on CU configuration beyond apparently twice as many lanes as before. The CU to NCU graphic is kind of strange. Regular/packed math at seemingly twice the frequency. No explanation how the clocks increased.

No mention of the scalar, but I'm guessing there has to be more than one now? Without ridiculously higher clocks, beyond the SIMD increase, I doubt it could keep pace with 8 SIMDs generating just flow control. Design would seemingly have half as many CUs or the quoted FLOPs would be substantially more with double ALUs and higher clocks. Would also correspond to half as many nodes on that mesh.

ToTTenTranz said:
It could also mean Vega is even less dependent on VRAM bandwidth.

For the deferred stuff with that binning it should be substantially less dependent. Lots more cache seems likely for the bins and driving twice as many SIMD lanes as before, so that should help too.

Gipsel · Jan 5, 2017

ToTTenTranz said:
CUs now have 128 ALUs, or maybe they are clocked at twice the frequency. I can't really tell from this slide.

Both. Twice the throughput per clock (128 fp32 ops) and higher clocked.

ToTTenTranz said:
Geometry performance seems to be substantially enhanced over Polaris, which was already seemingly on pair with Pascal.

Polaris isn't on par with Pascal. The tile binning rasterizer could propel its performance past a Pascal with the same amount of rasterizers, depending on the details like how much on chip storage there is for the tile bin cache and how well the work distribution ties in with the framebuffer tile cache in L2 (as the specialized ROP caches may be gone, one needs way more bandwidth there, or the L2 cache just backs the still existing ROP caches [the slide has also an L1 for the pixel pipeline backed by the L2]).

CarstenS · Jan 5, 2017

Gipsel, why are you sure, HBC is not HBM? I did not see that from the leaked slides.

lanek · Jan 5, 2017

CarstenS said:
Gipsel, why are you sure, HBC is not HBM? I did not see that from the leaked slides.

http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-12.jpg

http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-15.jpg

Kaotik · Jan 5, 2017

Gipsel said:
HBC is not HBM.

lanek said:
High bandwith cache ? http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-12.jpg

I'm quite sute HBC is HBM, unless you're suggesting Vega has no memory of it's own at all (which we already know it has, in form of HBM2)

http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-35.jpg
HBM2 is definitely not the NVRAM nor System DRAM nor Network Storage, so it has to be High Bandwidth Cache

Locuza · Jan 5, 2017

CarstenS said:
Gipsel, why are you sure, HBC is not HBM? I did not see that from the leaked slides.

Just look again

http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-34.jpg

There is a new High-Bandwidth-Cache (HBC) connected to the High-Bandwidth-Cache-Controller (HBCC).
I wonder how the caching works.

CarstenS · Jan 5, 2017

Locuza said:
Just look again
http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-34.jpg

And where does it say anything about HBC NOT being HBM(gen2)? Please do not only go by fancy names printed on a marketing slide.

lanek said:
High bandwith cache ?
http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-12.jpg

An can put the HBCC ( High banwith cache controller ) too http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-15.jpg

Remember Xeon Phi? What one of the applications for it's HMC is? DRAM-Caching.

Locuza · Jan 5, 2017

It's clearly a new cache on the chip and it would be news to me if you would need one for HBM(2) integration.

Kaotik · Jan 5, 2017

Locuza said:
It's clearly a new cache on the chip and it would be news to me if you would need one for HBM(2) integration.

No, it's not. It's the HBM2 which is on package, on interposer but not on chip

edit: just for the sake of it, do you think it's a coincidence that HBM2-slides are right after the HBC-slide and before HBCC?

Deleted member 13524 · Jan 5, 2017

Gipsel said:
Polaris isn't on par with Pascal.

I'm simply talking about this:

source: http://www.anandtech.com/show/10540/the-geforce-gtx-1060-founders-edition-asus-strix-review/15

Sure, it's a one-case scenario, but Polaris seems to behave a lot better in titles that are geometry-intensive (i.e. gameworks games) than previous GCN GPUs.

Locuza · Jan 5, 2017

I stand corrected, it really might be just HBM2 with a certain cache configuration.

Anarchist4000 · Jan 5, 2017

Gipsel said:
Both. Twice the throughput per clock (128 fp32 ops) and higher clocked.

The only way to get twice the FP32 throughput per clock is if the SIMDs got chained together and executed consecutive instructions in a single cycle. Simply doubling the number I wouldn't equate to the meaning of twice the throughput. It should mean each ALU doubled throughput if that were the case. Or quoting native FP64 performance which doesn't appear to be the case.

Kaotik said:
edit: just for the sake of it, do you think it's a coincidence that HBM2-slides are right after the HBC-slide and before HBCC?

Worth mentioning it followed the slide about "Introducing the world's most scalable GPU memory architecture"

Deleted member 13524 · Jan 5, 2017

New slides have been coming up:

http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-37.jpg

"Geometry throughput slide: Data based on AMD Engineering design of Vega. Radeon R9 Fury X has 4 geometry engines and a peak of 4 polygons per clock. Vega is designed to handle up to 11 polygons per clock with 4 geometry engines. This represents an increase of 2.6X. VG-3"

11 polygons with 4 geometry engines?
Wut?

Anarchist4000 · Jan 5, 2017

Second thought on that 128 ALUs, what if MADD was counting as two ops or they could dual issue certain FP32 instructions? In that case we're looking at the traditional SIMD count, but with the packed math adding some functionality.

EDIT: This could also be the FMA4 instructions and new GPU specifics. Extra operands would be useful with the packed math.

seahawk · Jan 5, 2017

Locuza said:
Just look again
http://cdn.videocardz.com/1/2017/01/AMD-VEGA-VIDEOCARDZ-34.jpg

There is a new High-Bandwidth-Cache (HBC) connected to the High-Bandwidth-Cache-Controller (HBCC).
I wonder how the caching works.

Or it is just a new fancy name for a cache feeding HBM memory.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

CarstenS

Moderator

revan

Gipsel

kalelovil

Deleted member 13524

Guest

Anarchist4000

Gipsel

CarstenS

Moderator

lanek

Kaotik

Drunk Member

Locuza

CarstenS

Moderator

Locuza

Kaotik

Drunk Member

Deleted member 13524

Guest

Locuza

Anarchist4000

Deleted member 13524

Guest

Anarchist4000

seahawk