AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Are we still operating under the assumption it is 4096 ALU? 1500mhz still seems excessive to me, considering this is an HPC part as well that would strongly imply any consumer version would be clocking at least at 1600mhz.
Would indeed be like a ~25% clock increase vs. what was possible with Polaris. Maybe that would be doable with lots of tweaking (assuming the CUs are still similar in the first place), but I'd not be surprised if it ends up with more than 64 CUs neither (like 72 or 80). From a die size perspective, more units certainly should be doable.
 
Would indeed be like a ~25% clock increase vs. what was possible with Polaris. Maybe that would be doable with lots of tweaking (assuming the CUs are still similar in the first place), but I'd not be surprised if it ends up with more than 64 CUs neither (like 72 or 80). From a die size perspective, more units certainly should be doable.

Even operating under the assumption that they were able to drastically improve clocking for Vega, it's unusual for HPC/Pro cards to be clocking high for both reliability/stability concerns in long term operation and to meet efficiency targets.

A 1520mhz pro card would necessarily entail even higher clocked consumer chips, which would put it squarely in Pascal territory despite the usual pattern of AMD's designs being more dense both in terms of transistor density and ALU density.

Admittedly one can envision a sort of middle ground with the Vega chip in question being larger than the node shrink would allow (less dense than Polaris), I would be quite impressed if this were the case, with the front-end improvements from Polaris and, hopefully, further improvements at the SE level (possibly more than four?) and a higher clock envelope this could acyually be a very competitive chips

Somewhat unrelated but I thought I'd post http://blogs.barrons.com/techtrader...phics-supplies-piling-up-warns-pacific-crest/
 
Even operating under the assumption that they were able to drastically improve clocking for Vega, it's unusual for HPC/Pro cards to be clocking high for both reliability/stability concerns in long term operation and to meet efficiency targets.
There is a lot of room though. You assume the design being targeted at lower "usual" clocks, but they may have planned Vega with and designed it for a higher ground. Let alone that the fact that Pro WX 7100 would have a pretty "unusual" boost clock in this sense.
 
If it's close to 2X Polaris 10 in size then it would be 36*2 = 72 CUs.
Assuming the same amount of ALUs per CU as past GCN architectures, it would be 4608 ALUs. So 9.216 TFLOPs at 1GHz.
12.5 TFLOPs would be achieved with a 1.356GHz "boost" clock. Simplified to 1.35GHz would do 24.88 TFLOPs FP16, at which point they could be entitled to call it a "MI25".

Depending on how many transistors-per-CU AMD is spending to achieve 2*FP16 throughput, we would probably be looking at a chip size between 400 and 500mm^2.
So definitely not a competitor to GP104 in size, but rather to GP102.



I'm saying two Vega10s on the same interposer providing a package roughly the same size as Fiji.

I think the biggest reason why the MI25 card isn't carrying 2* Vega 10 chips is because such a chip would probably overlap with Polaris 10 in gaming performance.
12.5 TFLOPs / 2 = 6.25 TFLOPs per Vega chip. This is about the same FP32 throughput as a Polaris 10 clocked at ~1.35 GHz, and many custom RX480 models are already being sold with those clocks.

Not only could this make AMD cards cannibalize themselves, but the performance level is probably not interesting for investing in an interposer + HBM2 solution.
 
If it's close to 2X Polaris 10 in size then it would be 36*2 = 72 CUs.
Assuming the same amount of ALUs per CU as past GCN architectures, it would be 4608 ALUs. So 9.216 TFLOPs at 1GHz.
12.5 TFLOPs would be achieved with a 1.356GHz "boost" clock. Simplified to 1.35GHz would do 24.88 TFLOPs FP16, at which point they could be entitled to call it a "MI25".

Depending on how many transistors-per-CU AMD is spending to achieve 2*FP16 throughput, we would probably be looking at a chip size between 400 and 500mm^2.
So definitely not a competitor to GP104 in size, but rather to GP102.





I think the biggest reason why the MI25 card isn't carrying 2* Vega 10 chips is because such a chip would probably overlap with Polaris 10 in gaming performance.
12.5 TFLOPs / 2 = 6.25 TFLOPs per Vega chip. This is about the same FP32 throughput as a Polaris 10 clocked at ~1.35 GHz, and many custom RX480 models are already being sold with those clocks.

Not only could this make AMD cards cannibalize themselves, but the performance level is probably not interesting for investing in an interposer + HBM2 solution.

That's pretty uncanny, I just wrote the same thing a few minutes ago on another forum lol.

12 CUs per SE, would allow for 6SEs, and would make sense given it would entail less dense and this plays well with higher clocks

The LinkedIn profile leak specifically mentioned 4096 ALU Vega chip though
 
I think the biggest reason why the MI25 card isn't carrying 2* Vega 10 chips is because such a chip would probably overlap with Polaris 10 in gaming performance.
12.5 TFLOPs / 2 = 6.25 TFLOPs per Vega chip. This is about the same FP32 throughput as a Polaris 10 clocked at ~1.35 GHz, and many custom RX480 models are already being sold with those clocks.
That can be tricky though. TFLOPs may not be comparable between generations. On top of that Polaris isn't suitable for mid to high tier APUs. We know they are designing two Vegas that ideally both work for APUs as well as being able to compete with P100. HBM will be expensive, but the APU can offset that a bit.
 
I'm saying two Vega10s on the same interposer providing a package roughly the same size as Fiji. So it makes sense that if AMD has 25% higher FP16 performance than P100 at ~600mm2 they would at least have have similar areas. Vega11 was in theory the "big" one. While possible, I doubt they made a chip significantly larger than 600mm2. Vega10 at ~300mm2 benchmarking a bit faster than an overclocked 1080 (314mm2) makes sense. Then a pair of them at lower clocks for the big card with double the ram. Otherwise they're horribly inefficient if 50% more bandwidth and double the die size is only marginally faster than GP104 in games, but outpaces P100 (which already has 2xFP16) in deep learning with 30% less bandwidth.
Ah, I didn't follow you there. Sorry. But it seems to have been cleared up in the meantime anyway. :)

One thing to keep in mind is that AMD already put apart $340 million from their 2016 Q3 to pay Global Foundries for producing chips at TSMC.
Given the timing of Vega's production schedule (news websites are claiming the Vega chips being presented only came a few weeks ago), it's a good possibility that Vega is being produced at TSMC, whose 16FF+ seems to be achieving substantially higher clocks than Samsung's/GF's 14F.
So Vega could be very different from Polaris in what relates to achievable clocks (it had better be, otherwise AMD spent >$340M for nothing?).
Yeah, that's also a factor and reinforcing my sentiment that the perf/watt sweet spot should have move up considerably.
 
One thing to keep in mind is that AMD already put apart $340 million from their 2016 Q3 to pay Global Foundries for producing chips at TSMC.
Given the timing of Vega's production schedule (news websites are claiming the Vega chips being presented only came a few weeks ago), it's a good possibility that Vega is being produced at TSMC, whose 16FF+ seems to be achieving substantially higher clocks than Samsung's/GF's 14F.
So Vega could be very different from Polaris in what relates to achievable clocks (it had better be, otherwise AMD spent >$340M for nothing?).

Is the $340M figure the same charge listed in the following:
http://www.anandtech.com/show/10631/amd-amends-globalfoundries-wafer-supply-agreement-through-2020

The waiver payment was $100M, with the stock warrant being $235M.
That doesn't mean Vega wouldn't be part of the waived inventory, just that the cost isn't quite that high.

Given the Radeon group's financial performance thus far, even if its whole output were under the waiver's umbrella, it probably would not justify $340M. Discrete on its own might not earn that back for quite some time. It's possible that there is semicustom/x86 hardware included. Those lines might have the financial impact to justify even the $100M payment.

This doesn't include the quarterly payments for the wafer capacity that is transferred.
 
Would indeed be like a ~25% clock increase vs. what was possible with Polaris. Maybe that would be doable with lots of tweaking (assuming the CUs are still similar in the first place), but I'd not be surprised if it ends up with more than 64 CUs neither (like 72 or 80). From a die size perspective, more units certainly should be doable.


Well they, to get that kind of base clock, they would have needed to do a lot of low level changes, if they are talking about boost clocks, I think they can reach that with 300 watts as their power envelope, GF vs TSMC I don't think that matters much in this context.
 
Last edited:
I think the biggest reason why the MI25 card isn't carrying 2* Vega 10 chips is because such a chip would probably overlap with Polaris 10 in gaming performance.
12.5 TFLOPs / 2 = 6.25 TFLOPs per Vega chip. This is about the same FP32 throughput as a Polaris 10 clocked at ~1.35 GHz, and many custom RX480 models are already being sold with those clocks.
Just realized an error in my reading skills. While the FP16 performance is in theory faster than P100, it was benchmarked against GP102 which lacks the double rate FP16. Not the P100. MI25 was therefore ~46% faster than GP102 in deep learning. Likely slower, not shown, than P100 following along the bandwidth limits. That also lines up with Vega10 being little Vega and likely ~300mm2 on par with GP104, albeit with HBM. If fabbed at TSMC with a slightly higher ideal clock speed that 4096+1024(equivalent scalar) at 1.25GHz makes a lot of sense. The scalar portion wouldn't be very efficient for deep learning and nice parallel benchmarks, so 20% of theoretical performance is left on the table. Eliminate the scalars and the benchmark is 46% higher FP16 performance for MI25 versus 55% higher theoretical TFLOPs. Well within what I'd consider a margin of error and reasonably efficient. That lines up much better with realistic expectations.

So this wouldn't be the enthusiast class product, just a high end around 1080 and likely approaching GP102 graphics performance. A dual Vega10 could in theory still exist, but we'd be looking at double the performance not accounting for thermal issues. That also leaves room for a Vega11 around 450mm2 along the lines of GP102 that is still likely small enough to fit onto an APU. Getting Vega10 faster than P100 at FP16 was what messed up all the math.

Posting here for reference.
 
If the single-chip Vega confirmed to be in MI25 is the smaller variant, how much power budget you think the larger chip will be able to wield beyond the max of 299.9999 watts sort of being confirmed for MI25? Yes, maybe the SSD-cache (if present) eats into it as well, but I do not see awfully large amounts left on the table.
 
MI25 has the same TDP as the GP100/GP102 offering by NV. If a 300mm² chips needs a similar amount (and remember they had the Fiji based card at 175W), I fail to see how a bigger 450mm² chip should be possible.
 
If the single-chip Vega confirmed to be in MI25 is the smaller variant, how much power budget you think the larger chip will be able to wield beyond the max of 299.9999 watts sort of being confirmed for MI25? Yes, maybe the SSD-cache (if present) eats into it as well, but I do not see awfully large amounts left on the table.
May depend on the workload. Drop the voltages and clocks to the optimal point on the curve and scale clocks based on power budget. I doubt it doubles performance under optimal conditions, but probably gains 50-75% in the power budget. Might actually double if tasked with less than ideal workloads that can't stress the chip. Hit a bottleneck and that power limit disappears fast. Also offers the possibility of double the bandwidth and density if that is a concern.

MI25 has the same TDP as the GP100/GP102 offering by NV. If a 300mm² chips needs a similar amount (and remember they had the Fiji based card at 175W), I fail to see how a bigger 450mm² chip should be possible.
To the best of my knowledge they haven't stated the power requirements of MI25. Nothing beyond an entire rack needs x amount of power. Realistically it may only be using around 200W. The :runaway:W figure is the standard design limit for a server. Thanks to performance curves, more silicon at lower clocks scales better. Doubling silicon or increasing clocks by 50% both likely result in double or worse power consumption. Rough comparison, but maximizing silicon nearly always yields the most efficient solution. Same reason the Nano has the perf/watt it does.
 
Last edited:
May depend on the workload. Drop the voltages and clocks to the optimal point on the curve and scale clocks based on power budget. I doubt it doubles performance under optimal conditions, but probably gains 50-75% in the power budget. Might actually double if tasked with less than ideal workloads that can't stress the chip. Hit a bottleneck and that power limit disappears fast. Also offers the possibility of double the bandwidth and density if that is a concern.


To the best of my knowledge they haven't stated the power requirements of MI25. Nothing beyond an entire rack needs x amount of power. Realistically it may only be using around 200W. The :love:00W figure is the standard design limit for a server. Thanks to performance curves, more silicon at lower clocks scales better. Doubling silicon or increasing clocks by 50% both likely result in double or worse power consumption. Rough comparison, but maximizing silicon nearly always yields the most efficient solution. Same reason the Nano has the perf/watt it does.

So you're implying AMD might have intentionally undersold it's MI25 product in it's announcement because they either did not utilize the 300w bracket fully or because they pushed the card beyond the optimal/reasonable point on the v/f curve? That would be a bad move in my books. And besides, they are rather specific in watts with MI8, not hitting either the 150 nor the 225w bracket, server guys usually might care about the most.
 
So you're implying AMD might have intentionally undersold it's MI25 product in it's announcement because they either did not utilize the 300w bracket fully or because they pushed the card beyond the optimal/reasonable point on the v/f curve? That would be a bad move in my books. And besides, they are rather specific in watts with MI8, not hitting either the 150 nor the 225w bracket, server guys usually might care about the most.
I wouldn't say undersold. They simply stated it will consume between 0W and and maximum possible allowed by the slot. They simply avoided the question. I have no doubt it could probably use all 300W, it's just they didn't specify. The other two cards demoed were essentially a 480 and Nano which were known. Both of those can hit 300W, but that's not what they quoted. Nearly all cards are past the optimal v/f point on the curve. It's just a matter of maximizing the silicon for the architecture and pushing up clocks from there. There is always some slack to pick up. The optimal die size for 300W is probably well North of 1000mm2. There are some obvious limitations with that.

I believe even Raja has stated Vega 11 is larger than Vega 10 and it still has to fit in that 300W window. So even if both designs consume 300W the larger die will be significantly faster.
 
Well it would be strange to market a 200W TDP card as :love:00W, because everybody understands :love:00W as roughly 250W+.

In that particular deep learning part, I think AMD is literally just saying that the card will meet the PCIe spec limit of 300W.

But yeah, for consumer desktop cards, we're probably stuck at 250W for the simple reason that OEMs are used to it. They all have cooling solutions for exactly that amount of heat in a blower form factor, so it'd be a drop-in for them. Remember that OEMs actually care about this stuff.
 
Back
Top