AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Weird, but easily the result of software limiting.
And/or bandwidth and/or geometry performance. In the article's conditions the Vega 56 and 64 have the exact same bandwidth and the same 4 geometry engines working at the same clocks.
With primitive shaders working these two factors might be less of a bottleneck.



Has anyone actually tested a "midrange" configuration with Vega? Around where that Nano would exist. We've seen the card, but not in any official capacity. Raja did mention not testing dynamic power. With the prices Intel charges for Iris Pro AMD could probably work an oversized Vega into the lineup just to gain share. Up to 64 CUs at ~1GHz without competition. Nvidia can't make an APU and Intel lacks large enough graphics chips. Lower margins, but higher revenue and share gains to establish themselves.

Raven Ridge is bringing up to 11 NCUs which is more than enough to meet the performance target of a GT4 Iris Pro even if it clocks at ~900MHz. Problem here is bandwidth. Supporting LPDDR4X or a single HBM stack would have done wonders for Raven Ridge, but it doesn't look like it's happening.
 
Last edited by a moderator:
And/or bandwidth and/or geometry performance. In the article's conditions the Vega 56 and 64 have the exact same bandwidth and the same 4 geometry engines working at the same clocks.
With primitive shaders working these two factors might be less of a bottleneck.
Unless more than half the frame is spent on geometry, async should work around a geometry bottleneck. Bandwidth should show at higher resolutions as cache bandwidth becomes more significant. At 4k a Vega64 should be pulling well ahead as both geometry and bandwidth become less significant relatively.

Raven Ridge is bringing up to 11 NCUs which is more than enough to meet the performance target of a GT4 Iris Pro even if it clocks at ~900MHz. Problem here is bandwidth. Supporting LPDDR4X or a single HBM stack would have done wonders for Raven Ridge, but it doesn't look like it's happening.
If the supply was there, I'd think HBM and a small form factor push would have been a huge opportunity for AMD. Drivers are still working on RR for Android, so may still be hope for some Chromebox and Steam Box designs.

What I'm suggesting is stick a nano and 4/8 core Ryzen into a Threadripper socket or simply embed it entirely. Push an entirely new market. A small box with a lot of external IO. Do what integrated graphics did to the low end market with Intel's graphics market share beyond just 11 NCUs. Take integrated a step further and go for 80-90% integrated.
 
Vega 56 and 64 operate basically on the same memory bandwidth as Fiji. Hence, every byte saved by the measures put in place should be very welcome.
 
And/or bandwidth and/or geometry performance. In the article's conditions the Vega 56 and 64 have the exact same bandwidth and the same 4 geometry engines working at the same clocks.
With primitive shaders working these two factors might be less of a bottleneck.
Aside from HBM and the data fabric, there is also the same L2-L1 bandwidth and the same RBE/export behavior to the L2 and beyond.
Latencies could be generally equivalent iso-clock, and if there are internal variations with the firmware settings they don't show up in the testing.

Some of those elements might change with DSBR and workable primitive shaders. Backpressure due to RBE thrashing and conflicts with CU traffic could slow execution of wavefronts or delay their final export and releasing of resources. Less successful early culling can potentially leave the iso-clock setup pipeline and wavefront launch process more burdened as well.

If AMD's patents on how it implements binning and tiled rasterizer reflect Vega's implementation, the front end's behavior without the features active would make the setup process longer-latency, which Vega may not be balanced for if it's the norm rather than a minority case. As this is happening in a stage whose output generally is amplified into a larger amount of pixel shader work, the level of parallelism and latency tolerance may profile differently and have more specialized concerns given the interaction with the more specialized paths and fixed-function blocks.

The latency angle has made me curious about the significantly higher wait count limit for Vega's vector memory instructions, and if this is a case of where decisions at each level of abstraction are bleeding through.
It doesn't seem like GCN suddenly incurred four times the memory latency, but it might matter more for the new shader variants than it does for more free-form pixel or compute shaders. The tendency for the front-end work to wind up spanning fewer CUs, interactions with fixed-function paths, taking a minority of CUs because of it being pre-amplificiation, and the more complex merged/primitive shader code might place a greater premium on per-wavefront latency handling. That's admittedly speculation in the absence of knowing how the new shaders profile.

That Vega's ISA splits the latency count field the way it does may be another indication of wanting binary backwards compatibility, or perhaps like the implementation-specific triangle coverage instruction it is a sign of the ISA reflecting different scenarios (or different CU revisions?) that need to be able to ignore the new bits.


Raven Ridge is bringing up to 11 NCUs which is more than enough to meet the performance target of a GT4 Iris Pro even if it clocks at ~900MHz. Problem here is bandwidth. Supporting LPDDR4X or a single HBM stack would have done wonders for Raven Ridge, but it doesn't look like it's happening.
The pricing and volume situation may keep this from happening going forward, given the orders of magnitude greater volume of the laptop market and the memory market's pricing trends. HBM seems to be hitting a point where it is too acceptable to buyers willing to pay a premium for APUs that will likely need to hit low price points--something likely to be assumed to be the case for AMD for some time just because that's generally what AMD gets and it would take time to reverse that perception.
The pricing clock tends to reset with every new memory type or variation as well.
Perhaps if Raven Ridge is among the last of the monolithic APUs, future implementations can allow flexibility where now AMD would need to add costs for itself to balance uncertainties in DRAM pricing and the disparate price points it needs to hit.

To avoid cluttering the review thread, I will append a note in response to a post you made:
https://forum.beyond3d.com/posts/2001017/

3 - Also as mentioned by Raja in the same tweet, the Infinity Fabric being used in Vega 10 wasn't optimized for consumer GPUs and that also seems to be holding the GPU back (maybe by holding back the clocks at iso TDP). Why did they use IF in Vega 10? Perhaps because iterating IF in Vega 10 was an important stepping stone for optimizing the implementation for Navi or even Vega 11 and Raven Ridge. Perhaps HBCC was implemented around IF from the start. Perhaps Vega SSG doesn't have a PCIe controller for the SSDs and IF is being used to implement a PCIe controller in Vega.
I think that tweet was in reference to area more than other factors. The IF strip is a measurable amount of area.
I don't really know why it would be a limitation beyond that, given it is described as a mesh and client Vega really shouldn't be stressing it enough to cause it to be a notable limitation.

Its clock domain is constant, which likely wouldn't change for a client-optimized version for power reasons. It may also help service certain heterogeneous compute functions if its domain is used as a timekeeper.
HBCC appears to sit in the coherent slave position noted for implementation of Zen's fabric, where there is an intermediary between the links and a memory controller, although what it's tasked with shouldn't be a major limiter since what gaming needs is a small subset. The unused features would be an area cost, generally.
IF itself doesn't implement controllers or PCIe. In Zen, the fabric interfaces with controllers that then plug into the interface and PHY.
 
What's the difference? Are you calling "architectural efficiency" to perf/mm^2 by leaving TDP aside?

If so, has anyone made a Vega 10 vs GP102 comparison at ISO clocks? Downclock a Titan X down to say 1400MHz, do the same with a Vega 64 and see how they compare?

Last time I saw something like that, I think the Polaris 10 actually goes very close to a GP104 at ISO clocks for core and memory.



They at least have an impact in the clocks the GP100 can achieve at a given TDP compared to GP102. According to nvidia's own whitepapers, GP100's peak FP32 throughout is 10.6 TFLOPs (56SM @1480MHz) with a 300W TDP whereas GP102 can get about 20% more at 250W. This obviously has an impact in its graphics performance.
So the answer to your question is yes: GP100's 1/2 FP64 + 2xFP16 + more cache + nvlinks etc. do in fact have a negative impact on gaming performance.
They're not responsible for decreasing IPC, they're responsible for decreasing clocks at iso TDP.




There's a number of reasons why Vega isn't reaching the same gaming performance as GP102 at iso TDP:

1 - GlobalFoundries' 14LPP is substantially less efficient than TSMC's 16FF+ (from the posts of experts in this forum, there's at least a 20% difference in power consumption at iso clocks).

2 - As Raja confirmed 2 weeks ago some of the features aren't implemented in the driver yet (his statement implies they will be, and so have @Rys ' statements so far). Perhaps this discussion will be different when DSBR gets enabled even in automatic mode, since it'll affect both geometry performance and effective bandwidth.

3 - Also as mentioned by Raja in the same tweet, the Infinity Fabric being used in Vega 10 wasn't optimized for consumer GPUs and that also seems to be holding the GPU back (maybe by holding back the clocks at iso TDP). Why did they use IF in Vega 10? Perhaps because iterating IF in Vega 10 was an important stepping stone for optimizing the implementation for Navi or even Vega 11 and Raven Ridge. Perhaps HBCC was implemented around IF from the start. Perhaps Vega SSG doesn't have a PCIe controller for the SSDs and IF is being used to implement a PCIe controller in Vega.

4 - Compute-oriented features like 2*FP16, larger caches and HBCC prevent Vega 10 from achieving higher clocks at iso TDP, just like what happens with GP100.

GP100 has a completely different SM than GP102. The ratio of scheduling to math hardware and on-chip memory is quite different. So this comparison is not as straightforward as you'd like to make it.
 
GP100 has a completely different SM than GP102. The ratio of scheduling to math hardware and on-chip memory is quite different. So this comparison is not as straightforward as you'd like to make it.

The bulk of the conversation was about comparing Vega 10 to GP102, but you're worried about the GP100->GP102 comparison not being straightforward enough?

Regardless, as posted above this isn't a conversation for this topic.
 
What's the difference? Are you calling "architectural efficiency" to perf/mm^2 by leaving TDP aside?
No, I'm talking about improved compute somehow impacting graphics at the cycle level within the same family.

There is no evidence at all that the additional compute features of GP100 have a negative impact at the cycle level on graphics compared to GP102.

If so, has anyone made a Vega 10 vs GP102 comparison at ISO clocks? Downclock a Titan X down to say 1400MHz, do the same with a Vega 64 and see how they compare?
It's a fun exercise, but not particularly relevant in this context. The trick is to make an architecture that combines best of everything. High clock speeds, low power, good performance for both games and compute workloads.

They at least have an impact in the clocks the GP100 can achieve at a given TDP compared to GP102. According to nvidia's own whitepapers, GP100's peak FP32 throughout is 10.6 TFLOPs (56SM @1480MHz) with a 300W TDP whereas GP102 can get about 20% more at 250W. This obviously has an impact in its graphics performance.

So the answer to your question is yes: GP100's 1/2 FP64 + 2xFP16 + more cache + nvlinks etc. do in fact have a negative impact on gaming performance.
They're not responsible for decreasing IPC, they're responsible for decreasing clocks at iso TDP.
Of course, TDP has an impact. But that doesn't apply to Vega vs GP102, since AMD already happily decided to give it a 350W power budget in order to not have to scale back clocks. And thus there's still no indication that Vega's disappointing performance can be explained by its compute features. It'd be different if AMD restricted Vega's clocks so that it could stay within the power envelope of GP102.

There's a number of reasons why Vega isn't reaching the same gaming performance as GP102 at iso TDP:
You give 4 reasons, but forgot the most obvious one: GCN was always a power inefficient architecture and one that's unable to make TFLOPS do the work for graphics, and that hasn't changed with Vega.
 
Cross-Topic, as it is not proper anymore to continue the discussion in the hardware-review thread, as was rightfully pointed out to me by ToTTenTranz. So we should continue this topic here.

There's a number of reasons why Vega isn't reaching the same gaming performance as GP102 at iso TDP:

1 - GlobalFoundries' 14LPP is substantially less efficient than TSMC's 16FF+ (from the posts of experts in this forum, there's at least a 20% difference in power consumption at iso clocks).

2 - As Raja confirmed 2 weeks ago some of the features aren't implemented in the driver yet (his statement implies they will be, and so have @Rys ' statements so far). Perhaps this discussion will be different when DSBR gets enabled even in automatic mode, since it'll affect both geometry performance and effective bandwidth.

3 - Also as mentioned by Raja in the same tweet, the Infinity Fabric being used in Vega 10 wasn't optimized for consumer GPUs and that also seems to be holding the GPU back (maybe by holding back the clocks at iso TDP). Why did they use IF in Vega 10? Perhaps because iterating IF in Vega 10 was an important stepping stone for optimizing the implementation for Navi or even Vega 11 and Raven Ridge. Perhaps HBCC was implemented around IF from the start. Perhaps Vega SSG doesn't have a PCIe controller for the SSDs and IF is being used to implement a PCIe controller in Vega.

4 - Compute-oriented features like 2*FP16, larger caches and HBCC prevent Vega 10 from achieving higher clocks at iso TDP, just like what happens with GP100.

All those reasons sound like deliberate and willful decisions made by AMD, don't they?
 
So this just appeared on reddit:



9vOBodI.jpg


IIRC, the Ryzen 5 2500U has the lower-end Vega 8 NCUs GPU, and Ryzen 7 "U" should get 11 NCUs. All of them have a 15W TDP.
If true, I'm left wondering how the Vega iGPU isn't getting drowned in bandwidth bottlenecks.





All those reasons sound like deliberate and willful decisions made by AMD, don't they?
AMD yes, RTG no (e.g. GF's divestment deals precedes the creation of RTG).
Point being?
 
So this just appeared on reddit:



9vOBodI.jpg


IIRC, the Ryzen 5 2500U has the lower-end Vega 8 NCUs GPU, and Ryzen 7 "U" should get 11 NCUs. All of them have a 15W TDP.
If true, I'm left wondering how the Vega iGPU isn't getting drowned in bandwidth bottlenecks.

AMD yes, RTG no (e.g. GF's divestment deals precedes the creation of RTG).
Point being?
Going by name Ryzen 7 mobile should have 10 CUs, not 11
Got link to reddit thread?

edit:
 
So this just appeared on reddit:



9vOBodI.jpg


IIRC, the Ryzen 5 2500U has the lower-end Vega 8 NCUs GPU, and Ryzen 7 "U" should get 11 NCUs. All of them have a 15W TDP.
If true, I'm left wondering how the Vega iGPU isn't getting drowned in bandwidth bottlenecks.
How is Vega able to outperform Fiji with lower bandwidth?

Vega itself is built to be more memory bandwidth efficient, and it shows in the comparison between Fiji and Vega.

All that 45MB of SRAM has to be doing something :???:
 
All that 45MB of SRAM has to be doing something :???:

What SRAM? Are AMD APUs going to have SRAM like Iris Pro? Because Intel 640 has 64 MB....

Regardless, they cannot come soon enough! My Core M ZenBook is getting awfully limited now that my use case for it changed from merely browsing/office work to software development.

EDIT - Oh, you probably mean the 45 MB of cache Vega has! Although an 8CU part would probably have way less cache...
 
Last edited:
How is Vega able to outperform Fiji with lower bandwidth?

Vega itself is built to be more memory bandwidth efficient, and it shows in the comparison between Fiji and Vega.

All that 45MB of SRAM has to be doing something :???:

Actually, in synthetic benchmarks Vega 10 seems to get lower effective bandwidth per-clock than Fiji, at least for now.
Plus, I doubt Raven Ridge will have 45MB of SRAM. It's a 15-35W APU, not a 200-350W GPU.

Maybe these results were obtained using high-clocked DDR4, though notebook memory currently tops at 3000MT/s (I think) and that's 48GB/s total.
In Intel's Iris Plus 640, Crystalwell's eDRAM is 50GB/s duplex (100GB/s total), plus the system's 25-30GB/s using LPDDR3 1866.
 
Despite all the derp and that has taken over this forum, buying a near MSRP 580 (about 75 on avg over in my country) or Vega is still a challenge. If AMD made desktop GFX the 3rd cousin to the consoles and Zen ( honestly it looks like it), it couldn't have turned out much better for them.

1. A CPU thats IPC competitive to Intel
2. A HEDT platform thats just better then Intels
3. If above is ball park An APU thats going to be great for those 15-25watt ultrabooks and 25-35watt standard laptops
4. A GPU that at least gets them in the enthusiast space
5. A platform (CPU + GPU) that gives them a solid compute/datacentre product ( intels going to intel, power + G[V,P]100 is an expensive platform)

Consider that currently AMD's R&D budget is less then NV's let alone worrying about Intels i think there is some room for optimism in the future and revenue and R&D will increase from where it has been over the last couple of years.

edit: I just want to point out we dont really know if Vega is bandwidth bottleneck*, increasing memory clock reduces latency and that could be just as big a factor, so tasty DDR4 could go hand in hand nicely.

*it certainly isn't compared to Polaris, which gets very good perf gains form mem OC.
 
Last edited:
What SRAM? Are AMD APUs going to have SRAM like Iris Pro? Because Intel 640 has 64 MB....

Regardless, they cannot come soon enough! My Core M ZenBook is getting awfully limited now that my use case for it changed from merely browsing/office work to software development.

EDIT - Oh, you probably mean the 45 MB of cache Vega has! Although an 8CU part would probably have way less cache...
Yeah, I mean the cache. Vega has a surprisingly high amount of it, and most of it is still unaccounted for.
 
Actually, in synthetic benchmarks Vega 10 seems to get lower effective bandwidth per-clock than Fiji, at least for now.
Plus, I doubt Raven Ridge will have 45MB of SRAM. It's a 15-35W APU, not a 200-350W GPU.

Maybe these results were obtained using high-clocked DDR4, though notebook memory currently tops at 3000MT/s (I think) and that's 48GB/s total.
In Intel's Iris Plus 640, Crystalwell's eDRAM is 50GB/s duplex (100GB/s total), plus the system's 25-30GB/s using LPDDR3 1866.
Well, of course it would have less mem bandwidth per clock. It's half the bus width :-|
 
Yeah, I mean the cache. Vega has a surprisingly high amount of it, and most of it is still unaccounted for.

Well, although it has a surprising amount of it, ROPs are now using it as well (although I have no idea if that means only reading or writing too), so its not like the increase benefits just the units who were using L2 cache already, they have to share the benefits across more units.
 
Despite all the derp and that has taken over this forum, buying a near MSRP 580 (about 75 on avg over in my country) or Vega is still a challenge. If AMD made desktop GFX the 3rd cousin to the consoles and Zen ( honestly it looks like it), it couldn't have turned out much better for them.

1. A CPU thats IPC competitive to Intel
2. A HEDT platform thats just better then Intels
3. If above is ball park An APU thats going to be great for those 15-25watt ultrabooks and 25-35watt standard laptops
4. A GPU that at least gets them in the enthusiast space
5. A platform (CPU + GPU) that gives them a solid compute/datacentre product ( intels going to intel, power + G[V,P]100 is an expensive platform)

Consider that currently AMD's R&D budget is less then NV's let alone worrying about Intels i think there is some room for optimism in the future and revenue and R&D will increase from where it has been over the last couple of years.

edit: I just want to point out we dont really know if Vega is bandwidth bottleneck*, increasing memory clock reduces latency and that could be just as big a factor, so tasty DDR4 could go hand in hand nicely.

*it certainly isn't compared to Polaris, which gets very good perf gains form mem OC.

Sure. But if you're an enthusiast looking for a good value the purchasing experience around Vega is frustrating and that's NOT good for AMD and, even though they at least have a product offering now, they are still effectively absent from this market for many people. If I had confidence that there would be custom Vega 56s readily available @ $399 within a reasonable period of time I would have considered waiting to purchase one of those over the custom GTX 1080 I actually purchased @ $499.

Clearly, there are people who will take any and all opportunities to present AMD's efforts in a negative light (and the reverse, of course), but this situation is not great for them, even if it does benefit them financially in the short term.
 
Back
Top