AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

How do you read it like that? The support is (will be) there if the dev decides to build the game using them
Right. Just as many games as supported that 3D sound shit AMD introduced with R9 290 series cards, which is zero for all intents and purposes.

With AMD's miniscule and shrinking marketshare, any feature requiring vega hardware-specific coding is going to be dead on arrival.
 
Because it was supposed to be...

I guess it doesn't make much sense to spend resources on that now because few gamers have Vega cards and no gamer is buying AMD cards to play games, much less Vega chips.

It does bother me that driver development for games is seemingly decelerating, but high-end PC gaming as a whole is actually dying, and it might die really fast.
I wonder what will happen to PC game sales after a year of severe drought of performance graphics cards in the shelves.


The IHVs need to come up with a solution fast. AMD needs to move up the launch of high-performance gaming APUs with HBM as much as they can.
These high-performance APUs needs Primitive Shaders!
That's the problem.
 
No, it would only imply that IF has a certain amount of overhead that starts to amortize only beyond the requirement and current count of the clients.
What clients though? The vast majority of the bandwidth is between the CUs and HBM2 and somehow applicable to MI25 or one of the server products. There may be some overhead, but that shouldn't be making the chip significantly larger than otherwise necessary. There don't appear to be any indications Vega is scaling past 64 CUs either to add more clients for IF. There's just no reason for it in the current incarnation. None of the clients listed in the hotchips presentation should have significantly large requirements for IF.

AMD said at multiple occasions that DSBR is only useful for SKUs with limited resources, and now their own testing confirms it. At Ultra settings, most games gain about 5%, or less, and likely in very specific scenarios too. And now with the cancellation of driver primitive shaders, I think it's time you abandon your theory of 30% more performance than a TitanXP through unicorn drivers. It was never a good theory to begin with. Writing was all over the wall that Vega was missing several things when the features were never enabled at launch or a few months after.
So situations where GPU limited? Absolutely shocking that reducing GPU load would increase performance when limited by the GPU. Almost as if bandwidth wasn't a limiting factor for Vega. With AMD's limited FLOPs holding them back and all. That might have been one of the dumbest comments I've seen in a while.

My prediction still looks rather accurate, even if that primitive shader news holds out, as we've already seen a Vega around Titan performance with limited implementation. RPM usage is still a bit hit or miss and the DSBR state uncertain. No need for unicorn drivers, just need a dev to implement all the features. Something they are already doing, especially with Vega poised to hold a significant chunk of the graphics market. Leaving Nvidia a distant third in marketshare.

These high-performance APUs needs Primitive Shaders!
That's the problem.
The alternative to primitive shaders was using a compute shader asynchronously to perform the culling. Ideally with RPM/FP16 where the front end has little if any culling to perform. Might even be faster. Wolfenstein I believe was doing something similar, but was disabled in some benchmarks with the FP16 issues to my understanding. At least the last time anyone looked at it. Might be a better implementation for mGPU workload distribution as well.
 
With async compute, having an execution gap between the primitive pipeline and fragment shaders is less of an issue.
My impression was there's still a direct feed from the primitive pipeline into the pixel shader stage. I'm not sure where the compute shader would interject.

That's possible, but wouldn't necessarily explain AMD not attaching a similar chip to Ryzen.
AMD's offering a discrete mobile Vega that appears roughly in the same range as the Intel custom chip.
Intel's EMIB implementation is riding a curve of manufacturing grunt, volume, and profitability that AMD is not positioned to match.

An 8 core Ryzen with 32 CU Vega and HBM2 and big 120mm cooler would dominate right now. In part because discrete parts became scarce.
It would be a niche product, which AMD seems to be de-prioritizing in favor of lower-hanging fruit in the form of Ryzen in its 8-core form or existing MCM CPU products. An APU would take on a significant portion of the up-front cost of a new package format while the GPU tied to an 8-core makes it less desirable for most markets either the GPU or CPU would otherwise be sold to.
Commercial customers would be less likely to care for the overkill GPU, while the GPU is rather low-tier for rigs that might want an 8-core. The larger amount of silicon would compromise its power characteristics for mobile.

The niche it might fit in is dominated by Intel, and with the custom product AMD is making money in that niche with Intel taking on most of the hassle.

Not counter as much as attacking the problem from different angles. VGPR spilling technically allows the larger register file size, just with unacceptable performance in most cases. The virtual RF would address that with a renaming and paging mechanism that should be transparent to the shader or DSBR model.
Indexing doesn't spill registers, and the overall register file is not really growing much or possibly shrinking if the virtual register scheme is implemented.
The DSBR doesn't have visibility on register addressing or spilling anyway, it's not in that part of the chip.

It would be transparent to the original design as it would be on par with simply providing a larger cache or register file and relaxing the bin size requirements.
Not if the sizing is dominated by other resource limits and fixed-function pipeline granularity, which the various patches indicate is the case.
Register usage is a CU occupancy constraint, which is highly variable to the primitive count, surface format, and sample count that the binning logic seems concerned with. Whether a bin is larger or smaller doesn't strongly correlate to the CU occupancy level, all else being equal. One bin with X wavefronts needed would if subdivided generally give N*X bins each needingX/N wavefronts--barring potentially redundant work for items spanning bins.

Extra space in the form of larger/additional PHYs with internal routing for growing the network like Epyc. 32 PCIe lanes on a gaming part would be largely wasted, but practical on SSG, duo, or APU if using the same part.
We have pictures of Vega, which don't seem to show extra IO (edits: besides some SATA, probably). PHYs don't do internal networking.

How do you read it like that? The support is (will be) there if the dev decides to build the game using them, it's just not the first advertised automatic conversion from vertex+geometry or whatever
Perhaps AMD would be publishing documentation on them, as well as what it would take to expose them.
Part of the earlier discussion of the internal driver path was that AMD hadn't figured out how it could give devs the chance to use them, and it's ominous if the company that best knows how to wrangle the architecture reversed its position after other engineers indicated it was a serious pain to use (for a driver that was supposedly almost always good at generating primitive shaders???).
 
No need for unicorn drivers, just need a dev to implement all the features.
So now no need for unicorn drivers? We just need a unicorn developer to code stuff for a GPU that pretty much has near zero marketshare?

especially with Vega poised to hold a significant chunk of the graphics market. Leaving Nvidia a distant third in marketshare.
Poised and have are two different things, historically APUs have never captured a great deal of the important marketshare that developers cater for. As of right now NVIDIA holds the majority of dGPUs. AMD's share has dwindled to a historically dangerous level due to a combination of uncompetitive high end solutions and the mining craze. It's not hard to see who is in the distant third right now and the future.

as we've already seen a Vega around Titan performance with limited implementation.
We've also seen a Vega around a 1070 several times with no effort whatsoever! So does that make the GP104 the future king of GPUs according to your logic? right?

My prediction still looks rather accurate
At this point, this is not a prediction, just wishful thinking wrapped into denial.
 
Amd reverted back to the original story. I distinctly remember amd representative saying these features are something that need to be specifically implemented in games. Then the story somehow changed to driver magic. Now we are back at the beginning.
 
It's good to see Raja gone.

No, it's sad to see him gone.

Engineering is hard and a product can fail for a variety of reasons.

For me, firing (or not trying to stop from living ) a high profile, talented technical person when things go bad rather indicates management acting on an impulse. Or pressure from even higher mangement/shareholders.
It would have been equally logical for AMD to encourage Raja to continue; as now he would have vad a clear reference point and he should be much more able to apply his experience on concrete improvements.
 
What clients though? The vast majority of the bandwidth is between the CUs and HBM2 and somehow applicable to MI25 or one of the server products. There may be some overhead, but that shouldn't be making the chip significantly larger than otherwise necessary. There don't appear to be any indications Vega is scaling past 64 CUs either to add more clients for IF. There's just no reason for it in the current incarnation. None of the clients listed in the hotchips presentation should have significantly large requirements for IF.
You're asking the right questions. Exactly because none of the clients listed in the hotchips presentation has a very large bandwidth requirement, IF is overbuild. IMHO, that's what the argument started with and now we have reached common grounds answering that question.
 
You're asking the right questions. Exactly because none of the clients listed in the hotchips presentation has a very large bandwidth requirement, IF is overbuild. IMHO, that's what the argument started with and now we have reached common grounds answering that question.

For IF, a client in this case would be any endpoint. One memory controller would have a coherent slave, which interfaces with the fabric and counts as a client. How many memory controllers there are isn't clear, but the fabric's bandwidth is set to be equal to memory controllers.
It seems like the GPU's L2-L1 domain is not a direct client, but there's a diagram with 16 L2 slices touching the fabric.

That's 512 GB/s of memory controller bandwidth and 16 L2 clients that at least traditionally were able to sink something on the order of 32 bytes of bandwidth each cycle.


Besides that, however, the marketing slides for Vega 10 had the IF domain shaded in as a strip between the ROPs and HBCC. It's likely bigger than traditional interconnects, but losing the area entirely doesn't take off that much area for Vega.
 
Poised and have are two different things, historically APUs have never captured a great deal of the important marketshare that developers cater for. As of right now NVIDIA holds the majority of dGPUs. AMD's share has dwindled to a historically dangerous level due to a combination of uncompetitive high end solutions and the mining craze. It's not hard to see who is in the distant third right now and the future.
So consoles aren't important nor catered to for game developers? Nvidia's "majority" of dGPU was something along the lines of 15% of the graphics marketshare last I checked. AMDs share will dwindle as the move towards APUs and integrated designs continues from that discussion we had a year or so ago. We've seen Raven with better performance, Intel added Vega graphics as well that are advertised as VR capable, and Nintendo Switch seems rather popular with it's typically less than dGPU level of performance. With 580/1060 mid-range marketshare shifting towards APU designs those dGPU sales will drop. Even Intel is getting in on that action.

At this point, this is not a prediction, just wishful thinking wrapped into denial.
Not really wishful, we've already seen parity without everything fully enabled in well optimized titles where the cards run close to max. So in this case "denial" is just common sense unless the new features slow things down or DX12/Vulkan titles are optimized horribly. The potential FP16 performance is well above 1080ti and that feature is attractive if going towards that low-power APU market with lots of users.

For me, firing (or not trying to stop from living ) a high profile, talented technical person when things go bad rather indicates management acting on an impulse. Or pressure from even higher mangement/shareholders.
It would have been equally logical for AMD to encourage Raja to continue; as now he would have vad a clear reference point and he should be much more able to apply his experience on concrete improvements.
That can be a tricky situation. It can be preferable for all involved if the head steps aside during any reorganization regardless of performance.

You're asking the right questions. Exactly because none of the clients listed in the hotchips presentation has a very large bandwidth requirement, IF is overbuild. IMHO, that's what the argument started with and now we have reached common grounds answering that question.
None of the clients normally have large requirements, but some(xDMA for example) could. IF is scalable so there is no reason to really overbuild it unless there exists a situation where it may be necessary. Back to the xDMA part, that would be the Crossfire connection to other cards over PCIe. Typically limited to 16x PCIe, making it larger would make sense and still be covered in the hotchips presentation. Just not drawn to scale. That would nominally be 16GB/s, doubled to 32GB's if a second link or PCIe4 was added(IF on Epyc runs above PCIe spec), and perhaps even >64/128GB/s(if designed to extend Epyc's fabric). It's conceivable that xDMA could be on the level of a memory controller. That client may also be switching between multiple IO devices including video encode/decode.
 
So consoles aren't important nor catered to for game developers?
Nope, consoles are not that important for the "PC" gaming ecosystem. Almost all titles already come with NVIDIA optimized code for PC despite the console heritage, which enables NVIDIA GPUs to run them better, further alienating AMD GPUs in PC space, making them less desirable, impacting R&D, impacting competitiveness, eventually might lead AMD to lose even the console market (like they lost the Switch). It's a cycle of causality and effect.
DX12/Vulkan titles are optimized horribly.
We are already seeing 70% of DX12 titles being optimized horribly.
The potential FP16 performance is well above 1080ti and that feature is attractive if going towards that low-power APU market with lots of users.
Not according to developers invested in the matter (like DICE), right now the feature is in limited use, (being only available on the PS4Pro), even when in use it is accelerating parts of the code by 30%, which means the overall impact is probably less than 10% depending on the code discussed. Not every feature is to be blown out of proportions like it's the second coming. How many chips have FP16 enabled again?
we've already seen parity without everything fully enabled in well optimized titles where the cards run close to max.
Actually, we've seen huge disparity more than parity. The concept of "well optimized titles" can be played by both sides, the amount of games where a 1080 far exceeds the Vega 64 is higher than the one or two cases where Vega 64 achieves close results to a 1080Ti (let alone the fully enabled TitanXP).
With 580/1060 mid-range marketshare shifting towards APU designs those dGPU sales will drop.
Intel's KabyLake G is nowhere near the 1060/580 level, not unless you believe some botched and skewed Intel marketing numbers. And Intel is only doing it for a niche market. AMD seems reluctant to even approach that level any time soon. And by the time they decide to the midrange will have moved on to an even higher level. Rinse and repeat. I won't go into the economics of this again, not when both Intel and AMD don't see it as a feasible option just yet.

Your point is basically this: AMD will convince developers to use some Vega features because they might gain marketshare because of APUs? How long is this going to take? 2 years? 3 Years? 5? By then it won't matter anymore, the landscape will change, the competition will catch on, and these features might not even exist in the new generations. Volta is here in 2018 already, while AMD is having a hiatus for another 1.5 Year at the minimum. How will this play into AMD capability of competing?
 
My impression was there's still a direct feed from the primitive pipeline into the pixel shader stage. I'm not sure where the compute shader would interject.
The compute shader wouldn't be injected, it would generate a coherent stream of culled geometry or indices prior to the traditional pipeline. The 4SEs can still only rasterize so much geometry with the high primitive shader geometry rate applying to culling. The compute variation should also work on Nvidia hardware.

AMD's offering a discrete mobile Vega that appears roughly in the same range as the Intel custom chip.
Intel's EMIB implementation is riding a curve of manufacturing grunt, volume, and profitability that AMD is not positioned to match.
Which chip was that? What I recall was Mobile Ryzen with up to a dozen CUs and KabyG's up to 24CUs. While similar performance could be possible based on power envelopes, I wouldn't consider those in direct competition. I haven't seen any >24CU Vega variants besides Vega56/64 and possibly a rumored 32CU Nano. I'd agree AMD can't match Intel's scale there, but Ryzen APUs would be well positioned for affordable NUCs, Chromebook/box, etc that wouldn't require thin and low powered designs.

It would be a niche product
...
The niche it might fit in is dominated by Intel, and with the custom product AMD is making money in that niche with Intel taking on most of the hassle.
Perhaps, but it may be the only option with current supplies. With high RAM prices and GPUs near impossible to find. For OEMs the smaller form factors would be more popular as the market moves away from larger desktops. Inevitably AMD needs to get into the market. They could cede the higher in mobile/ultra-thin market and still be positioned with moderately sized NUCs.

Indexing doesn't spill registers, and the overall register file is not really growing much or possibly shrinking if the virtual register scheme is implemented.
The DSBR doesn't have visibility on register addressing or spilling anyway, it's not in that part of the chip.
We're not disagreeing here. I'm stating that some scheduling constraints could be relaxed and larger, more complex shaders scheduled with acceptable performance. Avoiding situations where a spike in VGPR usage at some point in a shader's execution consumes most of the registers. The virtual registers would allow those inactive registers be paged, discarded, etc so more work could be dispatched. VGPR scheduling requirements changing temporally within a shader.

Not if the sizing is dominated by other resource limits and fixed-function pipeline granularity, which the various patches indicate is the case.
Register usage is a CU occupancy constraint, which is highly variable to the primitive count, surface format, and sample count that the binning logic seems concerned with. Whether a bin is larger or smaller doesn't strongly correlate to the CU occupancy level, all else being equal. One bin with X wavefronts needed would if subdivided generally give N*X bins each needingX/N wavefronts--barring potentially redundant work for items spanning bins.
I'm suggesting a bin would be a fixed, consistent number of wavefronts with variable amount of geometry and long running shaders. Shift the pipline entirely into one CU with bin sized screen space dimensions. The occupancy constraint would shift as waves could be oversubscribed with the paging mechanism. A more compute oriented pipeline than the traditional fixed function that could be better distributed. Primitive count would be variable with bin dimensions determined by ROP cache. Have a CU focus on one section of screen space until all geometry has been rendered there. Removing the global synchronization with primitives only being ordered within the scope of a CU/bin.

We have pictures of Vega, which don't seem to show extra IO (edits: besides some SATA, probably). PHYs don't do internal networking.
I wasn't suggesting the PHYs for internal networking, but tying in external adapters that may only be present on server parts. Sized for 20 PCIe lanes instead of 16 or perhaps larger. If we can get a good die shot it might be interesting to compare Vega and Epyc PHYs to see if they are similarly sized.
 
A slide from the CES techday:
attachment.php


Source:
https://www.forum-3dcenter.org/vbulletin/showthread.php?p=11612392#post11612392
 
The compute shader wouldn't be injected, it would generate a coherent stream of culled geometry or indices prior to the traditional pipeline. The 4SEs can still only rasterize so much geometry with the high primitive shader geometry rate applying to culling.
That would be able to avoid insertion, since that is a different spot than originally stated.

Which chip was that? What I recall was Mobile Ryzen with up to a dozen CUs and KabyG's up to 24CUs. While similar performance could be possible based on power envelopes, I wouldn't consider those in direct competition. I haven't seen any >24CU Vega variants besides Vega56/64 and possibly a rumored 32CU Nano. I'd agree AMD can't match Intel's scale there, but Ryzen APUs would be well positioned for affordable NUCs, Chromebook/box, etc that wouldn't require thin and low powered designs.
Radeon Vega Mobile was announced at CES. A discrete Vega with 1 stack of HBM2 and 28 CUs.
It should link to a CPU over PCIe.
The Intel package is a 24 CU solution using EMIB to link to its one stack of HBM2. The GPU links to the CPU with an 8x PCIe link.
It's somewhat in the same realm, although the discrete could benefit from better cooling given the extra space it has--which is the extra space that makes it less compact than the Intel solution.

As a discrete with no obligation to be mated to a specific CPU in a package, it has more flexibility for what it can be plugged into and would have required less up-front investment as well.

We're not disagreeing here. I'm stating that some scheduling constraints could be relaxed and larger, more complex shaders scheduled with acceptable performance.
Then perhaps I did not interpret the statement I was replying to as it was intended.
I would characterize the situation as being more stark than relaxing scheduling constraints. The shaders' register allocations are defined as worst-case and are a fixed amount. The current sitation isn't that they can be scheduled together with better performance so much as sufficiently large worst-case allocations prevent more than a few from being scheduled at all.

I'm suggesting a bin would be a fixed, consistent number of wavefronts with variable amount of geometry and long running shaders. Shift the pipline entirely into one CU with bin sized screen space dimensions. The occupancy constraint would shift as waves could be oversubscribed with the paging mechanism.
This seems to be assuming an additional change to the architecture, since waves are capped at 10 per SIMD. Relaxing register occupancy only goes so far.

Have a CU focus on one section of screen space until all geometry has been rendered there. Removing the global synchronization with primitives only being ordered within the scope of a CU/bin.
That's likely to run into more complex trade-offs. There's a potentially somewhat stable working set of opaque geometry that may give somewhat similar wavefront counts, or at least with limited variation between adjacent batches assuming similar coverage and overdraw with the DSBR's shade-once fully in effect. The forward progress guarantees for the virtual register file would likely only give one primary wavefront per SIMD a guarantee of progress if register spills start to spike. More complex regions of screen space may do better spread over more CUs, as they can give more wavefronts forward progress guarantees while spreading any spill storms across more register files and caches.
The DSBR is already rather serializing, so committing a batch to one CU and letting one CU's limitations exert back pressure may be constraining.
 
Back
Top