AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,173
    Location:
    La-la land
    Right. Just as many games as supported that 3D sound shit AMD introduced with R9 290 series cards, which is zero for all intents and purposes.

    With AMD's miniscule and shrinking marketshare, any feature requiring vega hardware-specific coding is going to be dead on arrival.
     
    xpea likes this.
  2. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    547
    Likes Received:
    247
    These high-performance APUs needs Primitive Shaders!
    That's the problem.
     
  3. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,289
    Likes Received:
    4,871
    They'd need working drivers first..
     
  4. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    What clients though? The vast majority of the bandwidth is between the CUs and HBM2 and somehow applicable to MI25 or one of the server products. There may be some overhead, but that shouldn't be making the chip significantly larger than otherwise necessary. There don't appear to be any indications Vega is scaling past 64 CUs either to add more clients for IF. There's just no reason for it in the current incarnation. None of the clients listed in the hotchips presentation should have significantly large requirements for IF.

    So situations where GPU limited? Absolutely shocking that reducing GPU load would increase performance when limited by the GPU. Almost as if bandwidth wasn't a limiting factor for Vega. With AMD's limited FLOPs holding them back and all. That might have been one of the dumbest comments I've seen in a while.

    My prediction still looks rather accurate, even if that primitive shader news holds out, as we've already seen a Vega around Titan performance with limited implementation. RPM usage is still a bit hit or miss and the DSBR state uncertain. No need for unicorn drivers, just need a dev to implement all the features. Something they are already doing, especially with Vega poised to hold a significant chunk of the graphics market. Leaving Nvidia a distant third in marketshare.

    The alternative to primitive shaders was using a compute shader asynchronously to perform the culling. Ideally with RPM/FP16 where the front end has little if any culling to perform. Might even be faster. Wolfenstein I believe was doing something similar, but was disabled in some benchmarks with the FP16 issues to my understanding. At least the last time anyone looked at it. Might be a better implementation for mGPU workload distribution as well.
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,170
    Likes Received:
    3,069
    Location:
    Well within 3d
    My impression was there's still a direct feed from the primitive pipeline into the pixel shader stage. I'm not sure where the compute shader would interject.

    AMD's offering a discrete mobile Vega that appears roughly in the same range as the Intel custom chip.
    Intel's EMIB implementation is riding a curve of manufacturing grunt, volume, and profitability that AMD is not positioned to match.

    It would be a niche product, which AMD seems to be de-prioritizing in favor of lower-hanging fruit in the form of Ryzen in its 8-core form or existing MCM CPU products. An APU would take on a significant portion of the up-front cost of a new package format while the GPU tied to an 8-core makes it less desirable for most markets either the GPU or CPU would otherwise be sold to.
    Commercial customers would be less likely to care for the overkill GPU, while the GPU is rather low-tier for rigs that might want an 8-core. The larger amount of silicon would compromise its power characteristics for mobile.

    The niche it might fit in is dominated by Intel, and with the custom product AMD is making money in that niche with Intel taking on most of the hassle.

    Indexing doesn't spill registers, and the overall register file is not really growing much or possibly shrinking if the virtual register scheme is implemented.
    The DSBR doesn't have visibility on register addressing or spilling anyway, it's not in that part of the chip.

    Not if the sizing is dominated by other resource limits and fixed-function pipeline granularity, which the various patches indicate is the case.
    Register usage is a CU occupancy constraint, which is highly variable to the primitive count, surface format, and sample count that the binning logic seems concerned with. Whether a bin is larger or smaller doesn't strongly correlate to the CU occupancy level, all else being equal. One bin with X wavefronts needed would if subdivided generally give N*X bins each needingX/N wavefronts--barring potentially redundant work for items spanning bins.

    We have pictures of Vega, which don't seem to show extra IO (edits: besides some SATA, probably). PHYs don't do internal networking.

    Perhaps AMD would be publishing documentation on them, as well as what it would take to expose them.
    Part of the earlier discussion of the internal driver path was that AMD hadn't figured out how it could give devs the chance to use them, and it's ominous if the company that best knows how to wrangle the architecture reversed its position after other engineers indicated it was a serious pain to use (for a driver that was supposedly almost always good at generating primitive shaders???).
     
    pharma likes this.
  6. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,986
    Likes Received:
    2,913
    So now no need for unicorn drivers? We just need a unicorn developer to code stuff for a GPU that pretty much has near zero marketshare?

    Poised and have are two different things, historically APUs have never captured a great deal of the important marketshare that developers cater for. As of right now NVIDIA holds the majority of dGPUs. AMD's share has dwindled to a historically dangerous level due to a combination of uncompetitive high end solutions and the mining craze. It's not hard to see who is in the distant third right now and the future.

    We've also seen a Vega around a 1070 several times with no effort whatsoever! So does that make the GP104 the future king of GPUs according to your logic? right?

    At this point, this is not a prediction, just wishful thinking wrapped into denial.
     
    Picao84 and A1xLLcqAgt0qc2RyMz0y like this.
  7. Dayman1225

    Newcomer

    Joined:
    Sep 9, 2017
    Messages:
    58
    Likes Received:
    84
  8. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    190
    Likes Received:
    162
    Paper dragons and shiny slides. That's really a shame. It's good to see Raja gone.
     
    xpea likes this.
  9. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    1,586
    Likes Received:
    416
    Location:
    Earth
    Amd reverted back to the original story. I distinctly remember amd representative saying these features are something that need to be specifically implemented in games. Then the story somehow changed to driver magic. Now we are back at the beginning.
     
    Picao84 likes this.
  10. entity279

    Veteran Regular Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,236
    Likes Received:
    428
    Location:
    Romania
    No, it's sad to see him gone.

    Engineering is hard and a product can fail for a variety of reasons.

    For me, firing (or not trying to stop from living ) a high profile, talented technical person when things go bad rather indicates management acting on an impulse. Or pressure from even higher mangement/shareholders.
    It would have been equally logical for AMD to encourage Raja to continue; as now he would have vad a clear reference point and he should be much more able to apply his experience on concrete improvements.
     
    BRiT likes this.
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,817
    Likes Received:
    2,098
    Location:
    Germany
    You're asking the right questions. Exactly because none of the clients listed in the hotchips presentation has a very large bandwidth requirement, IF is overbuild. IMHO, that's what the argument started with and now we have reached common grounds answering that question.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,170
    Likes Received:
    3,069
    Location:
    Well within 3d
    For IF, a client in this case would be any endpoint. One memory controller would have a coherent slave, which interfaces with the fabric and counts as a client. How many memory controllers there are isn't clear, but the fabric's bandwidth is set to be equal to memory controllers.
    It seems like the GPU's L2-L1 domain is not a direct client, but there's a diagram with 16 L2 slices touching the fabric.

    That's 512 GB/s of memory controller bandwidth and 16 L2 clients that at least traditionally were able to sink something on the order of 32 bytes of bandwidth each cycle.


    Besides that, however, the marketing slides for Vega 10 had the IF domain shaded in as a strip between the ROPs and HBCC. It's likely bigger than traditional interconnects, but losing the area entirely doesn't take off that much area for Vega.
     
  13. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    So consoles aren't important nor catered to for game developers? Nvidia's "majority" of dGPU was something along the lines of 15% of the graphics marketshare last I checked. AMDs share will dwindle as the move towards APUs and integrated designs continues from that discussion we had a year or so ago. We've seen Raven with better performance, Intel added Vega graphics as well that are advertised as VR capable, and Nintendo Switch seems rather popular with it's typically less than dGPU level of performance. With 580/1060 mid-range marketshare shifting towards APU designs those dGPU sales will drop. Even Intel is getting in on that action.

    Not really wishful, we've already seen parity without everything fully enabled in well optimized titles where the cards run close to max. So in this case "denial" is just common sense unless the new features slow things down or DX12/Vulkan titles are optimized horribly. The potential FP16 performance is well above 1080ti and that feature is attractive if going towards that low-power APU market with lots of users.

    That can be a tricky situation. It can be preferable for all involved if the head steps aside during any reorganization regardless of performance.

    None of the clients normally have large requirements, but some(xDMA for example) could. IF is scalable so there is no reason to really overbuild it unless there exists a situation where it may be necessary. Back to the xDMA part, that would be the Crossfire connection to other cards over PCIe. Typically limited to 16x PCIe, making it larger would make sense and still be covered in the hotchips presentation. Just not drawn to scale. That would nominally be 16GB/s, doubled to 32GB's if a second link or PCIe4 was added(IF on Epyc runs above PCIe spec), and perhaps even >64/128GB/s(if designed to extend Epyc's fabric). It's conceivable that xDMA could be on the level of a memory controller. That client may also be switching between multiple IO devices including video encode/decode.
     
  14. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,986
    Likes Received:
    2,913
    Nope, consoles are not that important for the "PC" gaming ecosystem. Almost all titles already come with NVIDIA optimized code for PC despite the console heritage, which enables NVIDIA GPUs to run them better, further alienating AMD GPUs in PC space, making them less desirable, impacting R&D, impacting competitiveness, eventually might lead AMD to lose even the console market (like they lost the Switch). It's a cycle of causality and effect.
    We are already seeing 70% of DX12 titles being optimized horribly.
    Not according to developers invested in the matter (like DICE), right now the feature is in limited use, (being only available on the PS4Pro), even when in use it is accelerating parts of the code by 30%, which means the overall impact is probably less than 10% depending on the code discussed. Not every feature is to be blown out of proportions like it's the second coming. How many chips have FP16 enabled again?
    Actually, we've seen huge disparity more than parity. The concept of "well optimized titles" can be played by both sides, the amount of games where a 1080 far exceeds the Vega 64 is higher than the one or two cases where Vega 64 achieves close results to a 1080Ti (let alone the fully enabled TitanXP).
    Intel's KabyLake G is nowhere near the 1060/580 level, not unless you believe some botched and skewed Intel marketing numbers. And Intel is only doing it for a niche market. AMD seems reluctant to even approach that level any time soon. And by the time they decide to the midrange will have moved on to an even higher level. Rinse and repeat. I won't go into the economics of this again, not when both Intel and AMD don't see it as a feasible option just yet.

    Your point is basically this: AMD will convince developers to use some Vega features because they might gain marketshare because of APUs? How long is this going to take? 2 years? 3 Years? 5? By then it won't matter anymore, the landscape will change, the competition will catch on, and these features might not even exist in the new generations. Volta is here in 2018 already, while AMD is having a hiatus for another 1.5 Year at the minimum. How will this play into AMD capability of competing?
     
    Picao84 and xpea like this.
  15. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The compute shader wouldn't be injected, it would generate a coherent stream of culled geometry or indices prior to the traditional pipeline. The 4SEs can still only rasterize so much geometry with the high primitive shader geometry rate applying to culling. The compute variation should also work on Nvidia hardware.

    Which chip was that? What I recall was Mobile Ryzen with up to a dozen CUs and KabyG's up to 24CUs. While similar performance could be possible based on power envelopes, I wouldn't consider those in direct competition. I haven't seen any >24CU Vega variants besides Vega56/64 and possibly a rumored 32CU Nano. I'd agree AMD can't match Intel's scale there, but Ryzen APUs would be well positioned for affordable NUCs, Chromebook/box, etc that wouldn't require thin and low powered designs.

    Perhaps, but it may be the only option with current supplies. With high RAM prices and GPUs near impossible to find. For OEMs the smaller form factors would be more popular as the market moves away from larger desktops. Inevitably AMD needs to get into the market. They could cede the higher in mobile/ultra-thin market and still be positioned with moderately sized NUCs.

    We're not disagreeing here. I'm stating that some scheduling constraints could be relaxed and larger, more complex shaders scheduled with acceptable performance. Avoiding situations where a spike in VGPR usage at some point in a shader's execution consumes most of the registers. The virtual registers would allow those inactive registers be paged, discarded, etc so more work could be dispatched. VGPR scheduling requirements changing temporally within a shader.

    I'm suggesting a bin would be a fixed, consistent number of wavefronts with variable amount of geometry and long running shaders. Shift the pipline entirely into one CU with bin sized screen space dimensions. The occupancy constraint would shift as waves could be oversubscribed with the paging mechanism. A more compute oriented pipeline than the traditional fixed function that could be better distributed. Primitive count would be variable with bin dimensions determined by ROP cache. Have a CU focus on one section of screen space until all geometry has been rendered there. Removing the global synchronization with primitives only being ordered within the scope of a CU/bin.

    I wasn't suggesting the PHYs for internal networking, but tying in external adapters that may only be present on server parts. Sized for 20 PCIe lanes instead of 16 or perhaps larger. If we can get a good die shot it might be interesting to compare Vega and Epyc PHYs to see if they are similarly sized.
     
  16. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    380
    Likes Received:
    124
    That's not too surprising. It was always a monumental challenge to discard triangles this way without messing with standards and the having the GPU still do what developers expect of it. Nvidia's similar fast primitive discard feature is similarly there only for developer enablement.
     
  17. Locuza

    Newcomer

    Joined:
    Mar 28, 2015
    Messages:
    45
    Likes Received:
    101
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,170
    Likes Received:
    3,069
    Location:
    Well within 3d
    That would be able to avoid insertion, since that is a different spot than originally stated.

    Radeon Vega Mobile was announced at CES. A discrete Vega with 1 stack of HBM2 and 28 CUs.
    It should link to a CPU over PCIe.
    The Intel package is a 24 CU solution using EMIB to link to its one stack of HBM2. The GPU links to the CPU with an 8x PCIe link.
    It's somewhat in the same realm, although the discrete could benefit from better cooling given the extra space it has--which is the extra space that makes it less compact than the Intel solution.

    As a discrete with no obligation to be mated to a specific CPU in a package, it has more flexibility for what it can be plugged into and would have required less up-front investment as well.

    Then perhaps I did not interpret the statement I was replying to as it was intended.
    I would characterize the situation as being more stark than relaxing scheduling constraints. The shaders' register allocations are defined as worst-case and are a fixed amount. The current sitation isn't that they can be scheduled together with better performance so much as sufficiently large worst-case allocations prevent more than a few from being scheduled at all.

    This seems to be assuming an additional change to the architecture, since waves are capped at 10 per SIMD. Relaxing register occupancy only goes so far.

    That's likely to run into more complex trade-offs. There's a potentially somewhat stable working set of opaque geometry that may give somewhat similar wavefront counts, or at least with limited variation between adjacent batches assuming similar coverage and overdraw with the DSBR's shade-once fully in effect. The forward progress guarantees for the virtual register file would likely only give one primary wavefront per SIMD a guarantee of progress if register spills start to spike. More complex regions of screen space may do better spread over more CUs, as they can give more wavefronts forward progress guarantees while spreading any spill storms across more register files and caches.
    The DSBR is already rather serializing, so committing a batch to one CU and letting one CU's limitations exert back pressure may be constraining.
     
  19. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    547
    Likes Received:
    247
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...