AMD Vega Hardware Reviews

Discussion in 'Architecture and Products' started by ArkeoTP, Jun 30, 2017.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Slide 19

    http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

    It's a key differentiator as compared with the VLIW SIMD design: "Vector back-to-back wavefront instruction issue", versus "Interleaved wavefront instruction required" for VLIW.

    I have got respectable performance out of GCN with just a single wavefront per SIMD (i.e. more than 128 VGPR allocation). Depends on ALU:MEM and incoherent control flow, in the end.

    32KiB (shared by several CUs) of I$ is plenty large enough for fairly complex compute (a single very heavy kernel). Multiple, large, competing kernels sharing I$ is obviously going to be a factor with the various kernels seen by graphics. Still doesn't change the fact that GCN was designed explicitly for back-to-back execution of instructions from a single wavefront.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    A single wavefront per SIMD workload and the same fully cached kernel for all CUs sharing a front end would go far in making the CU's fetch arbitration straightforward, perhaps more so after Polaris increased its buffer sizes and added whatever it calls instruction prefetch.
    It makes it seem like GCN can heavily exercise its capacity to have back to back issue within a wavefront in a SIMD if the workload gives the CU nothing else to switch to.
     
  3. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,749
    Likes Received:
    2,516
    Lightman likes this.
  4. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,977
    Likes Received:
    3,057
    Location:
    Pennsylvania
    How so? Comparable performance subjectively in a variable refresh environment.
     
  5. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Well, Antal says in the blog:
    „Even if we pitted Radeon RX Vega against the mightier GeForce 1080Ti,[…]“
    which might be interpreted as comparison to RX Vega as well as to the 1080 non-Ti, if you take into account the context of the preceding paragraphs.

    If you take this sentence of Antal's blog
    „That said, the biggest difference between the two systems was the price based simply on the monitor itself, with the G-Sync display costing $300 more than its FreeSync counterpart.“
    you might conclude that RX Vega in the edition shown in Budapest will sell for roughly GTX 1080's price levels.
     
    Lightman, DavidGraham and pharma like this.
  6. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,977
    Likes Received:
    3,057
    Location:
    Pennsylvania
    I hope so, and I also hope the AIO version isn't more than $100 premium.
     
  7. BacBeyond

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    73
    Likes Received:
    43
    I wasn't contracting you, I was just rewording it and adding to it.
     
  8. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,749
    Likes Received:
    2,516
    Well, in addition to what Carsten said, there is AMD's choice for the 1080 in the comparison, there are AMD's earlier demos which indicated 1080 performance, there is FE's gaming performance which is around 1080. And then there are the blind tests and lack of fps counters. Too many signs now indicating 1080 for the Air Vega. Maybe the water cooled version will be noticeably better than 1080?
     
    Lightman likes this.
  9. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,133
    Likes Received:
    905
    Location:
    still camping with a mauler
    This is actually an interesting point. Vega's high power consumption makes it a poor choice for mining, meaning retail prices for Vega FX could be substantially lower than similar performing cards with lower TDP.
     
  10. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,977
    Likes Received:
    3,057
    Location:
    Pennsylvania
    Yep, you're right. Given everything available previous and now the blind tests and follow-up article, we're confirmed 1080 speed for RX Vega. I imagine the water cooled version will be able to sustain the higher clocks (around 1600) but likely no more than 10% faster.

    Now let's move onto SIGGRAPH and announce pricing!
     
    DavidGraham likes this.
  11. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,296
    Likes Received:
    395
    Location:
    Australia


    well atleast the VEGA SSG looks like its awesome in its space
     
    BacBeyond, Alexko, Cat Merc and 5 others like this.
  12. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    That it does. This is the only Vega SKU that looks appealing to me, at this point. That is not to say no other SKUs will appeal to anyone else, just that this is the only one I would consider buying. I'm sure it will cost an arm and a leg though.
     
  13. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,133
    Likes Received:
    905
    Location:
    still camping with a mauler
    When does the NDA lift for Vega RX reviews?
     
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Not yet announced.
     
  15. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Yep, that's one USP for Vega right now. People needing seamless, predictable streaming access to massive amounts of data surely will love Vega SSG.
     
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    It's the multiple large kernels, amplified with async, that would be the concern. Then DPP and conditions requiring wait states where back to back execution is obviously problematic. I was never disputing GCN was capable of back to back instructions from the same wave, but that scheduling would prefer not to execute a single wave indefinitely and two waves being more ideal. In fact an unbound VALU heavy wave would ideally be the lowest priority for scheduling.

    It's a complex issue, but for multiple thread groups, especially with async, keeping them somewhat in lockstep I would think is preferable. Ensure all related waves hit the cache before another kernel or SIMD trashes it. Taken a step further, periodically changing groups to attempt to always stall one on every possible bottleneck to increase utilization. Assuming limited OoO execution capabilities. Instruction prefetch as a non-dependent fetch to prime a cache could do that, but there are other bottlenecks. It would make sense that the schedulers attempted to prioritize whatever waves weren't likely to be bound by currently observed bottlenecks. I know AMD has mentioned they actively track some of those metrics. HWS could use them, but so could wave scheduling.

    Back to my original software scheduling point, Nvidia was explicitly addressing temporary registers for the RF caching. Unlike VGPRs, they would block other waves from execution. Waves not using them could still schedule and execute. The programming/scheduling model would likely prefer short execution bursts allowing the hardware schedulers prioritization mechanisms to balance the load. Not an issue for Nvidia as they lack the async and hardware scheduling there. So while a single wave could execute indefinitely, that would be less than ideal. A CU/group level scalar for controlling timing could really help there.

    Polaris also added HWS, so it stands to reason some conventional wisdom may have changed. As mentioned above, non-dependent cached memory accesses make sense for the prefetch as an OoO optimization to prime caches. Then some barrier close to the expected latency to ensure all waves in a group hit the cache ASAP or at the very least stall only a short time.
     
  17. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    It's true that occupancy = 1 is enough for GCN SIMD to achieve peak performance, but only in massive ALU heavy kernels that don't do any memory operations. Occupancy = 1 prevents any co-issuing, so scalar unit can't give you perf gains (bad if you have ALU operations that could run at wave granularity). And obviously each memory access = stall, even when they hit caches. Even L1$ access is >100 cycle stall (but one instruction takes 4 cycles to execute, so stall is 4x less in instructions). LDS accesses stall the SIMD and barriers stall the SIMD. Same is obviously true for texture sampling (even from L1$). So in practice occupancy = 1 is useless, but occupancy = 2 is already much better, because it can co-issue vector + scalar + mem and hide infrequent LDS latency and infrequent K$ scalar unit loads (constants). But it will stall the SIMD every time when you do vector memory loads (unless in L1$). But this is still practical for some highly ALU dense kernels that mostly operate in registers and in LDS and have huge loops inside the kernel.

    Most shaders require SIMD occupancy of 4-7 for peak performance. Never seen any shader gaining anything from higher occupancy than 8. I have seen some shaders that reach peak perf at SIMD occupancy of 3. I would guess that 4K demos and shadertoy shaders with no texture sampling (pure analytic SDF sphere tracing) reach peak perf already at occupancy of 2. But this obviously means that you can't do any groupshared memory (LDS) optimizations or any communication (LDS or cross lane ops) between lanes. Shadertoy only supports pixel shaders, so there's no communication, so this problem doesn't exist.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    This is why I wrote about heavy kernels. Kernels that can run for hundreds to thousands of milliseconds. etc. (I don't do graphics). e.g. sorting millions of lists that are each hundreds to thousands long.

    The real problem with GCN is that the compiler stops your sweet algorithm that consumes N VGPRs from running sweetly, because the compiler decides to add a ridiculous number of additional VGPRs, trashing your carefully organised memory access patterns designed for N VGPRs (and the resulting wavefront count). So then you have to hunt down a lesser algorithm to work around the compiler.
     
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    HWS were already present in Tonga and Fiji. They are useful, but not magical unicorns.
     
  20. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    HWS were introduced with Polaris as I recall, but being programmable backported to Tonga and Fiji. In reference to the whitepaper and scheduling, HWS didn't exist at the time of publication. Paper dated 2011 and I believe released 2013 going off the URL. Nor did a readily usable implementation for async compute. Mantle was being experimented with at that time. It stands to reason the scheduling process is actively evolving over time. Which also explains the programmable hardware in the first place.

    While not affecting individual shader performance, it might make sense to use those extra waves for leading edge pixels if possible. Triggering page faults with HBCC as soon as possible as those waves will likely stall. Then using the duration of the frame to cover the stall. It wouldn't be surprising if the raster pattern on Vega was changing direction, going opposite of camera movement. If that's even possible with drivers considering the order the application is drawing or programmable with wave distribution.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...