AMD Vega Hardware Reviews

Discussion in 'Architecture and Products' started by ArkeoTP, Jun 30, 2017.

  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,152
    Likes Received:
    3,055
    Location:
    Finland
    http://www.pcgameshardware.de/AMD-Radeon-Grafikkarte-255597/Specials/Vega-10-HBM2-GTX-1080-1215734/
    There's one, it should be mentioned more or less directly in all articles about the AMD's Doom-demos at Tech Summit last December

    I would guess it comes mostly down to similar configuration units wise in the GPU and HBM-memory controllers oppose why "fiji-based" rather than "polaris-based"

    edit: and here's the reddit post suggesting Fiji-drivers
    edit: ffs autoparsing all the reddit links
    reddit.com /r/Amd/comments/6kdwea/vega_fe_doesnt_seem_to_be_doing_tiled/

     
  2. Cat Merc

    Newcomer

    Joined:
    May 14, 2017
    Messages:
    124
    Likes Received:
    108
    GP100 has 1:2 FP64, as well as more cache. That's your culprit for die size. FP16 shouldn't take nearly the die size penalty we see on Vega.
     
  3. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,582
    Likes Received:
    198
    If true would be enough for AMD to show us a benchmark with tiling activated in at least one game to see the improvements will come...
     
  4. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    558
    Likes Received:
    95
    Lightman, pharma and T1beriu like this.
  5. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,298
    Likes Received:
    3,575
    The same is true for TitanXp, If I were to pick the top results for each, the massive difference between them would stay the same.
     
    xpea likes this.
  6. seahawk

    Regular

    Joined:
    May 18, 2004
    Messages:
    511
    Likes Received:
    141
    And how would you get the Int8 and FP16 throughput Vega is showing with a Fiji driver?
     
    Geeforcer likes this.
  7. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    558
    Likes Received:
    95
    More data is needed, but the lower Vega results were submitted a month before the card's release.
     
  8. Sxotty

    Legend Veteran

    Joined:
    Dec 11, 2002
    Messages:
    5,110
    Likes Received:
    479
    Location:
    PA USA
    I have been waiting to buy a free sync display when amd finally got a high performing having card out. Sad. Plus I want nvidia to be forced to support free sync. Grrr. I liked it better when there was competition top to bottom.
     
    #69 Sxotty, Jul 3, 2017
    Last edited: Jul 4, 2017
  9. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,122
    Likes Received:
    5,665
    From the reddit post:

    Is anyone here capable of replicating this process?
     
  10. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    Even if we take this statement at face value, would it explain the performance?

    What kind of performance improvement can you reasonably expect from the Vega improvements?

    As I understand it, HBCC does little unless the software is written for it, so it wouldn't make a difference for today's games.

    NCU supposedly has better IPC. That's great, but how much of a difference does that really make? Are talking here about being better in cases of warp divergence? For games, would that give better than 5% on average?

    Then there's tiling. A nice boost for BW, but that doesn't explain why the FE has a hard time keeping up with a 1080, which has 20% less BW. And let's not forget that the impact of extra BW on gaming performance isn't that high. (That is: if you increase memory clocks by 10% gaming performance typical goes up less than 5%.)

    FP16 isn't currently used a lot (if at all) in games, so that doesn't make a difference either.

    Something else is going on.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,368
    Likes Received:
    3,962
    Location:
    Well within 3d
    There should be some synergy between the binning rasterizer and better divergence handling. Divergence handling wasn't mentioned by AMD when it started talking about Vega in more detail, though.
     
  12. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    The reason I brought up divergence handling is because that's one of the only things I could immediately come up with. :) I thought GCN had already a pretty efficient shader core. But I'm all ears about other potential improvements.
     
  13. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,947
    Likes Received:
    2,293
    Location:
    Germany
    Regarding Die Size: I am not 100% sure that the 1/16th DP rate we're seeing advertised for current productization is really the maximum available with this Vega ASIC. I am not saying that it has to have half-rate DP or so, but it's a possibility that it's more than 1/16th.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,368
    Likes Received:
    3,962
    Location:
    Well within 3d
    (*late correction: Meant to say 4 threads per CU below)

    GCN does have a pretty coarse threshold for getting utilization, testing of compute loads shows the architecture takes longer to spin up, needing groups that approach the significantly wider wavefronts and having at least 4 threads per SIMD (then probably at least 4 more to help hide a death by thousand low-latency cuts and now the number of register per wave comes into play).
    I suppose divergence handling can partially apply to that.
    Another way is finding something to match Volta's ability to not roll over and die with divergent or irreducible control flow with synchronization in the mix.

    Outside of that, there was discussion about how nice it would be if there were a scalar equivalent to the floating-point and other capabilities in the SIMDs. Also, there were some discussions from a console context that probably apply generally about kernels or shaders benefiting if it were possible to pull more than one scalar or LDS value per clock into a vector operation.

    There's some signs that AMD has tried to improve the shared fetch front end and instruction caches, which might be a scaling barrier.

    Register file capacity has not changed, so occupancy is still a source of pressure. I suppose FP16 is the hoped-for mitigation, but one reason why it's not as flexible is that packed math doesn't expand items like the conditional masks or EXE mask.
    There was some discussion about wanting some way to reduce occupancy pressure or allow execution on pending registers when utilizing GCN's ability to fire off multiple loads. Vega's waitcnt for memory is significantly higher, with an odd way of being split up so as to not conflict with existing encodings that points to NCU perhaps not being as large a departure as some statements indicate. Whether that means new things or perhaps something really increasing the memory subsystem's latency is unknown.

    Then there's the static allocation of registers based on the worst-case consumption of a kernel. I think AMD might have some speculative work on this, but no mention in any roadmap or marketing.

    There's the desire to have better memory consistency/coherence (quick release, timestamped coherence, etc.), with various papers from AMD but no mention so far here.

    It's more of an aesthetic preference on my part, but I feel some of the low-level details leaking into the architecture with regards to wait states or specific architectural quirks could use tightening up.

    Maybe more bandwidth from the L1 (Knights Landing is up to 128B)?

    *edit: Maybe start looking the whole cadence and CU implementation at some point?
     
    #75 3dilettante, Jul 3, 2017
    Last edited: Jul 3, 2017
  15. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    Are you sure that they need 4 warps to keep the CU busy? I thought, with a 4 deep pipeline, a 64 wide warp, and a 16 wide SIMD, they could do it with just 2 warps?
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,368
    Likes Received:
    3,962
    Location:
    Well within 3d
    For full vector throughput, there needs to be a wavefront active for each of the 4 16-wide SIMDs in a CU. The CU will be moving round-robin between the SIMDs, although given single instruction per-wavefront issue and other sundry stall sources it helps to have additional wavefronts to fill in the blanks.
    Thrashing and other issues (perhaps bigger L1s would help?) can make that a problem, and there are certain types of compute that can get away with fewer wavefronts with unusually large register allocations.

    GCN might be somewhat more prone to switching between threads, or needing it more so than other architectures. That's a potentially cheaper way to hide latency, but perhaps the level of switching behavior may need review at the latest nodes. Losing coherence in the data paths and traffic patterns, and perhaps losing some of the predictability for power/clock gating may be more costly.
    GCN's write-through L1s to distant L2 cache hierarchy might need a look as well.
     
  17. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,585
    Likes Received:
    2,310
    The "new" MI8 drivers would probably be Fiji based so maybe there is a reference in the drivers .
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,368
    Likes Received:
    3,962
    Location:
    Well within 3d
    Oops, I saw where I made a mistake. I meant CU in the earlier post, but somehow wrote SIMD.
     
  19. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    HBCC should just work. It's just a 4K page faulting mechanism from what I've seen. Usefulness on a 16GB card is another matter.
     
    Lightman likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...