AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by Deleted member 13524, Sep 20, 2016.

  1. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Most awesome news in long time. Someone is finally talking about games and fine grained automated memory paging from CPU memory. Hopefully Nvidia follows the suit. Professional (5000$+) Pascal P100 supports this already in CUDA. Link: http://www.techenablement.com/key-aspects-pascal-commercial-exascale-computing/. Now we just need consumer NV GPU support and graphics API support.

    Future looks bright :)
     
  2. SimBy

    Regular

    Joined:
    Jun 21, 2008
    Messages:
    700
    Likes Received:
    391
    Expecting a 530mm2 Vega not to clean with 1080 is lunacy. That would be the biggest fail AMD ever conceived.
     
  3. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    What we forget to take in account, is the run of the Vega was with V-sync fixed at 60fps ..... We have no idea at what fps it was run in reality. could be 61fps as 75 average. or maybe who now .. Honestly i dont know what will be the performance of it based on this demo.

    It will be same thing for the TitanX if.v-sync is enabled.( and it is in some of link posted upper. )

    i just cant tell for SW what was the performance of it based on a artificially fixed fps to 60fps. (v-sync on )
     
    #603 lanek, Jan 5, 2017
    Last edited: Jan 5, 2017
  4. seahawk

    Regular

    Joined:
    May 18, 2004
    Messages:
    511
    Likes Received:
    141
    In 3840x2400 my 4790 + 1080 delivery 60FPS with some dips to 53FPS with lots of action in the settings used and on Endor. So I think the shown performance is around the 1080 level.
     
    DavidGraham likes this.
  5. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    609
    Likes Received:
    1,142
    Is this unified memory? Every Pascal GPU supports 49bit and "Page Migration Engine" - comment section:

    https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/
     
    Razor1 likes this.
  6. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    746
    Likes Received:
    41
    Location:
    Copenhagen
    Could be, but this slide indicates both twice the clock (which I doubt should be taken literally) and twice the ops(/units) per cu.
     
  7. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ
    Would be strange for them to pull a Fermi (RE: clocks), wouldn't it? :confused:o_O
     
  8. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    DX8 did that not DX9. And implementing 4 programmable stages running all at different cadences into a single programmable stage is a far far more complex thing to do then what was done in DX8.
     
    Razor1 likes this.
  9. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    It takes some work out of the developers hands but pretty sure most of the work that is being done now for streaming assets still has to be there, of course it does take away from CPU usage I think with the new implementation methods, just guessing at this point though.
     
  10. Esrever

    Regular

    Joined:
    Feb 6, 2013
    Messages:
    846
    Likes Received:
    647
    I keep seeing people saying that they use fiji drivers for Vega in the demos, is that even possible?
     
  11. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    VideoCardz mentioned the demo dropped to 57 several times.
     
  12. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    not really possible lol.
     
    no-X likes this.
  13. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    IIRC AMD said at the December event that the drivers were "based off Fiji with few tweaks and a debug layer, early alpha", but that was early December, not CES
     
    xEx likes this.
  14. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    417
    Likes Received:
    381
    Assuming it is on-demand page migration, it should work differently though (push/copy vs pull/paging). Everything would be like automatic tiled resources, probably except render targets. Subsets of the resources should be swapped in only on demand (upon page fault when accessed, or a prefetch hint is given). The current model still requires manual management with assumptions in the size of VRAM.

    Although the real question is if it is that simple... AFAIU it can be done rather easily with the abstractions and coherency guarantees the graphics stack provide, as long as the GPU address translation hierarchy is architected to handle it. But probably not for compute (HSA/OCL), especially for HSA which requires agents to share/mirror the process VAS.

    Sounds a fit to Linux's HMM effort though.
     
    #615 pTmdfx, Jan 5, 2017
    Last edited: Jan 5, 2017
    Razor1 likes this.
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Based on the recent die shot of Polaris, if the scalar portion is next to the shared instruction fetch/scalar cache portion of a CU group and below what appears to be the LDS, quadrupling that band gets around the area of one SIMD partition (2 SIMDs).
    However, some of that area might be due to the scalar portion being integrated with part of the overall scheduling pipeline, which goes to the later portion.

    Breaking that 16-wide SIMD into 4 independent quad-width units creates 4x peak scheduling demand+area, but this is for only 1/4 of a current CU's ALU capacity.

    What happens with the vector register file might be interesting, since the file's depth per lane would increase in order to house a 64-wide wavefront's context in a narrower SIMD.
    If GCN optimized its register read process for the current configuration, there's possibly a bit more math since finding the physical register needs to take into account the wavefront width since a registerID may equal 1, 2, or 4 rows before considering 64-bit.


    I'm curious whether lane clock gating isn't already being done for inactive lanes even without physically different SIMDs.
    Unless a wavefront monopolizes a SIMD for multiple vector cycles, shutting down lanes based on one wavefront's mask is not going to realize much of a savings if the next issue cycle comes from another wavefront with a contradictory mask unless there's fine-grained detection of lane use. Once you have that level of detection, then it should work fine with or without the diagram's method.
    The diagram's stating there are space savings doesn't make sense if it's just gating. Inactive lanes don't go away in that case, barring some other change in the relationship between lanes and storage.


    Up until execution/export could become reordered with binning, the pipelined import/export of ROP tiles may have posed a risk of thrashing the L2 more than saving ROP traffic back to memory. Shading and exporting on the basis of a bin's consolidated lifetime rather than a mix of fragments with conflicting exports might have helped. It could also help reduce or avoid thrashing of the compression pipeline, and possibly give a point of consistency for an in-frame read after write to produce valid data.
    That might depend on where the compression pipeline and its own DCC cached data resides.

    Part of the ROV process might fall out of binning. ROV mode could make a bin terminate upon an detecting an overlap, then start a new bin with the most recent fragment. If the binning process has multiple bins buffered for each tile, the front end might be able to switch over to another tile if the next bin also hits a conflict.

    The Vega MI25 is supposedly 25 TF of FP16. Unless there are half as many CUs, it should be higher than 25. 4096 SPs x 2(FMA) x 2(FP16) x ~1.5(clock) gives ~25 TF.
     
    #616 3dilettante, Jan 5, 2017
    Last edited: Jan 5, 2017
  16. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    746
    Likes Received:
    41
    Location:
    Copenhagen
    Yes, that's why I say we probably should not take the clock doubling in the diagram literally - but just as a significant increase as mentioned in the text.

    Yeah, twice as wide CUs (and I agree that 128 sounds too wide) would mean half as many CUs. The slide could also indicate better packing of the 'ops' (higher utilization) - for instance the wavefront compaction, but I think we concluded that was for Navi only?
     
  17. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    568
    Likes Received:
    104
    Indeed. That may also explain why the release is likely further away than many were expecting.
    AMD would want to avoid a re-run of Tahiti's release, where initial drivers gave a poorer impression of the architecture than was eventually the case.
     
  18. xEx

    xEx
    Veteran

    Joined:
    Feb 2, 2012
    Messages:
    1,060
    Likes Received:
    543
    Talking about that I saw a recent review where Tahiti came ahead of the 680.

    Enviado desde mi HTC One mediante Tapatalk
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The preview footnotes also had triangle rate at up to 11 polygons per clock over 4 shader engines, which would bring the CUs per engine lower in this performance tier than it has been. There was no mention about texture or load/store throughput, which would presumably see sharply higher demand if the vector path doubled.

    Maybe the higher ops count is about removing sources of stalls. Some other possibilities could be that a CU could issue more than one instruction from a wavefront, possibly from more than one category in the absence of a dependence such as issuing a vector and scalar op.

    I don't recall much being concluded about Navi, or if any compaction is in the cards at all.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...