AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Clocks though are currently what give the difference between Fiji and Vega when it comes to compute with the professional models - putting aside the context regarding mixed-precision and FP64 difference.
    Without the 50% peak clock increase they would have very similar theoretical FP32 TFLOPs spec from AMD, I appreciate though it is more complex than this in real world HPC (such as memory) but clocks are pretty fundamental in this case still.
     
  2. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,257
    Likes Received:
    1,947
    Location:
    Finland
    Are they though? Do we actually know anything else about Vega 20 than 4096bit HBM2 memory and 7nm manufacturing process?
     
  3. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    I hope AMD is making a Tech Demo and show the performance of Vega when all features are enabled.
     
  4. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    I dont know if this will tell much to the public ( or us ), as it will certainly only compare with vega. ( meaning comparaison of performance will be really hard to judge on other hardware )
     
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Simply ramping clocks from an improved process I don't consider an architectural improvement. Routing adjustments perhaps, but they could just as easily apply to 14nm. For compute and professional, HBCC and RPM would be the big gains. For graphics DSBR, DX12.1 features, primitive shaders, etc would make a difference. Initial benchmarks did have Vega as an overclocked Fiji, but once the new features are used there is an added benefit.

    I'm not sure we can tell yet, but a simple change to 7nm alone shouldn't change the architecture. Unless an existing feature is somehow fixed or quad packed math is a thing, there doesn't seem to be much added. Even bandwidth with 4 stacks may be in proportion to clock gains from 7nm. Yes that's ignoring any efficiency hit from FP64 that was added. If there are more features coming with Vega 20, nobody has heard anything. Even the old roadmap leaks didn't show anything. I'd think there would have been some rumors or hints at what is coming. Changing cache sizes is all that really comes to mind.
     
  6. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    240
    xGMI?
     
  7. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    Or they should implement it in one game. They invested so many time in wattman and link. They should more invest in Ngg fastpath.

    People are buying GPU because they look good in benchmark not because the driver GUI is nice.
     
  8. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    240
    AMD is morbidly silent about it for a good reason.
     
  9. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    There is only one reason. The hardware is broken.

    The patents are out so AMD should have no fear to talk about it and also if it's a software issue AMD could use the open source community to help them get the driver on.
     
  10. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    240
    What do you mean by "open source community"?
     
  11. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I agree but that is different to the original reasons discussed which come back to compute and Vega 20, is 64CU important and efficiency of GCN where both are fine with compute-HPC type applications or even compute cryptomining if looking at consumers; clocks are part of that discussion.
    The points you raise are definitely valid and I agree but it is mixing segments of HPC with others now, and Vega 20 is primarily a compute-HPC card, those functions-features you mention fit more readily with other segments (see below).

    The context is compute though and theoretical TFLOPs or even TFLOPs spec; you need core clocks and cores-CU, like I said real world though requires other aspects such as memory already touched upon.
    RPM does not help with scientific FP32 or FP64 but yeah nice for mixed-precision, HBCC (beyond the Unified Memory between HBM and system) still needs to be seen operating with a real world HPC application on scaled up/out nodes until then is what one expects in a modern accelerator with Unified Memory.

    For future generations in the HPC-compute segment, AMD will probably have to implement one/some of those patents you and others have found in the past that focus either on scaling CU or changes at Vector Unit-ALU level.
    And fair to say Nvidia can scale their own current architecture so far (although it is scaling surprisingly well in some ways) before potential contention-bottlenecks (Arun's tool looks like it may had shown some of this with V100 when looking purely at Geometry engine ratio but more of an issue for Geforce rather than Quadro-Tesla segment).
     
    Lightman likes this.
  12. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    They should post code and information in GitHub and hand it over to the Linux open source driver team.

    Maybe with an Reward. 10.000 Dollar for the working implementation.
     
  13. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    240
    amdgpu team is (almost?) all paid devs working at AMD.
     
  14. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    There are two Linux drives outside. One is from AMD and one is from the community.

    The reward is of course for the community ;)
     
  15. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Interesting, but more or less a fat PCIe connection. IMHO it won't really change the architecture, but accelerate certain tasks with a lot of traffic going off chip. The exception would be if it was enabling some sort of MCM interconnection like Epyc.

    Not necessarily broken, but just not worth the effort given a more versatile solution. With compute based culling the implicit paths may perform worse. The explicit paths probably better, but that requires Vega specific programming that in the current mining environment I'm not sure makes sense. No devs, or even AMD for that matter, would see a point in CURRENTLY developing it for the existing Vega marketshare. That's in addition to or replacing the compute based culling implementation that should run on most if not all current hardware.

    RPM and Tensor Cores are somewhat analogous, so more than just mixed precision. HBCC is probably working on some HPC workloads. Specifically the oil and gas or astronomy guys with extremely parallel workloads on large datasets. The oil guys were the ones requesting the feature in the first place as I recall. They were also likely using a SAN for storage and we haven't seen any relevant cards publicly.

    I'd imagine there will be some changes to the CUs in the future. Personally I still favor cascaded DSPs to avoid heavy VGPR traffic and essentially create deeper pipelines if practical. Creating a systolic network for macro blocks. That design should be a step beyond what Nvidia is doing with the register file caching.

    That being said, I'm not sure AMD needs to worry about scaling past 64 CUs as process tech isn't getting much smaller and their goal was multiple chips working together. The 64 CU limit is a nice square number from a parallel hardware perspective. The solution would be to make extremely efficient, mobile-like cores with an infinitely fast interconnect between them. Some of the upcoming differential or modulated signaling implementations may allow that. It's always possible Vega 12 has xGMI and is doing something similar. The current roadmap is mobile or 7nm compute parts so it would seem plausible. Leave Navi for scaling beyond two(?) chips efficiently.

    AMD actively maintains both, although the community one does have additional devs working on it. The problem lies in code that can't be freely distributed or readily meet kernel standards.
     
  16. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    You use NGG fastpath to reduce load on Shader and then you use culling over Shader. This make no sense.

    I think ngg fastpath works fine, you see also the example in the white paper which was a real world result it's only hard to use it in games, because every game has a own LOD system.
     
  17. Dayman1225

    Newcomer

    Joined:
    Sep 9, 2017
    Messages:
    57
    Likes Received:
    78
    An example from the labs in a whitepaper isn't really "real world result" IMO
     
  18. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    With real world I mean there was a Programme and a driver where you can test it on the real hardware. It was not a calculation on the paper.
     
  19. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    AMD has shown HBCC with all its functions as a working demo-concept (not the right word as it is more than that but not sure what to call it) with a GPU rather than scaled up/out use accelerator node, so far in this segment it is a traditional modern Unified Memory accelerator solution and that is good but best to temper expectations as one does when AMD showed all features/functions Vega had pre launch; HBCC storage-cache would be incredibly complex (with overheads) in a scaled up/out HPC solution; look at (not just the interconnect but where positioned/control-flow/etc) new HPC cache-storage solutions with products based around CCIX or IBM next gen to CAPI.
    RPM is mixed precision, where are you getting the idea it can do accelerated tensor-GEMM mathematics beyond 2xFP16?
    If it could AMD would had shown this in a demo with current Vega, not sure it would align with RPM as it ideally needs a specific architecture-functions/instructions to co-exist with current architecture design.

    Edit:
    I went back to the HotChips presentation, and they say "Optimized GEMM for Deep learning".
     
    #5219 CSI PC, Apr 8, 2018
    Last edited: Apr 8, 2018
  20. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    NGG wouldn't necessarily reduce the load on the shaders, but the 4 tri/clock bottleneck in the front end by removing culled triangles. That can occur with primitive shaders or a compute shader culling and compacting the stream. Technically they could be chained together, but that would be a bit redundant. Ignoring both shader types running on the shader array for the sake of argument, neither PS or CS would reduce the shader load unless presented with some interesting culling capability by the developer. Some culling operation being hinted the default path couldn't pick up on or implement due to guarantees. Typically GCN would have difficulty using the shader array from an inability to feed geometry through the front-end. Async works around this bottleneck and in this case the culling shaders are scheduled during the previous frame.

    Complexity depends on how they are using it. For certain HPC tasks, large read-only datasets, most of that overhead would be nonexistent. That should be the case for the oil and gas guys and the implementation somewhat proprietary. Large scale rendering or raytracing could be similar. Multiple GPUs each working a subset of the screen space and HBCC caching pages that get hit. As each GPU would be completely independent the control flow issues go away. CCIX and CAPI are interesting for a certain segment of problems, but shouldn't be necessarily for a SSG type problem with a SAN. Once the GPUs have to start synchronizing it can get more difficult, but automated paging makes that far easier. No different than CPU programming where data pages automatically. That would be familiar to many researches with limited programming ability. It becomes a question of efficiency and any gaps are filled with other work thanks to async compute if practical. So long as all the jobs don't generate stalls simultaneously the chips should stay near peak performance. If that is occurring the implementation will be problematic on any hardware.
     
    Grall likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...