AMD Vega Hardware Reviews

Discussion in 'Architecture and Products' started by ArkeoTP, Jun 30, 2017.

  1. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Do FMACs dream of shaded sheep?

    edit: sheep, not sheeps, right? *facepalm*
     
    #1201 CarstenS, Aug 26, 2017
    Last edited: Aug 28, 2017
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    862
    Likes Received:
    263
    The sad news is that the only chance to get this asap into user-land appears to be to make your own eco-system like Nvidia did, except Nvidia's motive weren't exactly to push the pipeline paradigm I suppose.

    Understatement much. :D

    I don't think it's a "possibility" problem. It's a "performance" problem. You compete with your own older architectures, with the competitions architectures. You fell in a local minimum and you have no way out without making your product less competitive. Except you manage to jump from one local minimum to another further away but better.
    We programmers are (I hope) totally aware of local minimum problems, they are everywhere in literally every scope of code and hardware (and life :)), because global minima have a computation problem. There is ... a huge acceptance problem in culture about it, or business, or marketing. Everything needs to be best or die, nothing should ever become worst.
    Anyway, you can't optimize the architecture for one paradigm and be optimal under the other as well. The tradeoffs made to make GCN1 competitive under e.g. DX11-API make GCN1 non-competitive under "Primitive Shader"-API. It's the effect of the optimization pass I mentioned before, not because there is a feature inhibitor somewhere in the chip

    I wonder if memoryless busses are such a good thing. Imagine the data-path would have storage capacity, by design, in the specs, in the API ... and you could do something with it, specifically the FF stuff. Like another thing acompanying registers and caches, the traditional memory-hierarchy. It would make some thing certainly easier and more robust. I'm not a hardware designer, it's possible there is no such spot, and it's futile to think about it. :)

    Always. But you don't want the deadlock, you have to break out, one way or the other.
     
  3. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    159
    Likes Received:
    33
    The gaming market right now is the Consoles. Unfortunately, PC gamer's have just received "console ports" over the last 5+ years.

    Yes, many Game Companies have stuck with the PC, and developed their engines for consoles. But that was the past and EVERYTHING is unified now and will be even-more-so under 4k. The Xbox OneX is essentially a high-end HTPC for the household. But it uses x86 and 64bit windows. And AMD APU.


    Vega is forward thinking, and built for a new era of Gaming compute. Volta is coming, and will try to offer all the stuff Vega does. Because modern Game Engine's are already onboard with AMD. Even Volta seems slides seem lacking..
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    It was apparent that graphics APIs were going to the wrong direction when a single new DX11 feature required two new hard coded programmable shader stages (hull and domain shaders) and one new fixed function stage (tessellator). Apparently nobody learned from the obvious failure of geometry shaders. You can't simply design graphics APIs around random use cases extracted from offline/production rendering. I have waited for configurable shader stages (with on-chip storage to pass data between them) since DX10. All we have got is some IHV specific hacks around the common cases that aren't exposed by developers. But I don't have high hopes of getting fully configurable shader stages anytime soon, since the common shading languages (HLSL, GLSL) haven't improved much either. HLSL & GLSL design is still based on DX9 era. Good for 1:1 input:eek:utput (pixel and vertex shaders), but painful for anything else.
     
    #1204 sebbbi, Aug 28, 2017
    Last edited: Aug 28, 2017
    T1beriu, Alexko, chris1515 and 2 others like this.
  5. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    [moving to pricing]
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I think the problem with software-defined pipelines is that on-die buffering will be rapidly exhausted, especially if there's no bound on the count of stages in the pipeline. Which suggests that caching is the only way to do arbitrary buffering for these pipelines.

    So then we loop back round to the subject of cache-line locking and other techniques(?) to make a cache behave in a manner that makes software defined pipelines viable.

    Buffer usage lies at the heart of load-balancing algorithms, so a configurable pipeline requires tight control over load-balancing as well as buffer apportionment.

    As a compromise, one might argue that a mini-pipeline, with only two kernels and one intermediate buffer, with off-die output for the pipeline results might be a useful first API improvement. One could argue that such a mini-pipeline could produce output that conforms with later stages of the conventional graphics pipeline, and therefore this mini-pipeline doesn't need to output its results off-die. That sounds like "primitive shader" or "surface shader", doesn't it?

    I still think we could have a viable Larrabee like graphics processor now. The available compute in these latest GPUs is barely growing (50% growth in two years from AMD - that is actually horrifying to me) which I think proves that we're truly in the era of smarter graphics (software defined on-chip pipelines) than the nonsense of wrestling with fixed function TS, RS and OM (what else?). All the transistors spent on anything but cache and ALUs just seem wasted most of the time in modern rendering algorithms.
     
    DavidGraham likes this.
  7. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,815
    Likes Received:
    2,637
  8. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    13,320
    Likes Received:
    3,809
    Fury X better than Vega64 in some VR titles ... what?
     
  9. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    From what I see in that table, only with reduced details (2-line entries), except for The Unspoken, where Vega performs so abysmal that it has to be a driver bug.
     
  10. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,299
    Likes Received:
    249
    I think some launch-review stated, that current drivers lack VR optimisations.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    There are higher level concepts, although whether they would be specific forms of complex shaders or overall renderer algorithms that become more practical with them isn't clear to me. The exception would be the surface shader, which AMD gives as a name for a position in the pipeline matching the merged LS-HS stages. Until there's more information, I am wary of reading too much into what may be more aspirational or version 2.0 possibilities. The stages merged are consistent with either making the output of the more general Vertex stage and its variants more general, but it looks like stages remain separated at junctions with fixed-function and/or varying of item count--be it amplification or decimation.

    The programmable stages do not appear like they can fully encapsulate the tessellation hardware or the geometry shader stage without at least some recognition of the discontinuity, although it seems the GS stage is closer to being more generalized except for the VS pass-through stage.
    The VS to PS transition point is an area of fixed-function and amplification, potentially covering the post-batch path of the DSBR through scan-conversion and through the wavefront initialization done by the SPI.

    A hypothetical splitting of parameter cache writes from other primitive shader functions such that only parameters that feed into visible pixels occur would require straddling the VS to DSBR to PS path, which has some interestingly complex behaviors to traverse.


    I think it also looks to the GPU to implicitly maintain the semantics, ordering, and tighter consistency than the more programmable shader array. The graphics domain's tracking of internal pipelines, protection, and higher-level behavior provides a certain amount of overarching state that the programs are often dimly aware of, like the hierarchical culling and compression methods. It's not entirely alien from some of the meta-state handled for speculative reasons in CPU pipelines (branch prediction, stack management, prefetching, etc.), or for their respective forms of context switching and kernel management.

    The idea of having side-band hardware functions is also not unique to the GPU domain and has some ongoing adoption for non-legacy reasons. Compression is making its way into CPUs, where it saves on some critical resource with autonomous meta-work opportunistically without injecting it into the instruction stream that benefits from it.

    It might not quite be there yet, since load-balancing pain points seem to be part of where the old divisions remain, and the VS to PS division persists for some reason despite them being ostensibly similarly programmable.

    There's implicit elements implied in current recommendations, like the amount of ALU work between position and parameter exports in an attempt to juggle internal cache thrashing versus pipeline starvation. It's part of where I'm leery of exposing byzantine internal details to software. Low-level details bleeding into higher levels of abstraction have a tendency of obligating future revisions to honor them, or that developers juggle variations. It's an area where having abstractions, or a smarter driver, would stop today's hardware from strangling generation N+1.

    It would be interesting to see how some of the initial conditions would be revisited. The default rasterization scheme was willing to accept some amount of the front-end work being serialized to a core, which at various points today's GPUs are now more parallel than they were.

    Line locking seems to be something that comes up as an elegant solution to specific loads, but much mightier coherent system architectures have taken a look at this and designers consistently shoot it down for behaviors more complex than briefly holding a cache line for atomics and the like. There are implicit costs to impinging on a coherent space that have stymied such measures for longer than GPUs have been a concept.

    More advanced synchronization proposals tend towards registering information with some form of arbiter or specialized context like transactions or synchronization monitors.


    Increasing programmability does not lessen the dependence on compute, which apparently has slowed for AMD during its laurel-resting period. It's also potentially not a straightforward comparison between Larrabee's x86 cores and GCN. The memory hierarchies remain quite different, and the threading models differ. GCN's domain-specific hardware still performs functions the CUs are not tasked with that Larrabee's cores can/must handle.

    Without actual implementations, we wouldn't know how many truths of the time would hold up now.
    It's not even necessarily the ALUs and caches being worried about, with the increasing networks of DVFS sensors, data fabrics, interconnects, and all the transistors AMD said it spent on wire delay and clock driving. Sensors, controllers, offload engines, and wires seem to be showing up more.
     
    pharma likes this.
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Line locking was part of the Larrabee approach, if I remember right. It's worth bearing in mind that spilling off die is basically a good way to lose GPU acceleration entirely.

    Yes for the first time NVidia was ahead.

    Larrabee was approximately within a factor of 2 back when it could have happened.

    Now GPUs are implementing algorithms in hardware (some kind of tile-binned rasterisation, which was explicitly part of what Larrabee did in its rasterisation) and using data-dependent techniques such as delta colour compression because the API is such a lumbering dinosaur and developers still don't have the liberty to do the optimisations that they want to do. Unless they're on console, in which case they have slightly more flexibility.

    I don't see how any of that stuff would be beyond the wit of Intel.
     
  13. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,972
    Likes Received:
    1,656
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    I saw speculation for it, but I did not find a definitive instruction for locking a cache line. I only found certain references like a prefetch instruction that would set a fetched line to exclusive. If that's the equivalent of locking, it would have been a significant naming collision with the usual meaning of the word for cache lines.

    It does not look like compression is something that exists because of the API. Putting compression functionality as specialized hardware at specific points in the memory hierarchy relieves the executing code of needing compression/decompression routines embedded in the most optimized routines. An opportunistic method can reduce complexity and power consumption by not engaging a compressor outside of less common events like cache line writeback, which is not something another thread within a core or a different fully-featured core can intercept.

    I would expect Intel could have good implementations of those techniques, since it made note of its enhancements for duty cycling and superior integration with the L3 hierarchy in its consumer chips. I was noting that the old metrics used to judge something like RV770 and Larrabee have been joined by a raft of other considerations in the intervening time period. Xeon Phi doesn't maintain Larrabee's graphics or consumer focus, which makes it an unclear indicator of what Larrabee could have been if it hadn't been frozen in time by its cancellation.
     
  15. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,991
    Likes Received:
    850
    Location:
    Planet Earth.
    Come on give AMD a little time to tweak Vega drivers, it's a new architecture even if based on an existing one.
     
  16. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    What do you think, how long does AMDs driver programmer need still? In December, they had fallback mode up and running with what appears to be near-final Vulkan-performance in Doom.
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    T1beriu and Lightman like this.
  18. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,972
    Likes Received:
    1,656
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    I was not involved in that game test, I only had to hand over the cards.
    „Probemessungen mit der Radeon Software 17.8.2 sowie den Geforce-Treibern 385.41 ergaben jedoch keinerlei Performance-Verbesserungen gegenüber den früheren Treibern - weder im CPU-, noch GPU-Limit. “

    This however tells you that my colleague cross-checked with newer drivers and there were no performance changes. I guess he simply did not want to lie about the driver version used - maybe that was too honest. :)
     
  20. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    And a disaster at resolutions above... it has completely no sense, probably drivers are still a lot bugged (either the 1080 results are skewed because an incorrect rendering, or the higher resolutions are bugged, ot both, even if there should be a mention in the review about incorrect rendering)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...