AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by Deleted member 13524, Sep 20, 2016.

  1. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Remember I quoted Ryan at PCPer who stated it can be wrapped pertaining to the driver package and that would in theory provide a performance boost, this came as part of the section where it seems he was given additional information by AMD.
    I appreciate you do not put much weight on what he said due to different opinion on its context, but he has been absolutely correct when he also stated to fully utilise it must be accessed by API-coding-SDK, and this is backed up in the interview Razor linked as Scott mentions it there.
    The 11polygons listening to Scott is a theoretical maximum when/if using 4 Geometry Engines, because he goes on to say "realistically", sort of reminds me of Async Compute and the difference between theoretical maximum and real world.
    He did say "I think that is an example and not a specific product detail about the number of Geometry Engines in a particular chip" - Could be construed to mean either there are multiple designs (one being traditional 4 Geometry Engines) and this information is to come out later, or that all the various chips have more than 4 Geometry Engines.
    I think we will see multiple types; a standard version still with 4 Geometry Engines as the 1st one is meant/rumoured to be 4096 cores and with the same CU count (known by released info on FP32/clock), if you can increase the Geometry Engines (and associated functions) then it makes sense to also increase the CU count and all that entails as the limitation has been removed.

    Cheers
     
    #861 CSI PC, Jan 18, 2017
    Last edited: Jan 18, 2017
    pharma and Razor1 like this.
  2. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Still sounded like Scott to me just with sound peak limiter issues *shrug*.
    Yeah makes sense, but for AMD this is a separate benefit to aspects of the Primitive Shader, and draw stream binning possibly help anyway to some extent in various games.
    Also worth noting Scott clarifies the massive polygon count with regards to the Deus Ex example to say it is actually a fraction of that per frame due to the number being also what is around.
    Cheers
     
    #862 CSI PC, Jan 18, 2017
    Last edited: Jan 18, 2017
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    You really meant that as troll bait, didn't you? That article is as vague as can be while not adding any significant information in addition to what's directly contained in the slides.
     
    Razor1, Lightman and DavidGraham like this.
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    My interpretation of that interview was that the 11 was not being derived in the same way as the Fiji count, with the latter being a literal physical maximum of the hardware. If there's a modifier of "achievable" with some kind of test program, that would leave room for a nicer round number in the hardware, like 12 (or 16 if it's "realistic" and 16-wide SIMD utilization penalties were on the order of 20-30% back in the day of Larrabee).

    That article seems to be using some acronyms and initialisms in ways I haven't seen before. There are some potential tidbits (graphics pipeline has an equal wave dispatch throughput as the HWS/ACEs?, extended BAR and mapped memory accesses), but I don't know how much I can run with them with the surrounding information being rather contrary to my understanding of the terminology, architecture, and the general interpretation of the architecture by all the other articles and interviews I've seen (NCU is an ACE?, Geometry Command Processor?).

    One potential item to contemplate, is the nature of the the allegedly one big Intelligent Workload Distributor. Its ability to load/balance across primitives and calls has been discussed before, although without more detail on the hardware's mapping I am curious what is being balanced versus being re-ordered.
    One possibility is if a geometry processor that is running a primitive shader or more traditional geometry-related shaders that finds that execution is starting to move or spawn vertices outside of the local raster tile, an earlier speculated feedback loop from the geometry engine to the IWD can take the split point and shader information and spawn an instance of it (which would filter out the geometry it doesn't need concurrently or later), rather than trying to route triangles from one overworked shader engine to the others.
    A banked IWD could provide a queue/buffer per tile range, and the outcome of scheduling work across calls or primitives would fall out of it without centralizing decision making in one monolithic unit, which I would see as a potential scalablility barrier.

    If the shader engine and raster/RBE allocation are less static, then perhaps a more traditional load-balancing could occur if the heuristics can weigh the cost of that kind of distribution versus soldiering through locally. Compute may be able to benefit if the lanes in the IWD can serve as work queues anthat can be checked or stolen from.
     
    CarstenS and Lightman like this.
  5. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yes because of the way the operation-function probably is with Primitive Shader and not just HW scaling, why I brought Async Compute into this as another example including its theoretical maximum benefit compared to reality and involves a lot from the developers-API.
    There is no nice round number you can use that relates to the hardware-operation structure IMO that logically fits in with what you see with Fiji.
    From an absolute performance perspective, in theory seems it would be up to 2.75 times more powerful if everything aligned and with perfect programming-API-perfect scene if going by the 11polygon footnote as a 4 Geometry Engine solution - yeah it will never happen like you will never see the absolute thereotical perfect maximum performance gain from Async Compute.
    Even "over 2x" is probably stretching it for many PC games when they talk about the various Vega related functions *shrug*

    Anyway remember as part of that footnote relating to the up to 11polygons it also said
    This to me suggests the source is beyond just the marketing team, especially as it was also included within the preview slides.
    Cheers
     
  6. revan

    Newcomer

    Joined:
    Nov 9, 2007
    Messages:
    55
    Likes Received:
    18
    Location:
    look in the sunrise ..will find me
    Well, I know the man has a bad mouth but thought he could had some inside-info... When the informations are sparse (.... and even "sparse" could be an oversaying talking about Vega) you must look after bits of information even in foul places like Fudzilla or Semiaccurate (...sometime water lilies grow out from the mud, it's a saying) . But if nothing of value comes for there so be it, we can go on with what we have right now... not much (true to be told)
     
    CarstenS likes this.
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Why not add up the throughput of the physical paths in 4 supposedly parallel engines?

    Fiji's number is the theoretical max of its physical hardware. Any discussion of having an ideal scene or running at 11 for Vega should have made Fiji's listing be 2.x-3.x, and AMD has never shied away from giving the physical ALU throughput of the whole GPU when talking about compute. Every architecture falls short of theoretical peak, so that footnote wouldn't be treating the two equivalently unless there's something physically different with one of the pipelines or there's something architecturally in the way of using the full width of 4 independent units.

    The value in an "up to" is generally the peak. It suggests to me that the footnote was mangled, regardless of source.
     
  8. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I think a lot of us may not be disagreeing too much but have a different POV.
    Both are theoretical max in the context of the polygon/clock, just that Vega has greater flexibility that goes beyond the HW scaling of the Geometry Engine, you still end up with up to x2.75 if those engineers are comfortable with the max potential theory of both.
    Remember Async Compute and AMD saying improve performance up to 46% in DX12?
    At least that was with a public presented shader demo they provided.
    Yet in reality it is a fraction of that when the greatest benefit in this regard for Doom Vulkan (as an example) came from the API extensions rather than Async Compute.
    What was the theoretical maximum performance gain of Async Compute in Fiji compared to GCN1 that has it currently disabled and what is the real benefit?
    Point is I do not think one can try to correlate the Primitive Shader into same template as Fiji and traditional use of the 4 geometry engines, only the end result and theory/real world results or numbers.
    For real world (well as much a focused engineer demo gives and would be much higher than games due to being closer to the ideal theory) we have not been given or presented a new shader demo yet like we saw with Async Compute.
    Cheers
     
    #868 CSI PC, Jan 18, 2017
    Last edited: Jan 18, 2017
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    This is rearranging AMD's footnote, which might have the effect of correcting it if it is inaccurate but isn't how it is written. The 2.5x was given without the modifier, it was the 11 polygons part that was "up to" .
    Pairing a statement of a design's physical characteristic on one side, and then a derived or measured value on the other only has a few consistent interpretations, and they don't really make much sense. The 2.75x factor without caveats actually makes it harder to find a coherent interpretation, since one half of the comparison is an unwavering constant.
    That then leaves the question of what is physically preventing fully consistent handling either on a per-clock basis (units cannot behave the same on every cycle) or on a parallel pipeline basis (one or more units cannot function the same as the others).
    Alternately, and possibly more likely, there isn't a sensible way to interpret combining incompatible measurements.

    That's the difference between a performance claim for some set of software workloads, and a statement of an architectural characteristic. AMD didn't claim that AC made the GPU capable of more FMA operations per clock.

    Even if there is a non-traditional component to the throughput in Vega due to the primitive shader or some other feature, there's still a finite set of inputs or paths in the hardware to process it, and those have a finite amount of output.
    If this cannot be calculated, then it means something else is being misapplied or misworded, like the "polygon" one architecture is processing is not the same as the other.
     
    pTmdfx likes this.
  10. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I think we can agree it is a fudge until they present a new engineer shader demo like they did for Async Compute, it still will not necessarily reflect real-world situations/games but closer to the theoretical maximum with a more valid number, same can be said for the draw stream binning rasterizer.
    Cheers
     
    #870 CSI PC, Jan 19, 2017
    Last edited: Jan 19, 2017
  11. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Just optimized my console game (GCN2). Got 15% gain from async compute + barrier refactoring (2 week work). Still should see another 10% when everything is fully overlapped. I'd say 25% is achievable easily in modern game code. If the engine architecture was designed around async compute from the start, you could even double your performance in heavily geometry bound cases (100% geometry pipeline usage during the whole frame, next frame = async compute overlap lighting and post processing).

    Theoretical maximum gain of async compute... That is a silly question:
    Worst case shader for graphics pipe is a single wave wide fully serial shader (with various stalls). Occupies one wave (out of max 40) of a single CU (of 64). 0.03% GPU wide occupancy. It could theoretically saturate one SIMD (if no memory ops) = 0.3% GPU utilization. If you run any async compute simultaneously, you'll get 100x+ GPU utilization gains. Of course the graphics pipe could be just waiting for fences (waiting to load some huge texture from CPU memory). In this case async compute gives you infinite gains (over zero work done) :)

    Minimum gain is negative. In the worst case async compute can trash caches or increase latency -> stalls. This question is as hard to answer properly as "how much multithreading helps performance?".
     
  12. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Well about as silly as maximum gain from the Primitive Shader or draw stream binning rasterizer that go beyond HW scaling of the architecture due to their function-design and the figures provided by AMD that includes the footnote that has been the topic for last several pages.
    Nice results you got there and you will manage to get the best results (considering console and other developer comments historically) I have read for Async Compute if you hit 25%; Gears of War 4 is anywhere from 2% to 8%, Vulkan Doom I thought was around 8% to 12%, AoTS around 10% best (depends upon resolution),etc.
    But fair to say you optimised for Console while Vega so far is dGPU like the presentation for Async Compute with Fiji?
    That said still one of the best figures I have heard mentioned so nice work if you manage to get to 25%.
    And you agree that the figure can notably fluctuate depending upon settings-resolution-scene for the same game?

    Even putting aside theory of Async Compute maximum, AMD showed with their specific Async Compute shader demo on PC with Fiji they managed 46% performance gain, fitting my point about theoretical maximum/engineer demo that probably gets closest/real world including your results compared to dGPU PC games with gains.
    I will be curious to see if AMD present any new engineer shader demo for Vega regarding the changes, it is needed because the figures focused on 'dGPU' Vega are meaningless apart from as theoretical maximums-peaks.
    Thanks
     
    #872 CSI PC, Jan 19, 2017
    Last edited: Jan 19, 2017
    Razor1 and pharma like this.
  13. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Heinrich4, Lightman, Alexko and 3 others like this.
  14. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Thanks and yeah around some of the figures I have read from other devs that they managed, and even 30% still short of what AMD achieved on a PC with the engineer demo specific for Async Compute but to be expected; it will be interesting to see how close well optimised console game development can get to their figure.
    This is not a multi-platform release and interesting point as I would need to check the trend, is your game multi-platform or exclusive to one console/both consoles/all platforms?
    The trend I feel will be the same with the Vega functions and the game-rendering engines; single console exclusives with the highest gains, console multi-platform next (specifically Microsoft and Sony), and last multi-platform consoles and PC.
    Yeah nothing earth shattering there and as one would expect, but food for thought how such technology benefits occur when talking about a GPU architecture in general-broadly along with its performance increase over previous gens.

    Thanks
     
    #874 CSI PC, Jan 19, 2017
    Last edited: Jan 19, 2017
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    I am sure you can get higher gains than AMDs presentation showed.

    Simple example:
    - Use 100% (16.6ms) of your main/graphics pipe time to rasterize (shadow maps and g-buffer). Normally you could use only up to 8 ms for these tasks.
    - Use a very lightweight g-buffer technique: https://forum.beyond3d.com/threads/modern-textureless-deferred-rendering-techniques.57611/. These techniques are almost entirely fixed function (geometry & ROP) bound. Very little ALU/BW/sampler usage.
    - Continue with lighting and post processing in async compute pipeline (overlaps with the next frame). Ensure that lighting + post takes roughly 16.6ms (equal time to rasterization tasks).

    I would expect to see up to 60%-80% perf gains compared to executing both rasterization and lighting/post sequentially. Downside is that graphics pipe and compute pipe frame lengths need to be roughly equal to get perfect results. Meaning that you really need to spend that 16.6ms for rasterizing triangles, in every scene. This is hard to achieve, as viewport rendering tends to be the most fluctuating part of the scene rendering. When you double your geometry budget (8 ms -> 16 ms), you also increase the fluctuation. Hopefully somebody tries techniques like this and presents the results at GDC/SIGGRAPH. Would be interesting to see how big jump they can make, and what kind of problems they hit.
     
    #875 sebbbi, Jan 20, 2017
    Last edited: Jan 20, 2017
    Heinrich4, Razor1, CSI PC and 2 others like this.
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    http://www.freepatentsonline.com/y2016/0378565.html

    METHOD AND APPARATUS FOR REGULATING PROCESSING CORE LOAD IMBALANCE

    I have no idea if this kind of hybrid of both work-donation and work-stealing has been done before. PS3 game developers appeared to pioneer work-stealing but I don't know if work-donation was a part of those kinds of solutions. I'm not aware of arenas other than game development where work-stealing is commonly used (I presume it is still used) and know nothing about work-donation.

    What's really interesting with some of these diagrams is seeing "work" in L1 and L2 caches.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Perhaps it's too new to be in Vega? The known introduction of the binning rasterizer was first mooted in 2013, and then continued in August 2016.

    There are some elements to this that make it unclear how transparent this is to software, or what workgroups and wavefronts are in this scheme, with wavefronts being able to enqueue and dequeue work items after they have been allocated and are executing, and those items are able to be pulled into other workgroups or other processors.
    It seems like there would be challenges if this were implemented in the current way of doing things, when a whole workgroup's contexts are allocated fully in advance, and a different workgroup/wavefront currently wouldn't align itself with a foreign context's particulars or have room to support another context on top of its own.
    It seems like there's some kind of indirection in the execution loop (not 1:1 between a workgroup and the code/context its is running?), making thread contexts uniform (no negotiation for LDS, register space, etc. before taking items?) to make that happen.

    The various ways for monitoring and moving work can lean on specialized hardware, with work being able to be donated locally and stolen more globally. Keeping track of the queue pointers could provide a way to know how much work there is, by knowing the current pointer's offset from the queue's base. Fiddling with that value atomically via a specialized path should help maintain integrity, and I think it might allow scoped operations that could be synchronized via CU-level or GDS-level barriers, at a minimum.
    It would seem like some kind of protection would be put in place so that the relevant ranges can be checked by the dedicated hardware without being hit by some external buggy or malicious access.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    In the fixed function view of the graphics pipeline, work is solely divided into wavefronts. Compute is when workgroups become entangled in one of a short list of configurations (e.g. a workgroup of 2 wavefronts though of course a compute wavefront may be defined by a workgroup that is populated by a single wavefront).

    So, graphics wavefronts are always at the most finely-grained level of tasks in such a system.

    This document refers solely to work that has not started execution by a processor core. Once work is underway, it is no longer compatible with a work donation queue. Therefore it cannot be donated or stolen. This document is simply describing a distributed pool of available work whose tasks (wavefronts or groups of wavefronts) can be migrated around component parts of the pool. When tasks are initially assembed and pushed into the distributed pool, a "best estimate" can be made about which processor core is used to localise the available work.

    I think it's also worth noting that a processor core in this document is not well defined. e.g. translating this concept into GCN, it can be argued that a set of CUs actually forms a core, since work-queuing and program caching operates at a level above the compute unit. See the 2nd image on this page:

    http://www.guru3d.com/articles-pages/amd-radeon-hd-7970-review,5.html

    where you can see that K$ and I$ are shared by 4 compute units. The important point here is that program code size for individual kernels may prevent the co-existence of distinct kernels within a "processor core" of 3 or 4 compute units. We know that GCN has a general problem with code size and work distribution, because "code length" is a parameter frequently referenced in optimisation guidelines.

    So, code complexity, especially with competing graphics kernels or with very long kernels whose code needs to be paged into I$, effectively becomes another parameter that the work distributor has to handle. Implying that this would also need to be a parameter of work-stealing. There's no point trashing I$ (or K$) within a destination processor core by stealing work from another core.

    This seems like a low-quality document to be honest. I couldn't find any aspect of it that solves the supposed problem outlined by:

    In the end I see this as a fully transparent system. A graphics wavefront that hasn't commenced work is defined by quite a small amount of data, e.g. for a wavefront of fragment quads there's the coordinates of each quad and its gradient (usually several quads share gradient information) and the kernel Id (which translates into program Id and the state Id) shared by all the quads. What else?

    For a compute kernel, individual workgroups (and therefore their constituent wavefronts) are defined by "top-left" coordinates and kernel Id.

    Generally, the state associated with work that has not yet been commenced is pretty small and easy to move around the GPU. The most arduous state is DS, since it derives from TS output which is entangled with HS state. NVidia solved this problem ages ago and it's been tedious waiting for AMD to tackle it properly.

    Sadly this document doesn't touch on that subject.

    (One might argue that tessellation is a dying concept in modern rendering - if it isn't already dead.)
     
  19. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Perhaps, although AMD has an affinity towards using FPGAs to manage their queues of late. This may very well be the work distribution mechanism we've seen mentioned.

    With a heterogeneous configuration this may make a lot more sense. While not explicitly stated in that patent, a variable SIMD or hetereogenous design might use a similar technique. For example, a parallel workload moved onto a scalar wouldn't easily be transferable until some portion of the work completed. ACEs sort of implement this at a global level, but if broken into subroutines within a CU/cluster for more efficient scheduling, a more local mechanism would be warranted. That would account for donation and stealing. Throw a wave/workgroup back on the local queue when the width changes.

    I'd hazard a guess this is related to the CWSR (compute wave save restore) they added a year or so ago. If the instructions were transitioned to subroutines, practical if moving a wave onto a CPU/scalar or narrower SIMD, work technically would not have begun execution. Possibly extending the queuing mechanism upwards over a cluster or set of CUs. ACE/HWS-like functionality kicking in at migration, compaction, and synchronization points. Run a parallel code path, then reschedule/migrate within a CU/cluster for a specialized path.

    It also mentions that the L1/L2 cache are shown as on-chip memory, but "appreciated that they may be any suitable memory, whether on-chip or off-chip." That's very similar to what the ACEs were doing when spilling contexts into ram.
     
    #879 Anarchist4000, Jan 22, 2017
    Last edited: Jan 23, 2017
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The current method for allocation is all wavefronts in a workgroup are initialized in a CU up-front, if only to avoid potential deadlock on synchronization operations.
    However, the proposal seems to weaken the link between a workgroup and the software threads it is running, and later has workgroups or even wavefronts initiating queue/de-queue actions, so execution is at least partially underway or there's some meta-programming capable of asserting a different form of ownership over software threads by other software threads.

    "The work donation process allows a workgroup associated with a particular processing core to donate unprocessed workloads (e.g. a software thread waiting or needing to be executed) to another workgroup associated with the same processing core, such that unprocessed workloads may be transferred from one execution unit to another."

    "Similarly, workgroup queue N 118 stores work queue elements representing unprocessed workloads that may be executed by software threads associated with workgroup N 136."

    "The method begins at block 504, where wavefronts in each workgroup enqueue and dequeue elements from and to each workgroup queue using processing core scope atomic operations. For example, the hybrid donation and stealing control module 140 may allow wavefronts associated with workgroup 1 138 and workgroup N 136 to enqueue and dequeue queue elements to workgroup queue 1 116 and workgroup queue N 118."

    The lifetime and scope of the workgroup relative to the instructions executed by it seems to be changed. Perhaps this a case of awkward wording in the document, but it seems to be saying workgroups are resident on a core and they have ownership of some kind over their associated queue of pending elements. They then can take on new elements for execution--without specifying there's uniformity or consistency between them and the arbitrary elements they've just stolen.

    Intra-CU work-item movement is more compatible with the current method, since items like barriers and synchronization are still CU-level. It's still not a full match without further elaboration since the current architecture has effectively kicked everything off inside of a workgroup at that time. The movement between processors is even further out.

    What I find different is that it is saying that the workgroups or their constituent wavefronts are performing the action of donating their items--which hints at something different since the current method will not allow a workgroup or wavefront to be in a state capable of action until all wavefronts are already in a state of fetching instructions into their instruction buffers and executing.

    I think the determination of core may be informed by where the per-wave instruction buffers and program counters are maintained, which AMD's GCN presentation shows are present within a CU. The determination of program flow and the various special instructions can be encapsulated by everything contained in the CU (buffer-consumed instructions, scalar pipe, CU-level barriers), if the instruction buffers can be treated as a small L0 or L1 instruction cache.
    (edit: On reflection, potentially not an L1 cache due to its consumption of instructions, although complicated somewhat by the supposed "prefetch" that might allow reuse. It would function like instruction buffers in CPUs prior to including instruction caches, which in no way invalidated that they were cores.)

    The discussion of wavefronts taking actions and specially tagging instructions means it's not transparent to something.
     
    #880 3dilettante, Jan 23, 2017
    Last edited: Jan 23, 2017
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...