AMD: Speculation, Rumors, and Discussion (Archive)

Discussion in 'Architecture and Products' started by iMacmatician, Mar 30, 2015.

Thread Status:
Not open for further replies.
  1. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    555
    Likes Received:
    93
    no-X likes this.
  2. Pixel

    Regular

    Joined:
    Sep 16, 2013
    Messages:
    981
    Likes Received:
    437
    If PS4 Neo polaris rumors are true, this is a huge leg up for Polaris (and GCN1.3) over Pascal. Developers will be more incentivized to optimize their code for Polaris architecture, and we may have another scenario where AMD arch punches above its weight on some game engines. And thats not even including AMDs Asynchronous compute advantage.
     
  3. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Even if you take it there, they still have a ~50% deficit to Maxwell 2 actually closer to 60% for the highest perf/watt Maxwell card the gtx 980.
     
  4. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Console code doesn't seem to port over that well to desktop GPU's at least from what we have seen so far, this might change though. And also you have to look at lead time there too. AMD, and others have been saying this for years that AMD will have an advantage because of their console wins. Yet nothing has materialized from this, there seems to be short stints where it seems to work then once new cards come out, its just a wash.

    Also dev's are going to be doing games for both versions of consoles, PS4 neo and PS4, will this create a easier time for them or harder. Its going to be harder, since now you have two separate paths, two separate feature sets etc. So how will this translate over to PC gaming? I think its going to be more complex. Added to this the two separate consoles (PS4 and Xbox), that will be 4 separate paths on two different API's.

    We have seen Gameworks being much more effective than the console wins by AMD thus far. What if Gameworks throws DX12 into the mix, what happens with it if its blackboxed like before? With less driver control wouldn't this play right into the hands of Gameworks? What if nV decides to make an entire library of compute effects for Pascal using async compute based for them? What will happen to developers trying to pursue code efficiency for all hardware with async compute effects? I think many of them will be, its there its working we don't need to worry about getting it to work with other cards since its not our top priority.

    What it all comes down to is if we take best cases for each IHV, yeah its all rosy, but the middle ground is where its probably going to end up. Where we will be at where we are now. AMD is struggling with perf/watt and it doesn't seem it will have any major advantage over nV since they have to overcome the deficit they are in now.

    All the other things about consoles and async, there are many ways that can play out because its not just AMD that is on the game board.
     
    #1424 Razor1, May 2, 2016
    Last edited: May 2, 2016
  5. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    The issue wrt perf/W between GCN and Maxwell was that the difference, 2x between an R9 390X and a GTX 980, was so ridiculously high that it allowed Nvidia to use it as a major marketing point, reinforcing a story of AMD's own creation by the not-yet-forgotten crappy cooler on R9 290X.

    If Polaris fixes that deficit, nobody is going to care about if it's 10% better or 10% worse than Pascal: perf/W will go back to being just point of comparison that doesn't make a material difference. (Though if Polaris is a bit better than Pascal, I expect lots of people to complain about bias when review sites correctly point out that perf/W has gone back to being a non-issue.)

    I think AMD has once again set wrong expectations by leading people to believe that Polaris will crush Pascal in terms of perf/W.
     
    egoless likes this.
  6. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    594
    Likes Received:
    298
    Some people bought Kepler because it was 10% more efficient than GCN. A lot of people choose the 770 over the 280x for that exact reason.
     
  7. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Looking at the hardware.fr review of R9 280X, I see a performance between the 770 and the 280X that's almost identical, yet power for the latter is between 20 to 50W higher. Or a perf/W difference that varies between 15% to 35%. But according to TechPowerUp, the average difference is around 13%. (I don't know how many games they test.)

    Maybe a difference of 10% will make some people decide one way or the other, but it's not going to be the headline topic that it was for Maxwell. Especially if Pascal doesn't have async compute.
     
  8. dbz

    dbz
    Newcomer

    Joined:
    Mar 21, 2012
    Messages:
    98
    Likes Received:
    41
    I think I know of more people that preferred the 770 over the 280X ( or any similar comparisons down the product stack) because they use their computers and monitors as media centers as well as for gaming/productivity/content creation. The 770 and its brethren tend to be very quiet for Blu ray/DVD/video playing. Some of AMD's cards can be noticeably louder in an environment where silence is a definite asset.
     
  9. Cookie Monster

    Newcomer

    Joined:
    Sep 12, 2008
    Messages:
    167
    Likes Received:
    8
    Location:
    Down Under
    And this comes back to Perf/W. Lower power consumption while delivering similar performance figures means less heat -> quieter cooler (or a budget one for increasing margins..7900GT anyone?). The AMD stock reference coolers dating back all the way back to the northern islands really has hurt AMD big time in terms of brand perception. Surprised to find out that they've never really changed the cooler design at all for 3 generations til the introduction of the AIO CLC on the FuryX.
     
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Remember this is compounded that NVIDIA seem to provide enough of a window that their best performing AIB's are much better, which unfortunately cannot always be said about some of the AMD models.
    So the gap should be wider in certain comparisons (and depending upon the AMD model), using a reference NVIDIA card IMO is pointless (especially trying to overclock it) as many will prefer to buy an AIB that they leave at default, while enthusiasts eek out even more performance with a well designed AIB with overclocking headroom.
    Case in point just look at how large a window this performance difference can be between the reference design and an all-out AIB:
    http://www.techpowerup.com/reviews/Gigabyte/GTX_980_Ti_XtremeGaming/23.html
    http://www.techpowerup.com/reviews/Gigabyte/GTX_980_Ti_XtremeGaming/24.html

    And when they overclock that card, the gap just widens more, although more in performance rather than per/watt.
    So part of this will be interesting how much headroom there is for AIB with Polaris.
    Cheers
     
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    From what I've seen, you need to give the compiler hints (fences?) as to what you can safely compute asynchronously.

    I am not sure, we're still talking about the same thing here. I was referring to have full ALU optimization as a design goal from the get go.
     
  12. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    No, that is something different. This was about probing the CUs to find the best fit, in case multiple CUs have spare resources, and multiple (different) wavefronts are to be scheduled from any of the pipes. Regardless of which context the wavefront originates from.

    And yes, there are quite a few parameters probed (or tracked/profiled?) for this, even though the specific details are a well kept secret.
     
    Anarchist4000 and CarstenS like this.
  13. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    I stand corrected. Thank you, sir.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The most obvious example of a rendering pass that doesn't make much use of the ALUs is shadow map rendering. So while doing the post-processing for frame 1 as a compute shader, you can do the shadow map rendering for frame 2. G-buffer fill also tends to be an ALU-light.

    There is no way to make shadow map rendering and G-buffer fill use all available ALU capacity - it's inherent in the algorithms being used. Their throughput is constrained by fixed function units.

    Compute kernels submitted into the dedicated compute queue will run asynchronously unless a fence is put in place to say that a specific item of graphics work has finished first. You need to use barriers because you want to control when the compute work runs asynchronously, rather than letting it fire off as soon as possible. If you don't control when it starts then you can end up with compute clashing with other ALU-heavy tasks and other issues like cache thrashing. Fences in D3D12 are expensive (they are a bit like preemption: the whole GPU needs to reach a settled state) so you shouldn't used dozens or hundreds of fences per frame to generate an interleaving of graphics-only, graphics+compute and compute-only bundles of work.

    So some analysis of work durations and compatible pairings of graphics and compute is required before deploying asynchronous compute. A bit of runtime benching should allow the game to calibrate whether asynchronous compute is going to be a win on the installed hardware.

    Console developers have been using asynchronous compute for a while now. If PC developers want to keep up then they're going to have to learn how to use the GPU efficiently. That's why D3D12 was created in the way it was, to give PC developers access to the kinds of advanced techniques that console developers have been using for a long time. Asynchronous compute is just one of those techniques.
     
    trinibwoy and silent_guy like this.
  15. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,185
    Likes Received:
    1,841
    Location:
    Finland
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    I'd agree keeping them all utilized as a design goal is a good thing. If a new effect shows up, refactoring your entire pipeline for a new effect that takes slightly longer can be impractical. Not too different than the games where performance tanks after integrating a Gameworks effect in a patch after launch. Point I was getting at was that semi intelligent hardware scheduling could do a lot of that work for you. That would be far more beneficial for dev teams that don't have the time or experts to properly integrate and optimize new effects. I'm reasonably certain newer generations of GCN can do it beyond what console devs have experience.

    I think hardware can work around that expense to a significant degree. It seems likely ACEs could track barriers and fences to handle dependencies.

    This is the part I have a feeling newer hardware might actually be taking over with far better control than a developer could manage. It would just be too difficult to make different paths for hardware with varying compute capabilities. It would work for consoles, but less so for the PC market. I'd agree it's still required, but if only targeting modern GCN hardware it might not be required as the hardware would do the work for you.

    Even console devs I'm not sure are familiar with what's coming. It seems safe to assume AMD iterated on asynchronous compute since console hardware was created. Developing for a Nintendo NX might be far easier than Neo which still requires that compatibility with older hardware. In a couple years asynchronous compute could be trivial.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    ACEs do track synchronization events placed on their queues, but relying on front-end processors that are effectively at the front of a chain of events starting with the timeslice they give to their queues, interacting with the CU array, an unknown number of DMA/wave launches/wave completions/system events, and the latency of the completion signalling path is where a lot of the expense arises.

    Coordinating with the graphics front end can be expensive because it does a lot of advance work (measured in front-end commands, not the many possible operations and wavefronts they can spawn) and juggles a lot of context--which may have contributed to why preemption did not come there as quickly. Older presentations described the command processor as being several custom controllers, and later leaked documents on the consoles point out that at least some of that functionality is to run ahead as far as possible to get initial context loads out of the way.

    Reducing how long pipeline is might also come with a non-zero cost in terms of graphics pipeline effectiveness, if it constrains how much the graphics pipeline can put into motion without worry about atomicity, checkpointing, or context storage.
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    It seems that people are still confusing terms "async compute", "async shaders" and "compute queue". Marketing and press doesn't seem to understand the terms properly and spread the confusion :)

    Hardware:
    AMD:
    Each compute unit (CUs) on GCN can run multiple shaders concurrently. Each CU can run both compute (CS) and graphics (PS/VS/GS/HS/DS) tasks concurrently. The 64 KB LDS (local data store) inside a CU is dynamically split between currently running shaders. Graphics shaders also use it for intermediate storage. AMD calls this feature "Async shaders".

    Intel / Nvidia: These GPUs do not support running graphics + compute concurrently on a single compute unit. One possible reason is the LDS / cache configuration (GPU on chip memory is configured differently when running graphics - CUDA even allows direct control for it). There most likely are other reasons as well. According to Intel documentation it seems that they are running the whole GPU either in compute mode or graphics mode. Nvidia is not as clear about this. Maxwell likely can run compute and graphics simultaneously, but not both in the same "shader multiprocessor" (SM).

    Async compute = running shaders in the compute queue. Compute queue is like another "CPU thread". It doesn't have any ties to the main queue. You can use fences to synchronize between queues, but this is a very heavy operation and likely causes stalls. You don't want to do more than a few fences (preferably one) per frame. Just like "CPU threads", compute queue doesn't guarantee any concurrent execution. Driver can time slice queues (just like OS does for CPU threads when you have more threads than the CPU core count). This can still be beneficial if you have big stalls (GPU waiting for CPU for instance). AMDs hardware works a bit like hyperthreading. It can feed multiple queues concurrently to all the compute units. If a compute units has stalls (even small stalls can be exploited), the CU will immediately switches to another shader (also graphics<->compute). This results in higher GPU utilization.

    You don't need to use the compute queue in order to execute multiple shaders concurrently. DirectX 12 and Vulkan are by default running all commands concurrently, even from a single queue (at the level of concurrency supported by the hardware). The developer needs to manually insert barriers in the queue to represent synchronization points for each resource (to prevent read<->write hazards). All modern GPUs are able to execute multiple shaders concurrently. However on Intel and Nvidia, the GPU is running either graphics or compute at a time (but can run multiple compute shaders or multiple graphics shaders concurrently). So in order to maximize the performance, you'd want submit large batches of either graphics or compute to the queue at once (not alternating between both rapidly). You get a GPU stall ("wait until idle") on each graphics<->compute switch (unless you are AMD of course).
     
    elect, Otto Dafe, Ike Turner and 20 others like this.
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Thank you sebbi. That (2nd paragraph) was basically (though much more detailed than) what I was thinking as well before people confused the hell out of me. :)

    Still, you throw SM/Compute Unit and GPU in the mix. That's unclear to me still whether or not Nvidia GPUs at the chip level operates only on compute OR graphics (your 3rd ยง) or if this is possible at the GPC-level (which I would think sounds most logically) or even at the SM-leve (unlikely).
     
  20. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Does that means the async efficiency is related to the MIMD "width" of the whole device? It could be one of reasons Nvidia chopped down the multiprocessors in Pascal, aside for relieving the register pressure.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...