AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
If PS4 Neo polaris rumors are true, this is a huge leg up for Polaris (and GCN1.3) over Pascal. Developers will be more incentivized to optimize their code for Polaris architecture, and we may have another scenario where AMD arch punches above its weight on some game engines. And thats not even including AMDs Asynchronous compute advantage.
 
That depends on which implementation of Tonga. The initial 285 was particularly poor, but by the 380X perf/watt had noticeably improved (though still only to 270X levels).
https://www.techpowerup.com/reviews/ASUS/R9_380X_Strix/24.html
http://www.hardwareluxx.de/index.ph...0x-von-asus-und-sapphire-im-test.html?start=6


Even if you take it there, they still have a ~50% deficit to Maxwell 2 actually closer to 60% for the highest perf/watt Maxwell card the gtx 980.
 
If PS4 Neo polaris rumors are true, this is a huge leg up for Polaris (and GCN1.3) over Pascal. Developers will be more incentivized to optimize their code for Polaris architecture, and we may have another scenario where AMD arch punches above its weight on some game engines. And thats not even including AMDs Asynchronous compute advantage.

Console code doesn't seem to port over that well to desktop GPU's at least from what we have seen so far, this might change though. And also you have to look at lead time there too. AMD, and others have been saying this for years that AMD will have an advantage because of their console wins. Yet nothing has materialized from this, there seems to be short stints where it seems to work then once new cards come out, its just a wash.

Also dev's are going to be doing games for both versions of consoles, PS4 neo and PS4, will this create a easier time for them or harder. Its going to be harder, since now you have two separate paths, two separate feature sets etc. So how will this translate over to PC gaming? I think its going to be more complex. Added to this the two separate consoles (PS4 and Xbox), that will be 4 separate paths on two different API's.

We have seen Gameworks being much more effective than the console wins by AMD thus far. What if Gameworks throws DX12 into the mix, what happens with it if its blackboxed like before? With less driver control wouldn't this play right into the hands of Gameworks? What if nV decides to make an entire library of compute effects for Pascal using async compute based for them? What will happen to developers trying to pursue code efficiency for all hardware with async compute effects? I think many of them will be, its there its working we don't need to worry about getting it to work with other cards since its not our top priority.

What it all comes down to is if we take best cases for each IHV, yeah its all rosy, but the middle ground is where its probably going to end up. Where we will be at where we are now. AMD is struggling with perf/watt and it doesn't seem it will have any major advantage over nV since they have to overcome the deficit they are in now.

All the other things about consoles and async, there are many ways that can play out because its not just AMD that is on the game board.
 
Last edited:
The issue wrt perf/W between GCN and Maxwell was that the difference, 2x between an R9 390X and a GTX 980, was so ridiculously high that it allowed Nvidia to use it as a major marketing point, reinforcing a story of AMD's own creation by the not-yet-forgotten crappy cooler on R9 290X.

If Polaris fixes that deficit, nobody is going to care about if it's 10% better or 10% worse than Pascal: perf/W will go back to being just point of comparison that doesn't make a material difference. (Though if Polaris is a bit better than Pascal, I expect lots of people to complain about bias when review sites correctly point out that perf/W has gone back to being a non-issue.)

I think AMD has once again set wrong expectations by leading people to believe that Polaris will crush Pascal in terms of perf/W.
 
The issue wrt perf/W between GCN and Maxwell was that the difference, 2x between an R9 390X and a GTX 980, was so ridiculously high that it allowed Nvidia to use it as a major marketing point, reinforcing a story of AMD's own creation by the not-yet-forgotten crappy cooler on R9 290X.

If Polaris fixes that deficit, nobody is going to care about if it's 10% better or 10% worse than Pascal: perf/W will go back to being just point of comparison that doesn't make a material difference. (Though if Polaris is a bit better than Pascal, I expect lots of people to complain about bias when review sites correctly point out that perf/W has gone back to being a non-issue.)

I think AMD has once again set wrong expectations by leading people to believe that Polaris will crush Pascal in terms of perf/W.
Some people bought Kepler because it was 10% more efficient than GCN. A lot of people choose the 770 over the 280x for that exact reason.
 
Some people bought Kepler because it was 10% more efficient than GCN. A lot of people choose the 770 over the 280x for that exact reason.
Looking at the hardware.fr review of R9 280X, I see a performance between the 770 and the 280X that's almost identical, yet power for the latter is between 20 to 50W higher. Or a perf/W difference that varies between 15% to 35%. But according to TechPowerUp, the average difference is around 13%. (I don't know how many games they test.)

Maybe a difference of 10% will make some people decide one way or the other, but it's not going to be the headline topic that it was for Maxwell. Especially if Pascal doesn't have async compute.
 
Some people bought Kepler because it was 10% more efficient than GCN. A lot of people choose the 770 over the 280x for that exact reason.
I think I know of more people that preferred the 770 over the 280X ( or any similar comparisons down the product stack) because they use their computers and monitors as media centers as well as for gaming/productivity/content creation. The 770 and its brethren tend to be very quiet for Blu ray/DVD/video playing. Some of AMD's cards can be noticeably louder in an environment where silence is a definite asset.
 
I think I know of more people that preferred the 770 over the 280X ( or any similar comparisons down the product stack) because they use their computers and monitors as media centers as well as for gaming/productivity/content creation. The 770 and its brethren tend to be very quiet for Blu ray/DVD/video playing. Some of AMD's cards can be noticeably louder in an environment where silence is a definite asset.

And this comes back to Perf/W. Lower power consumption while delivering similar performance figures means less heat -> quieter cooler (or a budget one for increasing margins..7900GT anyone?). The AMD stock reference coolers dating back all the way back to the northern islands really has hurt AMD big time in terms of brand perception. Surprised to find out that they've never really changed the cooler design at all for 3 generations til the introduction of the AIO CLC on the FuryX.
 
Looking at the hardware.fr review of R9 280X, I see a performance between the 770 and the 280X that's almost identical, yet power for the latter is between 20 to 50W higher. Or a perf/W difference that varies between 15% to 35%. But according to TechPowerUp, the average difference is around 13%. (I don't know how many games they test.)

Maybe a difference of 10% will make some people decide one way or the other, but it's not going to be the headline topic that it was for Maxwell. Especially if Pascal doesn't have async compute.
Remember this is compounded that NVIDIA seem to provide enough of a window that their best performing AIB's are much better, which unfortunately cannot always be said about some of the AMD models.
So the gap should be wider in certain comparisons (and depending upon the AMD model), using a reference NVIDIA card IMO is pointless (especially trying to overclock it) as many will prefer to buy an AIB that they leave at default, while enthusiasts eek out even more performance with a well designed AIB with overclocking headroom.
Case in point just look at how large a window this performance difference can be between the reference design and an all-out AIB:
http://www.techpowerup.com/reviews/Gigabyte/GTX_980_Ti_XtremeGaming/23.html
http://www.techpowerup.com/reviews/Gigabyte/GTX_980_Ti_XtremeGaming/24.html

And when they overclock that card, the gap just widens more, although more in performance rather than per/watt.
So part of this will be interesting how much headroom there is for AIB with Polaris.
Cheers
 
At least on newer GCN versions I think it does work automatically. It has to do some sort of tuning, unless it's using a round robin dispatch of all the queues, or you'd just flood the hardware with a single available kernel. I really haven't seen any clarification from AMD on just how they select wavefronts for scheduling. If you follow the thought process that they wanted concurrent execution, having the scheduler target ratios of graphics:compute or fetch:alu doesn't seem unreasonable. I'd imagine it's not available because Nvidia is still working out the details for their implementation.
From what I've seen, you need to give the compiler hints (fences?) as to what you can safely compute asynchronously.

Power efficiency going off that recent patent. Throughput would reduce to whatever was required by disabling or downclocking ALUs. You would basically guarantee the hardware was always close to full utilization.
I am not sure, we're still talking about the same thing here. I was referring to have full ALU optimization as a design goal from the get go.
 
From what I've seen, you need to give the compiler hints (fences?) as to what you can safely compute asynchronously.
No, that is something different. This was about probing the CUs to find the best fit, in case multiple CUs have spare resources, and multiple (different) wavefronts are to be scheduled from any of the pipes. Regardless of which context the wavefront originates from.

And yes, there are quite a few parameters probed (or tracked/profiled?) for this, even though the specific details are a well kept secret.
 
The most obvious example of a rendering pass that doesn't make much use of the ALUs is shadow map rendering. So while doing the post-processing for frame 1 as a compute shader, you can do the shadow map rendering for frame 2. G-buffer fill also tends to be an ALU-light.

There is no way to make shadow map rendering and G-buffer fill use all available ALU capacity - it's inherent in the algorithms being used. Their throughput is constrained by fixed function units.

Compute kernels submitted into the dedicated compute queue will run asynchronously unless a fence is put in place to say that a specific item of graphics work has finished first. You need to use barriers because you want to control when the compute work runs asynchronously, rather than letting it fire off as soon as possible. If you don't control when it starts then you can end up with compute clashing with other ALU-heavy tasks and other issues like cache thrashing. Fences in D3D12 are expensive (they are a bit like preemption: the whole GPU needs to reach a settled state) so you shouldn't used dozens or hundreds of fences per frame to generate an interleaving of graphics-only, graphics+compute and compute-only bundles of work.

So some analysis of work durations and compatible pairings of graphics and compute is required before deploying asynchronous compute. A bit of runtime benching should allow the game to calibrate whether asynchronous compute is going to be a win on the installed hardware.

Console developers have been using asynchronous compute for a while now. If PC developers want to keep up then they're going to have to learn how to use the GPU efficiently. That's why D3D12 was created in the way it was, to give PC developers access to the kinds of advanced techniques that console developers have been using for a long time. Asynchronous compute is just one of those techniques.
 
http://videocardz.com/59487/amd-polaris-11-and-10-gpus-pictured

Polaris 10 die comparison to tonga


AMD-Polaris-10-GPU-vs-Tonga-GPU.jpg

To be more specific, Tonga comparison to what they assume is Polaris 10 die size, when they really only know length of 2 sides - it might not be exactly same aspect ratio as Tonga
 
I am not sure, we're still talking about the same thing here. I was referring to have full ALU optimization as a design goal from the get go.
I'd agree keeping them all utilized as a design goal is a good thing. If a new effect shows up, refactoring your entire pipeline for a new effect that takes slightly longer can be impractical. Not too different than the games where performance tanks after integrating a Gameworks effect in a patch after launch. Point I was getting at was that semi intelligent hardware scheduling could do a lot of that work for you. That would be far more beneficial for dev teams that don't have the time or experts to properly integrate and optimize new effects. I'm reasonably certain newer generations of GCN can do it beyond what console devs have experience.

Fences in D3D12 are expensive (they are a bit like preemption
I think hardware can work around that expense to a significant degree. It seems likely ACEs could track barriers and fences to handle dependencies.

So some analysis of work durations and compatible pairings of graphics and compute is required before deploying asynchronous compute. A bit of runtime benching should allow the game to calibrate whether asynchronous compute is going to be a win on the installed hardware.
This is the part I have a feeling newer hardware might actually be taking over with far better control than a developer could manage. It would just be too difficult to make different paths for hardware with varying compute capabilities. It would work for consoles, but less so for the PC market. I'd agree it's still required, but if only targeting modern GCN hardware it might not be required as the hardware would do the work for you.

Console developers have been using asynchronous compute for a while now. If PC developers want to keep up then they're going to have to learn how to use the GPU efficiently. That's why D3D12 was created in the way it was, to give PC developers access to the kinds of advanced techniques that console developers have been using for a long time. Asynchronous compute is just one of those techniques.
Even console devs I'm not sure are familiar with what's coming. It seems safe to assume AMD iterated on asynchronous compute since console hardware was created. Developing for a Nintendo NX might be far easier than Neo which still requires that compatibility with older hardware. In a couple years asynchronous compute could be trivial.
 
ACEs do track synchronization events placed on their queues, but relying on front-end processors that are effectively at the front of a chain of events starting with the timeslice they give to their queues, interacting with the CU array, an unknown number of DMA/wave launches/wave completions/system events, and the latency of the completion signalling path is where a lot of the expense arises.

Coordinating with the graphics front end can be expensive because it does a lot of advance work (measured in front-end commands, not the many possible operations and wavefronts they can spawn) and juggles a lot of context--which may have contributed to why preemption did not come there as quickly. Older presentations described the command processor as being several custom controllers, and later leaked documents on the consoles point out that at least some of that functionality is to run ahead as far as possible to get initial context loads out of the way.

Reducing how long pipeline is might also come with a non-zero cost in terms of graphics pipeline effectiveness, if it constrains how much the graphics pipeline can put into motion without worry about atomicity, checkpointing, or context storage.
 
It seems that people are still confusing terms "async compute", "async shaders" and "compute queue". Marketing and press doesn't seem to understand the terms properly and spread the confusion :)

Hardware:
AMD:
Each compute unit (CUs) on GCN can run multiple shaders concurrently. Each CU can run both compute (CS) and graphics (PS/VS/GS/HS/DS) tasks concurrently. The 64 KB LDS (local data store) inside a CU is dynamically split between currently running shaders. Graphics shaders also use it for intermediate storage. AMD calls this feature "Async shaders".

Intel / Nvidia: These GPUs do not support running graphics + compute concurrently on a single compute unit. One possible reason is the LDS / cache configuration (GPU on chip memory is configured differently when running graphics - CUDA even allows direct control for it). There most likely are other reasons as well. According to Intel documentation it seems that they are running the whole GPU either in compute mode or graphics mode. Nvidia is not as clear about this. Maxwell likely can run compute and graphics simultaneously, but not both in the same "shader multiprocessor" (SM).

Async compute = running shaders in the compute queue. Compute queue is like another "CPU thread". It doesn't have any ties to the main queue. You can use fences to synchronize between queues, but this is a very heavy operation and likely causes stalls. You don't want to do more than a few fences (preferably one) per frame. Just like "CPU threads", compute queue doesn't guarantee any concurrent execution. Driver can time slice queues (just like OS does for CPU threads when you have more threads than the CPU core count). This can still be beneficial if you have big stalls (GPU waiting for CPU for instance). AMDs hardware works a bit like hyperthreading. It can feed multiple queues concurrently to all the compute units. If a compute units has stalls (even small stalls can be exploited), the CU will immediately switches to another shader (also graphics<->compute). This results in higher GPU utilization.

You don't need to use the compute queue in order to execute multiple shaders concurrently. DirectX 12 and Vulkan are by default running all commands concurrently, even from a single queue (at the level of concurrency supported by the hardware). The developer needs to manually insert barriers in the queue to represent synchronization points for each resource (to prevent read<->write hazards). All modern GPUs are able to execute multiple shaders concurrently. However on Intel and Nvidia, the GPU is running either graphics or compute at a time (but can run multiple compute shaders or multiple graphics shaders concurrently). So in order to maximize the performance, you'd want submit large batches of either graphics or compute to the queue at once (not alternating between both rapidly). You get a GPU stall ("wait until idle") on each graphics<->compute switch (unless you are AMD of course).
 
Thank you sebbi. That (2nd paragraph) was basically (though much more detailed than) what I was thinking as well before people confused the hell out of me. :)

Still, you throw SM/Compute Unit and GPU in the mix. That's unclear to me still whether or not Nvidia GPUs at the chip level operates only on compute OR graphics (your 3rd §) or if this is possible at the GPC-level (which I would think sounds most logically) or even at the SM-leve (unlikely).
 
Intel / Nvidia: These GPUs do not support running graphics + compute concurrently on a single compute unit. One possible reason is the LDS / cache configuration (GPU on chip memory is configured differently when running graphics - CUDA even allows direct control for it). There most likely are other reasons as well. According to Intel documentation it seems that they are running the whole GPU either in compute mode or graphics mode. Nvidia is not as clear about this. Maxwell likely can run compute and graphics simultaneously, but not both in the same "shader multiprocessor" (SM).
Does that means the async efficiency is related to the MIMD "width" of the whole device? It could be one of reasons Nvidia chopped down the multiprocessors in Pascal, aside for relieving the register pressure.
 
Status
Not open for further replies.
Back
Top