AMD: Speculation, Rumors, and Discussion (Archive)

pharma · Jul 14, 2016

SimBy said:
That's assuming there is a problem in the first place. Expect a max 10% bump in Pascal performance when they enable and fine tune async compute. Expect 0% bump on anything older.

We really don't know until they post their findings, will we?

Silent_Buddha · Jul 14, 2016

renderstate said:
It would be nice if people would stop conflating async compute with instruction scheduling. How many times has to be explained/repeated that they are not related?!? If you think they are you still don't get what async compute is.

It's quite likely that people are just using the same terms that the developers themselves are using. In the specific case of Doom and Vulkan, the developers are stressing Async Compute and Shader Intrinsics.

If you want people to stop using the terms as they are using them, you'll need to get the developers to stop using the terms in those ways, especially when interviewed by gaming/tech sites.

Regards,
SB

SimBy · Jul 14, 2016

renderstate said:
So I am making a technical argument and you reply with that? Please put me on your ignored list so you can save time and effort.

Not sure why you took it personally. In your technical argument you said you're waiting for someone to start claiming 'but GCN gains more' and then saying it's just exposing inefficiencies in GCN uarch. That was a prevalent argument before Pascal was released. Maxwell being more efficient. Nvidia not needing async compute.

Now that Pascal actually shows significant gains with async compute I wonder how much backtracking we will see here.

Razor1 · Jul 14, 2016

Silent_Buddha said:
It's quite likely that people are just using the same terms that the developers themselves are using. In the specific case of Doom and Vulkan, the developers are stressing Async Compute and Shader Intrinsics.

If you want people to stop using the terms as they are using them, you'll need to get the developers to stop using the terms in those ways, especially when interviewed by gaming/tech sites.

Regards,
SB

People are using them without KNOWING what they are saying. Scheduling is an intricate part of Async? Yes, but how its done is not part of Async is it? (yeah if there is a performance penalty in the scheduling portion then async becomes useless, but if there isn't then doesn't matter how the scheduling is done and that is what renderstate was getting at and this was noted by Sebbi, Mdolenc, Andrew Lauritzen, myself and many others outside of this forum as well, I don't think a single developer that has talked about Async with how schedulers work (GDC conference withstanding) at a hardware level outside of Oxide (but at the time they didn't know like everyone else outside of nV) wtf was going on, but they did jump on the ACE scheduling crap along with AMD marketing when they posted on a forum, which wasn't correct AT ALL

seahawk · Jul 15, 2016

SimBy said:
Not sure why you took it personally. In your technical argument you said you're waiting for someone to start claiming 'but GCN gains more' and then saying it's just exposing inefficiencies in GCN uarch. That was a prevalent argument before Pascal was released. Maxwell being more efficient. Nvidia not needing async compute.

Now that Pascal actually shows significant gains with async compute I wonder how much backtracking we will see here.

I would not call the gains of NVidia significant. They jsut show that NV does gain a bit from new APIs and that there is no performance penalty for NV hardware, but it is still a long way off from the jump in performance that CGN is showing

MDolenc · Jul 15, 2016

seahawk said:
I would not call the gains of NVidia significant. They jsut show that NV does gain a bit from new APIs and that there is no performance penalty for NV hardware, but it is still a long way off from the jump in performance that CGN is showing

Well let's imagine you have a scene containing 8.75 million vertices that draws 93 million pixels. Now let's imagine you have a GPU that contains some special processors that can only process vertices, let's call them vertex shaders. This GPU also contains some special processors that can only process pixels, let's call them pixel shaders. Say that such a GPU runs at 350MHz and has 6 vertex processors and 16 pixel processors. Said GPU will be able to render the scene at 60 frames a second. Now let's imagine another GPU which still runs at 350MHz and contains 24 general purpose processors that can process either vertices or pixels. How much faster will such a GPU be on the above mentioned scene? What if the scene changes though and now contains 17,5 million vertices?

There's a gross misunderstanding where gains are coming from. It's a similar situation with async compute only more intertwined. Gains from running graphics and compute concurrently are only possible if graphics does not fully utilise compute cores. As has been mentioned in another thread the most common way of gaining advantage of this today is to do frame post process in compute and start shadow map rendering for the next frame in graphics. Even rendering shadow maps requires some compute resources (vertex shaders). If you pick up a specific frame on specific GPU there's X milliseconds of work that you don't want to overlap, there's Y milliseconds of work that's totally graphics bound and there's Z milliseconds of work that's only compute. Total time in async off will be X+Y+Z and total time for async on will be roughly X+max(Y, Z). If you switch to a different vendor of GPU you can't possibly expect that X, Y, Z will stay the same. They won't even stay within the same ratios!

AnomalousEntity · Jul 15, 2016

Not just shadow maps there can be lot of other shaders which are memory bound. Storing gbuffer and rendering lights is also memory bound in a deferred renderer. At that point the shader cores are actually doing nothing. So what you do? Push more work in between. It's just like how new 'active' warps are scheduled by HW to hide memory latency where ALU:TEX ratio is low. Now with async you're just adding more work to cover the latency cost even more - or say decrease the chance that cores would be idle. Especially FuryX has a low frequency memory (HBM) so the latencies are pretty high. (I've actually tested this using an OpenCL memory latency benchmark), so the shaders were sitting idle quite a few number of times. I agree the gains depend on the amount of compute concurrent work you can push but then again with the complexity and variety of shaders typically in a game frame there's always something to overlap.

seahawk · Jul 15, 2016

I would agree that the gains are mostly a sign of not so good utilisation of processing power with AMD in the old APIs and not so much a sign of superiority in the new APIs.

dskneo · Jul 15, 2016

http://www.pcworld.com/article/3095...geforce-in-this-major-new-dx12-benchmark.html

lanek · Jul 15, 2016

I repost it here just for info.
Guru3D numbers are out with Async compute on and off..

http://www.guru3d.com/articles-pages/futuremark-3dmark-timespy-benchmark-review,1.html

Malo · Jul 15, 2016

I personally never expected any kind of magic performance to come from Nvidia GPUs as it is my opinion that their utilization was near or at capacity already and that AMDs architecture is only now finally showing its capabilities. I'm just glad that AMD is now able to compete on a whole new scale.

SimBy · Jul 15, 2016

seahawk said:
I would agree that the gains are mostly a sign of not so good utilisation of processing power with AMD in the old APIs and not so much a sign of superiority in the new APIs.

You are talking about overall gains coming from DX11 and OGL right? No doubt.

Can you imagine the graphic landscape in the past few years if that was NOT the case?

MDolenc · Jul 15, 2016

AnomalousEntity said:
Not just shadow maps there can be lot of other shaders which are memory bound. Storing gbuffer and rendering lights is also memory bound in a deferred renderer. At that point the shader cores are actually doing nothing. So what you do? Push more work in between. It's just like how new 'active' warps are scheduled by HW to hide memory latency where ALU:TEX ratio is low. Now with async you're just adding more work to cover the latency cost even more - or say decrease the chance that cores would be idle. Especially FuryX has a low frequency memory (HBM) so the latencies are pretty high. (I've actually tested this using an OpenCL memory latency benchmark), so the shaders were sitting idle quite a few number of times. I agree the gains depend on the amount of compute concurrent work you can push but then again with the complexity and variety of shaders typically in a game frame there's always something to overlap.

I'm not too sure something memory heavy is actually a good candidate for this. Have you tried it? As you said latency is high which means loads of pixels in flight to hide it. Which further means high register pressure within CU. So running a compute kernel with its own register and threads in flight requirements alongside might not actually be the best idea.
As far as I have played with this it's fairly easy to knock it out (GCN) of "concurrent mode".

AnomalousEntity · Jul 15, 2016

High register pressure is a tricky problem to solve but you could certainly do better than just completely preallocating the register file and let it stay during the course of execution of all the wavefronts/warps of your shader. Swapping in and out to L2 cache with preloads before a wave is about to start seems a possibility but I am not sure. Or use some other on-chip memory like GDS? I will need to some tests. But there is certainly some overlap on a single CU as if you go the other way and divide your CUs between graphics and compute using a dynamic scheduler then it takes more time to do graphics which might not give any perf benefits.

Also if you look at the perf guideline for async compute here : http://gpuopen.com/performance-tweets-series-rendering-optimizations/
"Asynchronous queues can make some workloads free. Schedule compute & graphics jobs with different bottlenecks together."
Which means either graphics is memory bound or compute is memory bound.

AlNom · Jul 15, 2016

Malo said:
I personally never expected any kind of magic performance to come from Nvidia GPUs as it is my opinion that their utilization was near or at capacity already and that AMDs architecture is only now finally showing its capabilities. I'm just glad that AMD is now able to compete on a whole new scale.

A PetaScale?

Esrever · Jul 15, 2016

Anyone else see that the CPU score decreases on Pascal when asynchronous compute is enabled? Is it just a fluke?

Esrever · Jul 15, 2016

double

lanek · Jul 15, 2016

36CU, does that remain you something ?

Kaotik · Jul 16, 2016

lanek said:
36CU, does that remain you something ?

Remind you mean? It's a coincidence, it's APU, it has nothing to do with any discrete part aside the iGPU architecture, which you could implement with (almost, there's obviously upper limit due die size) any amount of CUs you want

lanek · Jul 16, 2016

Kaotik said:
Remind you mean? It's a coincidence, it's APU, it has nothing to do with any discrete part aside the iGPU architecture, which you could implement with (almost, there's obviously upper limit due die size) any amount of CUs you want

Ofc if the 480 was not exist, and if i was thinking that this 36CU wil correspond to an older GCN archittecture with 36cu.Is there any corespondance with older GCN gpu ? what do you think ? that the neo is still using an 7000, hawaii series based gpu's on 28nm?

AMD: Speculation, Rumors, and Discussion (Archive)

pharma

Silent_Buddha

SimBy

Razor1

seahawk

MDolenc

AnomalousEntity

seahawk

dskneo

lanek

Malo

Yak Mechanicum

SimBy

MDolenc

AnomalousEntity

AlNom

Moderator

Esrever

Esrever

lanek

Kaotik

Drunk Member

lanek

Similar threads