AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
It would be nice if people would stop conflating async compute with instruction scheduling. How many times has to be explained/repeated that they are not related?!? If you think they are you still don't get what async compute is.

It's quite likely that people are just using the same terms that the developers themselves are using. In the specific case of Doom and Vulkan, the developers are stressing Async Compute and Shader Intrinsics.

If you want people to stop using the terms as they are using them, you'll need to get the developers to stop using the terms in those ways, especially when interviewed by gaming/tech sites.

Regards,
SB
 
So I am making a technical argument and you reply with that? Please put me on your ignored list so you can save time and effort.

Not sure why you took it personally. In your technical argument you said you're waiting for someone to start claiming 'but GCN gains more' and then saying it's just exposing inefficiencies in GCN uarch. That was a prevalent argument before Pascal was released. Maxwell being more efficient. Nvidia not needing async compute.

Now that Pascal actually shows significant gains with async compute I wonder how much backtracking we will see here.
 
It's quite likely that people are just using the same terms that the developers themselves are using. In the specific case of Doom and Vulkan, the developers are stressing Async Compute and Shader Intrinsics.

If you want people to stop using the terms as they are using them, you'll need to get the developers to stop using the terms in those ways, especially when interviewed by gaming/tech sites.

Regards,
SB


People are using them without KNOWING what they are saying. Scheduling is an intricate part of Async? Yes, but how its done is not part of Async is it? (yeah if there is a performance penalty in the scheduling portion then async becomes useless, but if there isn't then doesn't matter how the scheduling is done and that is what renderstate was getting at and this was noted by Sebbi, Mdolenc, Andrew Lauritzen, myself and many others outside of this forum as well, I don't think a single developer that has talked about Async with how schedulers work (GDC conference withstanding) at a hardware level outside of Oxide (but at the time they didn't know like everyone else outside of nV) wtf was going on, but they did jump on the ACE scheduling crap along with AMD marketing when they posted on a forum, which wasn't correct AT ALL
 
Not sure why you took it personally. In your technical argument you said you're waiting for someone to start claiming 'but GCN gains more' and then saying it's just exposing inefficiencies in GCN uarch. That was a prevalent argument before Pascal was released. Maxwell being more efficient. Nvidia not needing async compute.

Now that Pascal actually shows significant gains with async compute I wonder how much backtracking we will see here.

I would not call the gains of NVidia significant. They jsut show that NV does gain a bit from new APIs and that there is no performance penalty for NV hardware, but it is still a long way off from the jump in performance that CGN is showing
 
I would not call the gains of NVidia significant. They jsut show that NV does gain a bit from new APIs and that there is no performance penalty for NV hardware, but it is still a long way off from the jump in performance that CGN is showing
Well let's imagine you have a scene containing 8.75 million vertices that draws 93 million pixels. Now let's imagine you have a GPU that contains some special processors that can only process vertices, let's call them vertex shaders. This GPU also contains some special processors that can only process pixels, let's call them pixel shaders. Say that such a GPU runs at 350MHz and has 6 vertex processors and 16 pixel processors. Said GPU will be able to render the scene at 60 frames a second. Now let's imagine another GPU which still runs at 350MHz and contains 24 general purpose processors that can process either vertices or pixels. How much faster will such a GPU be on the above mentioned scene? What if the scene changes though and now contains 17,5 million vertices?

There's a gross misunderstanding where gains are coming from. It's a similar situation with async compute only more intertwined. Gains from running graphics and compute concurrently are only possible if graphics does not fully utilise compute cores. As has been mentioned in another thread the most common way of gaining advantage of this today is to do frame post process in compute and start shadow map rendering for the next frame in graphics. Even rendering shadow maps requires some compute resources (vertex shaders). If you pick up a specific frame on specific GPU there's X milliseconds of work that you don't want to overlap, there's Y milliseconds of work that's totally graphics bound and there's Z milliseconds of work that's only compute. Total time in async off will be X+Y+Z and total time for async on will be roughly X+max(Y, Z). If you switch to a different vendor of GPU you can't possibly expect that X, Y, Z will stay the same. They won't even stay within the same ratios!
 
Not just shadow maps there can be lot of other shaders which are memory bound. Storing gbuffer and rendering lights is also memory bound in a deferred renderer. At that point the shader cores are actually doing nothing. So what you do? Push more work in between. It's just like how new 'active' warps are scheduled by HW to hide memory latency where ALU:TEX ratio is low. Now with async you're just adding more work to cover the latency cost even more - or say decrease the chance that cores would be idle. Especially FuryX has a low frequency memory (HBM) so the latencies are pretty high. (I've actually tested this using an OpenCL memory latency benchmark), so the shaders were sitting idle quite a few number of times. I agree the gains depend on the amount of compute concurrent work you can push but then again with the complexity and variety of shaders typically in a game frame there's always something to overlap.
 
I would agree that the gains are mostly a sign of not so good utilisation of processing power with AMD in the old APIs and not so much a sign of superiority in the new APIs.
 
http://www.pcworld.com/article/3095...geforce-in-this-major-new-dx12-benchmark.html

time-spy-results-100671645-orig.png


time-spy-results-async-disabled-100671643-orig.png
 
I personally never expected any kind of magic performance to come from Nvidia GPUs as it is my opinion that their utilization was near or at capacity already and that AMDs architecture is only now finally showing its capabilities. I'm just glad that AMD is now able to compete on a whole new scale.
 
I would agree that the gains are mostly a sign of not so good utilisation of processing power with AMD in the old APIs and not so much a sign of superiority in the new APIs.

You are talking about overall gains coming from DX11 and OGL right? No doubt.

Can you imagine the graphic landscape in the past few years if that was NOT the case?
 
Not just shadow maps there can be lot of other shaders which are memory bound. Storing gbuffer and rendering lights is also memory bound in a deferred renderer. At that point the shader cores are actually doing nothing. So what you do? Push more work in between. It's just like how new 'active' warps are scheduled by HW to hide memory latency where ALU:TEX ratio is low. Now with async you're just adding more work to cover the latency cost even more - or say decrease the chance that cores would be idle. Especially FuryX has a low frequency memory (HBM) so the latencies are pretty high. (I've actually tested this using an OpenCL memory latency benchmark), so the shaders were sitting idle quite a few number of times. I agree the gains depend on the amount of compute concurrent work you can push but then again with the complexity and variety of shaders typically in a game frame there's always something to overlap.
I'm not too sure something memory heavy is actually a good candidate for this. Have you tried it? As you said latency is high which means loads of pixels in flight to hide it. Which further means high register pressure within CU. So running a compute kernel with its own register and threads in flight requirements alongside might not actually be the best idea.
As far as I have played with this it's fairly easy to knock it out (GCN) of "concurrent mode".
 
High register pressure is a tricky problem to solve but you could certainly do better than just completely preallocating the register file and let it stay during the course of execution of all the wavefronts/warps of your shader. Swapping in and out to L2 cache with preloads before a wave is about to start seems a possibility but I am not sure. Or use some other on-chip memory like GDS? I will need to some tests. But there is certainly some overlap on a single CU as if you go the other way and divide your CUs between graphics and compute using a dynamic scheduler then it takes more time to do graphics which might not give any perf benefits.

Also if you look at the perf guideline for async compute here : http://gpuopen.com/performance-tweets-series-rendering-optimizations/
"Asynchronous queues can make some workloads free. Schedule compute & graphics jobs with different bottlenecks together."
Which means either graphics is memory bound or compute is memory bound.
 
I personally never expected any kind of magic performance to come from Nvidia GPUs as it is my opinion that their utilization was near or at capacity already and that AMDs architecture is only now finally showing its capabilities. I'm just glad that AMD is now able to compete on a whole new scale.
A PetaScale? ;)
 
36CU, does that remain you something ?
Remind you mean? It's a coincidence, it's APU, it has nothing to do with any discrete part aside the iGPU architecture, which you could implement with (almost, there's obviously upper limit due die size) any amount of CUs you want
 
Remind you mean? It's a coincidence, it's APU, it has nothing to do with any discrete part aside the iGPU architecture, which you could implement with (almost, there's obviously upper limit due die size) any amount of CUs you want

Ofc if the 480 was not exist, and if i was thinking that this 36CU wil correspond to an older GCN archittecture with 36cu.Is there any corespondance with older GCN gpu ? what do you think ? that the neo is still using an 7000, hawaii series based gpu's on 28nm?
 
Last edited:
Status
Not open for further replies.
Back
Top