AMD: Speculation, Rumors, and Discussion (Archive)

Status
Not open for further replies.
Does that means the async efficiency is related to the MIMD "width" of the whole device? It could be one of reasons Nvidia chopped down the multiprocessors in Pascal, aside for relieving the register pressure.
Maybe someone with more knowledge about Nvidia hardware should answer your question. I am just pointing out potential reasons based on the information available.

If you assume that a single Pascal SM cannot run mixed graphics + compute then splitting the MPs should improve the granularity. Compute and graphics might also share some higher level (more global) resources as well. Nvidia has quite sophisticated load balancing in their geometry processing. Distributed geometry data needs to be stored somewhere (SM L1 at least is partially pinned for graphics work, see this presentation: http://on-demand.gputechconf.com/gtc/2016/video/S6138.html). Also, Nvidia doesn't have separate ROP caches (AMD still does). Some portion of their L2 needs to serve ROPs when rendering graphics. This might be transparent (just another client of the cache) or might be statically pinned based on the GPU state. I don't know :)
 
Thanks, sebbbi! Where does preemption fit in this story?
Async shaders is not a "quality of service" mechanism. If multiple shaders are running simultaneously, one of them might get all the clock cycles (*). Also, the GPU has only limited resources (registers, LDS) and when these run out, the scheduler can't schedule new tasks to the compute units, until some tasks (thread groups) are finished. You still need pre-emption to guarantee that all applications get a fair share of the GPU.

(*) This is actually a real "problem", even without preemption. Example: You are running an ALU heavy task that has high occupancy and are rendering shadow maps concurrently. Shadow map shader might for example only have max 10% occupation (triangle setup or ROP bound). 10% / 90% split of occupancy would be the best case. Shadow maps render at full speed (as it only needs 10% of the CU waves) and the other task runs at 90% speed. However the GPU might just fill the CUs completely with waves of the other task, starving the shadow map rendering completely (it waits and runs after the other ALU heavy shader). Unfortunately PC DirectX 12 doesn't have priority adjustment per draw/dispatch. So the driver must do some black magic to ensure that concurrent tasks are not starved.
 
If you had it available as a tool, is preemption something that would be useful to you one way or the other intra-app? Or will its benefit be mostly at the OS level where it will prevent one task from gobbling up the resources, the way it currently exists at the CPU?
 
At least in the early days for the PS4, the developers for the Infamous launch title still pointed to long-running compute kernels being problematic, which could have been due to the immature platform and a weakness in maintaining QoS or response time requirements.
For the purposes of HSA and audio, Sony's audio engineer indicated the latency for compute wavefront launches under load with the limited prioritization and context-switching that had been researched at the time was tens of milliseconds when under 7 or 5 ms was a goal.

Per some leaks regarding the Xbox One, switching between the game portion to the system/application timeslice can be a fraught exercise if there are shaders that do not respond in a timely fashion.
 
Thank you sebbi. That (2nd paragraph) was basically (though much more detailed than) what I was thinking as well before people confused the hell out of me. :)

Still, you throw SM/Compute Unit and GPU in the mix. That's unclear to me still whether or not Nvidia GPUs at the chip level operates only on compute OR graphics (your 3rd §) or if this is possible at the GPC-level (which I would think sounds most logically) or even at the SM-leve (unlikely).

Sofar as I know, there is, and will still be for Pascal, a relatively heavy context switch hit on Nvidia hardware if you want to switch between compute/graphics on the same hardware subunits; as for whatever reason you can't run them concurrently on the same units like you can on GCN. Why? I unfortunately don't know this, though I'd assume it's something not easily fixed or they'd have done so for Pascal.
 
Async shaders is not a "quality of service" mechanism. If multiple shaders are running simultaneously, one of them might get all the clock cycles (*). Also, the GPU has only limited resources (registers, LDS) and when these run out, the scheduler can't schedule new tasks to the compute units, until some tasks (thread groups) are finished. You still need pre-emption to guarantee that all applications get a fair share of the GPU.

(*) This is actually a real "problem", even without preemption. Example: You are running an ALU heavy task that has high occupancy and are rendering shadow maps concurrently. Shadow map shader might for example only have max 10% occupation (triangle setup or ROP bound). 10% / 90% split of occupancy would be the best case. Shadow maps render at full speed (as it only needs 10% of the CU waves) and the other task runs at 90% speed. However the GPU might just fill the CUs completely with waves of the other task, starving the shadow map rendering completely (it waits and runs after the other ALU heavy shader). Unfortunately PC DirectX 12 doesn't have priority adjustment per draw/dispatch. So the driver must do some black magic to ensure that concurrent tasks are not starved.

It's interesting to see that in some ways, D3D12 still doesn't provide enough control to developers. Does Vulkan fare any better in this regard?
 
A cursory search turned up queue priorities. This seems to conform with a concept like the quick response queue, but is not the granularity under discussion.
 
It's interesting to see that in some ways, D3D12 still doesn't provide enough control to developers. Does Vulkan fare any better in this regard?
Vulkan allows the driver to specify a number of compute queues as well as vendor extensions, so there are a lot of ways it could be exposed there. That said, I haven't seen anything that looks like priority levels last I checked the API.

A cursory search turned up queue priorities. This seems to conform with a concept like the quick response queue, but is not the granularity under discussion.
The GCN ISA whitepaper mentions 4 priority levels (likely ACE limit), but they aren't exposed in any APIs beyond some proprietary tools I'm aware. Console SDKs probably have controls per sebbbi's post, "PC DX12". It wouldn't be surprising for QRQ=0, Graphics=1, Compute=2, Long Running Compute=3? So long as the distributor limits the occupancy of long running shaders on a CU jobs should keep flowing. I'd have to imagine AMD has worked through most of these hazards by now.
 
The GCN ISA whitepaper mentions 4 priority levels (likely ACE limit), but they aren't exposed in any APIs beyond some proprietary tools I'm aware. Console SDKs probably have controls per sebbbi's post, "PC DX12". It wouldn't be surprising for QRQ=0, Graphics=1, Compute=2, Long Running Compute=3? So long as the distributor limits the occupancy of long running shaders on a CU jobs should keep flowing. I'd have to imagine AMD has worked through most of these hazards by now.
GCN3 introduced context switching, while the consoles were considered part of the GCN2 (1.1, CI, etc.) IP pool.
Prioritization was something that hadn't been really tried by Sony's audio engineer, when the launch behavior of the architecture for HSA audio was researched, indicative of the immaturity of the tools at the time.
I was imprecise in my earlier post by mentioning preemption, it's been a while since I've seen the presentation. I think he may have only mentioned prioritization.

Context-switching is labeled as a GCN3 feature, although if there are tweaks to the consoles I suppose they might not be documented. Long-running compute probably wouldn't be considered problematic if the PS4 had preemption at the time of the first wave of games.

This goes to leaked data, but at least in the past the Xbox One at least indicated a strong preference for wavefronts that reached a completion point before they threatened the context-switch to the system time-slice. That seems like it shouldn't have been a concern if preemption were available.
 
Makes sense for them to launch after Nvidia. Nvidia have spoiled their launches so many times in the past that I have a feeling they wanted to see what Nvidia's lineup would be and then price accordingly. Lucky for them Nvidia didn't do a hard launch, so they have a bit of time before actually having to have their products out. Especially considering that the Founder's edition will be the only thing available for a bit. And anyone getting a Founder's edition card is highly unlikely to have gotten an AMD card no matter what. It's also interesting that they made sure that the rumored Press Conference won't be until the review embargo lifts on the Geforce 10x0 cards. Likely to see exactly where Polaris 10 will fit.

If rumors are true that Polaris 10 is ~Fury X/980 Ti speed, it should compete favorably with the 1070 at least. In many ways making it somewhat similar to the situation they were in with the Radeon 4870/4850 versus Geforce 280/260 (albeit without the huge perf/W advantage that Radeon had over Nvidia at the time). Although also eerily similar to the Radeon 3870 versus Radeon 2900 XT (similar speed but smaller and more efficient chip).

That would mean that any remaining stock of Fiji chips would likely be reserved for the professional market. Although if Radeon Pro Duo doesn't do well, that's going to be stock just sitting there waiting to be written off.

Regards,
SB
 
Last edited:
Time for Raja Koduri to show us that historic jump in performance he was talking about :)The sweet time they are taking is not confidence inspiring though.

I hope that time is/was well spent.
 
Yes, at the very least we need AMD to at least stay relatively competitive even if they never grab the performance crown. But until something is actually shown and reviewed, it's all up in the air. DX12 at least levels the playing field, a bit. Now it's up to AMD to deliver on hardware.

Regards,
SB
 
Yes, at the very least we need AMD to at least stay relatively competitive even if they never grab the performance crown. But until something is actually shown and reviewed, it's all up in the air. DX12 at least levels the playing field, a bit. Now it's up to AMD to deliver on hardware.

Regards,
SB

For be honest, speaking about performance crown, i will more wait the vega vs GP10x.. and as we know a bit about GP10x... the big unknown is on Vega.
 
Last edited:
I think there are some pretty unrealistic expectations regarding Polaris 10 in this thread.

Polaris 10 is a 232mm^2 chip that is set to replace the lower-end Hawaii and GM204 variants (R9 290 and GTX970) within the same performance levels but at substantially lower price and power consumption.
AMD's statements about Polaris 10 have been rather consistent, which is to make the "minimum-VR" performance level more affordable.

I doubt very much that even the highest-end Polaris 10 will be able to significantly surpass R9 390X levels of performance, much less reaching the GTX 1070 or even the 980 Ti.
Polaris 10 will be about taking down the GM204 which is a current sales champion and phasing out Hawaii which is too old, too big, too power-hungry and is probably giving them too little margins.
Reaching 980 performance might be a bonus but may not even be a requirement.

In fact, a very possible reason for AMD not rushing any further news until Computex is the fact that nvidia didn't (paper)launch anything that will go directly against their 2016 lineup.







As a side note, if AMD does indeed officially present Polaris 10 and 11 in Macau before Computex, then during Computex itself I think they will be focusing heavily on design wins in laptops, 2-in-1s and AiOs.
If AMD manages to grab the current market of GM107 in laptops and GM204 in desktops, then they will gain a significant chunk of marketshare from nvidia.
 
I think there are some pretty unrealistic expectations regarding Polaris 10 in this thread.

Polaris 10 is a 232mm^2 chip that is set to replace the lower-end Hawaii and GM204 variants (R9 290 and GTX970) within the same performance levels but at substantially lower price and power consumption.
AMD's statements about Polaris 10 have been rather consistent, which is to make the "minimum-VR" performance level more affordable.

I doubt very much that even the highest-end Polaris 10 will be able to significantly surpass R9 390X levels of performance, much less reaching the GTX 1070 or even the 980 Ti.
Polaris 10 will be about taking down the GM204 which is a current sales champion and phasing out Hawaii which is too old, too big, too power-hungry and is probably giving them too little margins.
Reaching 980 performance might be a bonus but may not even be a requirement.

In fact, a very possible reason for AMD not rushing any further news until Computex is the fact that nvidia didn't (paper)launch anything that will go directly against their 2016 lineup.







As a side note, if AMD does indeed officially present Polaris 10 and 11 in Macau before Computex, then during Computex itself I think they will be focusing heavily on design wins in laptops, 2-in-1s and AiOs.
If AMD manages to grab the current market of GM107 in laptops and GM204 in desktops, then they will gain a significant chunk of marketshare from nvidia.

Yes, but if Fury is 300 watts and AMD said that Polaris is 2 times perf/watt (Koduri said 2,5) then, there should be a chip at 150 watts with that performance, if not they will have failed at fulfilling the expected performance target they themselves created with their marketing buzz.

Another thing is that Polaris is a 110 watts maximum TDP chip that destroys the perf/wat ratio if taken to faster clocks.
 
Last edited:
Status
Not open for further replies.
Back
Top