And why does the 750Ti show the same 32 command pattern as the 9 series? Shouldn't it be different since it doesn't support even the 31+1 of the 9 series?
It doesn't. It jumps after every 16.
And why does the 750Ti show the same 32 command pattern as the 9 series? Shouldn't it be different since it doesn't support even the 31+1 of the 9 series?
Slide 22 of this presentation (http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf) lists many GCN specific things that would be hard to port to PC DirectX. And if people start to do crazy stuff such as writing to the GPU command queue by a compute shader (to spawn tasks by the GPU), then porting to PC becomes almost impossible.I think the EDRAM plus the HSA-like architecture of the consoles makes a load of console-specific performance-centric design decisions moot in the PC space.
Also, if publishers hand over console games to some fly-by-night studio which solely has to get the game working on PC, given the art assets and the console gaming experience as a guide then you get something like Batman: Arkham Knight.
I think gamers are learning an important lesson: there's no such thing as "full support" for DX12 on the market today.
There have been many attempts to distract people from this truth through campaigns that deliberately conflate feature levels, individual untiered features and the definition of "support." This has been confusing, and caused so much unnecessary heartache and rumor-mongering.
Here is the unvarnished truth: Every graphics architecture has unique features, and no one architecture has them all. Some of those unique features are more powerful than others.
Yes, we're extremely pleased that people are finally beginning to see the game of chess we've been playing with the interrelationship of GCN, Mantle, DX12, Vulkan and LiquidVR.
Raster Ordered Views and Conservative Raster. Thankfully, the techniques that these enable (like global illumination) can already be done in other ways at high framerates (see: DiRT Showdown).
It doesn't. It jumps after every 16.
Tier 2 vs Tier 3 binding is a completely separate issue from Async Compute. It's has to do with the number of root level descriptors we can pass. In tier 3, it turns out we can basically never update a descriptor during a frame, but in tier 2 we sometimes have to build a few . I don't think it's a significant performance issue though, just a techinical detail.
In regards to the purpose of Async compute, there are really 2 main reasons for it:
1) It allows jobs to be cycled into the GPU during dormant phases. In can vaguely be thought of as the GPU equivalent of hyper threading. Like hyper threading, it really depends on the workload and GPU architecture for as to how important this is. In this case, it is used for performance. I can't divulge too many details, but GCN can cycle in work from an ACE incredibly efficiently. Maxwell's schedular has no analog just as a non hyper-threaded CPU has no analog feature to a hyper threaded one.
2) It allows jobs to be cycled in completely out of band with the rendering loop. This is potentially the more interesting case since it can allow gameplay to offload work onto the GPU as the latency of work is greatly reduced. I'm not sure of the background of Async Compute, but it's quite possible that it is intended for use on a console as sort of a replacement for the Cell Processors on a ps3. On a console environment, you really can use them in a very similar way. This could mean that jobs could even span frames, which is useful for longer, optional computational tasks.
It didn't look like there was a hardware defect to me on Maxwell just some unfortunate complex interaction between software scheduling trying to emulate it which appeared to incur some heavy CPU costs. Since we were tying to use it for #1, not #2, it made little sense to bother. I don't believe there is any specific requirement that Async Compute be required for D3D12, but perhaps I misread the spec.
Regarding trying to figure out bottlenecks on GPUS, it's important to note that GPUs do not scale simply by adding more cores to it, especially graphics tasks which have alot of serial points. My $.02 is that GCN is a bit triangle limited, which is why you see greater performance on 4k, where the average triangle size is 4x the triangle size of 1080p.
I think you're also being a bit short-sighted on the possible use of compute for general graphics. It is not limited to post process. Right now, I estimate about 20% of our graphics pipeline occurs in compute shaders, and we are projecting this to be more then 50% on the next iteration of our engine. In fact, it is even conceivable to build a rendering pipeline entirely in compute shaders. For example, there are alternative rendering primitives to triangles which are actually quite feasible in compute. There was a great talk at SIGGRAPH this year on this subject. If someone gave us a card with only compute pipeline, I'd bet we could build an engine around it which would be plenty fast. In fact, this was the main motivating factors behind the Larabee project. The main problem with Larabee wasn't that it wasn't fast, it was that they failed to be able to map dx9 games to it well enough to be a viable product. I'm not saying that the graphics pipeline will disappear anytime soon (or ever), but it's by no means certain that it's necessary. It's quite possible that in 5 years time Nitrous's rendering pipeline is 100% implemented via compute shaders.
I think the EDRAM plus the HSA-like architecture of the consoles makes a load of console-specific performance-centric design decisions moot in the PC space.
Although it backfired really hard because of its even-worse-than-expected performance on Kepler cards, Arkham Knight is nVidia's dream come true.Also, if publishers hand over console games to some fly-by-night studio which solely has to get the game working on PC, given the art assets and the console gaming experience as a guide then you get something like Batman: Arkham Knight.
https://docs.unrealengine.com/lates...ing/ShaderDevelopment/AsyncCompute/index.html
AsyncCompute should be used with caution as it can cause more unpredicatble performance and requires more coding effort for synchromization.
The Rendering Hardware Interface (RHI) now supports asynchronous compute (AsyncCompute) for Xbox One. This is a good way to utilize unused GPU resources (Compute Units (CUs), registers and bandwidth), by running dispatch() calls asynchronously with the rendering.
(...)
This feature was implemented by Lionhead Studios.
We integrated it and intend to make use of it as a tool to optimize the XboxOne rendering. As more more APIs expose the hardware feature we would like make the system more cross platform.
Although it backfired really hard because of its even-worse-than-expected performance on Kepler cards, Arkham Knight is nVidia's dream come true.
If only gamers would suck it up, buy the game and not complain like the entitled little whining brats that they are...
Your selective quoting abilities are great. Let me try it too.
Same link: https://docs.unrealengine.com/lates...ing/ShaderDevelopment/AsyncCompute/index.html
Mark Cerny said:And there’s a lot of features in the GPU to support asynchronous fine-grain computing. ‘Asynchronous’ is just saying it’s not directly related to graphics, ‘fine grain’ is just saying it’s a whole bunch of these running simultaneously on the GPU. So I think we’re going to see the benefits of that architecture around 2016 or so.
With the PlayStation 4, it’s even such things as the share cores have a beautiful instruction set and can’t be programmed in assembly. If you were willing to invest the time to do that, you could do some incredibly efficient processing on the GPU for graphics or for anything else. But the timeframe for that kind of work would not be now. I don’t even think it would be three years from now.
I just find it curious that you chose to quote the only sentence in the whole damn page that seems to demean the use of Async compute, that's all.yeah so, that's why I linked the entire page? your point? Outside of useless banter?
A test run on a GK110 board would give us a bit more clarity.
I just find it curious that you chose to quote the only sentence in the whole damn page that seems to demean the use of Async compute, that's all.
Very curious.
One idea I had is that, if this an internal processor or potentially a SIMD running a firmware routine, is that it's a 32-slot structure.The boundary is 31, 64, 96 and 128. So the first boundary is the outlier in this case, though it appears that the first boundary always behaves this way on the 3 NVidia architectures documented so far...
If the GPU is juggling two distinct modes internally, it might be that it cannot readily run both at the same time, hence the discussion of an expensive context switch. Rather than at a kernel or wavefront level, it might be a front-end context.If nVidia's chips need to add the rendering time to a compute task with even one active kernel, doesn't this mean that "Async Compute" is not actually working and nVidia's hardware, at least in this test, does not seem to support Async Compute? Even if the driver does allow Async Compute tasks to be done, the hardware just seems to be doing rendering+compute in a serial fashion and not parallel at all.
If the GPU is juggling two distinct modes internally, it might be that it cannot readily run both at the same time, hence the discussion of an expensive context switch.
AMD_Robert said:Oxide effectively summarized my thoughts on the matter. NVIDIA claims "full support" for DX12, but conveniently ignores that Maxwell is utterly incapable of performing asynchronous compute without heavy reliance on slow context switching.
Kollok said:Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that.
(...)
AFAIK, Maxwell doesn't support Async Compute, at least not natively. We disabled it at the request of Nvidia, as it was much slower to try to use it then to not.
(...)
It didn't look like there was a hardware defect to me on Maxwell just some unfortunate complex interaction between software scheduling trying to emulate it which appeared to incur some heavy CPU costs.