No DX12 Software is Suitable for Benchmarking *spawn*

Another aside, wow is intel's driver situation not good. Only the latest beta driver supports doom and a previous beta driver to support doom vulkan...just not the latest.
To be fair, the developers didn't even test on Intel because even a 580 is a fair bit below the min spec. Given the lack of OpenGL and Vulkan games out there (and general situation with both and extensions) it's unreasonable to assume anything that hasn't been tested will work. I was actually fairly shocked to find that the GL path does work fairly decently as of the latest driver update given how far below the min spec (~2.5tflop GPU @ 720p) even the 580 is. That said, their "min spec" is a ways off consoles so not sure what's up with that...

Realistically in GL and Vulkan though if you haven't tested it it's fair to assume it's broken - and that goes for every IHV. DX is a much more solid API from that perspective and obviously gets more attention on Windows as the vast majority of games use DX. I don't think most folks would want us to invert those priorities :)
 
Regarding Time Spy, it's definitely a bit odd that it goes out of its way to use async compute, but doesn't even support FL12 stuff like bindless that is fairly ubiquitous at this point. That said, benchmarks are mostly useful for being pretty these days as engine designs have diverged quite a bit from the goals of these benchmark vendors, so I'll enjoy it for what it is :)

And yeah, A-buffer = yuck. Let's all emphasize how much it sucks that AMD doesn't have ROVs yet I guess! :S
 
IIRC It's the lowest common denominator that includes GCN, that still doesn't support conservative raster and raster order views.

It would have been nice to see those new two features being used in this benchmark :(

Ah poop I also forgot to mention that FL 12_0 doesn't require either conservative rasterization or rasterizer ordered views, so neither of those would be why Time Spy is FL 11_0.

As mentioned it's likely FL 11_0 as that's the highest that is supported with Fermi, not sure about Keplar and they'd want to support as broad a selection of relatively modern graphics cards as they can.

Regards,
SB
 
IIRC It's the lowest common denominator that includes GCN, that still doesn't support conservative raster and raster order views.

It would have been nice to see those new two features being used in this benchmark :(
Sepaking about feature level, the lowest commond denominator are called Kepler and Maxwel 1. Those are feature level 11_0 cards. GCN Gen 1 GPUs are 11_1 (they do not support min/max filtering and mssa for sparse/tiled/reserved resources), all other GCN are actually 12_0. Intel Haswell/Broadwell are 11_1, skylake is 12_1 (and provides a more complete feature support then Pascal).
Pixel Sync (ROVs) and Conservative Rasterization tier 1, are required for feature level 12_1.
 
Ah poop I also forgot to mention that FL 12_0 doesn't require either conservative rasterization or rasterizer ordered views, so neither of those would be why Time Spy is FL 11_0.

As mentioned it's likely FL 11_0 as that's the highest that is supported with Fermi, not sure about Keplar and they'd want to support as broad a selection of relatively modern graphics cards as they can.

Regards,
SB
Kepler and Maxwell 1 are both 11_0 and are likely the reason that's the spec used (as opposed to 11_1 for GCN 1.0). Fermi's DX12 driver was never finished/released, so it doesn't really matter.
 
Kepler and Maxwell 1 are both 11_0 and are likely the reason that's the spec used (as opposed to 11_1 for GCN 1.0). Fermi's DX12 driver was never finished/released, so it doesn't really matter.

Damn, I'd forgotten about that. Yeah, I thought I'd remembered seeing that Kepler and Maxwell 1 were both 11_0, but couldn't remember for sure so didn't want to mention them. As well, I couldn't remember if GCN 1.0 was 11_0 or 11_1. Thanks for the clarification.

Regards,
SB
 
Regarding Time Spy, it's definitely a bit odd that it goes out of its way to use async compute
Agree, it's hard to imagine a developer which would like to spend 20% of rendering time on low occupancy stuff, at least on PC

PS I hate inter frame async, why would somebody want to trade a few % gain in framerate on many % loss in input latency?
 
Regarding Time Spy, anyone have any idea what they are doing with 70 million compute shader invocations per frame? It seems rather excessive.
 
Anyone know why this shows a performance gain for pascal when async is used?

This is the only case so far where we see gains with nvidia hardware when "asynchronous compute" is used. I expect we'll be back to the usual no gains with actual games but it would be interesting to understand whats happening in this particular case. Developers are still "working with nvidia" to get gains in their games with async, so I would guess futuremark really worked with nvidia to get this result.

And regarding CR and ROV in 12_1, has anyone tested and implementation of these on maxwell/pascal? It could well be pointless if the performance is worse for the same end result.
 
Because even NVIDIA's GPUs have execution bubbles that can be filled.

sure. but what's different about time spy that the results differ from games? Its likely taking advantage of pascal's load balancing and better pre-emption which seem to only help with shader efficiency. Last time I saw that discussed here I think I saw people saying it was not asynchronous compute.

It would be nice to have an analysis of this rather than what sites seem to be doing, which is taking it on face value. "What is going on here? and will it really be this way in any significant number of games?" Seems a significant discrepancy.
 
sure. but what's different about time spy that the results differ from games? Its likely taking advantage of pascal's load balancing and better pre-emption which seem to only help with shader efficiency. Last time I saw that discussed here I think I saw people saying it was not asynchronous compute.

It would be nice to have an analysis of this rather than what sites seem to be doing, which is taking it on face value. "What is going on here? and will it really be this way in any significant number of games?" Seems a significant discrepancy.


You can still see async benefits in games as well, but at different resolutions you are going to have bottleneck shifts, and that will affect the noticeable async benefits, might even see this with Time spy too, I didn't see any reviews using different resolutions.

Also depending on what developers have used and how they used programmed for async could affect it on different IHV's hardware, so things like that we won't really know.

And its not preemption on Pascal that is being used for its dynamic load balancing, preemption defeats the purpose of async compute.
 
You can still see async benefits in games as well, but at different resolutions you are going to have bottleneck shifts, and that will affect the noticeable async benefits, might even see this with Time spy too, I didn't see any reviews using different resolutions.

Also depending on what developers have used and how they used programmed for async could affect it on different IHV's hardware, so things like that we won't really know.

And its not preemption on Pascal that is being used for its dynamic load balancing, preemption defeats the purpose of async compute.

I do think it should be possible to see gains in some situations with pascal's load balancing. But do they even have to use the compute queue for this? compute tasks in the graphics queue could be assigned to some clusters while the rest handle graphics? Or assigned to idle clusters while a graphics task is running. without touching the compute queue. My understand of the load balancing is that previously there was a fixed division of shaders within the graphics queue. Some doing compute, some doing graphics for each task. This did not change while the task was running. With pascal the clusters can be changed on the fly to do either compute or graphics to reduce idle states where either graphics or compute finishes before the entire task is done. Since the previous situation was all in the graphics queue, it would be safe to assume what pascal is doing is also in the graphics queue? This would mean its not actually asynchronous compute that's going on with time spy and pascal. It is still concurrent but it sounds like it was concurrent before as well.
 
I do think it should be possible to see gains in some situations with pascal's load balancing. But do they even have to use the compute queue for this? compute tasks in the graphics queue could be assigned to some clusters while the rest handle graphics? Or assigned to idle clusters while a graphics task is running. without touching the compute queue. My understand of the load balancing is that previously there was a fixed division of shaders within the graphics queue. Some doing compute, some doing graphics for each task. This did not change while the task was running. With pascal the clusters can be changed on the fly to do either compute or graphics to reduce idle states where either graphics or compute finishes before the entire task is done. Since the previous situation was all in the graphics queue, it would be safe to assume what pascal is doing is also in the graphics queue? This would mean its not actually asynchronous compute that's going on with time spy and pascal. It is still concurrent but it sounds like it was concurrent before as well.


The queues aren't where the problem is, all DX12 hardware must have both a graphics and compute queues and copy queues, so that isn't something they can do without, it has to be there.

Well two different things, each cluster or block all DX12 graphics cards have no problem with load balancing on a block level

On a SMX level prior to pascal, once load balancing was done initially, (partitioning of the SMX) is could not change after that, so if in an application the amount of compute or graphics change % wise, that would be disastrous on pre pascal architectures on nV's side because now you will end up with underutilized units, it would actually be better to do things sequentially, and this will happen in a game environment so this is why we see performance deficits when Maxwell 2 is forced to do async compute. The only way for pre pascal architectures on nV's side to change the partioning is everything has to be stopped and then re partitioning can take place, pretty much a context switch on all kernels and queues, which of course is probably even worse then leaving the partition alone lol.
 
Last edited:
The queues aren't where the problem is, all DX12 hardware must have both a graphics and compute queues and copy queues, so that isn't something they can do without, it has to be there.

Well two different things, each cluster or block all DX12 graphics cards have no problem with load balancing on a block level.

On a SMX level prior to pascal, once load balancing was done initially, (partitioning of the SMX) is could not change after that, so if in an application the amount of compute or graphics change % wise, that would be disastrous on pre pascal architectures on nV's side because now you will end up with underutilized units, it would actually be better to do things sequentially, and this will happen in a game environment so this is why we see performance deficits when Maxwell 2 is forced to do async compute.

I think that lines up with what I was saying. The queues exist, the question is if things are being done on both queues (compute and graphics) at the same time. This would be what I understand to be async compute (thought it seems to go beyond shaders).

Pascals load balancing looks like better use of the graphics queue to do compute and graphics. More efficient concurrent execution.

Some info from someone at futuremark was posted on anandtech forums.

http://forums.anandtech.com/search.php?searchid=2805545
 
From one of the Futurmark developer on the Steam Forums (disclaimer..the thread is a hot mess so click at your own risk..seriously do not click).

http://steamcommunity.com/app/223850/discussions/0/366298942110944664/

Replying to claims that Time Spy doesn't supprt "real" Async Compute (whatever the hell that means... meh)
It was not tailored for any specific architecture. It overlaps different rendering passes for asynchronous compute, in paraller when possible. Drivers determine how they process these - multiple paraller queues are filled by the engine.

The reason Maxwell doesn't take a hit is because NVIDIA has explictly disabled async compute in Maxwell drivers. So no matter how much we pile things to the queues, they cannot be set to run asynchronously because the driver says "no, I can't do that". Basically NV driver tells Time Spy to go "async off" for the run on that card. If NVIDIA enables Asynch Compute in the drivers, Time Spy will start using it. Performance gain or loss depends on the hardware & drivers.

Yes it is. But Engine cannot dictate what the hardware has available or not.

Async compute is about utilizing "idle" shader units. Slower the card, less idle ones you have. Less capable hardware may also be hard pressed to utilize all of them even if the engine asks nicely. Also there may be limitations as to what workloads in the engine *can* run in parallel. Yes, Time Spy is very graphics-heavy, since, well, its a graphics benchmark. But even there many of the rendering passes have compute tasks that can use this.

Ultimately some AMD cards gain quite a bit (ie. they have a lot of shader units idling while rendering and they are very good at using them for the available paraller loads). Some AMD cards gain less or not at all (either less capable at paralleriziing, less idle shader units or no idle shader units at all - for example a HD 7970 is hard pressed to have any to "spare")

Some NVIDIA cards cannot do this at all. The driver simply says "hold your horses, we'll do this nicely in order". Some NVIDIA cards can do some of it. They might use another way than AMD (more driver/software based), but the end result is the same - the card hardware is capable of doing more through some intelligent juggling of the work.

Regarding the bolded part... isn't the 7970 actually getting some rather substantial gains in Doom with Vulkan? I'm guessing that it's mainly due to lower CPU overhead and Shader Instinsics. Does anyone have Async on/off benches on a 7970?

Same FM dev at AnandTech forums

http://forums.anandtech.com/showpost.php?p=38362082&postcount=30

http://forums.anandtech.com/showpost.php?p=38362194&postcount=46
 
Last edited:
So a simple question. Why is it that when Nvidia disables 'async compute' on a driver level on anything older than Pascal (to prevent performance tanking obviously), TimeSpy score is valid. But when you do the same and disable 'async compute' in TimeSpy benchmark itself, it invalidates your score.
 
Back
Top