No DX12 Software is Suitable for Benchmarking *spawn*

I think amd covered that too specifically noting the difference between Dx11/Dx12

28050433850l.jpg
 
Yeah I was hoping to also see Nvidia focus on the benefits of Dx12 for multiple GPU rendering. Unfortunately it appears they've done the stupid thing and doubled down on AFR through the driver along with reinforcing the need for physical bridges.

But since that doesn't lock you into their eco-system, it's not something they'd want to promote.

Regards,
SB
 
Thats not really a difference Xfire and SLi had options for SFR in older Dx's too back all the way to Dx9;)

It's actually a huge difference. SFR was attempted (as well as other methods like the alternating tiles method for AMD) but didn't work as there was no way for the driver to intelligently split the workload across 2 devices. So a lot of things got duplicated as you can't easily just arbitrarily cut a scene in half (what if triangles overlap, as an extremely simple example?). Thus performance at best had minor speedups and at worse actually performed worse than with no SFR. SFR would have been the ideal solution if they could have gotten it to work well. But neither Nvidia nor AMD could. And the situation only got worse with regards to driver level SFR once games started to not use pure forward rendering.

Dx12 gives the developer the ability to intelligently split the workload across multiple GPUs. This allows scaling regardless of the complexity of the rendering engine. It doesn't even have to follow the SFR model of multi-GPU. The developer is free to implement any method with which they want to split the workload across multiple GPUs. Just hopefully they aren't stupid enough or lazy enough to do AFR as that is by far the worst method possible.

Hell, I initially tried Crossfire back with the Radeon X850 XT just because their tiled M-GPU method looked like it had the best potential to do a balanced split of a scene between cards. Yeah that didn't go well and ending up only using AFR which is just horrible.

Regards,
SB
 
Last edited:
So, we need to hope that some developers out there are not rushed by their publishers and show more dedication to a 5-7% marketshare than they do for the whole PC gaming community when they do their usual console port - allegedly optimized for PC, when sometimes you even get Press "Triangle/Circle/Box" to continue? Hm, hope dies last.
 
So, we need to hope that some developers out there are not rushed by their publishers and show more dedication to a 5-7% marketshare than they do for the whole PC gaming community when they do their usual console port - allegedly optimized for PC, when sometimes you even get Press "Triangle/Circle/Box" to continue? Hm, hope dies last.

Well it does happen from time to time. I'm actually interested to see if the next Civilization game does M-GPU in the engine again like they did with Civ V.

But yeah, the other option is for reusable game engines (UE, Unity, Frostbite, Chrome Engine, etc.) to offer some basic level of M-GPU which may or may not be optimal for the developer's game but would offer either some benefits, or a base to allow the developer to expand/customize it to their game's requirements.

And that 5-7% could potentially grow if there was something used other than AFR. I know myself and quite a few others would give M-GPU a try again if it happened.

Regards,
SB
 
Well it does happen from time to time. I'm actually interested to see if the next Civilization game does M-GPU in the engine again like they did with Civ V.
Regards,
SB
If Civ VI engine is same continuity as the previous engine, I'm quite sure they will, they supported even SFR already in Civ:BE (Mantle-version)
 
Well it does happen from time to time.

From time to time, agreed. But for mGPU to have any merit, it needs to hapen more often than not. At least.

It's just, from day to day business, I see vastly more games where the developers for whatever reasons (and there may be good reasons, mind you) do not focus much on advanced techniques in any area.
 
Yeah I was hoping to also see Nvidia focus on the benefits of Dx12 for multiple GPU rendering. Unfortunately it appears they've done the stupid thing and doubled down on AFR through the driver along with reinforcing the need for physical bridges.

But since that doesn't lock you into their eco-system, it's not something they'd want to promote.

Regards,
SB
You can totally do explicit mGPU on NVIDIA cards. We've seen that first-hand with AotS, and by and large it even works well.:p

NVIDIA's position is basically that implicit mGPU probably isn't going away any time soon, and that it's handy to be able to keep mGPU traffic off of the PCIe bus. But the use of SLI bridges is not mutually exclusive from supporting explicit mGPU.
 
You can totally do explicit mGPU on NVIDIA cards. We've seen that first-hand with AotS, and by and large it even works well.:p

NVIDIA's position is basically that implicit mGPU probably isn't going away any time soon, and that it's handy to be able to keep mGPU traffic off of the PCIe bus. But the use of SLI bridges is not mutually exclusive from supporting explicit mGPU.

Oh I realize that they can support it, but they won't promote it.

Regards,
SB
 
Screen space techniques and temporal reprojection do not prevent multi-gpu techniques. AFR doesn't work well with techniques using last frame data, but who would want to use AFR in DX12? AFR adds one extra frame of latency. AFR was fine when developers could not write their own custom load balancing inside a frame. The DX9-11 driver had to automatically split workload between two GPUs and splitting odd/even frames worked best with no extra developer support needed. This was a big compromise.

I would have included some multi-gpu and async compute thoughs in my Siggraph 2015 presentation, if I had more time. With GPU-driven rendering, it is highly efficient to split the viewport in half and do precise (sub-object) culling for both sides to evenly split the workload among two GPUs. This is much better than AFR.

Temporal data reuse is a very good technique. Why would anyone want to render every pixel completely from the scratch at 60 fps? The change between two sequential images is minimal. Most data can/could be reused to speed up the rendering and/or to improve the quality. It is also a well known (and major) optimization to reuse shadow map data between frames. It's easy to save 50% or more of your shadow map rendering time with it. AFR chokes on this optimization as well. Sure you can brute force refresh everything every frame if you detect AFR, but this makes no sense. DX12 finally allows developers to use modern optimized techniques and make them work perfectly with multi-GPU. There is no compromise.
There's a question with explicit multi GPU that has been stuck in my mind for a while now but haven't come around to ask someone yet... Why it seems hasn't anyone experimented with this on DX11?
I mean you can fire up two D3D11 devices just the same as D3D12 devices. I realize performance might not be quite up there with DX12 but is hit just so big that no one even bothered to try?
 
Until I read the footnotes I discount GoW. The rest seems reasonable although deceptive for Hitman given it's issues you guys say in chapter 1 versus 2.

overclock3d(where that image comes from) did test it at 4k and found Fury X double-digit faster.

http://www.overclock3d.net/reviews/...son_16_4_2_-_directx_12_performance_boosted/5

Though the problem with such upto benchmarks is that the game can favor different cards in different conditions, for instance I looked at a benchmark of Tomb Raider pitting Fury X against Titan X and while Titan X was mostly faster the video still had parts where Fury X was consistently around 10% faster.
 
I think amd covered that too specifically noting the difference between Dx11/Dx12

28050433850l.jpg
It's worth noting that custom SFR doesn't have to be 50/50 split. You can dynamically adjust the split based on GPU performance differences. Works much better in situations when you upgrade your GPU (GTX 970 + GTX 1080 for example). Or with a fast iGPU + entry level DGPU.

You could tile the scene instead of splitting it. If you have a precise culling system, splitting the rendering for example to 128x128 or 256x256 pixel tiles is practical. If the tile fits completely inside the ROP caches (or L2 on Nvidia or L3 on Intel), there is zero memory bandwidth cost for overdraw. Blending to HDR render target gets the biggest wins (I have measured 2x+ performance gain in simple huge overdraw particle blending case). GPU caches are getting bigger, meaning that you could use bigger tiles -> less vertex overhead (sub-object culling is not pixel precise).

Tiled rendering also means that you have final depth + g-buffer of some regions ready very early in the frame. You can start lighting and processing these regions using asynchronous compute while you are rendering more tiles with ROPs + geometry pipe. This way you can utilize the fixed function hardware for much longer period of the time during a frame (get more out of the triangle rate and fill rate), while simultaneously filling the CUs with compute shader work. This is just one way to do it (with it's own drawbacks of course). Explicit multiadapter and asynchronous compute give some nice new ways to implement things efficiently. But big changes to engine architecture are unfortunately needed to get full advantage of these new features. Maybe we need to wait a few years until big engines catch up.
 
There's a question with explicit multi GPU that has been stuck in my mind for a while now but haven't come around to ask someone yet... Why it seems hasn't anyone experimented with this on DX11?
I mean you can fire up two D3D11 devices just the same as D3D12 devices. I realize performance might not be quite up there with DX12 but is hit just so big that no one even bothered to try?
DX11 shared resources (between two devices) have huge limitations:
https://msdn.microsoft.com/en-us/library/windows/desktop/ff476531(v=vs.85).aspx

I haven't tried two devices + shared resources, but I would expect that performance is not that great, as this feature is not designed for multi-GPU use (you need to flush manually to see the results, most likely reducing the parallelism).
 
These tests were performed using the High preset.Same for AMD probably. The game has an Ultra preset. And an HBAO+ option, where NV has the lead.
And also 4k (appreciate they are limited in their testing),
which is not really the best resolution to test these IMO.
Cheers
 
These tests were performed using the High preset.Same for AMD probably. The game has an Ultra preset. And an HBAO+ option, where NV has the lead.

pcper's review of 1080 uses high too, though it does show 980Ti in lead at 1440p while being equal at 4k.

As for HBAO+ option, I'm not sure if it's there but giving nvidia the lead wouldn't be a surprise.

And also 4k (appreciate they are limited in their testing),
which is not really the best resolution to test these IMO.
Cheers

4k high is better than 4k ultra surely if you're using that reasoning.
 
pcper's review of 1080 uses high too, though it does show 980Ti in lead at 1440p while being equal at 4k.

As for HBAO+ option, I'm not sure if it's there but giving nvidia the lead wouldn't be a surprise.



4k high is better than 4k ultra surely if you're using that reasoning.
I am saying it is better to use 1440p with uncapped frames, along with showing 4k.
I would say not even the 1080FE is truly designed in terms of HW spec/technology implemented as a single card solution for 4k when playing games with enthusiast settings - all of those games they benchmarked show sub-optimal fps at 4k (let alone what a frame analysis would show)
Sure you can make it work, but you really need to see a broad range of resolutions and ideally 1440p.
I assume they used 4k to overcome forced VSYNC?
Although I thought that was now resolved.
Cheers
 
overclock3d(where that image comes from) did test it at 4k and found Fury X double-digit faster.

http://www.overclock3d.net/reviews/...son_16_4_2_-_directx_12_performance_boosted/5

Though the problem with such upto benchmarks is that the game can favor different cards in different conditions, for instance I looked at a benchmark of Tomb Raider pitting Fury X against Titan X and while Titan X was mostly faster the video still had parts where Fury X was consistently around 10% faster.


Yep that is true, different frames are going to give us different results, this is why I always say every benchmark is valid, just have to average them out to get the results. Yeah we can have outliers favoring one of the IHV but they will or should even themselves out.
 
Back
Top