No DX12 Software is Suitable for Benchmarking *spawn*

Is there a description of what the benchmark contains versus a gameplay scenario?
The priorities for a canned benchmark and a consistent interactive experience are not the same, and the structure and effort put into the two can differ.
Stronger resource demands along certain axes might be avoided in the general game because it would lead to performance drops or responsiveness issues across varied hardware or unexpected peaks in engine demand, which a benchmark wouldn't care about.

It is possible there is optimization for the readily targeted built-in benchmark, but I also find it plausible that it could be a scenario where an architecture with generally inferior optimization in general could start falling down in less structured scenarios due to unexpected complexities.
 
Is there a description of what the benchmark contains versus a gameplay scenario?
It's a fly by over a crowded area. Gameplay scenarios used by review sites have largely centered around Prague and other large city hubs in the game, where they also have crowds and a lot of exterior complicated scenes.

AMD also stressed that the game be tested with Medium settings (dubbed High) to show any tangible fps boost in DX12 at all, when the game is tested with Max Quality, the boost from DX12 is ranging from nothing to very small in the benchmark.
 
Different locations and traversal of the scene create a lot of room for variance. It'd be nice if there were a walk through a crowded area in the exact same place, or if Jensen were given a jetpack if sticking with a fly over.

While complexity may be superficially the same, I'd prefer doing more to make sure conditions that could prompt resource loads or hits to shader caches/compiles were kept similar. Hitches and latency spikes sound like something like that is interrupting things.
The benchmark could be more planned in how it handles all of that, which could benefit some GPUs more. Another scenario is that the benchmark is serving as a showcase rather than an evaluation tool, and certain weak points in the engine's own management are being worked around in ways not possible with gameplay.
One architecture or driver team might be more successful in catching where that engine messes up than another.

Also unclear is how gameplay versus benchmark runs differ in the load actual gameplay or IO might put on a session. Latency weirdness has come up for both vendors in a few places recently, and AI and input loads could differ.
Mantle was benchmarked to help with network multiplayer BF4, for example, and another random problem could be that there's DRM and DLC-validation network hooks in the actual gameplay loop.
 
This could get real interesting with the ports. Apparently the Mac port of Deus Ex is using Metal and there's a possibility the linux port uses Vulkan. No idea if the Windows version could get a Vulkan patch if that's ultimately used.
 
Is there a description of what the benchmark contains versus a gameplay scenario?

Is there a description of pcgameshardware's own gameplay scenario at all?
Their performance findings are much more obscure than the integrated benchmark that anyone with the game can run.

Anyone could make gameplay runs with FRAPS (or whatever software they're using) showing better performance with a GTX 960 than a GTX 1080, with the same settings. Just stand still facing the floor when doing the GTX 960 run, and casually play the game with the GTX 1080.



I'm not suggesting that's what pcgameshardware did, but if their game input wasn't scripted to the millisecond, then any number of runs can get completely random results.
 
While complexity may be superficially the same, I'd prefer doing more to make sure conditions that could prompt resource loads or hits to shader caches/compiles were kept similar. Hitches and latency spikes sound like something like that is interrupting things.
That's why repeated run throughs of the same area are done to exclude sudden hiccups and random latency spikes. In the case of Deus Ex, 3 test sites with came to the same conclusion. Even within GPUs across different families.

DX11 results are fine and stable, but once you go DX12 frames are just unstable. This is beyond normal gameplay variation, and it's obvious DX12 is the cause.

The benchmark could be more planned in how it handles all of that, which could benefit some GPUs more
In-engine Cut Scenes are always a good fit for benchmarking, especially if they are loaded with taxing effects, also areas at the start of the mission/level, where the player is standing still facing an identical fixed view.
 
That's why repeated run throughs of the same area are done to exclude sudden hiccups and random latency spikes. In the case of Deus Ex, 3 test sites with came to the same conclusion. Even within GPUs across different families.

DX11 results are fine and stable, but once you go DX12 frames are just unstable. This is beyond normal gameplay variation, and it's obvious DX12 is the cause.
The context of when the hitches were brought up was that it occurred in gameplay when the discussion was centered on the possibility of actively targeting the benchmark versus actual gameplay for AMD DX12--which was cited as having gameplay spikes.

How do repeated runthroughs exclude spikes in this context? Did that mean they went away?
 
How do repeated run throughs exclude spikes in this context? Did that mean they went away?
It excludes hitches related to random loading of assets (eg, when loading them for the first time), or when they related to random unexpected events in the run through (like sudden bursts of fire/smoke ..etc). Hitches coinciding with these events disappear with multiple play throughs, assets have already been loaded, and random events dont happen all the time.
 
It excludes hitches related to random loading of assets (eg, when loading them for the first time), or when they related to random unexpected events in the run through (like sudden bursts of fire/smoke ..etc). Hitches coinciding with these events disappear with multiple play throughs, assets have already been loaded, and random events dont happen all the time.
The assumption built into that is whether the engine is optimally managing the data that needs to carry over.

Without having a better profile of what is going on under the hood, and the timing of some of the events, the possibility remains that there are errant flushes or spurious loads that could be ignored if profiled, or combinations of events that one architecture is more vulnerable to.
The benchmark may space out its management or front-load it in a way that free navigation through a level cannot.

If the mix is not equivalent between the benchmark and gameplay, it can be ambiguous as to whether it's disproportionate optimization of the benchmark or equal levels of optimization that are insufficient for gameplay.

If there is something else, like invasive DLC or DRM checks, they may not be active in benchmark mode at all.
 
It's interesting that certain sites think they know more than the developers about the most taxing render scenarios in a game. Well actually it isn't interesting, it's just page impressions and advertising $ at work.
 
Different locations and traversal of the scene create a lot of room for variance. It'd be nice if there were a walk through a crowded area in the exact same place, or if Jensen were given a jetpack if sticking with a fly over.

While complexity may be superficially the same, I'd prefer doing more to make sure conditions that could prompt resource loads or hits to shader caches/compiles were kept similar. Hitches and latency spikes sound like something like that is interrupting things.
The benchmark could be more planned in how it handles all of that, which could benefit some GPUs more. Another scenario is that the benchmark is serving as a showcase rather than an evaluation tool, and certain weak points in the engine's own management are being worked around in ways not possible with gameplay.
One architecture or driver team might be more successful in catching where that engine messes up than another.

Also unclear is how gameplay versus benchmark runs differ in the load actual gameplay or IO might put on a session. Latency weirdness has come up for both vendors in a few places recently, and AI and input loads could differ.
Mantle was benchmarked to help with network multiplayer BF4, for example, and another random problem could be that there's DRM and DLC-validation network hooks in the actual gameplay loop.

Another thing to note is that the Canned benchmark appears to stress the system much more than the custom benchmarks that are being used.

http://www.pcgameshardware.de/Deus-.../Specials/DirectX-12-Benchmarks-Test-1207260/

Taking a look at that, the canned benchmark scores are universally lower than the custom "gameplay" benchmark scores. That applies to both Dx11 and Dx12.

That would make sense as you'd generally want the canned benchmark to represent the worst case scenario.

Looking at the pcgameshardware.de results would seem to imply that whenever the engine is stressed the AMD cards do relatively better, while once the engine is less stressed Nvidia gains considerably more performance than AMD hardware does.

Regards,
SB
 
All this recent demonization of internal benchmarks (made mostly by the BFFs in this subforum) is ridiculous. And of course this had to appear in a time when the new API shows one vendor consistently getting considerable performance advantages to their competitors within the same price range.


It's their new thing. "B-but AMD is optimizing only for the internal benchmarks".
Yeah, everyone is picturing AMD >6 years ago when laying down the plans for GCN having secret meetings with Square-Enix to optimize their architecture for a runtime demo of an unannounced game yet to start production while running under a Beta implementation of an unannounced API. Not ridiculous at all.
Even suggesting they're doing heavy driver optimizations for the internal benchmarks is ridiculous because DX12 weighs a lot less on the driver and a lot more on the hardware and the game's own code. Either the hardware performs adequately to the game/engine or it does not.
Though everything is permitted when trying to follow the narrative, I guess.


Scripted internal benchmarks are the only practical way to guarantee that all system setups are going through the exact same loads.
Does computerbase pcgameshardware use a scripted keyboard+mouse input for their "own" benchmark and do they publish video recordings of said benchmarks showing the exact same playthroughs with the exact same number of polygons and particles on screen?
If not, then computerbase's pcgameshardware's "non-internal" benchmark results mean utter crap because no one will ever know if they're playing the whole run facing the floor or a wall on AMD cards and facing the scenery on nvidia cards, or vice versa.
Even if they did use scripted KB+M inputs, in-game NPC AIs rely on random variables so the end process would still never be the same.


It's not by pure chance that all scientific papers in respected publications must always provide the readers with methods to accurately reproduce their findings, either through descriptions in the paper itself or through references with said descriptions. Otherwise, it's worth shit.


Where was all this condemnation of internal benchmarks when we were looking at previous games until not long ago, like GTA V, Arkham Knight, every single Far Cry game to date, etc.?
And now with DX12 the internal benchmarks are suddenly a dirty word?


But nevermind this little bit of sanity. By all means, carry on with these conspiracy theories.

EDIT: Not computerbase. It was pcgameshardware. Wrong german site.

'Thank you for point out the missing video of the benchmark run - it is the same actually as for the same test in DX11 - with video. We actually do pride ourselves a bit for documenting quite extensively what we're doing there, so thx again for making me look the article by my colleagues up again! I have added the appropriate video now, which should have been there all the time.

WRT to repeatability: With a margin of error of 0,1 Fps in a 110-ish fps range, it's rather ok-ish in my books - and we DO check for that too, before we're doing benchmarks. But you're welcome to disagree of course. With the integrated benchmark - just run on my system thrice now for comparision here - I got 75.3/75.5/75.4 which is not substantially better than our run-through results. And that's with both a fixed clocks card and a CPU with a bit of OC, so no Turbo variations (we check for that...).

FWIW - whenever possible, we're using non-internal benchmarks, aka real gameplay scenes as long as I've been there, which is now a bit more than 11 years. You know, the thing with Nvidias cheats in 3DMark 03 and the scanned camera path etc., was not something we wanted to fall for all along.

And time and again, canned/internal benchmarks did live up to not representing real life gameplay.

It's interesting that certain sites think they know more than the developers about the most taxing render scenarios in a game. Well actually it isn't interesting, it's just page impressions and advertising $ at work.
Thank you for this assessment. Maybe we should work on our communications - of maybe it's google translate that's doing a poor job from german to english.

Actually, since you're obviously referring to our article, we're drawing the conclusion that the integrated benchmark does not reflect what you get while actually playing the game. And that's the main focus on our site when doing benchmark tests for individual games.
 
If there is something else, like invasive DLC or DRM checks, they may not be active in benchmark mode at all.
Why should we care if they are present or not during gameplay (like the Mantle multiplayer test)? after all these things should affect test systems equally.

If the mix is not equivalent between the benchmark and gameplay, it can be ambiguous as to whether it's disproportionate optimization of the benchmark or equal levels of optimization that are insufficient for gameplay.
All of this will remain theoretical, because once we go to the realm of theory then even benchmarks like 3D Mark, Time Spy and the likes should be considered in the GPU evaluation, the reality is benchmarking GPUs is a practice to help consumers select a better product to play his/her games with. GPUs are not isolated systems, they are part of a whole and need to work in tandem with other hardware to deliver the required experience. Built in benches don't necessarily deliver on that concept, they tend to test isolated parts of the system. As an exmaple, Built-In benches often stress the GPU far more than the CPU, mitigating CPU overhead which isn't realistic from a gameplay perspective. If the product is good in a theoretical benchmark but stuttering and making gameplay a mess then it's delivering an inferior experience, this is what matters for consumers and tech enthusiasts alike.
 
Last edited:
Another thing to note is that the Canned benchmark appears to stress the system much more than the custom benchmarks that are being used.

http://www.pcgameshardware.de/Deus-.../Specials/DirectX-12-Benchmarks-Test-1207260/

Taking a look at that, the canned benchmark scores are universally lower than the custom "gameplay" benchmark scores. That applies to both Dx11 and Dx12.

That would make sense as you'd generally want the canned benchmark to represent the worst case scenario.

Looking at the pcgameshardware.de results would seem to imply that whenever the engine is stressed the AMD cards do relatively better, while once the engine is less stressed Nvidia gains considerably more performance than AMD hardware does.

Regards,
SB
But until the developer provides full disclosure on internal-canned benchmarks we do not know if they set additional post-processing/'async compute'/etc beyond settings of in-game; as an example look at how AoTS can create different loadings with the batch settings that do not relate to the game as played.
Not suggesting AoTS is wrong just pointing out that further 'synthetic' processing overhead can be added that does not entirely reflect the actual game, and can be further exacerbated by developers methodology on measuring/presenting frames in benchmark compared to game.

This is relevant for all games no matter if they perform well for Nvidia or AMD.

Cheers
 
Why should we care if they are present or not during gameplay (like the Mantle multiplayer test)? after all these things should affect test systems equally.
The main item I was addressing was the contention that AMD was specifically targeting the benchmark, based on the non-equivalent behaviors. I do think it is in the realm of possibility, but since the benchmark and the gameplay scenarios are substantially different and have different priorities I think there are other interpretations that can be made with the limited external information we have.

As far as items as networking and DRM/DLC hooks affecting things equivalently, Nvidia was working on driver issues related to Pascal and DPC latency, and Techreport had to re-do some Polaris benchmarking due to some beta firmware and possible interaction with the integrated NIC that was more detectable for AMD.

One perverse outcome that might come out for lower-level APIs once engines are designed for them is not having a single heavy-duty optimizing driver that pegs a core or two might show low-level timing considerations like core parking or migration at points where the engine is swinging from low and high demand periods. That's something of a theoretical consideration given the small pool of samples, but latency issues with parking cores have shown up before.
 
But until the developer provides full disclosure on internal-canned benchmarks we do not know if they set additional post-processing/'async compute'/etc beyond settings of in-game; as an example look at how AoTS can create different loadings with the batch settings that do not relate to the game as played.
Not suggesting AoTS is wrong just pointing out that further 'synthetic' processing overhead can be added that does not entirely reflect the actual game, and can be further exacerbated by developers methodology on measuring/presenting frames in benchmark compared to game.

This is relevant for all games no matter if they perform well for Nvidia or AMD.

Cheers

Sure but the thing people need to be aware of is that internal game benchmarks aren't, or at least shouldn't, be representative of typical gameplay. It should represent the worst possible situation, such that when a user gets a set of settings that offers acceptable performance for them, that they then don't experience worse behavior in the game itself.

To that end, the benchmark appears to serve its purpose. If an IHV is optimizing for that benchmark they are also in turn optimizing for the game in general. If the optimizations are so out of whack that they do not translate to the actual gameplay then you would see that in situations where the game performs significantly worse than the internal benchmark. In both Dx11 and Dx12 both IHVs hardware does worse in the benchmark than they do in the game, so it's still serving the purpose it was designed for.

It is also entirely possible that whatever Dx12 features were put in show a greater effect in the benchmark versus regular gameplay as the benchmark should be more stressful than the game. Considering the game was created as a Dx11 game with absolutely Zero thought put into Dx12, it shouldn't come as a surprise that in regular gameplay Dx12 code does show as great an effect as it does in the benchmark. Especially when you consider the both the BETA nature of the Dx12 path as well as the fact that it is just tacked onto a Dx11 engine.

To think of it another way. The Dx12 path in DE: MD could be viewed as a good opportunity by the engine programmers to experiment with Dx12 without regards to it's impact on the game in order to test things in a shipped title for application in an engine for a future title.

That testing may have little to no effect or even an adverse effect in gameplay itself, as again, the game is still using a Dx11 engine. That testing may or may not favor a particular architecture without the actual intent of making a particular architecture look better or worse than another. It's just that one or the other is particularly well suited to the optimizations they are trying out.

Regards,
SB
 
Late here so cannot fully respond, good news restest by computerbase.de with the patch update of today (or yesterday).
I think we agree on some points Silent_Buddha but probably disagree with how the benchmark served its purpose - have not had time to look at the latest retest but maybe this will influence that:
Patch for more power in the CPU limit under DirectX 12: https://www.computerbase.de/2016-09/deus-ex-patch-dx12/

Cheers
 
The patch does nothing except give the 1060 a boost over 480 if CPU limited in DX12 , but the current problems with DX12 performance being less than DX11 on both GPUs still stands.
It looks to me 480 has the advantage now rather than 1060 if one looks at 1080p and to a lesser extent in terms of performance gains at 1440p, I doubt many would play at 720p.
Quick look shows that the 480 improved by a fair margin with the update with DX11 at 1080p (around 10%), but there was a big imrovement with DX12 with the 480 going from 42fps to 53.4fps with the update at 1080p.
However no love for Nvidia as it had no improvement for either DX11 or DX12 above 720p - makes one wonder if this patch was specific for AMD even if they did go on about improving for CPU that shows at 720p :)
This is looking at just 6700K, and I am mostly ignoring 720p and 1440p for these lower-mid GPUs as they would be more optimal for 1080p for PC gamers, 1440p would probably be more likely than 720p.

Shame they did not retest the internal benchmark as well, which was what I thought they would also do.
Fingers crossed we get other follow-ups also looking at say 390x/Fury X/1070FE/1080FE along with the internal benchmark either from computerbase.de or other publications.
Cheers
 
Last edited:
Back
Top