DX12 Performance Discussion And Analysis Thread

But how much additional input latency?
If I'm not mistaken dividing up rendering into stages and doing it on seperate gpu's decreases the input latency since you can start processing a new frame faster. Its simple pipelining if I'm thinking about this correctly.
 
But how much additional input latency? Would gamers trade 10% more FPS for 30% more latency?

This would be a major problem with AFR rendering - if we assume that the Intel GPU is only 10% the speed of a 980ti, then it's only going to produce one out of every ten frames. The problem is that the frame will take 10 times longer, and thus, from the time it starts to the time it finishes, 10 frames will have been rendered. This gives you a whopping 1000% increase in latency, which is the difference between 30 ms at 30 FPS and 300 ms at 33 FPS. I don't think 300 ms is even playable, let alone acceptable.

If I'm not mistaken dividing up rendering into stages and doing it on seperate gpu's decreases the input latency since you can start processing a new frame faster. Its simple pipelining if I'm thinking about this correctly.

The problem here is that it's very hard to load balance with one device vastly more powerful than another - if the powerful device has to stall, even a little, in order to give the weak device some work, you end up going slower overall, since the weak device isn't powerful enough to even cover the stall, let alone increase performance noticeably. This is particularly troublesome in something like rendering a single frame, since dividing the work in such a way that you don't have to transfer large portions of the frame buffer back and forth is unlikely. And if you overshot and accidentally gave the weak device 20% of the work instead of 10%, the frame render time suddenly doubles. Oops. Thing is, with a real scene, it's hard to partition work - scenes are just too irregular to be predictable, especially when the player can move the camera about as they wish. Remember, with the weak device 10% the power of the strong one, you end up with a performance loss if the weak device ever has to do more than 11% of the work (10% + 10% of 10%). This is a very slim margin to hit.

Bottom line: single digit % gains in the best cases, major slow downs in sub-optimal cases. Oh, and lots of extra developer work to even get any of the best cases.
 
Oh, forgot to mention. Pipelining doesn't decrease latency - the time for a single frame is the same, it's just that it can work on more than one at a time, so you get more throughput.
 
The pcgh review does look better, but it seems to be not multiplayer and at 4k, the overclocked 980Ti is barely faster than a Fury which means that it'd still be slower than a Fury X.

Barring 980Ti lightning I don't see many AIBs 980Tis jumping across a 20% gulf in performance.
And most will buy say MSI Twin Frozer equivalent that is about 5% slower from the reviews I have read (depending upon the game), which brings it in line with Fury X and not radically slower.
http://www.techpowerup.com/reviews/MSI/GTX_980_Ti_Lightning/13.html
http://www.techpowerup.com/reviews/MSI/GTX_980_Ti_Gaming/18.html
Not sure where you get it equal to a Fury when looking at pcgameshardware that I linked, furthermore they did their benchmark it seems using a mix and included match play on a large map and in the review they say: "This took place on the biggest map and with the maximum 40 players. It was played with maximum details, including Temporal-AA and in 1440p."

Cheers
 
Yeah a person that doesn't know the merits of the scientific method and how to read those benchmarks can twist those numbers to mean just about anything, just like you are doing right now.

By no means the benchmarks numbers are incorrect, but to use them as backing to what you have said, hell NO that is not reasonable thinking. Its inaccurate as using raw statistics to prove a point. Doesn't work.

They do work and they work with all the reviewers who try their best to control for other factors. If you think scientists get perfect scenarios for their experiments then you're living in lala land.

nvidia cards do better in BF4 multiplayer, they do worse in Battlefront's. That's the bottom line.

Oh yeah many reviewers are quite aware of my stance on their reviews, just ask Kyle at H, I blasted him here when he went on a triad of how one way of benchmarks is the best way..... that was well before you where posting here I think.....so you might not know that.

Ok, I'm guessing nvidia was behind in that review as well.


Right the first part that wasn't the part I was talking about ;) . Pass go and collect $200 ring a bell?

It wasn't implied, what I quoted was what he stated. Pretty stupid move, when what was stated was false. I think you really should read this entire thread to see what was really going on.

What's ringing my bell is that a discussion with you is going to be a futile exercise.

I want you to read that document again and split it up based on what will work best on different IHV hardware and post that, and lets see where that goes, because I can tell you right now it won't go anywhere.

No, you go read that thread and come back here. It's clear as day what was stated in that quote.


And most will buy say MSI Twin Frozer equivalent that is about 5% slower from the reviews I have read (depending upon the game), which brings it in line with Fury X and not radically slower.
http://www.techpowerup.com/reviews/MSI/GTX_980_Ti_Lightning/13.html
http://www.techpowerup.com/reviews/MSI/GTX_980_Ti_Gaming/18.html
Not sure where you get it equal to a Fury when looking at pcgameshardware that I linked, furthermore they did their benchmark it seems using a mix and included match play on a large map and in the review they say: "This took place on the biggest map and with the maximum 40 players. It was played with maximum details, including Temporal-AA and in 1440p."

Cheers

It'd sell more because it's nvidia card.

1920x1080 200% res. bench puts Fury at a 2% disadvantage compared to that 980Ti. As for the map, they mention on graphs table that the results are from "Star Wars Battlefront BETA (Origin), PCGH-Benchmark 'Tatooine Survival'" which is the single player mode in the beta.
 
They do work and they work with all the reviewers who try their best to control for other factors. If you think scientists get perfect scenarios for their experiments then you're living in lala land.

nvidia cards do better in BF4 multiplayer, they do worse in Battlefront's. That's the bottom line.

Perfect scenarios are not necessary but you do need a foundation or baseline to go from, and that is the bottom line, and that is what is missing. Extroplating from data that doesn't have that, causes false predictions. Errors only get magnified by that. Standard deviations are necessary to explain those errors, which of course while benchmarking without looking into what is causing those possible errors, standard deviation can't be explained either.


Ok, I'm guessing nvidia was behind in that review as well.

Are you kidding me? It wasn't just one review, it was many reviews. That wasn't what is was, it was Kyle's approach to benchmarking, real world testing is good, but again, there is no baseline. And he did start doing a base line afterwards, which is great.

Its like benchmarking a power pc system vs Intel system, back in the days, you can kinda of say one is faster than the other by looking at the specs, but without testing with equalizing the testing field with same programs, the results are meaningless.

This is what is happening when you have multiplayer, you are not getting the same scenario in every test, and I would presume it would not even be close, too many factors to consider to make guessing on what is going on specially since you see weird numbers from the same IHV's hardware.

What's ringing my bell is that a discussion with you is going to be a futile exercise.

Just like you above statement? Come on, get real.


No, you go read that thread and come back here. It's clear as day what was stated in that quote.


Yeah, that is a very incorrect statement, he stated it clear as day, but what was in that document is pretty much the same thing AMD's documents on their async compute docs and DX12 where too. Pretty much its just best practices for ALL DX12 hardware not specific to only nV's hardware.
 
Last edited:
But how much additional input latency? Would gamers trade 10% more FPS for 30% more latency?



Reviewers need to take note on this. Potential slight flicker/shimmer due to discrepancy in the implementation algorithm of AF, transparency AA, etc. Though AoTS does not like a good example to expose this kind of problems due to its gameplay PoV.


Yep, as a benchmark, AoTS is basically nothing much more than glorified CPU overhead test. It is probably what happened when 3DMark API overhad test give AMD+Oxide some ideas. On their blog, Oxide even bragging about having null renderer feature built in it. 3DMark API tests show that NVidia still have some small overhead in DX12 due to the way its driver handles scheduling as opposed to AMD's hardware scheduler. AMD performs better in this pure API overhead tests even without the much hyped Async Compute involved at all. AoTS secret sauce is just to pound the CPU heavily, even on the crop of the creme CPUs, which chokes NVidia driver sheduler and in turns reduce its GPU performance.

StarSwarm on the other hands, had nothing much going on the CPU, thus it is more of a pure high drawcalls bechmarks -- in which current NVidia hardware dominates.

The multi adapter tests further demonstrate that dual Maxwell is obviously severely CPU bottlenecked. It has tighter frame-time delta when compared to the more micro-stutter prone Fury pairs, indicating that its performance is largely governed by how much the CPU can keep up. Also when the resolution increased from 1440p to 4K, the relative FPS performance of Maxwell pair jumps from trailing to matching Furys -- not something anyone would expect in normal situation (i.e. it's usually the other way around).

Which might also explain why mixed Fury X + 980 Ti is faster than 980 Ti + TX, AoTS left the CPU with only enough grunts to somewhat properly feed one NVidia GPU.

This should left a question whether AoTS is a true representation of future DX 12 games, as it totally runs counter to all the DX 12 premises about giving more power to lesser CPUs. Though of course, Intel wouldn't mind.

It doesn't. Corollary to reducing CPU driver overhead is that, you increase CPU utilization. Then when your driver demands more of it than your competition you will get behind, because it's a bottleneck again: for you (Nvidia).
 
This would be a major problem with AFR rendering - if we assume that the Intel GPU is only 10% the speed of a 980ti, then it's only going to produce one out of every ten frames. The problem is that the frame will take 10 times longer, and thus, from the time it starts to the time it finishes, 10 frames will have been rendered. This gives you a whopping 1000% increase in latency, which is the difference between 30 ms at 30 FPS and 300 ms at 33 FPS. I don't think 300 ms is even playable, let alone acceptable.
His post was in regards to sebbbi's which was in regards to UnrealEngines multiadapter which wasn't AFR it was dividing up the work of a single frame in discrete stages between multiple gpus.

Oh, forgot to mention. Pipelining doesn't decrease latency - the time for a single frame is the same, it's just that it can work on more than one at a time, so you get more throughput.
I was talking about input latency, as in when input is taken into account. Input latency would reduce since the first half of rendering is running in parallel with the other stage(s). Whereas if it were a single gpu the gpu wouldn't start rendering the next frame until the whole of the last frame is done. Since all stages of a pipeline are basically running at the same time the input would get samples at a earlier time than when serialized.
 
Guys to be clear this is a thread about how DX12 can potentially alter the status quo with respect to performance and what new rendering techniques are now possible. This is not a Fury vs 980 ti thread. It's also not a thread about how and what various sites benchmark. Let's stay on topic.
 
But how much additional input latency? Would gamers trade 10% more FPS for 30% more latency?

It shouldn't increase latency per frame, it should actually decreases it.

If I'm not mistaken dividing up rendering into stages and doing it on seperate gpu's decreases the input latency since you can start processing a new frame faster. Its simple pipelining if I'm thinking about this correctly.

Correct, although it's not due to pipelining. That's something completely different.

What's being done is to split what is being done on the GPU for any given frame to multiple GPUs. Similar to what Firaxis did with Mantle for Civ 5, except with GPUs from different vendors. In the Civ 5 case, average and max fps didn't go up significantly as you would see with traditional split GPU rendering (AFR), but instead latency was reduced and minimum fps increased significantly.

Something 3dilettente mentioned that worth keeping in mind, however, is how general can you code for something like this when low level architectural GPU differences can be relatively large between GPU vendors and even between GPU generations within the same vendor. Will it be something that can be done somewhat generically while maintaining performance consistency? Will there be some method for a program to balance things automatically? Will it need to be coded specifically for each architecture with manual balancing of split workloads?

It's still early days. But there are a lot of exciting possibilities. As well as quite likely unexpected pitfalls.

Regards,
SB
 
It doesn't. Corollary to reducing CPU driver overhead is that, you increase CPU utilization. Then when your driver demands more of it than your competition you will get behind, because it's a bottleneck again: for you (Nvidia).
It takes a ridiculous amount of "CPU utilization" to bottleneck a 980 Ti on top dog Intel CPU under DX 12.

Even AoTS does not look like the true representation of its own final version with progress like these:
85 CPU FPS @ version 0.50
131 CPU FPS @ version 0.55

I was talking about input latency, as in when input is taken into account. Input latency would reduce since the first half of rendering is running in parallel with the other stage(s). Whereas if it were a single gpu the gpu wouldn't start rendering the next frame until the whole of the last frame is done. Since all stages of a pipeline are basically running at the same time the input would get samples at a earlier time than when serialized.
Your definition of input latency is incomplete at best. Refer to the first paragraph of this classic article for a more proper definition of input latency.

The IGP in the given example is doing post-processing. It takes its input from the DGPU in the form of completely rendered frame (minus post-processing). Thus within the context of any particular frame, they can not be running in parallel. Although they can work in parallel on different, consecutive frames. I.e. the DGPU is rendering frame n + 1 while the IGP is doing post processing on frame n.

So while the inputs is indeed getting picked up at slightly faster rate, the total time it took to process and deliver the final image for displaying took a lot longer, because the IGP can only work at a fraction of the speed of the DGPU.
 
Really, that's interesting. I doubled the performance of the software I'm working on. Its previous version was "not final" according to your logic, even though it was sold worldwide, just because I could improve performance that much...
"My logic" has nothing to do with what is going to happen after the game gets finalized/released though. And I don't see anything strange for some one to expect that the final version of the game (say, 4-6 months ahead) to have a significant performance improvement over its current version. Are you trying to say that it is the norm to expect otherwise?

So how much performance improvement do you made to your software during the last 4-6 months until it get released?
 
Guys to be clear this is a thread about how DX12 can potentially alter the status quo with respect to performance and what new rendering techniques are now possible. This is not a Fury vs 980 ti thread. It's also not a thread about how and what various sites benchmark. Let's stay on topic.

What is the status quo wrt to performance? Apparently we need confidence intervals and p-values before we even start thinking of that.

And Fury vs. 980Ti is the current status quo until the next gen cards release.
 
What is the status quo wrt to performance? Apparently we need confidence intervals and p-values before we even start thinking of that.

The status quo being DirectX 9/10/11. I think at this point it's quite clear applications have to be structured differently with DirectX 12 than traditional 3D APIs. This thread's aim is discussing which of those differences have performance ramifications and what those performance ramifications mean (are there new limits, new possibilities, etc.).

I don't believe confidence intervals or p-values are required to have this discussion. :smile:

And Fury vs. 980Ti is the current status quo until the next gen cards release.

That's orthogonal to this thread. Certainly various architectures will receive different levels of performance benefits/drawbacks from these new rendering techniques. I think it's important and interesting to understand why those architectures perform the way that they do.

What's not relevant though is X website said card Y with overclock Z had an average fps of W which is greater than card V but website U got a completely different result. Who cares? It's not very interesting and doesn't further our understanding of DirectX 12.
 
Your definition of input latency is incomplete at best. Refer to the first paragraph of this classic article for a more proper definition of input latency.
I am aware of that definition but I was concentrating on my point so left out the rest.

The IGP in the given example is doing post-processing. It takes its input from the DGPU in the form of completely rendered frame (minus post-processing). Thus within the context of any particular frame, they can not be running in parallel. Although they can work in parallel on different, consecutive frames. I.e. the DGPU is rendering frame n + 1 while the IGP is doing post processing on frame n.

So while the inputs is indeed getting picked up at slightly faster rate, the total time it took to process and deliver the final image for displaying took a lot longer, because the IGP can only work at a fraction of the speed of the DGPU.
The Unreal example showed a frame rate increase so the total frame time was still less than a single gpu.

Correct, although it's not due to pipelining. That's something completely different.
The pipelining takes place on a single gpu setup as well since one stage of rendering's output in used by the next, but when the pipeline is "implemented" on multiple GPU's the new frame does/can indeed start earlier than usual. So long as you get a fps increase from the setup the input lag is indeed reduced because of this.
 
Ashes of the Singularity is still in Early Access.
... and it is the first "shipped" DX12 game, and also the first DX12 game to use explicit multiadapter. Bugs are expected at this stage in the graphics drivers, in the DX12 API itself, in the Windows 10 OS and also in their game code (it's their first DX12 game, built with unfinished DX12 tools). I expect to see all the relevant OS/API/driver bugs to be fixed at the latest in the 2016 holiday period (when the first big wave of AAA games ship using DX12).
 
Back
Top