DX12 Performance Discussion And Analysis Thread

Did i misinterpret something or is the status quo that AC works fine on maxwell? What have changed so far? I had lost track of this discussion so a short summary would be too kind. :)
 
Did i misinterpret something or is the status quo that AC works fine on maxwell? What have changed so far? I had lost track of this discussion so a short summary would be too kind. :)
Apparently the current situation is "yes and it's a disaster"
Yes: It can run concurrent graphics & compute tasks
It's a disaster: You partition the GPU resources, let's say you do 50/50 for graphics and compute, and your graphics task finishes before compute task, and you have 50% of your resources idling and waiting for the compute to finish up, you can't change that on the fly, and instead have to do expensive context switch to change the partitioning
 
Here's HD530 (def. clocks, but DDR4-3000) in an overclocked i7-6700K.
 

Attachments

  • perf_HD530.txt
    471.2 KB · Views: 16
And since I know no one else will bother, here's GTX 580. Yep, _5_80.
BTW - MDolenc: Does your program require 64-Bit windows or a certain amount of dedicated video memory? It does not run on my cheap-ass tablet with Atom Z3735G and x86-W10.
 

Attachments

  • perf_GTX580.txt
    11.4 KB · Views: 22
It's 64bit yes and will require at least 128MB video memory. Let me know if you want to check that one out. :) But it would probably be better to shorten the graphics part for that one, it takes half a second per run on not so slow integrated Intel GPUs.
 
Na, it's ok. No one's really interested in that 4-EU-crap anyway I guess. It displays the browser window alright and shows some YT vids.
 
I checked this today real quick. Added a new case to the sample, so scenario is a bit different:
- main queue renders to offscreen target (no buffering of frames, trivial VS/PS) - 128 draws.
- there's a high priority queue that executes a compute kernel after 10ms delay.
Seems to work on GCN only. That is on 380X graphics finishes in 70ms and compute in 1.5ms. Reaction time (from issue to completion signal on high priority queue - average kernel runtime) seems to be in 0.2-0.5ms-ish range. Checked on Maxwell and HD 4600 and in both cases high priority queue only kicked in after graphics queue was done.
It would be interesting to see how the high priority queue will react on GCN Gen 4 GPUs: both latency and total rendering time. PS: NDA for RX 480 ends tomorrow :D
 
Here are stock Fury-X results for comparision as requested. A lot of variation in latency test. I ran it twice and both were the same (browsers closed).
 

Attachments

  • Fury-X-Stock.txt
    11.1 KB · Views: 38
AMD HD7970 (1050mhz) ( not quite sure aboutt the results as i have 2 installed ( but CFX was disabled )
 

Attachments

  • AMD HD7970.txt
    11.1 KB · Views: 20
Here's HD530 (def. clocks, but DDR4-3000) in an overclocked i7-6700K.
OOps. That was not intentional.
600+ milliseconds of waiting = no high prio queue support at all (queue it after the GPU is idle). Ouch!

High priority queues are definitely NOT working properly on Intel GPUs. Intel has UMA and shared caches making low latency GPGPU a perfect use case. Too bad the latency completely tanks when the GPU is rendering at the same time.

Of course there's also a possibility to use the Intel iGPU solely for gameplay GPGPU tasks and discrete GPU solely for rendering. DX12 explicit multiadapter makes this possible. But if you are using the same shared scene data structures in the GPGPU code and in the rendering code, you need to duplicate them and maintain the state of both (copy modifications between the memory pools). Makes things complicated. And even this doesn't solve the case where the consumer only has an Intel iGPU or if he/she has a 6+ core Xeon/i7 (no iGPU, only discrete).

It seems that high prio compute queues are NOT yet ready for shipping games. AMD has been ready since 2011. Nvidia is now ready with Maxwell and Pascal (Maxwell suffers some penalty, but gets the job done). Hopefully Intel could fix their high prio queues with a new driver (to match Maxwell's functionality). People seem to be blinded by concurrent execution. It is solely a GPU performance gain. High priority queues on the other hand enable games to offload game logic to the GPU, allowing completely new gameplay. Some modern console games do this already, making it hard to port them to PC.
 
Last edited:
GCN Gen 4 should provide better support for high priority compute queue. I guess the same feature will be on the next iteration of Microsoft and Sony consoles (though I hope in a new rasterizer too!). It would be a lot interesting to see how much important will become such feature. But yes, we are still far from having three priority options on engine queues (actually D3D12 exposes only two priority-value, normal and high. Low/background priority is missing).
 
GCN Gen 4 should provide better support for high priority compute queue.
In fact, that part is only software. GCN3, especially Fiji, *used to* have worse support than it does now. With a recent driver, Fiji uses an entirely different MEC firmware, feature-wise very similar to what is known about Polaris. I'm not exactly sure about Tonga, I can't remember whether Tonga already had sufficient memory for the full MEC firmware, or only a cut down version.
 
Last edited:
In fact, that part is only software. GCN3, especially Fiji, *used to* have worse support than it does now. With a recent driver, Fiji uses an entirely different MEC firmware, feature-wise very similar to what is known about Polaris. I'm not exactly sure about Tonga, I can't remember whether Tonga already had sufficient memory for the full MEC firmware, or only a cut down version.

I see on polaris that Hws feature can be updated via micro code, but i can imagine it was allready the case before.
 
I see on polaris that Hws feature can be updated via micro code, but i can imagine it was allready the case before.
It was. However the maximum possible size of the micro code differs significantly. Only since Fiji, there is sufficient memory available to pack all the desired functionality into a single firmware image.
 
Back
Top