D
Deleted member 2197
Guest
Damage control .... . Will this "hardware" analysis will hold up with Fable Legends or other DX12 benchmarks? Or is this just an Ashes phenomena?
The first mover thinks that ROPs control the frontend and tessellation performance and that since Fiji hasn't improved it over Hawaii is the reason why it is not doing that much better.
http://www.overclock.net/t/1569897/...singularity-dx12-benchmarks/400#post_24321843
And his 'analysis' of the hardware that has only come into prominence now is supposedly all the rage right now.
The antagonist who has been posted above doesn't know that CUDA miner was there before OpenCL for AMD.
The whole thing was a bit funny like all ocn threads turn into before it was being plastered everywhere.
Somebody tested 3DMark Api Overhead D3D12 vs 11 on HD 530?I read somewhere that Ashes doesn't work yet with HD 530. It would be interesting to see the results, especially on a GT4e laptop with limited TDP. Current desktop Skylakes with low end GT2 graphics should be 100% GPU bound. DX12 shouldn't improve things much, unless Ashes uses async compute or some other new DX12 features that improve GPU utilization.
I don't think a majority of the work is going into the graphics API/driver actually - they aren't even *really* pushing that many draw calls compared to some of the microbenchmarks that have come out to date. The fact that NVIDIA's DX11 implementation gets similar performance to DX12 is further evidence of this... it's good to get the API out of the way, but beyond that I expect most of the CPU load is the engine work itself (AI, physics and all that). In fact if you set the settings to "low" to reduce/remove the GPU bottleneck - which also lightens the API load - that's where you get the best multi-core scalability. I expect that screenshot of 100% CPU was on low with a high end GPU or similar. On higher settings the game is clearly GPU bound.Why waste 16 CPU cores to push draw calls (and determine visibility)? You can instead do the culling on GPU and perform a few ExecuteIndirects to draw the whole scene, saving you 15.9 CPU cores for tasks that are better suited for CPU
Not to get too into the weeds here, but what exactly is the benchmark doing? You have to be a bit careful when "testing async compute" as the notion isn't even a well-defined concept. Depending on the specifics of the load on the various units on the machine, certain architectures may or may not get performance benefits from async queues, but it's certainly not as simple as "supported or not". Similar notion with async copy as well: you're providing the driver/hardware with some additional parallelism that it may be able to make use of to improve performance, but it depends highly on the specific characteristics of both the net workloads running on the machine at the time, and the architecture.Ok, so here's a little micro benchmark that I wrote. Maybe it will point out something interesting about whether async compute is the culprit for AotS results or not.
It crashes on GCN 1.0 Tahiti, driver version 15.200.1062.1002
Right so if you're doing pure compute I'd wager that every DX11+ architecture can actually run multiple kernels at once (certainly Intel can). Yet the way that you've structured it in different queues will likely defeat some of the pipelining...It pre-constructs 128 command lists that each run this shader once and generates 128 command queues which then get fed one command list at a time and execute it. Then it goes in a loop from 1 to 128 and executes 1, 2, 3,... 128 command queues with the idea that if it can dispatch multiple kernels simultaneously it would show up as total time not growing up linearly.
Aren't async shaders just shaders that are invoked differently? And doesn't the compiler handle architecture specific optimizations, more over since there is no way to specify to the compiler that a shader is meant to be run asynchronously or overall "invocation arch" there's no way for the compiler to optimize beyond normal 'settings'. (unless there is a way to specify to the compiler to optimize how many registers to use) The only thing I can think of that might be done by the developer is minimize shared memory usage, but I'm not sure of that either. Of interest is how much the driver takes in to account when scheduling shaders to be run in parallel and driver interplay in general. I'm also curious how detrimental cache pollution will be when running shaders in parallel.The benchmark for AOS is probably valid, but async shader performance is sensitive to how they are written for the architecture (this is much more so then serially written shaders), there are many variables for this.
Aren't async shaders just shaders that are invoked differently? And doesn't the compiler handle architecture specific optimizations, more over since there is no way to specify to the compiler that a shader is meant to be run asynchronously or overall "invocation arch" there's no way for the compiler to optimize beyond normal 'settings'. (unless there is a way to specify to the compiler to optimize how many registers to use) The only thing I can think of that might be done by the developer is minimize shared memory usage, but I'm not sure of that either. Of interest is how much the driver takes in to account when scheduling shaders to be run in parallel and driver interplay in general. I'm also curious how detrimental cache pollution will be when running shaders in parallel.
Slide 12:and GCN can do 4 or 5 co issue.