DX12 Performance Discussion And Analysis Thread

Any ideas on the negative gains for fury X Carsten? The overflow seems awkward since that should effect all of them no?
 
With all the sh*t storm we are about to see regarding Microsoft Store/WDDM 2.0 - Windows Composite Engine/VSYNC,
anyone tested the very latest version (applied to recent update) of Ashes of the Singularity with both GSYNC and Freesync?

Going back awhile I thought Freesync can only operate correctly with full screen application, and I wonder if GSYNC will also be screwed by impact of implementation of game requirement for Windows Store-DX12.
I meant to post awhile back, but reading Ryan's latest article on PCPer and that at Guru3d means this is going to throw some spanners in the gaming development world on PC, let alone the support for Crossfire and SLI.
http://www.pcper.com/reviews/Genera...up-Ashes-Singularity-DX12-and-Microsoft-Store

I get that bad feeling Microsoft is creating another palm and face moment for the PC gaming environment; wonder if this debacle is something else Phil Spencer will need to take control of and put on the right track as we saw with the XBOX One project.
Cheers

ABout the Freesync, V-sync.. i think Guru3D was explain that AMD have a driver for fix it who will be released soon for Ashes. I will not too much read on this situation right now.
 
Any ideas on the negative gains for fury X Carsten? The overflow seems awkward since that should effect all of them no?
Nothing substantial. But it was repeatable behaviour, not singular events. Might help to look at it with GPU view or similar. If only I had time...
 
Last edited:
Recommended specs for DX12 game Gears of War: Ultimate edition:

Those are only estimates but still it's interesting which cards are grouped together by the devs: DX12, savior of AMD?

Ideal -> Nvidia Geforce GTX 980 Ti / AMD Radeon R9 390X
Recommended -> Nvidia Geforce GTX 970 / AMD Radeon R9 290X
Min -> Nvidia Geforce GTX 650 Ti / AMD Radeon R7 260X
 
Nothing substantial. But it was repeatable behaviour, not singular events. Might help to look at it with GPU view or similar. If only I had time...
Isn't this kinda expected and has been pointed at in this thread already? Running compute and graphics concurrently makes both queues compete for resources. Most notably bandwidth. As the amount of work increases so does the amount of L1/L2 evictions. HBM interface is clocked far slower then GDDR 5 interfaces are and thus it's able to service less transactions.
 
Recommended specs for DX12 game Gears of War: Ultimate edition:

Those are only estimates but still it's interesting which cards are grouped together by the devs: DX12, savior of AMD?

Ideal -> Nvidia Geforce GTX 980 Ti / AMD Radeon R9 390X
Recommended -> Nvidia Geforce GTX 970 / AMD Radeon R9 290X
Min -> Nvidia Geforce GTX 650 Ti / AMD Radeon R7 260X
Given the minimal differences between R9 390X und R9 290X, I'd rather guess it's based on available Videomemory. (8/6 GiB in top-tier, 4/3,5+0,5 GiB in med-tier and 1/2 GiB in lowest tier). But that's just a wild guess.

Isn't this kinda expected and has been pointed at in this thread already? Running compute and graphics concurrently makes both queues compete for resources. Most notably bandwidth. As the amount of work increases so does the amount of L1/L2 evictions. HBM interface is clocked far slower then GDDR 5 interfaces are and thus it's able to service less transactions.
Sounds sound - I had not thought about the comparatively low clock speed of the HBM.
 
GCN3's poor dx12 performance got even worse.

What you’re watching is the Radeon Fury running the Gears of War: Ultimate Edition benchmark on my capable Intel test bench, at 1440p with High quality settings. These settings include FXAA and Ambient Occlusion. You’re also seeing horrendous hitching and stuttering, and some visual corruption thrown in for good measure, making the game completely unplayable on an excellent $500 graphics card.

AMD’s Radeon Fury X and Radeon 380 also choked when switching quality to High and running at 1440p or higher.

Surely the performance gets even worse as you make your way down the Radeon product stack, right? Oddly enough, no. I tested an Asus Strix R7 370 under the same demanding 4K benchmark, and it turned in only a 13% lower average framerate. Crucially, no stuttering or artifacting was present.
The Radeon 390x is just fine, achieving double the framerate at High Quality/4K as the more expensive Fury and Nano cards.

http://www.forbes.com/sites/jasonev...-disaster-for-amd-radeon-gamers/#33c0e9857e7e
 
Isn't this kinda expected and has been pointed at in this thread already? Running compute and graphics concurrently makes both queues compete for resources. Most notably bandwidth. As the amount of work increases so does the amount of L1/L2 evictions. HBM interface is clocked far slower then GDDR 5 interfaces are and thus it's able to service less transactions.

Is there a particular restriction in mind, such as an access pattern that does not adequately stripe across all channels or GPU/DRAM bottleneck?
GDDR5 has a burst length of 8 per transaction. For the 390X with data at 6 Gb/s , that is .75 Gtxn/s.
Across 16 such channels, that is 12 Gtxn/s.

Fiji would be running 32 channels, with 1 Gb/s yielding .5 Gtxn/s. That is .5 Gtxn/s per channel, and there are 2x as many channels, giving 16 Gtxn/s.
 
With all the sh*t storm we are about to see regarding Microsoft Store/WDDM 2.0 - Windows Composite Engine/VSYNC,
Most of the shit storm in the past 2-3 days springs from all type of folks mixing up forced composition (respectively the lack of exclusive full screen) for Microsoft Store apps, and the (entirely unrelated) move from Direct Flip to Immediate Flip AMD has performed.

AotS is running in exclusive full screen on both platforms, and unthrottled that is. It's not subject to the forced composition required for Windows Tore apps. It's really just that FCAT couldn't properly detect writes to the swap chain, when the buffers were not overwritten continuously (Direct Flip), but remapped instead (Immediate Flip). And completely breaking down as the flip only happens once per refresh, as anything else would cause tearing. Not even being able to measure when the frame was actually submitted for presentation.
 
Forbes is spam.

I also wish that MS put the full WDDM 2.0 documentation public. I would like to understand better how WDDM 2.0 works for many things, like presentation. Reading the WDK headers is not so much usefull..
 
Is there a particular restriction in mind, such as an access pattern that does not adequately stripe across all channels or GPU/DRAM bottleneck?
GDDR5 has a burst length of 8 per transaction. For the 390X with data at 6 Gb/s , that is .75 Gtxn/s.
Across 16 such channels, that is 12 Gtxn/s.

Fiji would be running 32 channels, with 1 Gb/s yielding .5 Gtxn/s. That is .5 Gtxn/s per channel, and there are 2x as many channels, giving 16 Gtxn/s.
There's a stripe over all the channels so that's still .75 Gtxn/s on 44CUs vs. .5 Gtxn/s on 64CUs. In this particular case at 64k particles graphics queue is rendering from a 2MB buffer and compute queue is reading from another 2MB buffer so that definitely blows over cache sizes. And since it's an n-body problem it needs entire 2MB buffer for each of the 64k particles.
 
GCN3's poor dx12 performance got even worse.

GCn3 poor performance of what ? GCN perform really well in DX12... In this game, there's serious problem ( 10 fps average with HBAO+ ).. but not due to GCN, but to developpers.

Im sorry, but your titles is well innapropriate. (Or i have misunderstood it ? )
 
Last edited:
There's a stripe over all the channels so that's still .75 Gtxn/s on 44CUs vs. .5 Gtxn/s on 64CUs. In this particular case at 64k particles graphics queue is rendering from a 2MB buffer and compute queue is reading from another 2MB buffer so that definitely blows over cache sizes. And since it's an n-body problem it needs entire 2MB buffer for each of the 64k particles.
Using the single-channel figure means the GPU cannot use all channels?
If it is striped over all available channels, the aggregate transaction capability of the memory subsystem is 12 Gtxn/s for 44CUs, or .27 Gtxn/s pre CU.
Fury would get 16 Gtxn/s for 64 CUs, or .25 per CU.
 
ABout the Freesync, V-sync.. i think Guru3D was explain that AMD have a driver for fix it who will be released soon for Ashes. I will not too much read on this situation right now.

Not just you lanek but everyone who responded did you read the pcper link I provided as Ryan clarifies it much better and has been in discussion at length with various teams.
The issue goes way beyond just setting directflip and comes down to an architectural strategy from Microsoft, that is going to create the sh$t storm, which IMO has not hit yet but some hints starting to show but looks like it will affect heavily both AMD and NVIDIA, and of course developers who end up with even more responsibility where no clear practical strategy has yet appeared from Microsoft.
Anyway the pcper link is a good read and puts it into a better context IMO on why this could be a right nightmare, and why I am curious if both GSYNC and Freesync are working correctly with the latest update to the Ashes game, along with how SLI and Crossfire will work with the implementation Microsoft wants to push (aligns with the Microsoft store).

Cheers
 
Do we actually know that Fiji has 32 channels internally?

What if AMD decided that with 4 shader engines it would aggregate the 8 channels at each HBM device into a single channel, targetting a simple crossbar twixt 4 HBM controllers and 4 shader engines?
 
AMD did not compare or state the L2 bandwidth for Fiji versus Hawaii, while the L2 bandwidth change between Tahiti and Hawaii due to the rise in channel count was noted. Maybe something changed.

HBM states that the channels in a stack are independent, although it doesn't necessarily mean that the chip would not be able to send essentially the same transaction for each channel. Earlier CPUs have ganged 2 64-bit channels into a 128-bit interface, such as earlier AMD IMC chips.
How wieldly a 1024-bit channel would be is unclear, and the old relationship between channel, L2 slice, and each slice's ability to service requests would be changed.

That would point to a GPU-side constraint not attributable to the memory stack or its clock speed. The amount of data fetched with 8 channels supplying 32B per burst would be large compared to the cache line granularity.
 
Isn't this kinda expected and has been pointed at in this thread already? Running compute and graphics concurrently makes both queues compete for resources. Most notably bandwidth. As the amount of work increases so does the amount of L1/L2 evictions.
This could be the case.

HBM interface is clocked far slower then GDDR 5 interfaces are and thus it's able to service less transactions.
That doesn't make a lot of sense: ignoring refresh, HBM is theoretically able to saturate the bus (just like GDDR5), so if HBM has high absolute BW, it can also service more transactions.

For GDDR5 and HBM, the transaction size is 32 bytes. A 12 device GDDR5 GPU has 12 channels. A 4 stack HBM1 GPU has 32 channels. There's probably some adjustments wrt number of clock cycles per command, but, again, since peak BW is possible, it's not an issue anyway.
 
1024-bit ganged at the HBM controller could be 512-bit at 1GHz across the chip?

But you're right, the cache line abuse this implies makes this seem very unlikely.

On the other hand, there do seem to be some oddities in Fiji memory system performance. I've failed to find decent data, though.
 
Back
Top