But I thought Fiji's TMUs can't read it directly (decompression pass needed, Polaris introduced that feature to the TMUs if I don't mix it up), so the difference between black and random there appears to be strange (as TMUs shouldn't care at all). The behaviour is somewhat similar to the pixel fillrate tests of hardware.fr (for older GPUs, there is no data for Vega yet), why I would like to know what the tests actually measures and how.
AMD's description of DCC from GCN 1.2 onward is that the shader core can read the format directly, although the compression ratio is worse if a compressed resource is shader-readable. I didn't see a special case made of Fiji or Tonga.
However, unless the test is low-level enough to evade the driver, an intra-frame dependence on a render target by a shader is a barrier and/or automatic disabling of DCC since the metadata path is incoherent.
For the texture filtering test at least, I thought it could get by with writing to a render target and not testing the consumption of it until frame N+1.
I think the bandwidth test could have concurrent reads by the texture units and output from the ROPs. As long as they aren't hitting the same target, measuring how long both sets of operations take to complete overall divided by time would give a rough indicator of whether compression was easing the burden. I'm not sure what would be measured if the two paths hit the same resource, given the conservative barrier and disabling of DCC by the driver.
Unlike Nvidia's method, however, the compression doesn't seem to exceed the max theoretical delivery of the memory bus. That there is a difference does support the idea that something is being compressed.
Edit:
And MDolenc is completely right that the texel fillrate tests (which are done with tiny textures easily fitting to the caches) are also somewhat off for Vega. Something weird is going on and the texture bandwidth numbers may be skewed by a side effect of that.
Perhaps a change in how Vega handles texture ops, or more latency than there used to be?
I think there was additional code added that mentioned modifying the math for fetch address calculation (edit: or some documented lines indicating the math was currently buggy). It might have a negative effect on a tight loop that spammed trivially cached bilinear fetches, when more complex workloads are more likely to have other bottlenecks.
The texturing block is something GCN's architects admitted had more hidden state than they would have liked, which might be due for a change. If that portion is more programmable it might inject additional sources of delay at the issue of the relevant instructions or register writeback.
Vega does seem to be gearing up for some kind of change to its memory handling or latency since the LLVM patches show the vector memory waitcnt is quadrupled. I do not recall the particulars of the test, but perhaps the cycling of texture data in the L1 may no longer cover delays if the latency is expected to grow in proportion to the wait counts intended to handle it.