AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

How would tiling help with a low level micro-benchmark that only tests texture bandwidth?
You're right, this test should have zero overdraw. So DSBR shouldn't be doing anything. Or, it's doing the wrong thing and the GPU is tripping up over itself trying to bin something that doesn't need to be binned :runaway:
 
Alas, that's a rather ancient OpenGL Test. Will see if I can run it tomorrow in the office. But IIRC the results have been... strange for a couple of other cards a few years back, so I stopped using it on a regular basis. I still don't see, however, how I can correlate certain filtering modes to ALUs. Except the results between filtering modes differ wildly from the one in Fiji/Polaris - which the ones tested with the modern B3D suite do not indicate.

Right, having had a look at my excel data, the last two cards I tested were the HD 7970 and the 780 Ti. And also, looking at that data, I see now why I stopped using it: It does indeed draw the framebuffer and seemingly is using only a single texture layer, so for the majority of cases, even with trilinear filtering, the results were ROP-limited (or in case of Nvidia sometimes Shade-export limited). It was a great tool at it's time, where texture filtering with (then) more exotic format was half- or even quarter-rate, sometimes not even supported.

I'll give it a quick run nevertheless, but hopes are low that we can learn anything useful from it.
 
Right, having had a look at my excel data, the last two cards I tested were the HD 7970 and the 780 Ti. And also, looking at that data, I see now why I stopped using it: It does indeed draw the framebuffer and seemingly is using only a single texture layer, so for the majority of cases, even with trilinear filtering, the results were ROP-limited (or in case of Nvidia sometimes Shade-export limited). It was a great tool at it's time, where texture filtering with (then) more exotic format was half- or even quarter-rate, sometimes not even supported.

I'll give it a quick run nevertheless, but hopes are low that we can learn anything useful from it.
Any chance you could do some HBM OC only? 1050 + HBM OC @ 1100 or whatever you can get it highest to? 1440 @ 1100?

Just wondering how much the memory bandwidth is holding it back even at Fury clocks and stock clocks. I have a feeling that GN's test will only have both OC'd and not show bottlenecking as much.
 
It's shaping up to be AMD's Fermi, except for AMD having a bigger chip than rv870 that was 50% better.
Nope. Fermi was (at least when released) the fastest GPU available. It was really horrible, but still it was beating its competitors.

So far Vega is faster than GTX 1070. Not GTX 1080, 1080 Ti, Titan X nor Titan Xp. From this POV it seems more like AMD's FX 5800.
 
How would tiling help with a low level micro-benchmark that only tests texture bandwidth?
Actually, I would like to know what this test is actually measuring and how. Because as you see an effect of the framebuffer compression (the prevalent explanation for the the difference between random and black texture), it appears to measure the ROP bandwidth or throughput. And if one has only a single texture fetch per fragment/pixel (or what does that "1x texture" mean?) it is going to be ROP limited and not indicative of texturing bandwidth no matter what.
 
As I wrote in my previous post: I doubt very much it's the effective bandwidth issue. Check the texturing fillrate (standard 32bt A8R8G8B8)... That's quite far of as well. Texture there should be something small that fits into cache (and based on other results it is). Low effective bandwidth seems to me more like a consequence of weird texturing performance.
 
As I wrote in my previous post: I doubt very much it's the effective bandwidth issue. Check the texturing fillrate (standard 32bt A8R8G8B8)... That's quite far of as well. Texture there should be something small that fits into cache (and based on other results it is). Low effective bandwidth seems to me more like a consequence of weird texturing performance.

And i dont really see why this problem of "textures" performance is there. (if this is the problem )... What could have change compared to Fiji in this regard ? ... ( architecture wise )
 
Actually, I would like to know what this test is actually measuring and how. Because as you see an effect of the framebuffer compression (the prevalent explanation for the the difference between random and black texture), it appears to measure the ROP bandwidth or throughput.
But this assumes that ROPs are responsible for fetching data for the texture unit, doesn't it? I don't think that's necessary.
 
But this assumes that ROPs are responsible for fetching data for the texture unit, doesn't it? I don't think that's necessary.
Of course not. But you can fetch way more texels through the TMUs than you can export pixels through the ROPs. So if there is only a single texture fetch per pixel (as you could think from the "1x texture" designation), you don't measure texturing bandwidth. And for a texturing benchmark it is at least somewhat strange that the framebuffer compression (which is done while writing to memory through the ROPs) is supposed to have such a large effect. Ideally, it shouldn't tax the ROPs (so compression wouldn't have an influence). Alternatively, the used texture is created on the fly by the GPU (so it may be created in a DCC compressed format) and the test measures some combination of bandwidth an how well the TMUs can handle reading this DCC compressed format. But I thought Fiji's TMUs can't read it directly (decompression pass needed, Polaris introduced that feature to the TMUs if I don't mix it up), so the difference between black and random there appears to be strange (as TMUs shouldn't care at all). The behaviour is somewhat similar to the pixel fillrate tests of hardware.fr (for older GPUs, there is no data for Vega yet), why I would like to know what the tests actually measures and how.

Edit:
And MDolenc is completely right that the texel fillrate tests (which are done with tiny textures easily fitting to the caches) are also somewhat off for Vega. Something weird is going on and the texture bandwidth numbers may be skewed by a side effect of that.
 
Last edited:
It's a hype machine if course, so no one should expect things like in game settings and fps counters to be available. People need to be realistic about the purpose of the events.

Why not? AMD showed off FPS months ago when demoing Vega hands on.


He checked and it was 4k Ultra + 8x TSAA at 60fps+
 
Edit:
And MDolenc is completely right that the texel fillrate tests (which are done with tiny textures easily fitting to the caches) are also somewhat off for Vega. Something weird is going on and the texture bandwidth numbers may be skewed by a side effect of that.
It would probably be prudent to run a compute-oriented memory bandwidth test in this case, just to rule out an actual memory bandwidth issue.
 
But I thought Fiji's TMUs can't read it directly (decompression pass needed, Polaris introduced that feature to the TMUs if I don't mix it up), so the difference between black and random there appears to be strange (as TMUs shouldn't care at all). The behaviour is somewhat similar to the pixel fillrate tests of hardware.fr (for older GPUs, there is no data for Vega yet), why I would like to know what the tests actually measures and how.
AMD's description of DCC from GCN 1.2 onward is that the shader core can read the format directly, although the compression ratio is worse if a compressed resource is shader-readable. I didn't see a special case made of Fiji or Tonga.
However, unless the test is low-level enough to evade the driver, an intra-frame dependence on a render target by a shader is a barrier and/or automatic disabling of DCC since the metadata path is incoherent.

For the texture filtering test at least, I thought it could get by with writing to a render target and not testing the consumption of it until frame N+1.
I think the bandwidth test could have concurrent reads by the texture units and output from the ROPs. As long as they aren't hitting the same target, measuring how long both sets of operations take to complete overall divided by time would give a rough indicator of whether compression was easing the burden. I'm not sure what would be measured if the two paths hit the same resource, given the conservative barrier and disabling of DCC by the driver.

Unlike Nvidia's method, however, the compression doesn't seem to exceed the max theoretical delivery of the memory bus. That there is a difference does support the idea that something is being compressed.

Edit:
And MDolenc is completely right that the texel fillrate tests (which are done with tiny textures easily fitting to the caches) are also somewhat off for Vega. Something weird is going on and the texture bandwidth numbers may be skewed by a side effect of that.
Perhaps a change in how Vega handles texture ops, or more latency than there used to be?

I think there was additional code added that mentioned modifying the math for fetch address calculation (edit: or some documented lines indicating the math was currently buggy). It might have a negative effect on a tight loop that spammed trivially cached bilinear fetches, when more complex workloads are more likely to have other bottlenecks.
The texturing block is something GCN's architects admitted had more hidden state than they would have liked, which might be due for a change. If that portion is more programmable it might inject additional sources of delay at the issue of the relevant instructions or register writeback.

Vega does seem to be gearing up for some kind of change to its memory handling or latency since the LLVM patches show the vector memory waitcnt is quadrupled. I do not recall the particulars of the test, but perhaps the cycling of texture data in the L1 may no longer cover delays if the latency is expected to grow in proportion to the wait counts intended to handle it.
 
The memory transfer bandwidth as measured by AIDA64's GPGPU-test is 20% better on Fiji (367) than on Vega (303), right inbetween sits the Titan X (Pascal) with 336 GB/s.
20% lower effective bandwidth out of a 6% lower theoretical bandwidth.
 
Back
Top