Yes, there is something iffy with the new pixel engine.So anyone is welcome to correct me here, but as a layman I see at least two major culprits:
1 - Texel fillrate (per TMU per clock) is pretty terrible compared to Polaris and even Fiji (maybe connected to new ROPs as L2 cache clients?)
2 - Effective bandwidth is actually lower than Fiji
Something strange going on with geometry performance too, as the promised 2.6x geometry performance boost due to the new primitive shader simply isn't there. It was supposed to be 11 triangles/clock when in fact we're seeing the same 4 triangles/clock as Fiji. At 1050MHz Vega is hitting close to 4000MTriangles/s, when the slides suggested it should reach up to 11000MTri/s at ~1GHz, putting it above the Pascal cards if it was at the default clocks.
Combine this with current Vega FE clocks barely going above Polaris 20's, when slides from January suggested a 2x clock increase, and it's pretty much the perfect storm of anticipated vs. current performance.
What if Vega doesn't have TMUs? With the 2xFP16(INT16?) and 4xINT8 they could be filtering with the shader cores. Then lower bandwidth and/or register pressure slowing things down. With everything seemingly programmable that makes sense. Could apply to ROPs as well. Still leaves the question of what's taking up all the space.1 - Texel fillrate (per TMU per clock) is pretty terrible compared to Polaris and even Fiji (maybe connected to new ROPs as L2 cache clients?)
This is probably the real killer. How exactly is this measured? Might be a weird measuring error due to Infinity. Tying a cluster to a particular channel might not align well to the testing.2 - Effective bandwidth is actually lower than Fiji
The option is there though to make visible the other tests - as is the case with almost all other sub-tests. I've only culled a few texturing results which i felt show redundant information.By default, PCGH present the results for the 100% culled test (using strips, so a modern GPU's peak throughput ideally) and the 50% culled test (using lists). In both cases the geometry is always fully submitted to the GPU with no host-side culling.
Something strange going on with geometry performance too, as the promised 2.6x geometry performance boost due to the new primitive shader simply isn't there. It was supposed to be 11 triangles/clock when in fact we're seeing the same 4 triangles/clock as Fiji. At 1050MHz Vega is hitting close to 4000MTriangles/s, when the slides suggested it should reach up to 11000MTri/s at ~1GHz, putting it above the Pascal cards if it was at the default clocks.
When looking at the FE clicked at 1050, the results are identical to Fury X. So while they have obviously improved clock speeds, the first order pipeline structure seems to be (unsurprisingly) unchanged.Latency and ALU look very good and improved vs Fiji.
The option is there though to make visible the other tests - as is the case with almost all other sub-tests. I've only culled a few texturing results which i felt show redundant information.
So, during PCGH testing, the card does indeed throttle down heavily (to about 1269MHz under heavy load). They also found that 1.2V is used by default for 1600MHz stage. Thus I feel the discussion about other clocks is rather academic at this point. Fact of the matter is, Vega FE consumes large amount of power @1600MHz sustained clocks, that in order for it not to throttle under, it needs unlocking the Power Target, through which power consumption is increased even further.He says in the video, setting the power management manually disables the auto voltage of the GPU, causing it to always pull 1.2v.
Giving your numbers, we at least know Vega achieves some of it's goals in geometry processing, (it's 80% faster than Fiji clock for clock in the 100% culling strip), which defeats the Fiji drivers argument for good this time.And the product specs for Vega FE also mention 4 triangles per clock.
AMD didn't confirm the number of TMUs as of yet, this should be a straight forward information, supposedly they are the same number as Fiji, unless they are less, in which case this could explain the scores we are seeing.Which leaves the question whether or not it can be solved for the RX.
Not that I know of. I have nothing omitted, except the mentioned redundant texture fillrate tests.Is there also a polygontest with strip and 0% culling?
Yes, compressible (one color) vs. basically non-compressible (rand.) textures.The random texture unit results are strange. This is essentially just another BW test, isn't it?
That's actually been done with Polaris already.Giving your numbers, we at least know Vega achieves some of it's goals in geometry processing, (it's 80% faster than Fiji clock for clock in the 100% culling strip), which defeats the Fiji drivers argument for good this time.
No, they did not. I guess everyone assumed automatically, the "quad TMU per CU" did not change. When I first saw the results and then repeated re-runs showed the same numbers, I was pondering about maybe ineffectiveness of or issues with texture cache. Nvidia did unite that with L1 data cache... maybe there's an unsolved contention here.AMD didn't confirm the number of TMUs as of yet, this should be a straight forward information, supposedly they are the same number as Fiji, unless they are less, in which case this could explain the scores we are seeing.
I'll need to double check, but I recall from drivers the scalars used 5 of 16 registers for addressing each wave. That could be where the 11 triangles come from. Broadcasting limitation to VALUs perhaps?Something along the lines of being able to work on 11 triangles concurrently
You lost me here.Did that test suite have any of the filtering tests from way back? Curious how well that aligns to the theoretical flops.
I recall some tests from a while (long while) ago testing filtering on various texture formats. Depth, HDR, INT8 rates etc for point sampling, bi/trilinear, anisotropic, etc. Theory being the TMUs had better filtering capability than ALUs. Might confirm a lack of TMUs if the different rates align to the ALU ratios.You lost me here.
Don't use these results, it's an old Polaris driver!That's actually been done with Polaris already.
That's not true if the texture units are held back by memory BW. As can be seen by the fact that the numbers are the same for FE clocked at 1050 and at 1600.AMD didn't confirm the number of TMUs as of yet, this should be a straight forward information, supposedly they are the same number as Fiji, unless they are less, in which case this could explain the scores we are seeing.
It's an expensive, finished, released-to-market product. Even if you don't want to make RX Vega conclusions, the results are still interesting on its own.I think it might be pointless to discuss vega fe like it was a finished product and the current findings are representative.
You'd have a point if only fill rate were an issue: you could blame the not enabled tiler for that. Maybe.my theory is AMD put out vega fe to do what it can currently do, for whatever reason. it was not ready for gaming. they didn't have whatever software supported most of the new gaming relevant hardware ready for launch.
Alas, that's a rather ancient OpenGL Test. Will see if I can run it tomorrow in the office. But IIRC the results have been... strange for a couple of other cards a few years back, so I stopped using it on a regular basis. I still don't see, however, how I can correlate certain filtering modes to ALUs. Except the results between filtering modes differ wildly from the one in Fiji/Polaris - which the ones tested with the modern B3D suite do not indicate.I recall some tests from a while (long while) ago testing filtering on various texture formats. Depth, HDR, INT8 rates etc for point sampling, bi/trilinear, anisotropic, etc. Theory being the TMUs had better filtering capability than ALUs. Might confirm a lack of TMUs if the different rates align to the ALU ratios.