AMD Vega Hardware Reviews

SPECviewperf and compute results (except mining) do show a sizeable price/performance advantage over the existing competition.
Maybe fail to mention that TitanXp is not running pro drivers while Vega FE is using them? Or fail to mention that Quadro GP104 still beats Vega FE in SPECview? As for compubench, Vega FE is so far behind it's not even a competition.
 
Spec ViewPerf needs to be thoroughly understood before numbers are posted*. At least that's who I feel about it now after having toyed around with it for a couple of hours. Still plenty of scores that don't make much sense at first glance. And by that I do not refer to the obvious differences between Geforce and Quadro for example (or Radeon and Radeon WX for that matter).

*disclaimer: I am not our editor-in-chief, so don't hit me, if we might post something in that regard.
 
Our (p)review (in german) is out

I didn't know you did PCGH, been using it for a long time (thanks to google translate). Always enjoyed them though lots of great testing both on CPU and GPUs. Appreciate the hard work you guys do and all the issues you find (like UWP oddities) that most US places never mention.

Thanks for running the B3D Suite was interesting to see how it compared to Fury, especially Clock for Clock.
 
Expanding on the "FGL" fun fact:
SPECviewperf MESSAGE: Welcome to SPECviewperf 12.1.1
---------------------------------------------------------------
SPECviewperf 12.1.1 settings:
setting source value
---------------------------------------------------------------
viewperf root default (null)
viewset name viewset config snx-02
viewset library viewset config snx-02
window x viewset config 10
window y viewset config 20
window width viewset config 1900
window height viewset config 1060
multisample viewset config 0
screen viewset config 0
threads viewset config 1
processes viewset config 1
results dir viewset config c:\SPEC\SPECgpc\SPECviewperf12\results\snx-02
---------------------------------------------------------------
END SPECviewperf settings
---------------------------------------------------------------
SPECviewperf MESSAGE: Viewset Message: Graphics Renderer: ATI Technologies Inc. Radeon Vega Frontier Edition 4.5.13486 Compatibility Profile Context FireGL 22.19.384.2.
[my bold]
 
Well the polygon throughput is certainly impressive, but it appears to be hampered by the culling performance. Seems like something which could swing performance considerably depending on the engine/game (doesn't appear like something AMD will be handling passively though)...

The only other oddity appears to be a performance regression in effective bandwidth. Hopefully that is just a driver issue, would be pretty unfortunate otherwise...

Ultimately, my impression thus far is that Vega is shaping up to more or less be AMD's Fermi. The power consumption and resulting heat is a bit disheartening, but it looks like it will be a very robust architecture to iterate on. And the path forward seems pretty clear.
 
I've just noticed something interesting about the Effective Texture Bandwidth results: the "shape" of Vega FE's 1x random, 1x black, 8x random, 8x black results is very similar to the shape seen with RX 580. The percentages of all four versus 1x black:

Vega FE: 65, 100, 69, 74
RX 580: 67, 100, 75, 83
Fury X: 85, 100, 110, 121

Fury X is quite differently "shaped" though.

So, the reason that Vega FE's 8x results are worse than 1x results could be the same as for RX 580.

Why is RX580 slower at 8x?
 
I've just noticed something interesting about the Effective Texture Bandwidth results: the "shape" of Vega FE's 1x random, 1x black, 8x random, 8x black results is very similar to the shape seen with RX 580. The percentages of all four versus 1x black:

Vega FE: 65, 100, 69, 74
RX 580: 67, 100, 75, 83
Fury X: 85, 100, 110, 121

Fury X is quite differently "shaped" though.

So, the reason that Vega FE's 8x results are worse than 1x results could be the same as for RX 580.

Why is RX580 slower at 8x?

I think Tonga has generally escaped scrutiny because it was old news by the time the tester came out, but perhaps if it were tested again we'd have a comparison with a GPU whose IP was aligned with Fiji's.

The ratio of channels to RBEs is different between Fury and the rest.
Fiji had 64 ROPs in 16 RBEs with the compression occurring at RBE cache spill/fill to memory.
8 channels per stack x4 meant there were twice as many channels per ROP/RBE versus Polaris (8 32-bit channels for 32 ROPs) and Vega (16 128-bit channels for 64 ROPs).
Tonga might have had the same issue, or perhaps if Tonga ever enabled its other 128 bits of interface it might have added more data on the effect of channel count.

I am unclear on how Fiji's L2 slices were arranged versus its HBM channels, although there were signs that its on-die bandwidth didn't match the growth in channel count. Synthetics showed results in the neighborhood of Hawaii until memory strides thoroughly blew through the caches and started queuing up on the DRAM channels.

How AMD's compression works is not clear. There is a separate path for metadata like compression, and the compression logic is after the caches and before DRAM. That could mean that there is more compressor throughput if it's per channel, or perhaps there are channels to spare if the compression pipeline winds up blocking a controller for some number of cycles or is fetching/evicting from its own metadata cache.

There was some discussion on how this test worked, back in the mists of time. I'm having difficulty finding it. AMD cautions with its DCC description that sufficiently poor or random texturing loads can cause DCC to be a performance regression, and if the 8x test is trying to throw traffic to 8x as many separate locations it might be hitting a corner case for it.
 
I did some comparisons last year with clocks set to identical speeds (in case of memory to achieve identical bw):
@ same clock-speeds R9 280X UC R9 380X OC RX 470 UC
Bandwidth (GB/s) RSCE 16.10.1 RSCE 16.10.1 RSCE 16.10.1
1 Random Texture 156 153 157
1 Black Texture 153 214 221
8 Random Textures 179 178 178
8 Black Textures 179 197 199

basically identical behaviour between Tonga and Polaris.
 
I did some comparisons last year with clocks set to identical speeds (in case of memory to achieve identical bw):

code tags. *ahem* :p

Code:
                    R9 280X UC    R9 380X OC   RX 470 UC
Bandwidth (GB/s)    RSCE 16.10.1  RSCE 16.10.1 RSCE 16.10.1
1 Random Texture    156           153          157
1 Black Texture     153           214          221
8 Random Textures   179           178          178
8 Black Textures    179           197          199

Cheers,
 
I did some comparisons last year with clocks set to identical speeds (in case of memory to achieve identical bw):
@ same clock-speeds R9 280X UC R9 380X OC RX 470 UC
Bandwidth (GB/s) RSCE 16.10.1 RSCE 16.10.1 RSCE 16.10.1
1 Random Texture 156 153 157
1 Black Texture 153 214 221
8 Random Textures 179 178 178
8 Black Textures 179 197 199

basically identical behaviour between Tonga and Polaris.

The GPUs show a pattern of increasing utilization for the random texture x1 to x8 tests, which seems consistent with getting better utilization due to having more concurrency to extract workable transaction patterns.
Vega's scaling with concurrency is relatively poorer than all the others, which isn't ideal but could be expected of an immature platform.

Fiji's 32 channels start off less utilized, but there's a consistent ramp and it gets closer to the theoretical bandwidth of its interface in the 8x black texture case. The other component is that the compression pipeline that at least theoretically should be similar to Tonga has limited upside in the 1x case. Perhaps this is because Fiji's channels are so underutilized that the compression method is saving bus cycles that wind up being wasted anyway.
It seems as if the compression path is less able to extract bandwidth savings past a certain level of concurrent traffic.

It'd be nice to know if there are multiple components to this system. Is there a separate ceiling in the compression pipeline related to how much data it can deliver internally to the chip? It seems like Nvidia's method is more aggressive in what it has for an upside, or perhaps tiling accellerates things more.
Is there a separate limit and how much concurrency AMD's compressor can support, such as per-channel context, per-channel metadata DRAM traffic, and/or per-channel compression logic?

Some combinations of answers could make a hypothetical Tonga with all 384-bits of its DRAM bus enabled scale differently.
 
Back
Top