AMD Vega Hardware Reviews

DavidGraham · Jul 10, 2017

ToTTenTranz said:
SPECviewperf and compute results (except mining) do show a sizeable price/performance advantage over the existing competition.

Maybe fail to mention that TitanXp is not running pro drivers while Vega FE is using them? Or fail to mention that Quadro GP104 still beats Vega FE in SPECview? As for compubench, Vega FE is so far behind it's not even a competition.

CarstenS · Jul 10, 2017

Spec ViewPerf needs to be thoroughly understood before numbers are posted*. At least that's who I feel about it now after having toyed around with it for a couple of hours. Still plenty of scores that don't make much sense at first glance. And by that I do not refer to the obvious differences between Geforce and Quadro for example (or Radeon and Radeon WX for that matter).

*disclaimer: I am not our editor-in-chief, so don't hit me, if we might post something in that regard.

CarstenS · Jul 11, 2017

Our (p)review (in german) is out - luckily, Spec is still in evaluation, but B3D suite is in there:
http://www.pcgameshardware.de/Vega-...-Release-AMD-Radeon-Frontier-Edition-1232684/

B3D Suite-section: http://www.pcgameshardware.de/Vega-...ase-AMD-Radeon-Frontier-Edition-1232684/3/#a1

Fun fact: GPU-z reads the BIOS, where the GPU type is given as "FGL Vega 10" - Fire GL?

http://www.pcgameshardware.de/Vega-...ier-Edition-1232684/galerie/2761665/?fullsize

Kaotik · Jul 11, 2017

CarstenS said:
Fun fact: GPU-z reads the BIOS, where the GPU type is given as "FGL Vega 10" - Fire GL?
http://www.pcgameshardware.de/Vega-...ier-Edition-1232684/galerie/2761665/?fullsize

Considering last FireGLs were TeraScale 1's (R600/RV670) that makes about zero sense :runaway:

CarstenS · Jul 11, 2017

Kaotik said:
Considering last FireGLs were TeraScale 1's (R600/RV670) that makes about zero sense

Considering some parts of the RSCE drivers still begin with ATI, you argument is completely invalid.

Kaotik · Jul 11, 2017

CarstenS said:
Considering some parts of the RSCE drivers still begin with ATI, you argument is completely invalid.

Fair point

Also B3D needs to update the smileys

BacBeyond · Jul 12, 2017

CarstenS said:
Our (p)review (in german) is out

I didn't know you did PCGH, been using it for a long time (thanks to google translate). Always enjoyed them though lots of great testing both on CPU and GPUs. Appreciate the hard work you guys do and all the issues you find (like UWP oddities) that most US places never mention.

Thanks for running the B3D Suite was interesting to see how it compared to Fury, especially Clock for Clock.

CarstenS · Jul 12, 2017

Thanks for the kind words! It's a nice contrast to the usual flaming (not here in B3D of course, but at most other places).

CarstenS · Jul 12, 2017

Expanding on the "FGL" fun fact:

SPECviewperf MESSAGE: Welcome to SPECviewperf 12.1.1
---------------------------------------------------------------
SPECviewperf 12.1.1 settings:
setting source value
---------------------------------------------------------------
viewperf root default (null)
viewset name viewset config snx-02
viewset library viewset config snx-02
window x viewset config 10
window y viewset config 20
window width viewset config 1900
window height viewset config 1060
multisample viewset config 0
screen viewset config 0
threads viewset config 1
processes viewset config 1
results dir viewset config c:\SPEC\SPECgpc\SPECviewperf12\results\snx-02
---------------------------------------------------------------
END SPECviewperf settings
---------------------------------------------------------------
SPECviewperf MESSAGE: Viewset Message: Graphics Renderer: ATI Technologies Inc. Radeon Vega Frontier Edition 4.5.13486 Compatibility Profile Context FireGL 22.19.384.2.

[my bold]

ninelven · Jul 12, 2017

Well the polygon throughput is certainly impressive, but it appears to be hampered by the culling performance. Seems like something which could swing performance considerably depending on the engine/game (doesn't appear like something AMD will be handling passively though)...

The only other oddity appears to be a performance regression in effective bandwidth. Hopefully that is just a driver issue, would be pretty unfortunate otherwise...

Ultimately, my impression thus far is that Vega is shaping up to more or less be AMD's Fermi. The power consumption and resulting heat is a bit disheartening, but it looks like it will be a very robust architecture to iterate on. And the path forward seems pretty clear.

Jawed · Jul 12, 2017

ninelven said:
Well the polygon throughput is certainly impressive[...]

At the same clocks as Fury X it's practically the same as Fury X. There's only one meaningful deviation: "Strip (100% culled)", with 81% higher performance.

Tessellation is also practically the same per clock as Fury X.

ninelven · Jul 12, 2017

Ah yep.... mistook the 1600 result for the 1050 one.

fellix · Jul 12, 2017

CarstenS said:
B3D Suite-section: http://www.pcgameshardware.de/Vega-...ase-AMD-Radeon-Frontier-Edition-1232684/3/#a1

What's up with the texture fillrate score on Vega? Much lower than its theoretical value, as well compared to Fury X.

Jawed · Jul 12, 2017

I've just noticed something interesting about the Effective Texture Bandwidth results: the "shape" of Vega FE's 1x random, 1x black, 8x random, 8x black results is very similar to the shape seen with RX 580. The percentages of all four versus 1x black:

Vega FE: 65, 100, 69, 74
RX 580: 67, 100, 75, 83
Fury X: 85, 100, 110, 121

Fury X is quite differently "shaped" though.

So, the reason that Vega FE's 8x results are worse than 1x results could be the same as for RX 580.

Why is RX580 slower at 8x?

3dilettante · Jul 12, 2017

Jawed said:
I've just noticed something interesting about the Effective Texture Bandwidth results: the "shape" of Vega FE's 1x random, 1x black, 8x random, 8x black results is very similar to the shape seen with RX 580. The percentages of all four versus 1x black:

Vega FE: 65, 100, 69, 74
RX 580: 67, 100, 75, 83
Fury X: 85, 100, 110, 121

Fury X is quite differently "shaped" though.

So, the reason that Vega FE's 8x results are worse than 1x results could be the same as for RX 580.

Why is RX580 slower at 8x?

I think Tonga has generally escaped scrutiny because it was old news by the time the tester came out, but perhaps if it were tested again we'd have a comparison with a GPU whose IP was aligned with Fiji's.

The ratio of channels to RBEs is different between Fury and the rest.
Fiji had 64 ROPs in 16 RBEs with the compression occurring at RBE cache spill/fill to memory.
8 channels per stack x4 meant there were twice as many channels per ROP/RBE versus Polaris (8 32-bit channels for 32 ROPs) and Vega (16 128-bit channels for 64 ROPs).
Tonga might have had the same issue, or perhaps if Tonga ever enabled its other 128 bits of interface it might have added more data on the effect of channel count.

I am unclear on how Fiji's L2 slices were arranged versus its HBM channels, although there were signs that its on-die bandwidth didn't match the growth in channel count. Synthetics showed results in the neighborhood of Hawaii until memory strides thoroughly blew through the caches and started queuing up on the DRAM channels.

How AMD's compression works is not clear. There is a separate path for metadata like compression, and the compression logic is after the caches and before DRAM. That could mean that there is more compressor throughput if it's per channel, or perhaps there are channels to spare if the compression pipeline winds up blocking a controller for some number of cycles or is fetching/evicting from its own metadata cache.

There was some discussion on how this test worked, back in the mists of time. I'm having difficulty finding it. AMD cautions with its DCC description that sufficiently poor or random texturing loads can cause DCC to be a performance regression, and if the 8x test is trying to throw traffic to 8x as many separate locations it might be hitting a corner case for it.

CarstenS · Jul 13, 2017

I did some comparisons last year with clocks set to identical speeds (in case of memory to achieve identical bw):
@ same clock-speeds R9 280X UC R9 380X OC RX 470 UC
Bandwidth (GB/s) RSCE 16.10.1 RSCE 16.10.1 RSCE 16.10.1
1 Random Texture 156 153 157
1 Black Texture 153 214 221
8 Random Textures 179 178 178
8 Black Textures 179 197 199

basically identical behaviour between Tonga and Polaris.

AlNom · Jul 13, 2017

CarstenS said:
I did some comparisons last year with clocks set to identical speeds (in case of memory to achieve identical bw):

code tags. *ahem*

Code:

                    R9 280X UC    R9 380X OC   RX 470 UC
Bandwidth (GB/s)    RSCE 16.10.1  RSCE 16.10.1 RSCE 16.10.1
1 Random Texture    156           153          157
1 Black Texture     153           214          221
8 Random Textures   179           178          178
8 Black Textures    179           197          199

Cheers,

CarstenS · Jul 13, 2017

What took you so long? *SCNR*

3dilettante · Jul 13, 2017

CarstenS said:
I did some comparisons last year with clocks set to identical speeds (in case of memory to achieve identical bw):
@ same clock-speeds R9 280X UC R9 380X OC RX 470 UC
Bandwidth (GB/s) RSCE 16.10.1 RSCE 16.10.1 RSCE 16.10.1
1 Random Texture 156 153 157
1 Black Texture 153 214 221
8 Random Textures 179 178 178
8 Black Textures 179 197 199

basically identical behaviour between Tonga and Polaris.

The GPUs show a pattern of increasing utilization for the random texture x1 to x8 tests, which seems consistent with getting better utilization due to having more concurrency to extract workable transaction patterns.
Vega's scaling with concurrency is relatively poorer than all the others, which isn't ideal but could be expected of an immature platform.

Fiji's 32 channels start off less utilized, but there's a consistent ramp and it gets closer to the theoretical bandwidth of its interface in the 8x black texture case. The other component is that the compression pipeline that at least theoretically should be similar to Tonga has limited upside in the 1x case. Perhaps this is because Fiji's channels are so underutilized that the compression method is saving bus cycles that wind up being wasted anyway.
It seems as if the compression path is less able to extract bandwidth savings past a certain level of concurrent traffic.

It'd be nice to know if there are multiple components to this system. Is there a separate ceiling in the compression pipeline related to how much data it can deliver internally to the chip? It seems like Nvidia's method is more aggressive in what it has for an upside, or perhaps tiling accellerates things more.
Is there a separate limit and how much concurrency AMD's compressor can support, such as per-channel context, per-channel metadata DRAM traffic, and/or per-channel compression logic?

Some combinations of answers could make a hypothetical Tonga with all 384-bits of its DRAM bus enabled scale differently.

Jawed · Jul 13, 2017

3dilettante said:
There was some discussion on how this test worked, back in the mists of time.

Yes I was there. I'm not even sure why I post on this subject since no one knows what it's doing.

AMD Vega Hardware Reviews

DavidGraham

CarstenS

Moderator

CarstenS

Moderator

Kaotik

Drunk Member

CarstenS

Moderator

Kaotik

Drunk Member

BacBeyond

CarstenS

Moderator

CarstenS

Moderator

ninelven

PM

Jawed

ninelven

PM

fellix

Jawed

3dilettante

CarstenS

Moderator

AlNom

Moderator

CarstenS

Moderator

3dilettante

Jawed

Similar threads