AMD Vega Hardware Reviews

You can change that fan behaviour in gaming mode's wattman without ever touching the BIOS switch.

Yeah I know, that's what I did during my own testing. I'm just thinking along the lines of users who might not be willing to mess with around in those settings, since many folks are talking about "out of the box" configuration.
 
AMD themselves call it „packed math“, don't they? And also in case of
It's still packed and occurring in the same clock cycle for scheduling. Only difference is the execution unit serializes it thanks to lower propagation delay. Consider a simple integer adder where the MSB depends on the LSB. Half the size, half the delay, so double the clockspeed. Bit more complex in reality, but keeping it simple. Same amount of bits are entering and exiting from the registers. Like I mentioned above, 4x should be doable with a little work, but that packed part breaks down as it would require twice the bandwidth. At which point a lot of changes are being made.
 
Then why is 2*FP16-per-ALU absent from every single consumer Pascal graphics chip, despite nvidia implementing in TX1 and GP100?
Segmentation and complexity savings could be reasons. One claim I did see, although it's been some time and I forget if it was a thread here or elsewhere, is that some of the products' development timelines did not match up for the readiness of hardware supporting certain precisions.
GV100 is the first product to have DP, FP32, FP16, and INT8 in the 1/2, 1, 2x, 4x (ed: fixed 4x) hierarchy that one would expect in a top product, while the Pascal generation's hardware is less uniform.


The logic behind it wouldn't be overly difficult. Technically you could just increase clocks until half the logic had an indeterminate state from propagation delay that you ultimately ignore. At least for INT. Then a simple crossbar that swizzles the first and second half of the input and output.
I have many concerns raised by this and the elaborations that follow.
From how to provide deterministic breakdown into indeterminate behavior, where this is happening, when these things happen, how they can be accomplished simply, circuit behavior, knock-on effects in the pipeline, pipeline balance, clocking complexity, "simple" circuits that somehow map to situations with nebulous bounds on complexity, inserting layers for decision making, gating, wires in, wires throughout, wires going out, and so on.


It's still packed and occurring in the same clock cycle for scheduling. Only difference is the execution unit serializes it thanks to lower propagation delay. Consider a simple integer adder where the MSB depends on the LSB. Half the size, half the delay, so double the clockspeed.
The EXE stage is a serial chain of single-bit adders and a sequencer for instruction type propagation delay adjustment?

Generally, operand size has not had a linear effect on stage delay. The number of logic layers isn't linear, and there are many elements and layers outside of the ALU like pipeline latches and settling time, or the delay of other stages that moderate the effects. The switch from 32-bit to 64-bit ALUs for x86 CPUs didn't halve all clock speeds, and I think I've seen estimates of the internal delays for the execution stage maybe increasing by a modest amount (years ago I vaguely recall estimates in perhaps the tens of percent from the tens of picoseconds of intra-stage delay), but the overall pipeline was able to mostly absorb the incremental delay since more pressing limits constrained the clock period before the ALU circuit itself could limit speeds.
 
Last edited:
AMD Radeon RX Vega Lands in Budapest – Shows Performance Close To a GTX 1080, Launching in 2 Weeks
The AMD Radeon RX Vega was tested on a Ryzen 7 system inside a closed PC. People were not able to see the graphics card at all and play through only one gaming title. The title selected was DICE’s Battlefield 1 and was running on a resolution of 3440×1440. The graphics card was compared to a second PC that was running a NVIDIA GeForce GTX 1080 (reference card). This was confirmed by a AMD representative who said that they were using a GTX 1080 (Non Ti) against the Radeon RX Vega graphics card.

Both systems were running a curved display, one had G-Sync and one had Freesync so it’s obvious that the G-Sync system had GeForce installed and the Freesync system had Radeon Installed. How ever, there was no indication which system was using the Freesync + Radeon setup and which had the G-Sync + GeForce setup. Both monitors were fully covered so even the specific models weren’t mentioned.

Anyways, moving onward, there was one system that faced a little hiccup and was performing worse. There’s no way to tell which system that was but we know that a single GeForce GTX 1080 does over 60 FPS (average) on 4K resolution with Ultra settings. The tested resolution is not high compared to 4K so it’s possible that the GTX 1080 was performing even better. Plus, AMD not publicly showing any FPS counter and restricting max FPS with syncing showcases that there was possibly a problem with the Radeon RX Vega system against the GeForce part.

  • The AMD rep guy was asked and he said it’s a GTX 1080 non Ti against the RX VEGA
  • We were given 2 systems with an RX and a GTX to play BF1 on.
  • They do use free- and g-sync and yes there were no fps counters. From my experience there were no fps drops on any of the systems.
  • There was a little hiccup, but they resolved it in an instant and from my experience and many others the difference was unnoticeable. Mind you we were not told and are not going to be told which setup is which. Via Szunyogg Reddit
Lastly, AMD reps told the public that the AMD system has a $300 US difference which means two things. First, AMD puts the average price of a Freesync monitor compared to a G-Sync monitor as $200 US cheaper. Then if we take that $200 out of the $300 from what AMD told, it means that the Radeon RX Vega can be as much as $100 US cheaper than the GeForce GTX 1080 at launch which should be a good deal but they haven’t told anything on things aside from that like performance numbers in other titles, power consumption figures and most importantly, what clocks was Vega running at which seems a little sad.
http://wccftech.com/amd-radeon-rx-vega-gtx-1080-battlefield-1-comparison/

Edit: Why all the smoke and mirrors?
 
If the AMD representative said the Radeon setup was $300 cheaper, I think it is prudent to not assume that they are using the supposed $200 average difference between Freesync and G-Sync unless they said so explicitly.
Covering up the monitor model information means they are free to find some pair of monitors that are outliers or are part of the sample that are necessarily above average in price differential.

Assuming that the rest of the system specs or their pricing are equivalent or not misleading would be nice, but the rest of the spectacle makes me wary.

edit:
The Tomshardware temp numbers for Vega and the HBM2 stacks seem strange to me.
95C should be what AMD's silicon can readily maintain, and not generally the HBM2.
How the DRAM can hit that if it's even somewhat touching a cooler that can keep the GPU at 85 seems counter to my intuition.

Could the layers really be that insulating versus a GPU at full tilt?
 
Last edited:
If the AMD representative said the Radeon setup was $300 cheaper, I think it is prudent to not assume that they are using the supposed $200 average difference between Freesync and G-Sync unless they said so explicitly.
Covering up the monitor model information means they are free to find some pair of monitors that are outliers or are part of the sample that are necessarily above average in price differential.

Assuming that the rest of the system specs or their pricing are equivalent or not misleading would be nice, but the rest of the spectacle makes me wary.

assuming the translation of this video is correct:


then they were explicity using the price difference between freesync and gysync as the selling point..

edit: actually he says g-sync system. so that could be interpreted as price difference in monitor + card if they are using $200 monitor diff avg..
 
Last edited:
From how to provide deterministic breakdown into indeterminate behavior, where this is happening, when these things happen, how they can be accomplished simply, circuit behavior, knock-on effects in the pipeline, pipeline balance, clocking complexity, "simple" circuits that somehow map to situations with nebulous bounds on complexity, inserting layers for decision making, gating, wires in, wires throughout, wires going out, and so on.
In this case I'd have inserted a few gates to break the circuit and yes for FP it would be a little more involved. That said, my understanding was the packed math only accounts for relatively simple math operations. It could very well be it's own parallel logic, but they also could have reused portions to conserve space.

The bounds are always going to be the 32 bits in and out in x amount of time. I'm not proposing any changes beyond extending time slightly if it proved necessary.

The EXE stage is a serial chain of single-bit adders and a sequencer for instruction type propagation delay adjustment?
I'm not sure it would even be that complex. It could be a matter of doubling the clockspeed and planning on the circuit stabilizing in a single clock for faster double rate math. Mask off portions to likely save a little power on indeterminate states and comply with standards. Other instructions now taking two cycles to maintain the status quo. Ultimately the timing would need analyzed throughout the pipeline to determine clocks, but I'm operating from the basis the ALU is the limiting factor as slower is the only possibility to affect other parts of the pipeline.

Generally, operand size has not had a linear effect on stage delay. The number of logic layers isn't linear, and there are many elements and layers outside of the ALU like pipeline latches and settling time, or the delay of other stages that moderate the effects. The switch from 32-bit to 64-bit ALUs for x86 CPUs didn't halve all clock speeds, and I think I've seen estimates of the internal delays for the execution stage maybe increasing by a modest amount (years ago I vaguely recall estimates in perhaps the tens of percent from the tens of picoseconds of intra-stage delay), but the overall pipeline was able to mostly absorb the incremental delay since more pressing limits constrained the clock period before the ALU circuit itself could limit speeds.
FP32 to FP64 didn't halve the clocks, but it did require doubling operand bandwidth to sustain throughput. In the case of P100 we definitely aren't seeing the 2GHz clocks of the consumer parts so FP64 likely became the bottleneck. Only the ALUs should have that serial dependency. Everything else a per bit operation, so most of the timing fixed regardless of operand size.
 
Could the layers really be that insulating versus a GPU at full tilt?
"Technical" reasons for not having 8-Hi HBM1. I'd imagine AMD would have liked those for Fiji. Not to mention Nvidia wanting the same for P100. I doubt cost was that harmful to margins. Only other possibility was poor yields from the assembly process. I always assumed the thermal issues were the reason for Fiji's AIO as temps were rather low for a typical GPU.
 
In this case I'd have inserted a few gates to break the circuit and yes for FP it would be a little more involved. That said, my understanding was the packed math only accounts for relatively simple math operations. It could very well be it's own parallel logic, but they also could have reused portions to conserve space.

The bounds are always going to be the 32 bits in and out in x amount of time. I'm not proposing any changes beyond extending time slightly if it proved necessary.
The proposal was to force the timings fast enough to cause half of the logic to fail, not specifying how this would be determined or why this failure mode should occur in a consistent pattern. It's not clear why this method should be linked to doubling clock rate. If the pipeline is balanced and tight enough, there may not be that much slack--before adding intra-stage state monitoring, one or two arbitrary crossbars, clock doubling hardware, and gates reactive to a sub-cycle granularity.
Is there a step or design feature that was omitted that makes the analog behavior of these circuits reach indeterminate states in a deterministic manner? The total solution should be simple and not inject more delays than it saves in the normal and doubled case.

I'm not sure it would even be that complex. It could be a matter of doubling the clockspeed and planning on the circuit stabilizing in a single clock for faster double rate math. Mask off portions to likely save a little power on indeterminate states and comply with standards.
It's not that complex, rather it's simple enough to create the asserted linear relationship between execution cycle time and the width of the data unit.
This is inserting a doubled clock or clock doubling circuitry. Is this applying the faster clock to the whole unit and register pipeline then selectively down-clocking one half-speed stage, or is the whole pipeline at one clock and one stage has a doubled clock and the buffering necessary to bring signals up and down out of the domain?

FP32 to FP64 didn't halve the clocks, but it did require doubling operand bandwidth to sustain throughput.
I don't see the relevance that this has to whether the half the ALU logic can settle within a given clock period.
 
Last edited:
During further testing we've found something quite accidentally which Vega really excels at: +72% compared to Fury X, which itself is faster than a Titan X - so maybe not a good indicative of performance compared to Nvidia, but performance jump from prior Radeon generations when everything falls into place.
A pure compute 4k entry from Scene demo competition called 2nd Stage Boss:
http://www.pcgameshardware.de/Vega-...ase-AMD-Radeon-Frontier-Edition-1232684/3/#a5

From the demo's readme:
Description:
2nd_stage_boss.1280x720.20160221.exe (party version)
- file size 4088 bytes
shader for the graphics = 2486 bytes
compute shader for the sound synthesizer = 841 bytes
other = 761 bytes
 
During further testing we've found something quite accidentally which Vega really excels at: +72% compared to Fury X, which itself is faster than a Titan X - so maybe not a good indicative of performance compared to Nvidia, but performance jump from prior Radeon generations when everything falls into place.
A pure compute 4k entry from Scene demo competition called 2nd Stage Boss:
http://www.pcgameshardware.de/Vega-...ase-AMD-Radeon-Frontier-Edition-1232684/3/#a5

From the demo's readme:
Description:

Interesting thanks :), can you run it @ Fury clocks to see how much is clocks and how much is arch changes?
 
During further testing we've found something quite accidentally which Vega really excels at: +72% compared to Fury X, which itself is faster than a Titan X - so maybe not a good indicative of performance compared to Nvidia, but performance jump from prior Radeon generations when everything falls into place.
A pure compute 4k entry from Scene demo competition called 2nd Stage Boss:
http://www.pcgameshardware.de/Vega-...ase-AMD-Radeon-Frontier-Edition-1232684/3/#a5

From the demo's readme:
Description:
4K demos like this tend to be pure ALU (sphere tracing of analytical scene). No memory accesses at all (HBM2 and memory controllers are idling). Also ROPs, TMUs and geometry units are idling. It's entirely possible that Vega reaches max clocks in code like this. That gets us to 52% increase. The shader is pretty big (whole demo in one shader), so it should benefit from the instruction prefetch introduced in GCN4 (Polaris). Tiled rasterizer and improved DCC aren't used at all because this demo is compute shader based.
 
This sounds as if you had a background in that kind of programming, too. :) Much appreciated insights. And yes, it runs at max clocks. As does Fury X. Compared to Polaris, Fury X scores 21-23% higher while having 31% higher TFLOPS throughput, so maybe that's the prefetch effect showing.

TFLOPS per Fps (less is better)
Code:
Fury X 		221,7
Polaris 20 	209,1
Vega 10 	192,4
Hawaii XT 	223,6
Titan X (Pascal)355,4
 
Last edited:
During further testing we've found something quite accidentally which Vega really excels at: +72% compared to Fury X, which itself is faster than a Titan X - so maybe not a good indicative of performance compared to Nvidia, but performance jump from prior Radeon generations when everything falls into place.
A pure compute 4k entry from Scene demo competition called 2nd Stage Boss:
http://www.pcgameshardware.de/Vega-...ase-AMD-Radeon-Frontier-Edition-1232684/3/#a5
Could you run this shader heavy demo on Vega: http://www.geeks3d.com/20151231/impressive-pixel-shader-of-a-snail-glsl/

The setup is a bit complicated, just follow the instructions.
 
Complicated is not good atm, we have our mag's deadline to hit. :) But I'll try.

edit:
Ah, not too complicated.

Titan X (Pascal) 228-234 Fps with 1x AA, 12 Fps with 4x AA at default 800x480 window
Did not run on Vega FE. GeexLab just closed down.
 
Last edited:
If you have to promote the price advantage of the full system incl. the monitor (and most of it comes from the monitor) your performance can not look too awesome.

That's what I've been wondering the past one or two week following this thread. If Vega was in any way competitive either on pure speed or price performance surely by now AMD would have made some noise about that? Power usage while gaming is not really that important imo but given how coy AMD is being about everything it seems to me Vega is slower, hotter and not really (if any) cheaper than the competition.

Hopefully that isn't that case. Apart from a GF4MX and 560TI I always had AMD cards and I'm perfectly happy with my R9 290 but I'm not sure if I'll buy a power hungry card again. Not because its power hungry but because of the noise. Even the Gigabyte R9 I have can get pretty load in summer.
 
Back
Top