Why do GCN Architecture chips tend to use more power than comparable Pascal/Maxwell Chips?

The manufacturing process is likely a factor. That should affect the voltages they have to pump through their GPUs, right? many folks have had success in reducing the power consumption near pascal levels with undervolting. I assume they use higher voltage to preserve yield rather than be aggressive with the voltage for higher efficiency.

That's probably one of the clearer causes. Tho last gen they were both using TSMC. a 980ti did use around the same as a fury
 
The manufacturing process is likely a factor. That should affect the voltages they have to pump through their GPUs, right? many folks have had success in reducing the power consumption near pascal levels with undervolting. I assume they use higher voltage to preserve yield rather than be aggressive with the voltage for higher efficiency.

That's probably one of the clearer causes. Tho last gen they were both using TSMC. a 980ti did use around the same as a fury

Yeah power demand was pretty comparable to reference 980 ti, but then a reference Fury X did not have the same performance generally as the reference 980 ti - not sure if this came down to throttling (even the reference 980ti throttled on default) or just some limits of the performance envelope used by AMD for Fiji relative to power-efficiency.
Vega has managed some efficiency gains, but not with both clocks and scaled architecture as has been done with both Pascal and Volta.
You can reduce voltage/power even on Nvidia cards if one wants to really play around with efficiency; why there is a Pascal Tesla P4 that is in essence GP104 running at 50W or 75W with 5.5 FP32 TFLOPS and that is pretty incredible performance/efficiency.
 
Last edited:
why there is a Pascal Tesla P4 that is in essence GP104 running at 50W or 75W with 5.5 FP32 TFLOPS and that is pretty incredible performance/efficiency.
The same basic concept as Fury Nano: wide and "slow".
Btw, where's Vega Nano?
 
It's not really that big of a difference right? Big Vega can hit its rated frequency for theoretical 12.5 TF at around 300w. Big pascal can do the same for around 250. Seems like Nvidia has a 10-15% advantage. Possibly due to process or maybe just better optimizations. The fact GCN apparently has much worse resource efficiency probably makes the disparity look much bigger than it actually is
 
It's not really that big of a difference right? Big Vega can hit its rated frequency for theoretical 12.5 TF at around 300w. Big pascal can do the same for around 250. Seems like Nvidia has a 10-15% advantage. Possibly due to process or maybe just better optimizations. The fact GCN apparently has much worse resource efficiency probably makes the disparity look much bigger than it actually is
You cannot compare their TDP-TBP in the same way (look at TomsHardware/PCPer who use scopes to see the difference between both manufacturers), nor how they rate it against TFLOPs; Nvidia usually never quote boost performance unless they state it such as with the Tesla P4 at 75W so their performance is actually higher with their other cards than officially stated.
Remember though that AMD back then did have an edge in terms of compute performance (even allowing for the above) but this does not translate into games nor necessarily certain applications and about sustained peak performance relative to comparable 980 ti.
That said for AMD Fiji was a step in the right direction and did seem to be within its stated spec in terms of TDP-TBP.
But unfortunately Vega is not enough of a step relative to Pascal followed by 'Volta' in terms of scalable efficiency; clocks and architecture/cores and also the jump in compute performance this also gives Nvidia catching up with AMD when allowing for normal GPU performance envelope and thermal-power-dynamic boost.
 
Last edited:
Jensen mentioned nvidia's focus on clockspeeds with Pascal and AMD seem to have followed suit with Vega but are still woefully short.
I believe he stated that they were somewhat surprised how effective their optimizations were, which can happen. Sometimes a good prediction can come up with higher dividends than normal.

Not really, more like 'why is that Rottweiler eating so much if it can't run as fast'.
I'm more out of my depth with canine physiology than I am with car analogies, but from the following it seems like Greyhound breeding has made them very optimized for what they do.
https://en.wikipedia.org/wiki/Greyhound

"The key to the speed of a Greyhound can be found in its light but muscular build, large heart, highest percentage of fast-twitch muscle of any breed, double suspension gallop, and extreme flexibility of its spine. "Double suspension rotary gallop" describes the fastest running gait of the Greyhound in which all four feet are free from the ground in two phases, contracted and extended, during each full stride."
Additionally, there are swaths of body tissue that are minimized, such as body fat, undercoat, red blood cell count, and liver capacity. Racing emphasizes their prey chasing instincts, apparently to the point that it can seriously endanger them in environments with car traffic.

From https://en.wikipedia.org/wiki/Rottweiler, we see gait, physiology, and temperament that will not motivate them to the extremes of a Greyhound or will physically work against them at the operating points of forcing a canine body to 70 km/h.
They would lose out in acceleration and energy loss per stride, and devote resources to maintaining bulk and strength for purposes other than running. A thin-skinned 70 km/h lightweight with lesser clotting capacity and a tendency to bolt doesn't provide utility for herding or holding its own in a drag-out fight.

To make the analogy fit GPU design more closely, it would probably be more like AMD and Nvidia had the chance to design a dog, and they had to guess in more general terms what sorts of muscle fibers they'd have on hand or how far they could push the spine and legs, then compare it to how many races versus fights the dog would be in.
Then there's a question of how often they could iterate on their guesses, and what unexpected stumbling blocks each would hit and the timing of them.

As for it using more diespace, I think nvidia had to go that route as well, they didn't mention transistor figures however.
Circuits face a trade-off between transistor performance and wire delay, with the latter becoming significantly more limiting with every node. There are a number of choices that increase the capability of the transistor portion to drive signals down wires by sacrificing area and power, such as repeaters or extra pipeline stages with their corresponding latches. Since wire delay is quadratic with the distance traveled, cutting a path in half can significantly boost signal propagation at the cost of the switching delay of the transistors involved--which have significantly outpaced wire delay.
If the overall mix of wire versus transistor is mostly locked-in, then driving voltages up will provide more drive strength to the transistors--to a point. At some point, the transistors' ability to carry current saturates or the limits of their connections to their wires is reached, and the quadratic cost of greater voltage.

In the nodes since 28nm, the foundries became more conservative with wire scaling than they have with transistors, and the emphasis on density can lead to worse wire delay because thin wires have higher resistance. GF's 7nm presentation shows that its upcoming FinFETs have significantly more refined transistors, with better materials and taller fins. The high-performance variant of the process explicitly goes for more metal layers and less dense cells+wires versus the single density-optimized option with relatively stubby fins available for GF's 14nm.

Why AMD are behind in clockspeed? Is it architecture as in shaders/TMU/ROPs/schedulers layout or the transistors themselves?
It's more of a holistic question at this point. Architectures are designed with certain predictions about their workloads and the physical and electrical realities they will face at the time of manufacturing.
Decisions such as the overall balance of per-cycle work and working transistors versus on-die connectivity will be decided long before the chips are taped out and need to face the realities of their manufacturing.

I think it's impossible for us to tell.
There is something of a hint when AMD started talking about wire delay in Vega when the concept didn't come up before.
 
Reading this, it occured to me that in every die shot we've seen yet, AMDs individual CUs seem to be rather lenghty „stripes“ while Nvidias SMs are more compact and shaped like a fat L. While the first lends itself to denser packing in rows and columns, the latter seems to be able to have shorter intra-SM wiring albeit chip designers having a harder time to fill the gaps.
 
My mental model is that GCN 4-cycle cadence works like this: fetch reg A, fetch reg B, fetch reg C, execute.
I've described the current operation of GCN here:

https://forum.beyond3d.com/threads/...ors-and-discussion.59649/page-13#post-1977187

The register file is really 256x 2048-bit registers [...]

As I understand it, the current design uses three cycles to read operands (64 operands from address A, then address B and finally address C in FMA A * B + C) and a fourth to write resultants.

The 64 operand slots need to be "swizzled in time" in order to feed them into the SIMD: A, B and C 0-15, then A, B and C 16-31 etc. So GCN already has an "operand collector" to support this (not exactly a FIFO, but the lanes read from it as though it were a per-lane FIFO per operand).

[...]

In GCN as it currently exists there has to be some kind of resultant collector, so that a single write cycle can send all 64 resultants to the register file. The resultants arrive in this collector over four cycles, but need to be written coherently as a single operation in a single cycle to a single address (with masking for resultants that should not be written - GCN already has masking for resultant writes, completely distinct from predication).
 
Other elements of Volta's architecture show efforts to keep things more local. There's an L0 instruction cache near the SIMDs, rather than the shared L1 that gets shared between 3 other CUs in Vega. AMD's choice to limit Vega's sharing to 3 CUs was discussed in the context of reducing wire delay, and if we take Nvidia's diagram as being roughly true to the hardware its instruction pipeline is closer. Also, if Vega's capping the number of CUs sharing a front end was for wire delay, it wouldn't seem like going from 4 to 3 would change the picture that drastically.
Each CU has a buffer for instructions, L0 effectively. That buffer is shared amongst the kernels that are running in the CU.

So the buffer changes slowly. But for example there could be problems if latency-hiding has caused a single, immense, kernel to have multiple wavefronts each with a program counter that is far from all the other program counters. So there could be contention for buffer pages if each of 16 hardware threads is in a distinct part of the kernel. (10 compute threads or 16 graphics threads are the limits per CU).
 
The manufacturing process is likely a factor. That should affect the voltages they have to pump through their GPUs, right?
There may be some merit in this.

many folks have had success in reducing the power consumption near pascal levels with undervolting. I assume they use higher voltage to preserve yield rather than be aggressive with the voltage for higher efficiency.
But AMD has always "over-volted" its chips. This is so that the chip will still run in 5 years' time. As time goes by the "under-volt" that people use will fail and they will have to increase voltages to keep the chip running.

Of course one has to run a chip for years to notice this.
 
The GCN arch is oldish.

R300 family - Aug 2002 to Oct 2005. That's over 3 years.

R500 family - Oct 2005 to May 2007. That's over 2.5 years.

TeraScale - May 2007 to Dec 2011. That's roughly 4.5 years.

GCN is already 6 years old and its EOL seems to come sometime in 2019. So GCN's lifespan will be nearly 8 years by then. Given there have been only minor architectural improvements, it's showing its age.
 
The voltage difference between GloFo and TSMC is not that much and would not be a notable contribution when looking at each GPU in context of this thread.
In fact the voltage by default was higher with Fiji than it is with Vega even looking at custom Fury X to custom Vega by some reviews analysing voltage/power, while for Nvidia the difference was around 5% between the 1050ti (uses Samsung) and 1060 (TSMC) with 1.09V vs 1.04V in-game and still within a performance envelope range that still would be moderately optimal.

The custom Sapphire Fury X was around 1.2V while the custom Sapphire Vega is 1.05V when pushed in gaming with default GPU settings; this is Fury X compared to Vega64.
Other reviews analysing Fury X reference and voltages still had that at 1.2V with 3D loads, while Vega in whichever form is quite a lot lower unless pushing the OC.
Also worth noting 980 ti also went to 1.2V in 3D loads with normal boost behaviour when not throttling.

Could be argued that the reduced node size is more sensitive to higher voltages in terms of silicon wear, along with impact of higher density.
 
Last edited:
Each CU has a buffer for instructions, L0 effectively. That buffer is shared amongst the kernels that are running in the CU.
I'm aware of a per-SIMD instruction buffer mentioned for GCN, and an instruction buffer documented in the ISA docs that seems to match. It drains rather quickly if the timing or NOP packing for cache invalidation or special mode operations are an indication. Per Vega's ISA, it's 16 DWORDs long, up from 12 in prior generations.
Nvidia's description for Volta indicated SIMD-local buffers existed prior to whatever it is calling an L0.

Perhaps the per-CU buffer you mention is part of the the instruction prefetch functionality in recent GCN generations, or is there a reference to what AMD calls it?
 
I'm aware of a per-SIMD instruction buffer mentioned for GCN, and an instruction buffer documented in the ISA docs that seems to match. It drains rather quickly if the timing or NOP packing for cache invalidation or special mode operations are an indication. Per Vega's ISA, it's 16 DWORDs long, up from 12 in prior generations.
Nvidia's description for Volta indicated SIMD-local buffers existed prior to whatever it is calling an L0.

Perhaps the per-CU buffer you mention is part of the the instruction prefetch functionality in recent GCN generations, or is there a reference to what AMD calls it?
I think you're correct, this is an instruction buffer localised to a SIMD and holding instructions defined by a single hardware thread, remembering that a CU can issue 5 instructions in parallel (e.g. scalar, vector, export. LDS, VMEM) and these are chosen from a distinct thread for each type of unit.

The original whitepaper:

https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

implies that the selection of up to 5 hardware threads required for instruction issue all come from buffers on a single SIMD:

"The selected SIMD can decode and issue up to 5 instructions from the 10 wavefront buffers to the execution units."

This also implies that each hardware thread has its own instruction buffer within the SIMD. So there's 10 buffers, each of 16 DWords, in Vega. (This number, 10, is troubling, because in graphics 16 hardware threads per SIMD are supported, I believe.)

When a branch is taken or when a different thread is swapped in to the SIMD, then the buffer for the hardware thread needs to be re-filled according to the new program counter and the buffer will contain a varying count of instructions ("pre-fetch") depending upon the mix of DWord or QWord sized instructions.

So I$ read contention is reduced by making this buffer 33% larger. In Polaris the sharing of I$ was changed from 4 CUs down to 3. So both of these step changes have been about I$ contention.

One thing that seems weird to me is that the original whitepaper describes an instruction cache line as 64 bytes. Which is 16 DWords. Yet in original GCN the instruction buffers were 12 DWords? So a cache line fetch would always produce more instructions than could fit in the buffer. If anything, this is the clearest sign that the instruction buffer in GCN is not a cache, since there is no direct mapping between a 64B cache line and what's in the instruction buffer. Instead, it seems, there would be 1 or 2 fetches from I$ required to populate the 12 DWords of the instruction buffer. That's because contiguous fetches of 12 DWords cannot be aligned with 16 DWord cache lines.

Also, it's worth noting that the buffer may not contain a fully complete final instruction. The final instruction might be QWord sized, but only one DWord of that instruction can fit because the previous instructions used up 15 DWords (in Vega), 11 in Polaris or earlier.

So the burstiness of fetches from I$ into the instruction buffer is, in general, worse than envisioned.

But, Vega might have made the instruction buffer cache aligned, since its size now matches a cache line. This would mean that there'd be no burstiness (half the time 2 fetches from I$ are required contiguously in Polaris and earlier). On the other hand, even if Vega's instruction buffer is cache aligned this presents the problem: what if the first instruction after a JMP is QWord sized and that instruction is split across two cache lines. There's no way to issue from an incomplete instruction. So I think that's enough to rule out the possibility that the instruction buffer is cache-aligned. Instruction fetch into the buffer always requires futzing to put the first new instruction so that it starts at the beginning of the buffer and a complete buffer fill will require fetching either one or two 64B lines from I$.

You might like to rummage in this slide-deck to see if you can find more clues:

https://32ipi028l5q82yhj72224m8j-wp...DC2017-Advanced-Shader-Programming-On-GCN.pdf
 
Two reasons: GF node instead of TSMC's and underutilization of theorectical FP power. RX480 haves around the same power consumption for the around the same FP power as a GTX 1070.
Radeon begs for a real new uarch.
 
This also implies that each hardware thread has its own instruction buffer within the SIMD. So there's 10 buffers, each of 16 DWords, in Vega. (This number, 10, is troubling, because in graphics 16 hardware threads per SIMD are supported, I believe.)
My recollection was that it was 40 wavefronts per CU for all types.

One thing that seems weird to me is that the original whitepaper describes an instruction cache line as 64 bytes. Which is 16 DWords. Yet in original GCN the instruction buffers were 12 DWords? So a cache line fetch would always produce more instructions than could fit in the buffer. If anything, this is the clearest sign that the instruction buffer in GCN is not a cache, since there is no direct mapping between a 64B cache line and what's in the instruction buffer. Instead, it seems, there would be 1 or 2 fetches from I$ required to populate the 12 DWords of the instruction buffer. That's because contiguous fetches of 12 DWords cannot be aligned with 16 DWord cache lines.
In other instances, instruction fetch has been internally subdivided into one or more windows in a clock that are less than the with of a cache line, probably because it helps with wasted bandwidth due to jumps and the complexity of processing a full line at once. Bulldozer for example had a front end that fetched from two sub-line length windows, and some nuances to the windows delivered per cycle(s).
It's possible that there's something similar going on with a smaller granularity, with less than 12 DWords fetched at a time, and possibly delivered in some kind of internal pipeline cadence with some cycles idle or reserved for other CUs.

12 DWords may be somewhat associated with peak issue, which is 5 instructions that can be issued to an execution stage (1-2 DWords each) and one special instruction that can be executed in the instruction buffer, so 6 instructions at 1-2 DWords each. 12 DWords makes more sense then.
Upping things to 16 might mean AMD profiled an increase due to more features being used and new instructions like packed path often leveraging 64-bit encodings. Another maybe related figure is that a front end that can service 4 clients tuned for 12 DWords has the same throughput as a front end servicing 3 clients with 16.
 
Back
Top