Why do GCN Architecture chips tend to use more power than comparable Pascal/Maxwell Chips?

Genotypical

Newcomer
I figure some here would have good ideas on this. One reason I suspect is the difference in scheduling. I usually link people to this article

https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

The end result is an interesting one, if only because by conventional standards it’s going in reverse. With GK104 NVIDIA is going back to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the code’s efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.

Ultimately it remains to be seen just what the impact of this move will be. Hardware scheduling makes all the sense in the world for complex compute applications, which is a big reason why Fermi had hardware scheduling in the first place, and for that matter why AMD moved to hardware scheduling with GCN. At the same time however when it comes to graphics workloads even complex shader programs are simple relative to complex compute applications, so it’s not at all clear that this will have a significant impact on graphics performance, and indeed if it did have a significant impact on graphics performance we can’t imagine NVIDIA would go this way.

I assume later Nvidia architectures continued and improved on this. Things weren't as great with Kepler so I left it out of the title, but it started there. As long as just this one difference remains its very unlikely consumption numbers will match at the same or similar manufacturing node, right? are there other factors?
 
You have a chip with 3072 or 4096 FP32 FMAs, tons of texture units, and ROPs, large register files and caches.

They take up the vast majority of die size and have wires and transistors that are toggling all the time.

Yet the first thing that comes to mind about what makes the most difference in power consumption is ... scheduling.
 
One reason I suspect is the difference in scheduling.
From what I seem to recall, those statements are mostly a misinterpretation, and besides, "scheduling" would not make up a very large portion of the GPU energy budget anyhow. Certainly not the 50-100W variations in power that can be seen in recent comparable AMD/NV GPU generations.

As a clueless layperson, I suspect it's largely down to differences in R&D budgets, probably also a difference in quality and experience of the engineers involved as well... :p
 
I figure some here would have good ideas on this. One reason I suspect is the difference in scheduling. I usually link people to this article

https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3



I assume later Nvidia architectures continued and improved on this. Things weren't as great with Kepler so I left it out of the title, but it started there. As long as just this one difference remains its very unlikely consumption numbers will match at the same or similar manufacturing node, right? are there other factors?

That appears to have been a conflation of two very different levels of scheduling.
GCN's instruction scheduling is arguably even more simplistic than Nvidia's. ALU latency doesn't even need specific code annotation since GCN has its outright 4-cycle cadence, and there are multiple operation classes with variable latency that have simple counters rather than Nvidia's reduced set of scoreboards.
Going just by that, I think the impression would have been that Nvidia would have consumed more power.
 
Simplest answer is huge disparity in R&D budget that allows nvidia to spend a lot more man-hours on optimization as well as hiring and/or poaching the brighter minds that come up, and AMD being stuck to GlobalFoundries who provides a cheaper-per-mm^2 but significantly less power-efficient process than TSMC.
 
That appears to have been a conflation of two very different levels of scheduling.
GCN's instruction scheduling is arguably even more simplistic than Nvidia's. ALU latency doesn't even need specific code annotation since GCN has its outright 4-cycle cadence, and there are multiple operation classes with variable latency that have simple counters rather than Nvidia's reduced set of scoreboards.
Going just by that, I think the impression would have been that Nvidia would have consumed more power.
Not only that, but as you recall in Fermi there was hardware scoreboarding involved in latency tracking. If memory serves right, that piece of hardware was it specifically what Nvidia referred to in this context.

AMD on the other hand, never had a comparably complex and power hungry piece of hardware in the first place. Apart from Fermi really hitting the power wall hard for the first time and thus people having no experience how to deal with it, I think that's the main reason for the paragraphs AT wrote on this matter.

For today's situation though, I don't think that has much relevance any more. Here, the more important part is that Nvidia had to focus hard on power consumption in the first place after the Fermi experience and drew the right conclusions while AMD still was in the comfortable position to have power budget left, thus no or at least less pressing need to deal with this topic. For performance improvements, they could just keep throwing more ALUs (and TMUs) at a given problem until Hawaii. By that time, Nvidia had a lead in power efficient design and AMDs dwindling ressources made it very difficult to keep up, let alone catch up.

IMHO, of course.
 
It also bears recalling that for a long time AMD had a consistent lead in manufacturing process AND that Nvidia was designing chips aimed multiple applications much more so than AMD at the time.

People bring up R600 in regard to Vega but really it’s AMDs Fermi 1.0: a large multi-functional design on inferior process and power issues that come with that. Nvidia, having marshaled enough resources to segment the design along various markets and/or have Fabs tailor to their needs. Hopefully, just like Fermi evolved beyond its initial form, so will future Vega products.
 
I think it'd be fascinating to do power analysis for microbenchmarks on specific extreme workloads. You can only get so far guessing these things...
 
It's the clockspeeds, stupid.

I think it's pretty obvious that the biggest reason for the difference is the clockspeed disparity. Running cards closer to their limits isn't new for AMD, but nvidia have opened up such a gap with Pascal that AMD fell way behind on performance to make up for it.
 
It's the clockspeeds, stupid.

I think it's pretty obvious that the biggest reason for the difference is the clockspeed disparity. Running cards closer to their limits isn't new for AMD, but nvidia have opened up such a gap with Pascal that AMD fell way behind on performance to make up for it.
But then the immediate follow up is: why are AMD’s clocks too low to start with?

I always thought it had to do with the math pipeline being 4 deep, but thanks to Volta that theory flew right out of the window.
 
It's the clockspeeds, stupid.

I think it's pretty obvious that the biggest reason for the difference is the clockspeed disparity. Running cards closer to their limits isn't new for AMD, but nvidia have opened up such a gap with Pascal that AMD fell way behind on performance to make up for it.

IMHO that's like saying that a rottweiler is slower than a greyhound because it can't run as fast.
Clockspeed and it's operating comfort zone is just a consequence of the underlying electric and architectural reality.
It's self-evident that AMD is behind in clock speed; the question is why.
Why Vega burns small villages to reach 1.7GHz, while GP102 does it gracefully. What's in her DNA that causes high consumption?

How many times have we heard: all AMD needs is higher clock and they will be fine.
Well Vega expands on clockspeed and impressively so, but how... allegedly by using more diespace (wth)
 
The biggest difference between nVidia and AMD isnt hardware its mentality. After Fermi nVidia started to rethink their approach for the next hardware designs. There are talks from nVidia's Dally about power consumption 7 years ago. AMD hasnt cared about it. There was always the fixation about transistors and die size. But after Fermi it was power consumtion as the biggest problem to tackle.

People bring up R600 in regard to Vega but really it’s AMDs Fermi 1.0: a large multi-functional design on inferior process and power issues that come with that. Nvidia, having marshaled enough resources to segment the design along various markets and/or have Fabs tailor to their needs. Hopefully, just like Fermi evolved beyond its initial form, so will future Vega products.

Vega doesnt come close to Fermi. In retroperspective Fermi was a true next-gen architeture and a huge leap forward for nVidia. Vega on the other hand is just another GCN iteration lacking multiple features like FP64 (performance), Tensor Cores (DL), better HBM memory controller etc. What Vega offers isnt really better than what nVidia provides with their Tegra X1 chip...
 
Thing is, I wonder how much of Vega we will find in Navi. If it's a lot, then, Vega was a necessary step. If it's not, then I will consider Vega like a OC Fiji with more memory (since DSR, Primitive shaders, etc seems not enabled or doing squat for performances), and WTF were they doing for years with that architecture. Even Polaris look "nicer".

Back on topic, I though I read somewhere that nVidia use a lot of "by hand design", where AMD/RTG automated a lot of that, because of lower R&D budget, and so it was less efficient. Could this be a reason ?
 
I always thought it had to do with the math pipeline being 4 deep, but thanks to Volta that theory flew right out of the window.
My mental model is that GCN 4-cycle cadence works like this: fetch reg A, fetch reg B, fetch reg C, execute. So only a single cycle to execute. Nvidia nowadays has operand reuse cache. Couldn't they do: fetch reg A, fetch reg B, execute, execute. If we assume that at least one multiply-add operand needs to come from the operand reuse cache. Just some random thoughts, no concrete info. Also Nvidia simplified their register files a lot for Maxwell (more bank conflicts, leaning more on operand reuse cache). Here's some info: https://github.com/NervanaSystems/maxas/wiki/SGEMM. Register files are a big part of GPU power consumption. AMD Vega presentations also said that register files were optimized in cooperation with Ryzen engineers to allow higher clock rates.
 
My mental model is that GCN 4-cycle cadence works like this: fetch reg A, fetch reg B, fetch reg C, execute. So only a single cycle to execute. Nvidia nowadays has operand reuse cache. Couldn't they do: fetch reg A, fetch reg B, execute, execute.

That's how I imagine it. It costs as much to move the data from the reg file to the ALUs as it does to do the computation itself. GCN loads all operands from the reg file and always store the result. I imagine Nvidia can elide a store+subsequent load if a result is consumed by a following operation.

Cheers
 
My mental model is that GCN 4-cycle cadence works like this: fetch reg A, fetch reg B, fetch reg C, execute. So only a single cycle to execute.
That's the right mental model, but it's the wrong implementation model, because it's just not possible to execute in 1 cycle.

That's why they fetch 64 values per cycle, but only launch 16 operations per cycle.

Here's my implementation model:
You want to calculate:
Z0 = A0 * B0 + C0
Z1 = A1 * B1 + Z0
Z2 = A2 * B2 + Z0
You have a register file that can read or write 64 values at a time. Only 1 port to save area.
The execution pipeline is 4 deep.
You have a result collector to store results Z[47:0] before writing the values back to the register file.
You have a bypass path from Execute D to Execute A.

Timeline:
Code:
RegFile   Execute A Execute B Execute C Execute D Output Collector
Fetch A0
Fetch B0
Fetch C0
Write Z-2 Z0[15:0]
Fetch A1  Z0[31:16] Z0[15:0]
Fetch B1  Z0[47:32] Z0[31:16] Z0[15:0]
 <IDLE>   Z0[63:48] Z0[47:32] Z0[31:16] Z0[15:0]
Write Z-1 Z1[15:0]  Z0[63:48] Z0[47:32] Z0[31:16] Z0[15:0]
Fetch A2  Z1[31:16]           Z0[63:48] Z0[47:32] Z0[31:0]
Fetch B2  Z1[47:32]                     Z0[63:48] Z0[47:0]
 <IDLE>   Z1[63:48]                               Z0[63:0]
Write Z0  Z2[15:0]

In this model, you have 2 bypass options: one directly from the Execute D to Execute A (Z0 -> Z1), and another one from the output collector to Execute A (Z0 -> Z2).

(I should have done this long time ago. Only after sweating the details now do I realize that you need 2 bypasses.)
 
Last edited:
But then the immediate follow up is: why are AMD’s clocks too low to start with?

I always thought it had to do with the math pipeline being 4 deep, but thanks to Volta that theory flew right out of the window.
Volta's dependent FMA latency is 4 cycles, but unlike GCN I'm not sure how broadly that carries over to other operations.
It wouldn't just be what the cycle count is, but what is done in that cycle, over what distance, and how much work was put into optimizing it.
One assumption is that GCN's 4-cycle cadence is the limiting path, which is possible but unverified. Some of AMD's discussion about Vega indicated they did make special effort to keep it at 4.

Other elements of Volta's architecture show efforts to keep things more local. There's an L0 instruction cache near the SIMDs, rather than the shared L1 that gets shared between 3 other CUs in Vega. AMD's choice to limit Vega's sharing to 3 CUs was discussed in the context of reducing wire delay, and if we take Nvidia's diagram as being roughly true to the hardware its instruction pipeline is closer. Also, if Vega's capping the number of CUs sharing a front end was for wire delay, it wouldn't seem like going from 4 to 3 would change the picture that drastically.

Other items include both architectures have 16-wide SIMDs, but Nvidia has split off more instruction types into separate SIMD groups rather than it all going through the one bucket AMD does. That would affect how much has to happen in a cycle in a relatively confined area.
Nvidia's warp size is half the width of a GCN wavefront, and there are various operations that would scale with that width occurring in a single cycle.
GCN's ISA is more freely documented, so the level of complexity versus Volta's instructions isn't clear and would speak to how much has to be done in a cycle. GCN has a fair number of shuffles and alternate data sources/routings over 64 items that get put into that cadence that Nvidia might have relaxed the timing for.

Beyond that, is the question of how much work is put into optimizing the implementation. Nvidia has made note of at least some level of partial customization in some products, and it's willing to more aggressively target efficiency measures and physical changes in its architecture with things like its register operand cache. It's been able to leverage its less successful at mobile tech to better manage its other products power consumption.

AMD generally hasn't displayed that level of interest in optimizing its GPUs as of late. It's true that Vega's register file was apparently optimized by the Zen team to notable effect, which seems to hint that the base level of optimization has a lot left on the table. And given what the register file's custom work did, is the rest of the hardware closer to bog standard?
 
But then the immediate follow up is: why are AMD’s clocks too low to start with?

I always thought it had to do with the math pipeline being 4 deep, but thanks to Volta that theory flew right out of the window.

I think the other immediate follow up is : why are nvidia's clocks so high with Pascal?

If Pascal clocked at stock what Maxwell could do overclocked, 1.45-1.5Ghz, AMD would have no trouble with its performance but Pascal shot up to 1.8Ghz at stock and could do 2Ghz on about 100% of chips. Jensen mentioned nvidia's focus on clockspeeds with Pascal and AMD seem to have followed suit with Vega but are still woefully short.


IMHO that's like saying that a rottweiler is slower than a greyhound because it can't run as fast.
Clockspeed and it's operating comfort zone is just a consequence of the underlying electric and architectural reality.
It's self-evident that AMD is behind in clock speed; the question is why.
Why Vega burns small villages to reach 1.7GHz, while GP102 does it gracefully. What's in her DNA that causes high consumption?

How many times have we heard: all AMD needs is higher clock and they will be fine.
Well Vega expands on clockspeed and impressively so, but how... allegedly by using more diespace (wth)

Not really, more like 'why is that Rottweiler eating so much if it can't run as fast'.

While it might be self-evident that AMD are behind in clock speeds, it isn't so for why they consume so much power, which is why OP thinks scheduling first. And it isn't self-evident that why clockspeeds would matter since the chips are so different. But they do.

If AMD can keep up with nvidia at same clocks then of course, AMD's one course of action would be to increase clockspeeds and they'd be fine. If AMD finds that architectural changes are more important that might allow them to compete at lower clockspeeds, they'd go for that. They seem to have gone for both with Vega, but while the former has somewhat panned out, the latter is still MIA.

As for it using more diespace, I think nvidia had to go that route as well, they didn't mention transistor figures however.

Why AMD are behind in clockspeed? Is it architecture as in shaders/TMU/ROPs/schedulers layout or the transistors themselves? I think it's impossible for us to tell.
 
Back
Top