Why do GCN Architecture chips tend to use more power than comparable Pascal/Maxwell Chips?

Discussion in 'Architecture and Products' started by Genotypical, Dec 21, 2017.

  1. Genotypical

    Newcomer

    Joined:
    Sep 25, 2015
    Messages:
    38
    Likes Received:
    11
    I figure some here would have good ideas on this. One reason I suspect is the difference in scheduling. I usually link people to this article

    https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

    I assume later Nvidia architectures continued and improved on this. Things weren't as great with Kepler so I left it out of the title, but it started there. As long as just this one difference remains its very unlikely consumption numbers will match at the same or similar manufacturing node, right? are there other factors?
     
  2. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    You have a chip with 3072 or 4096 FP32 FMAs, tons of texture units, and ROPs, large register files and caches.

    They take up the vast majority of die size and have wires and transistors that are toggling all the time.

    Yet the first thing that comes to mind about what makes the most difference in power consumption is ... scheduling.
     
    orangpelupa and Bob like this.
  3. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,171
    Location:
    La-la land
    From what I seem to recall, those statements are mostly a misinterpretation, and besides, "scheduling" would not make up a very large portion of the GPU energy budget anyhow. Certainly not the 50-100W variations in power that can be seen in recent comparable AMD/NV GPU generations.

    As a clueless layperson, I suspect it's largely down to differences in R&D budgets, probably also a difference in quality and experience of the engineers involved as well... :p
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,119
    Likes Received:
    2,864
    Location:
    Well within 3d
    That appears to have been a conflation of two very different levels of scheduling.
    GCN's instruction scheduling is arguably even more simplistic than Nvidia's. ALU latency doesn't even need specific code annotation since GCN has its outright 4-cycle cadence, and there are multiple operation classes with variable latency that have simple counters rather than Nvidia's reduced set of scoreboards.
    Going just by that, I think the impression would have been that Nvidia would have consumed more power.
     
  5. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,780
    Likes Received:
    4,431
    Simplest answer is huge disparity in R&D budget that allows nvidia to spend a lot more man-hours on optimization as well as hiring and/or poaching the brighter minds that come up, and AMD being stuck to GlobalFoundries who provides a cheaper-per-mm^2 but significantly less power-efficient process than TSMC.
     
  6. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,796
    Likes Received:
    2,054
    Location:
    Germany
    Not only that, but as you recall in Fermi there was hardware scoreboarding involved in latency tracking. If memory serves right, that piece of hardware was it specifically what Nvidia referred to in this context.

    AMD on the other hand, never had a comparably complex and power hungry piece of hardware in the first place. Apart from Fermi really hitting the power wall hard for the first time and thus people having no experience how to deal with it, I think that's the main reason for the paragraphs AT wrote on this matter.

    For today's situation though, I don't think that has much relevance any more. Here, the more important part is that Nvidia had to focus hard on power consumption in the first place after the Fermi experience and drew the right conclusions while AMD still was in the comfortable position to have power budget left, thus no or at least less pressing need to deal with this topic. For performance improvements, they could just keep throwing more ALUs (and TMUs) at a given problem until Hawaii. By that time, Nvidia had a lead in power efficient design and AMDs dwindling ressources made it very difficult to keep up, let alone catch up.

    IMHO, of course.
     
  7. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,297
    Likes Received:
    464
    It also bears recalling that for a long time AMD had a consistent lead in manufacturing process AND that Nvidia was designing chips aimed multiple applications much more so than AMD at the time.

    People bring up R600 in regard to Vega but really it’s AMDs Fermi 1.0: a large multi-functional design on inferior process and power issues that come with that. Nvidia, having marshaled enough resources to segment the design along various markets and/or have Fabs tailor to their needs. Hopefully, just like Fermi evolved beyond its initial form, so will future Vega products.
     
    no-X likes this.
  8. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    I think it'd be fascinating to do power analysis for microbenchmarks on specific extreme workloads. You can only get so far guessing these things...
     
    3dcgi and Lightman like this.
  9. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    715
    Likes Received:
    220
    Location:
    india
    It's the clockspeeds, stupid.

    I think it's pretty obvious that the biggest reason for the difference is the clockspeed disparity. Running cards closer to their limits isn't new for AMD, but nvidia have opened up such a gap with Pascal that AMD fell way behind on performance to make up for it.
     
  10. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    But then the immediate follow up is: why are AMD’s clocks too low to start with?

    I always thought it had to do with the math pipeline being 4 deep, but thanks to Volta that theory flew right out of the window.
     
  11. sonen

    Newcomer

    Joined:
    Jul 13, 2012
    Messages:
    53
    Likes Received:
    33
    IMHO that's like saying that a rottweiler is slower than a greyhound because it can't run as fast.
    Clockspeed and it's operating comfort zone is just a consequence of the underlying electric and architectural reality.
    It's self-evident that AMD is behind in clock speed; the question is why.
    Why Vega burns small villages to reach 1.7GHz, while GP102 does it gracefully. What's in her DNA that causes high consumption?

    How many times have we heard: all AMD needs is higher clock and they will be fine.
    Well Vega expands on clockspeed and impressively so, but how... allegedly by using more diespace (wth)
     
  12. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    120
    Likes Received:
    181
    The biggest difference between nVidia and AMD isnt hardware its mentality. After Fermi nVidia started to rethink their approach for the next hardware designs. There are talks from nVidia's Dally about power consumption 7 years ago. AMD hasnt cared about it. There was always the fixation about transistors and die size. But after Fermi it was power consumtion as the biggest problem to tackle.

    Vega doesnt come close to Fermi. In retroperspective Fermi was a true next-gen architeture and a huge leap forward for nVidia. Vega on the other hand is just another GCN iteration lacking multiple features like FP64 (performance), Tensor Cores (DL), better HBM memory controller etc. What Vega offers isnt really better than what nVidia provides with their Tegra X1 chip...
     
    DrYesterday and Lightman like this.
  13. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    511
    Likes Received:
    232
    I can also call Fermi another G80 iteration.
     
  14. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,148
    Likes Received:
    570
    Location:
    France
    Thing is, I wonder how much of Vega we will find in Navi. If it's a lot, then, Vega was a necessary step. If it's not, then I will consider Vega like a OC Fiji with more memory (since DSR, Primitive shaders, etc seems not enabled or doing squat for performances), and WTF were they doing for years with that architecture. Even Polaris look "nicer".

    Back on topic, I though I read somewhere that nVidia use a lot of "by hand design", where AMD/RTG automated a lot of that, because of lower R&D budget, and so it was less efficient. Could this be a reason ?
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    My mental model is that GCN 4-cycle cadence works like this: fetch reg A, fetch reg B, fetch reg C, execute. So only a single cycle to execute. Nvidia nowadays has operand reuse cache. Couldn't they do: fetch reg A, fetch reg B, execute, execute. If we assume that at least one multiply-add operand needs to come from the operand reuse cache. Just some random thoughts, no concrete info. Also Nvidia simplified their register files a lot for Maxwell (more bank conflicts, leaning more on operand reuse cache). Here's some info: https://github.com/NervanaSystems/maxas/wiki/SGEMM. Register files are a big part of GPU power consumption. AMD Vega presentations also said that register files were optimized in cooperation with Ryzen engineers to allow higher clock rates.
     
    orangpelupa, BRiT and Lightman like this.
  16. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,509
    Likes Received:
    839
    That's how I imagine it. It costs as much to move the data from the reg file to the ALUs as it does to do the computation itself. GCN loads all operands from the reg file and always store the result. I imagine Nvidia can elide a store+subsequent load if a result is consumed by a following operation.

    Cheers
     
  17. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    That's the right mental model, but it's the wrong implementation model, because it's just not possible to execute in 1 cycle.

    That's why they fetch 64 values per cycle, but only launch 16 operations per cycle.

    Here's my implementation model:
    You want to calculate:
    Z0 = A0 * B0 + C0
    Z1 = A1 * B1 + Z0
    Z2 = A2 * B2 + Z0
    You have a register file that can read or write 64 values at a time. Only 1 port to save area.
    The execution pipeline is 4 deep.
    You have a result collector to store results Z[47:0] before writing the values back to the register file.
    You have a bypass path from Execute D to Execute A.

    Timeline:
    Code:
    RegFile   Execute A Execute B Execute C Execute D Output Collector
    Fetch A0
    Fetch B0
    Fetch C0
    Write Z-2 Z0[15:0]
    Fetch A1  Z0[31:16] Z0[15:0]
    Fetch B1  Z0[47:32] Z0[31:16] Z0[15:0]
     <IDLE>   Z0[63:48] Z0[47:32] Z0[31:16] Z0[15:0]
    Write Z-1 Z1[15:0]  Z0[63:48] Z0[47:32] Z0[31:16] Z0[15:0]
    Fetch A2  Z1[31:16]           Z0[63:48] Z0[47:32] Z0[31:0]
    Fetch B2  Z1[47:32]                     Z0[63:48] Z0[47:0]
     <IDLE>   Z1[63:48]                               Z0[63:0]
    Write Z0  Z2[15:0]
    
    In this model, you have 2 bypass options: one directly from the Execute D to Execute A (Z0 -> Z1), and another one from the output collector to Execute A (Z0 -> Z2).

    (I should have done this long time ago. Only after sweating the details now do I realize that you need 2 bypasses.)
     
    #17 silent_guy, Dec 22, 2017
    Last edited: Dec 22, 2017
    sebbbi, Jawed and Grall like this.
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,119
    Likes Received:
    2,864
    Location:
    Well within 3d
    Volta's dependent FMA latency is 4 cycles, but unlike GCN I'm not sure how broadly that carries over to other operations.
    It wouldn't just be what the cycle count is, but what is done in that cycle, over what distance, and how much work was put into optimizing it.
    One assumption is that GCN's 4-cycle cadence is the limiting path, which is possible but unverified. Some of AMD's discussion about Vega indicated they did make special effort to keep it at 4.

    Other elements of Volta's architecture show efforts to keep things more local. There's an L0 instruction cache near the SIMDs, rather than the shared L1 that gets shared between 3 other CUs in Vega. AMD's choice to limit Vega's sharing to 3 CUs was discussed in the context of reducing wire delay, and if we take Nvidia's diagram as being roughly true to the hardware its instruction pipeline is closer. Also, if Vega's capping the number of CUs sharing a front end was for wire delay, it wouldn't seem like going from 4 to 3 would change the picture that drastically.

    Other items include both architectures have 16-wide SIMDs, but Nvidia has split off more instruction types into separate SIMD groups rather than it all going through the one bucket AMD does. That would affect how much has to happen in a cycle in a relatively confined area.
    Nvidia's warp size is half the width of a GCN wavefront, and there are various operations that would scale with that width occurring in a single cycle.
    GCN's ISA is more freely documented, so the level of complexity versus Volta's instructions isn't clear and would speak to how much has to be done in a cycle. GCN has a fair number of shuffles and alternate data sources/routings over 64 items that get put into that cadence that Nvidia might have relaxed the timing for.

    Beyond that, is the question of how much work is put into optimizing the implementation. Nvidia has made note of at least some level of partial customization in some products, and it's willing to more aggressively target efficiency measures and physical changes in its architecture with things like its register operand cache. It's been able to leverage its less successful at mobile tech to better manage its other products power consumption.

    AMD generally hasn't displayed that level of interest in optimizing its GPUs as of late. It's true that Vega's register file was apparently optimized by the Zen team to notable effect, which seems to hint that the base level of optimization has a lot left on the table. And given what the register file's custom work did, is the rest of the hardware closer to bog standard?
     
  19. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    715
    Likes Received:
    220
    Location:
    india
    I think the other immediate follow up is : why are nvidia's clocks so high with Pascal?

    If Pascal clocked at stock what Maxwell could do overclocked, 1.45-1.5Ghz, AMD would have no trouble with its performance but Pascal shot up to 1.8Ghz at stock and could do 2Ghz on about 100% of chips. Jensen mentioned nvidia's focus on clockspeeds with Pascal and AMD seem to have followed suit with Vega but are still woefully short.


    Not really, more like 'why is that Rottweiler eating so much if it can't run as fast'.

    While it might be self-evident that AMD are behind in clock speeds, it isn't so for why they consume so much power, which is why OP thinks scheduling first. And it isn't self-evident that why clockspeeds would matter since the chips are so different. But they do.

    If AMD can keep up with nvidia at same clocks then of course, AMD's one course of action would be to increase clockspeeds and they'd be fine. If AMD finds that architectural changes are more important that might allow them to compete at lower clockspeeds, they'd go for that. They seem to have gone for both with Vega, but while the former has somewhat panned out, the latter is still MIA.

    As for it using more diespace, I think nvidia had to go that route as well, they didn't mention transistor figures however.

    Why AMD are behind in clockspeed? Is it architecture as in shaders/TMU/ROPs/schedulers layout or the transistors themselves? I think it's impossible for us to tell.
     
  20. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    511
    Likes Received:
    232
    I'd say both, in 70/30 ratio.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...