Why do GCN Architecture chips tend to use more power than comparable Pascal/Maxwell Chips?

Discussion in 'Architecture and Products' started by Genotypical, Dec 21, 2017.

  1. Genotypical

    Newcomer

    Joined:
    Sep 25, 2015
    Messages:
    38
    Likes Received:
    11
    The manufacturing process is likely a factor. That should affect the voltages they have to pump through their GPUs, right? many folks have had success in reducing the power consumption near pascal levels with undervolting. I assume they use higher voltage to preserve yield rather than be aggressive with the voltage for higher efficiency.

    That's probably one of the clearer causes. Tho last gen they were both using TSMC. a 980ti did use around the same as a fury
     
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    It was rather the norm that both used TSMC.
     
    Pixel likes this.
  3. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah power demand was pretty comparable to reference 980 ti, but then a reference Fury X did not have the same performance generally as the reference 980 ti - not sure if this came down to throttling (even the reference 980ti throttled on default) or just some limits of the performance envelope used by AMD for Fiji relative to power-efficiency.
    Vega has managed some efficiency gains, but not with both clocks and scaled architecture as has been done with both Pascal and Volta.
    You can reduce voltage/power even on Nvidia cards if one wants to really play around with efficiency; why there is a Pascal Tesla P4 that is in essence GP104 running at 50W or 75W with 5.5 FP32 TFLOPS and that is pretty incredible performance/efficiency.
     
    #23 CSI PC, Dec 27, 2017
    Last edited: Dec 27, 2017
  4. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    515
    Likes Received:
    237
    The same basic concept as Fury Nano: wide and "slow".
    Btw, where's Vega Nano?
     
  5. bazooka_penguin

    Joined:
    Oct 21, 2017
    Messages:
    3
    Likes Received:
    1
    It's not really that big of a difference right? Big Vega can hit its rated frequency for theoretical 12.5 TF at around 300w. Big pascal can do the same for around 250. Seems like Nvidia has a 10-15% advantage. Possibly due to process or maybe just better optimizations. The fact GCN apparently has much worse resource efficiency probably makes the disparity look much bigger than it actually is
     
  6. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Are you comparing rated TDPs here?
     
  7. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    You cannot compare their TDP-TBP in the same way (look at TomsHardware/PCPer who use scopes to see the difference between both manufacturers), nor how they rate it against TFLOPs; Nvidia usually never quote boost performance unless they state it such as with the Tesla P4 at 75W so their performance is actually higher with their other cards than officially stated.
    Remember though that AMD back then did have an edge in terms of compute performance (even allowing for the above) but this does not translate into games nor necessarily certain applications and about sustained peak performance relative to comparable 980 ti.
    That said for AMD Fiji was a step in the right direction and did seem to be within its stated spec in terms of TDP-TBP.
    But unfortunately Vega is not enough of a step relative to Pascal followed by 'Volta' in terms of scalable efficiency; clocks and architecture/cores and also the jump in compute performance this also gives Nvidia catching up with AMD when allowing for normal GPU performance envelope and thermal-power-dynamic boost.
     
    #27 CSI PC, Dec 27, 2017
    Last edited: Dec 28, 2017
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I believe he stated that they were somewhat surprised how effective their optimizations were, which can happen. Sometimes a good prediction can come up with higher dividends than normal.

    I'm more out of my depth with canine physiology than I am with car analogies, but from the following it seems like Greyhound breeding has made them very optimized for what they do.
    https://en.wikipedia.org/wiki/Greyhound

    "The key to the speed of a Greyhound can be found in its light but muscular build, large heart, highest percentage of fast-twitch muscle of any breed, double suspension gallop, and extreme flexibility of its spine. "Double suspension rotary gallop" describes the fastest running gait of the Greyhound in which all four feet are free from the ground in two phases, contracted and extended, during each full stride."
    Additionally, there are swaths of body tissue that are minimized, such as body fat, undercoat, red blood cell count, and liver capacity. Racing emphasizes their prey chasing instincts, apparently to the point that it can seriously endanger them in environments with car traffic.

    From https://en.wikipedia.org/wiki/Rottweiler, we see gait, physiology, and temperament that will not motivate them to the extremes of a Greyhound or will physically work against them at the operating points of forcing a canine body to 70 km/h.
    They would lose out in acceleration and energy loss per stride, and devote resources to maintaining bulk and strength for purposes other than running. A thin-skinned 70 km/h lightweight with lesser clotting capacity and a tendency to bolt doesn't provide utility for herding or holding its own in a drag-out fight.

    To make the analogy fit GPU design more closely, it would probably be more like AMD and Nvidia had the chance to design a dog, and they had to guess in more general terms what sorts of muscle fibers they'd have on hand or how far they could push the spine and legs, then compare it to how many races versus fights the dog would be in.
    Then there's a question of how often they could iterate on their guesses, and what unexpected stumbling blocks each would hit and the timing of them.

    Circuits face a trade-off between transistor performance and wire delay, with the latter becoming significantly more limiting with every node. There are a number of choices that increase the capability of the transistor portion to drive signals down wires by sacrificing area and power, such as repeaters or extra pipeline stages with their corresponding latches. Since wire delay is quadratic with the distance traveled, cutting a path in half can significantly boost signal propagation at the cost of the switching delay of the transistors involved--which have significantly outpaced wire delay.
    If the overall mix of wire versus transistor is mostly locked-in, then driving voltages up will provide more drive strength to the transistors--to a point. At some point, the transistors' ability to carry current saturates or the limits of their connections to their wires is reached, and the quadratic cost of greater voltage.

    In the nodes since 28nm, the foundries became more conservative with wire scaling than they have with transistors, and the emphasis on density can lead to worse wire delay because thin wires have higher resistance. GF's 7nm presentation shows that its upcoming FinFETs have significantly more refined transistors, with better materials and taller fins. The high-performance variant of the process explicitly goes for more metal layers and less dense cells+wires versus the single density-optimized option with relatively stubby fins available for GF's 14nm.

    It's more of a holistic question at this point. Architectures are designed with certain predictions about their workloads and the physical and electrical realities they will face at the time of manufacturing.
    Decisions such as the overall balance of per-cycle work and working transistors versus on-die connectivity will be decided long before the chips are taped out and need to face the realities of their manufacturing.

    There is something of a hint when AMD started talking about wire delay in Vega when the concept didn't come up before.
     
    DavidGraham likes this.
  9. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Reading this, it occured to me that in every die shot we've seen yet, AMDs individual CUs seem to be rather lenghty „stripes“ while Nvidias SMs are more compact and shaped like a fat L. While the first lends itself to denser packing in rows and columns, the latter seems to be able to have shorter intra-SM wiring albeit chip designers having a harder time to fill the gaps.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I've described the current operation of GCN here:

    https://forum.beyond3d.com/threads/...ors-and-discussion.59649/page-13#post-1977187

     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Each CU has a buffer for instructions, L0 effectively. That buffer is shared amongst the kernels that are running in the CU.

    So the buffer changes slowly. But for example there could be problems if latency-hiding has caused a single, immense, kernel to have multiple wavefronts each with a program counter that is far from all the other program counters. So there could be contention for buffer pages if each of 16 hardware threads is in a distinct part of the kernel. (10 compute threads or 16 graphics threads are the limits per CU).
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    There may be some merit in this.

    But AMD has always "over-volted" its chips. This is so that the chip will still run in 5 years' time. As time goes by the "under-volt" that people use will fail and they will have to increase voltages to keep the chip running.

    Of course one has to run a chip for years to notice this.
     
    el etro and DavidGraham like this.
  13. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    178
    Likes Received:
    147
    The GCN arch is oldish.

    R300 family - Aug 2002 to Oct 2005. That's over 3 years.

    R500 family - Oct 2005 to May 2007. That's over 2.5 years.

    TeraScale - May 2007 to Dec 2011. That's roughly 4.5 years.

    GCN is already 6 years old and its EOL seems to come sometime in 2019. So GCN's lifespan will be nearly 8 years by then. Given there have been only minor architectural improvements, it's showing its age.
     
    Picao84 and el etro like this.
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    The voltage difference between GloFo and TSMC is not that much and would not be a notable contribution when looking at each GPU in context of this thread.
    In fact the voltage by default was higher with Fiji than it is with Vega even looking at custom Fury X to custom Vega by some reviews analysing voltage/power, while for Nvidia the difference was around 5% between the 1050ti (uses Samsung) and 1060 (TSMC) with 1.09V vs 1.04V in-game and still within a performance envelope range that still would be moderately optimal.

    The custom Sapphire Fury X was around 1.2V while the custom Sapphire Vega is 1.05V when pushed in gaming with default GPU settings; this is Fury X compared to Vega64.
    Other reviews analysing Fury X reference and voltages still had that at 1.2V with 3D loads, while Vega in whichever form is quite a lot lower unless pushing the OC.
    Also worth noting 980 ti also went to 1.2V in 3D loads with normal boost behaviour when not throttling.

    Could be argued that the reduced node size is more sensitive to higher voltages in terms of silicon wear, along with impact of higher density.
     
    #34 CSI PC, Dec 28, 2017
    Last edited: Dec 28, 2017
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I'm aware of a per-SIMD instruction buffer mentioned for GCN, and an instruction buffer documented in the ISA docs that seems to match. It drains rather quickly if the timing or NOP packing for cache invalidation or special mode operations are an indication. Per Vega's ISA, it's 16 DWORDs long, up from 12 in prior generations.
    Nvidia's description for Volta indicated SIMD-local buffers existed prior to whatever it is calling an L0.

    Perhaps the per-CU buffer you mention is part of the the instruction prefetch functionality in recent GCN generations, or is there a reference to what AMD calls it?
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I think you're correct, this is an instruction buffer localised to a SIMD and holding instructions defined by a single hardware thread, remembering that a CU can issue 5 instructions in parallel (e.g. scalar, vector, export. LDS, VMEM) and these are chosen from a distinct thread for each type of unit.

    The original whitepaper:

    https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

    implies that the selection of up to 5 hardware threads required for instruction issue all come from buffers on a single SIMD:

    "The selected SIMD can decode and issue up to 5 instructions from the 10 wavefront buffers to the execution units."

    This also implies that each hardware thread has its own instruction buffer within the SIMD. So there's 10 buffers, each of 16 DWords, in Vega. (This number, 10, is troubling, because in graphics 16 hardware threads per SIMD are supported, I believe.)

    When a branch is taken or when a different thread is swapped in to the SIMD, then the buffer for the hardware thread needs to be re-filled according to the new program counter and the buffer will contain a varying count of instructions ("pre-fetch") depending upon the mix of DWord or QWord sized instructions.

    So I$ read contention is reduced by making this buffer 33% larger. In Polaris the sharing of I$ was changed from 4 CUs down to 3. So both of these step changes have been about I$ contention.

    One thing that seems weird to me is that the original whitepaper describes an instruction cache line as 64 bytes. Which is 16 DWords. Yet in original GCN the instruction buffers were 12 DWords? So a cache line fetch would always produce more instructions than could fit in the buffer. If anything, this is the clearest sign that the instruction buffer in GCN is not a cache, since there is no direct mapping between a 64B cache line and what's in the instruction buffer. Instead, it seems, there would be 1 or 2 fetches from I$ required to populate the 12 DWords of the instruction buffer. That's because contiguous fetches of 12 DWords cannot be aligned with 16 DWord cache lines.

    Also, it's worth noting that the buffer may not contain a fully complete final instruction. The final instruction might be QWord sized, but only one DWord of that instruction can fit because the previous instructions used up 15 DWords (in Vega), 11 in Polaris or earlier.

    So the burstiness of fetches from I$ into the instruction buffer is, in general, worse than envisioned.

    But, Vega might have made the instruction buffer cache aligned, since its size now matches a cache line. This would mean that there'd be no burstiness (half the time 2 fetches from I$ are required contiguously in Polaris and earlier). On the other hand, even if Vega's instruction buffer is cache aligned this presents the problem: what if the first instruction after a JMP is QWord sized and that instruction is split across two cache lines. There's no way to issue from an incomplete instruction. So I think that's enough to rule out the possibility that the instruction buffer is cache-aligned. Instruction fetch into the buffer always requires futzing to put the first new instruction so that it starts at the beginning of the buffer and a complete buffer fill will require fetching either one or two 64B lines from I$.

    You might like to rummage in this slide-deck to see if you can find more clues:

    https://32ipi028l5q82yhj72224m8j-wp...DC2017-Advanced-Shader-Programming-On-GCN.pdf
     
  17. el etro

    Newcomer

    Joined:
    Mar 9, 2014
    Messages:
    95
    Likes Received:
    12
    Two reasons: GF node instead of TSMC's and underutilization of theorectical FP power. RX480 haves around the same power consumption for the around the same FP power as a GTX 1070.
    Radeon begs for a real new uarch.
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    My recollection was that it was 40 wavefronts per CU for all types.

    In other instances, instruction fetch has been internally subdivided into one or more windows in a clock that are less than the with of a cache line, probably because it helps with wasted bandwidth due to jumps and the complexity of processing a full line at once. Bulldozer for example had a front end that fetched from two sub-line length windows, and some nuances to the windows delivered per cycle(s).
    It's possible that there's something similar going on with a smaller granularity, with less than 12 DWords fetched at a time, and possibly delivered in some kind of internal pipeline cadence with some cycles idle or reserved for other CUs.

    12 DWords may be somewhat associated with peak issue, which is 5 instructions that can be issued to an execution stage (1-2 DWords each) and one special instruction that can be executed in the instruction buffer, so 6 instructions at 1-2 DWords each. 12 DWords makes more sense then.
    Upping things to 16 might mean AMD profiled an increase due to more features being used and new instructions like packed path often leveraging 64-bit encodings. Another maybe related figure is that a front end that can service 4 clients tuned for 12 DWords has the same throughput as a front end servicing 3 clients with 16.
     
  19. Locuza

    Newcomer

    Joined:
    Mar 28, 2015
    Messages:
    45
    Likes Received:
    101
  20. AlBran

    AlBran Ferro-Fibrous
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,667
    Likes Received:
    5,759
    Location:
    ಠ_ಠ
    Mmhm...

    GCN VGPR table.jpg
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...