AMD Vega Hardware Reviews

Discussion in 'Architecture and Products' started by ArkeoTP, Jun 30, 2017.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    If expressing how much of peak is used, why not utilization?
     
  2. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    Don't ask to me, I did not use the term "IPC" :-D
     
  3. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Interesting. Do you know which instructions have this or does it apply in general to instructions that don't use the pipeline?
    Don the scalar instructions need it? (For those, the cost of a multi-ported register file would be much smaller than for the SIMD ops)
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Setting VSKIP mode can make the buffer skip vector instructions. This occurs at a rate of 10 waves skipping one per cycle, which the GCN manual indicates is faster than issuing and discarding them.

    *edit: Admittedly, I think this might be ambiguous as to how it is handled.
    **much later edit: On second thought, the max rate of 10 skipping an instruction per cycle may point to it being limited to the issue cycle of a given SIMD. If so, that would keep even the skip rate at effectively 1/4.
     
    #144 3dilettante, Jul 6, 2017
    Last edited: Jul 6, 2017
    silent_guy likes this.
  5. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I would say that discussing IPC can't be done in isolation but is tied to the architecture (and the ISA, what/how much is an instruction doing). Having said that, one should do exactly this and also state, that vector instructions are only part of the instruction stream. There is usually a significant amount of scalar instructions (and sometimes also some fraction of LDS instructions). From the architectural point of view, these are of course actual instructions doing something meaningful (maybe not for flops but control flow for instance).
    Other GPUs need to use vector instructions for these tasks. One part of optimizing for GCN can be trying to shift more work to the scalar unit as scalar instructions have the same maximum issue rate as vector instructions and can be issued in parallel. GCN's vector instructions basically do the heavy lifting for the calculations in a (very loosely) similar way SIMD units in CPUs can do it. And nobody talking IPC in case of CPUs is talking only about the instruction throughput of AVX instructions. That's only one part (albeit an important one) of the equation.
    In some (maybe contrived) circumstances, GCN can issue 1 vector, 1 scalar, 1 LDS, 1 memory instruction (plus some of the special instructions which are handled internally by the instructions buffers) in parallel. And the CU as a whole is able to sustain an issue rate of >2.5 instructions per clock (without counting the internal ones) with the "right" instruction mix.
     
    #145 Gipsel, Jul 7, 2017
    Last edited: Jul 7, 2017
    silent_guy, Jawed and Lightman like this.
  6. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Thanks. I never really thought this through beyond the simple 4 cycle cadence...

    But the way I look at it know is that GCN can issue one instruction per cycle for non-vector instructions, and that those non-vector instructions can 'hide' in the 3 remaining cycles during which the SIMD is still doing its thing.
    So you can freely mix scalar and LDS operations with vector operations without inserting a bubble in the vector path.

    Meanwhile, Nvidia Maxwell GPUs are able to dual issue in the same cycle to the vector pipeline and the LDS, which they need to avoid bubbles in the vector pipeline because they don't have the 4 cycle cadence of AMD. (https://devtalk.nvidia.com/default/topic/743377/understanding-cuda-scheduling/)

    In Volta, the new thing is that they have a 2 cycle vector cadence. So they probably don't have to dual issue anymore and can issue that LDS operation in the second cycle, or even use it to issue an integer vector operation.

    Is that understanding correct?
     
    pharma likes this.
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    That goes to the semantic density of the instruction stream, what operations come out of the instruction sequence, and why it's better used when comparing within an architectural family or line. Adjusting for an ISA gap has been attempted, although it usually comes with some controversy.

    At best, an apples to apples comparison would occur on the same instruction stream to see how two different implementations perform on the same workload. Specialized instructions that require specific coding muck up the metric by modifying the instruction stream.
    So far, Vega versus Fiji doesn't seem so disparate as to negate what saying an architecture higher IPC usually denotes, and if packed math isn't currently being used, it is a closer comparison.

    If AMD chose to use packed math instructions as the basis for its claims of higher IPC, they knew what they were trying to lead people into believing.

    If AMD used packed math to make its IPC claims for Vega, that is effectively what they are doing.

    That would not come from the same thread. AMD could throw more parallel threads at the problem and the claim of higher IPC should be given as much credence as Sun's Niagara core having higher IPC than Opteron.
    I'm not saying it couldn't be a good thing to increase overall throughput across threads, but there are ways of stating it that wouldn't inflate NCU's departure from the prior architecture.

    The other instructions belong to a different thread, and they aren't really hiding since this is a pipeline built around the 4-cycle cadence. There are three other sets of 10 threads that will be trying to issue something before getting back to this the first one.

    Within the same (hardware) thread? Not exactly, although some of the interactions are related to the internal gaps of the semi-independent pipelines/sequencers in the CU. LDS in particular is going to pose a stall risk unless it is totally aligned and the current thread is alone in using that storage.
     
    pharma likes this.
  8. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    That's the model that I used to have in mind: everything being on the 4-cycle cadence, and no dual-issue. But that goes against what Gispel (not Giselle, dear iPhone) is saying.

    But in that very simple model, a scalar/LDS operation will insert a bubble in the vector pipeline no matter what. (Just as it does for integer operation book keeping pre-Volta.)
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The CU can issue from one instruction per thread, and will pick up to 5 from a set of 10 per-wave instruction buffers (instruction buffer operations or VSKIP excepted). What issues cannot be of the same type, and must be from a different thread.

    The next cycle, the CU's sequencer moves on to the next SIMD and another set of 10 waves.
     
    pharma, Lightman and silent_guy like this.
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    SALU can be issued on all four cycles within the VALU cadence.
     
    silent_guy likes this.
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    @3dilettante:
    Exactly. In a GCN CU more than one wavefront (up to five as you said) can advance their IP (or PC as they call it) in the same clock cycle. They don't have multiple issue for a single wave, but they can issue instructions from multiple waves (assigned to the same SIMD) in the same clock, provided they are of a different type (vector, scalar, LDS, vector memory, export/GDS + the special instructions consumed directly in the instruction buffer). A CU can sustain 1 scalar + 1 vector + 0.5 LDS (+ an occasional vector memory or export) instructions per clock (edit: and yes, the stars have to align at least a bit so the LDS access doesn't cause conflicts and therefore stalls, but it should be possible). So having a higher occupancy can also help because you can issue more instructions in parallel (more waves to chose from).
    Without taking that capability into account (and as said, other GPUs have to use more vector instructions as they lack scalar ones), the isolated discussion of IPC for a single thread may be a bit misleading.

    @Jawed:
    Yes, but these scalar instruction have to come from a different set of waves each cycle. Each cycle instructions from only one of the four blocks of instruction buffers (each assigned to one of the SIMDs) can be issued.
     
    #151 Gipsel, Jul 7, 2017
    Last edited: Jul 7, 2017
    silent_guy likes this.
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    It wasn't my choice to use the term in AMD's marketing, and the thrust of my argument is that if it is being used in a manner contrary to convention, it is misleading and likely purposefully so.
     
  13. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Ok, I finally got it, and it makes much more sense.

    My mistake was assuming that there was one scalar per SIMD but also one instruction fetch/decider/issue fixed allocated per SIMD/scalar combo.

    With this out of the way, here's something that has been bugging me: GCN usage of a 4 cycle vector pipe makes operand collection and result storage very easy, but it also imposes a pipeline that is a multiple of 4 deep.

    I also thought that this was a constraint with significant peak clock speed impact. Maxwell and Pascal have a pipeline depth of 6 for common operations.

    However, the Volta white paper says the following: "Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal." This undercuts that whole clock speed theory. :)
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    It probably means at least certain portions of the pipeline are some multiple of 4, and that it would be difficult to see when some functions could be implemented at some different cycle count.

    One item is that this is using Nvidia's operand delivery latency, which may be affected by presence or latency of its result forwarding. GCN is likely getting away with a register writeback stage even more than 4 cycles away--since certain vector operations like DPP that do not have vector forwarding have several software-visible wait states.
    Using the logic applied to Nvidia, GCN's pipeline would be zero cycles, as would the vector pipelines in a CPU like Zen. (*edit: correction posted later)

    Also, the specific clock impact would depend on physical optimizations and what it tried packing into those stages.
    Volta's SIMDs are narrower, which makes some of the routing/forwarding potentially easier while possibly borrowing some of the extra time built into the 32-wide warp over 16-wide SIMD loop.

    edit: To clarify, the SIMD is narrower especially relative to Maxwell. Pascal to Volta subdivides the schedulers and instruction hardware further.

    Volta also split out INT functionality into its own SIMD, which would reduce the complexity of the FP area. Pascal had to fit more into that area and the stages it contained. GCN's SIMDs have a lot more going on inside of a black box, so the tightness of the loop is unclear.

    Nvidia has also been making other architectural improvements (new ISA, different units, specialized blocks), so Pascal's corners may have been shaved off enough to make it possible for Volta. It's also been improving at a regular cadence and been willing to do physical optimizations and tweaks.

    My perception of the effort put into GCN along these axes has been that it's behind, iterating slower, less focused, and not as aggressive.
     
    #154 3dilettante, Jul 7, 2017
    Last edited: Jul 7, 2017
    pharma, xpea and silent_guy like this.
  15. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    I don't follow - are you saying Nvidia is not including execution latency in "dependent FMA issue latency = 4", just operand delivery?
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The use of the term dependent indicates the latency is related to operand delivery.
    With fully pipelined operations, execution latency is hidden from instruction issue by the work from multiple instructions being split across stages.
    The stall counts mentioned for Nvidia dealt with resolving data dependences. Independent instructions could issue every cycle, at least for Maxwell and Pascal.


    Correction: I see your point after reviewing, and to further correct execution latency does show for CPU as part of this delay due to superscalar issue. Nvidia's pipeline may show execution latency as part of the delay, although the single issue for each type can pipeline more of the hazard than a CPU can.

    It would take some kind of hazard like the inability to have operands ready to show the actual pipeline depth in that section, or something like a branch misprediction that would show the length of the pipeline for a CPU.
     
    #156 3dilettante, Jul 7, 2017
    Last edited: Jul 7, 2017
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,184
    Likes Received:
    1,841
    Location:
    Finland
    Answering a bit old perhaps, but VideoCardz never explained why they think some scores are OCd and others are not - 3DMark doesn't show any difference in clocks, it's same 1630/945 for every result
     
  18. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,930
    Likes Received:
    1,626
  19. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland

    The only thing who seems to tell them that thoses scores are overclocked, is because they all use same driver ( well at least it show same driver version ( dont mean block inside are the same ), same written default frequencies but graphics score are higher.
     
  20. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,031
    Likes Received:
    3,102
    Location:
    Pennsylvania
    If the clockspeed fluctuates during the test and downclocks a lot, what is the actual speed sent to Futuremark?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...