AMD Vega Hardware Reviews

There are specific cases where the instruction buffer can churn through at 1, but those skip the rest of the pipeline.
Interesting. Do you know which instructions have this or does it apply in general to instructions that don't use the pipeline?
Don the scalar instructions need it? (For those, the cost of a multi-ported register file would be much smaller than for the SIMD ops)
 
Interesting. Do you know which instructions have this or does it apply in general to instructions that don't use the pipeline?
Don the scalar instructions need it? (For those, the cost of a multi-ported register file would be much smaller than for the SIMD ops)

Setting VSKIP mode can make the buffer skip vector instructions. This occurs at a rate of 10 waves skipping one per cycle, which the GCN manual indicates is faster than issuing and discarding them.

*edit: Admittedly, I think this might be ambiguous as to how it is handled.
**much later edit: On second thought, the max rate of 10 skipping an instruction per cycle may point to it being limited to the issue cycle of a given SIMD. If so, that would keep even the skip rate at effectively 1/4.
 
Last edited:
As Sebbi noted, it's 1/4 for vector utilization. There are specific cases where the instruction buffer can churn through at 1, but those skip the rest of the pipeline. I was thinking in terms of what it logically appears as to the software, but IPC is more of a statement about what the implementation is actually doing.
I would say that discussing IPC can't be done in isolation but is tied to the architecture (and the ISA, what/how much is an instruction doing). Having said that, one should do exactly this and also state, that vector instructions are only part of the instruction stream. There is usually a significant amount of scalar instructions (and sometimes also some fraction of LDS instructions). From the architectural point of view, these are of course actual instructions doing something meaningful (maybe not for flops but control flow for instance).
Other GPUs need to use vector instructions for these tasks. One part of optimizing for GCN can be trying to shift more work to the scalar unit as scalar instructions have the same maximum issue rate as vector instructions and can be issued in parallel. GCN's vector instructions basically do the heavy lifting for the calculations in a (very loosely) similar way SIMD units in CPUs can do it. And nobody talking IPC in case of CPUs is talking only about the instruction throughput of AVX instructions. That's only one part (albeit an important one) of the equation.
In some (maybe contrived) circumstances, GCN can issue 1 vector, 1 scalar, 1 LDS, 1 memory instruction (plus some of the special instructions which are handled internally by the instructions buffers) in parallel. And the CU as a whole is able to sustain an issue rate of >2.5 instructions per clock (without counting the internal ones) with the "right" instruction mix.
 
Last edited:
In some (maybe contrived) circumstances, GCN can issue 1 vector, 1 scalar, 1 LDS, 1 memory instruction (plus some of the special instructions which are handled internally by the instructions buffers) in parallel. And the CU as a whole is able to sustain an issue rate of >2.5 instructions per clock (without counting the internal ones) with the "right" instruction mix.
Thanks. I never really thought this through beyond the simple 4 cycle cadence...

But the way I look at it know is that GCN can issue one instruction per cycle for non-vector instructions, and that those non-vector instructions can 'hide' in the 3 remaining cycles during which the SIMD is still doing its thing.
So you can freely mix scalar and LDS operations with vector operations without inserting a bubble in the vector path.

Meanwhile, Nvidia Maxwell GPUs are able to dual issue in the same cycle to the vector pipeline and the LDS, which they need to avoid bubbles in the vector pipeline because they don't have the 4 cycle cadence of AMD. (https://devtalk.nvidia.com/default/topic/743377/understanding-cuda-scheduling/)

In Volta, the new thing is that they have a 2 cycle vector cadence. So they probably don't have to dual issue anymore and can issue that LDS operation in the second cycle, or even use it to issue an integer vector operation.

Is that understanding correct?
 
I would say that discussing IPC can't be done in isolation but is tied to the architecture (and the ISA, what/how much is an instruction doing).
That goes to the semantic density of the instruction stream, what operations come out of the instruction sequence, and why it's better used when comparing within an architectural family or line. Adjusting for an ISA gap has been attempted, although it usually comes with some controversy.

At best, an apples to apples comparison would occur on the same instruction stream to see how two different implementations perform on the same workload. Specialized instructions that require specific coding muck up the metric by modifying the instruction stream.
So far, Vega versus Fiji doesn't seem so disparate as to negate what saying an architecture higher IPC usually denotes, and if packed math isn't currently being used, it is a closer comparison.

If AMD chose to use packed math instructions as the basis for its claims of higher IPC, they knew what they were trying to lead people into believing.

And nobody talking IPC in case of CPUs is talking only about the instruction throughput of AVX instructions. That's only one part (albeit an important one) of the equation.
If AMD used packed math to make its IPC claims for Vega, that is effectively what they are doing.

In some (maybe contrived) circumstances, GCN can issue 1 vector, 1 scalar, 1 LDS, 1 memory instruction (plus some of the special instructions which are handled internally by the instructions buffers) in parallel. And the CU as a whole is able to sustain an issue rate of >2.5 instructions per clock (without counting the internal ones) with the "right" instruction mix.
That would not come from the same thread. AMD could throw more parallel threads at the problem and the claim of higher IPC should be given as much credence as Sun's Niagara core having higher IPC than Opteron.
I'm not saying it couldn't be a good thing to increase overall throughput across threads, but there are ways of stating it that wouldn't inflate NCU's departure from the prior architecture.

But the way I look at it know is that GCN can issue one instruction per cycle for non-vector instructions, and that those non-vector instructions can 'hide' in the 3 remaining cycles during which the SIMD is still doing its thing.
The other instructions belong to a different thread, and they aren't really hiding since this is a pipeline built around the 4-cycle cadence. There are three other sets of 10 threads that will be trying to issue something before getting back to this the first one.

So you can freely mix scalar and LDS operations with vector operations without inserting a bubble in the vector path.
Within the same (hardware) thread? Not exactly, although some of the interactions are related to the internal gaps of the semi-independent pipelines/sequencers in the CU. LDS in particular is going to pose a stall risk unless it is totally aligned and the current thread is alone in using that storage.
 
The other instructions belong to a different thread, and they aren't really hiding since this is a pipeline built around the 4-cycle cadence. There are three other sets of 10 threads that will be trying to issue something before getting back to this the first one.
That's the model that I used to have in mind: everything being on the 4-cycle cadence, and no dual-issue. But that goes against what Gispel (not Giselle, dear iPhone) is saying.

But in that very simple model, a scalar/LDS operation will insert a bubble in the vector pipeline no matter what. (Just as it does for integer operation book keeping pre-Volta.)
 
That's the model that I used to have in mind: everything being on the 4-cycle cadence, and no dual-issue. But that goes against what Gispel (not Giselle, dear iPhone) is saying.

But in that very simple model, a scalar/LDS operation will insert a bubble in the vector pipeline no matter what. (Just as it does for integer operation book keeping pre-Volta.)

The CU can issue from one instruction per thread, and will pick up to 5 from a set of 10 per-wave instruction buffers (instruction buffer operations or VSKIP excepted). What issues cannot be of the same type, and must be from a different thread.

The next cycle, the CU's sequencer moves on to the next SIMD and another set of 10 waves.
 
@3dilettante:
Exactly. In a GCN CU more than one wavefront (up to five as you said) can advance their IP (or PC as they call it) in the same clock cycle. They don't have multiple issue for a single wave, but they can issue instructions from multiple waves (assigned to the same SIMD) in the same clock, provided they are of a different type (vector, scalar, LDS, vector memory, export/GDS + the special instructions consumed directly in the instruction buffer). A CU can sustain 1 scalar + 1 vector + 0.5 LDS (+ an occasional vector memory or export) instructions per clock (edit: and yes, the stars have to align at least a bit so the LDS access doesn't cause conflicts and therefore stalls, but it should be possible). So having a higher occupancy can also help because you can issue more instructions in parallel (more waves to chose from).
Without taking that capability into account (and as said, other GPUs have to use more vector instructions as they lack scalar ones), the isolated discussion of IPC for a single thread may be a bit misleading.

@Jawed:
Yes, but these scalar instruction have to come from a different set of waves each cycle. Each cycle instructions from only one of the four blocks of instruction buffers (each assigned to one of the SIMDs) can be issued.
 
Last edited:
@3dilettante:
Without taking that capability into account (and as said, other GPUs have to use more vector instructions as they lack scalar ones), the isolated discussion of IPC for a single thread may be a bit misleading.
It wasn't my choice to use the term in AMD's marketing, and the thrust of my argument is that if it is being used in a manner contrary to convention, it is misleading and likely purposefully so.
 
Ok, I finally got it, and it makes much more sense.

My mistake was assuming that there was one scalar per SIMD but also one instruction fetch/decider/issue fixed allocated per SIMD/scalar combo.

With this out of the way, here's something that has been bugging me: GCN usage of a 4 cycle vector pipe makes operand collection and result storage very easy, but it also imposes a pipeline that is a multiple of 4 deep.

I also thought that this was a constraint with significant peak clock speed impact. Maxwell and Pascal have a pipeline depth of 6 for common operations.

However, the Volta white paper says the following: "Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal." This undercuts that whole clock speed theory. :)
 
With this out of the way, here's something that has been bugging me: GCN usage of a 4 cycle vector pipe makes operand collection and result storage very easy, but it also imposes a pipeline that is a multiple of 4 deep.
It probably means at least certain portions of the pipeline are some multiple of 4, and that it would be difficult to see when some functions could be implemented at some different cycle count.

I also thought that this was a constraint with significant peak clock speed impact. Maxwell and Pascal have a pipeline depth of 6 for common operations.

However, the Volta white paper says the following: "Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal." This undercuts that whole clock speed theory. :)

One item is that this is using Nvidia's operand delivery latency, which may be affected by presence or latency of its result forwarding. GCN is likely getting away with a register writeback stage even more than 4 cycles away--since certain vector operations like DPP that do not have vector forwarding have several software-visible wait states.
Using the logic applied to Nvidia, GCN's pipeline would be zero cycles, as would the vector pipelines in a CPU like Zen. (*edit: correction posted later)

Also, the specific clock impact would depend on physical optimizations and what it tried packing into those stages.
Volta's SIMDs are narrower, which makes some of the routing/forwarding potentially easier while possibly borrowing some of the extra time built into the 32-wide warp over 16-wide SIMD loop.

edit: To clarify, the SIMD is narrower especially relative to Maxwell. Pascal to Volta subdivides the schedulers and instruction hardware further.

Volta also split out INT functionality into its own SIMD, which would reduce the complexity of the FP area. Pascal had to fit more into that area and the stages it contained. GCN's SIMDs have a lot more going on inside of a black box, so the tightness of the loop is unclear.

Nvidia has also been making other architectural improvements (new ISA, different units, specialized blocks), so Pascal's corners may have been shaved off enough to make it possible for Volta. It's also been improving at a regular cadence and been willing to do physical optimizations and tweaks.

My perception of the effort put into GCN along these axes has been that it's behind, iterating slower, less focused, and not as aggressive.
 
Last edited:
One item is that this is using Nvidia's operand delivery latency, which may be affected by presence or latency of its result forwarding. GCN is likely getting away with a register writeback stage even more than 4 cycles away--since certain vector operations like DPP that do not have vector forwarding have several software-visible wait states.
Using the logic applied to Nvidia, GCN's pipeline would be zero cycles, as would the vector pipelines in a CPU like Zen.

I don't follow - are you saying Nvidia is not including execution latency in "dependent FMA issue latency = 4", just operand delivery?
 
I don't follow - are you saying Nvidia is not including execution latency in "dependent FMA issue latency = 4", just operand delivery?

The use of the term dependent indicates the latency is related to operand delivery.
With fully pipelined operations, execution latency is hidden from instruction issue by the work from multiple instructions being split across stages.
The stall counts mentioned for Nvidia dealt with resolving data dependences. Independent instructions could issue every cycle, at least for Maxwell and Pascal.


Correction: I see your point after reviewing, and to further correct execution latency does show for CPU as part of this delay due to superscalar issue. Nvidia's pipeline may show execution latency as part of the delay, although the single issue for each type can pipeline more of the hazard than a CPU can.

It would take some kind of hazard like the inability to have operands ready to show the actual pipeline depth in that section, or something like a branch misprediction that would show the length of the pipeline for a CPU.
 
Last edited:
Answering a bit old perhaps, but VideoCardz never explained why they think some scores are OCd and others are not - 3DMark doesn't show any difference in clocks, it's same 1630/945 for every result


The only thing who seems to tell them that thoses scores are overclocked, is because they all use same driver ( well at least it show same driver version ( dont mean block inside are the same ), same written default frequencies but graphics score are higher.
 
If the clockspeed fluctuates during the test and downclocks a lot, what is the actual speed sent to Futuremark?
 
Back
Top