If expressing how much of peak is used, why not utilization?
Don't ask to me, I did not use the term "IPC"If expressing how much of peak is used, why not utilization?
Interesting. Do you know which instructions have this or does it apply in general to instructions that don't use the pipeline?There are specific cases where the instruction buffer can churn through at 1, but those skip the rest of the pipeline.
Interesting. Do you know which instructions have this or does it apply in general to instructions that don't use the pipeline?
Don the scalar instructions need it? (For those, the cost of a multi-ported register file would be much smaller than for the SIMD ops)
I would say that discussing IPC can't be done in isolation but is tied to the architecture (and the ISA, what/how much is an instruction doing). Having said that, one should do exactly this and also state, that vector instructions are only part of the instruction stream. There is usually a significant amount of scalar instructions (and sometimes also some fraction of LDS instructions). From the architectural point of view, these are of course actual instructions doing something meaningful (maybe not for flops but control flow for instance).As Sebbi noted, it's 1/4 for vector utilization. There are specific cases where the instruction buffer can churn through at 1, but those skip the rest of the pipeline. I was thinking in terms of what it logically appears as to the software, but IPC is more of a statement about what the implementation is actually doing.
Thanks. I never really thought this through beyond the simple 4 cycle cadence...In some (maybe contrived) circumstances, GCN can issue 1 vector, 1 scalar, 1 LDS, 1 memory instruction (plus some of the special instructions which are handled internally by the instructions buffers) in parallel. And the CU as a whole is able to sustain an issue rate of >2.5 instructions per clock (without counting the internal ones) with the "right" instruction mix.
That goes to the semantic density of the instruction stream, what operations come out of the instruction sequence, and why it's better used when comparing within an architectural family or line. Adjusting for an ISA gap has been attempted, although it usually comes with some controversy.I would say that discussing IPC can't be done in isolation but is tied to the architecture (and the ISA, what/how much is an instruction doing).
If AMD used packed math to make its IPC claims for Vega, that is effectively what they are doing.And nobody talking IPC in case of CPUs is talking only about the instruction throughput of AVX instructions. That's only one part (albeit an important one) of the equation.
That would not come from the same thread. AMD could throw more parallel threads at the problem and the claim of higher IPC should be given as much credence as Sun's Niagara core having higher IPC than Opteron.In some (maybe contrived) circumstances, GCN can issue 1 vector, 1 scalar, 1 LDS, 1 memory instruction (plus some of the special instructions which are handled internally by the instructions buffers) in parallel. And the CU as a whole is able to sustain an issue rate of >2.5 instructions per clock (without counting the internal ones) with the "right" instruction mix.
The other instructions belong to a different thread, and they aren't really hiding since this is a pipeline built around the 4-cycle cadence. There are three other sets of 10 threads that will be trying to issue something before getting back to this the first one.But the way I look at it know is that GCN can issue one instruction per cycle for non-vector instructions, and that those non-vector instructions can 'hide' in the 3 remaining cycles during which the SIMD is still doing its thing.
Within the same (hardware) thread? Not exactly, although some of the interactions are related to the internal gaps of the semi-independent pipelines/sequencers in the CU. LDS in particular is going to pose a stall risk unless it is totally aligned and the current thread is alone in using that storage.So you can freely mix scalar and LDS operations with vector operations without inserting a bubble in the vector path.
That's the model that I used to have in mind: everything being on the 4-cycle cadence, and no dual-issue. But that goes against what Gispel (not Giselle, dear iPhone) is saying.The other instructions belong to a different thread, and they aren't really hiding since this is a pipeline built around the 4-cycle cadence. There are three other sets of 10 threads that will be trying to issue something before getting back to this the first one.
That's the model that I used to have in mind: everything being on the 4-cycle cadence, and no dual-issue. But that goes against what Gispel (not Giselle, dear iPhone) is saying.
But in that very simple model, a scalar/LDS operation will insert a bubble in the vector pipeline no matter what. (Just as it does for integer operation book keeping pre-Volta.)
It wasn't my choice to use the term in AMD's marketing, and the thrust of my argument is that if it is being used in a manner contrary to convention, it is misleading and likely purposefully so.@3dilettante:
Without taking that capability into account (and as said, other GPUs have to use more vector instructions as they lack scalar ones), the isolated discussion of IPC for a single thread may be a bit misleading.
It probably means at least certain portions of the pipeline are some multiple of 4, and that it would be difficult to see when some functions could be implemented at some different cycle count.With this out of the way, here's something that has been bugging me: GCN usage of a 4 cycle vector pipe makes operand collection and result storage very easy, but it also imposes a pipeline that is a multiple of 4 deep.
I also thought that this was a constraint with significant peak clock speed impact. Maxwell and Pascal have a pipeline depth of 6 for common operations.
However, the Volta white paper says the following: "Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal." This undercuts that whole clock speed theory.
One item is that this is using Nvidia's operand delivery latency, which may be affected by presence or latency of its result forwarding. GCN is likely getting away with a register writeback stage even more than 4 cycles away--since certain vector operations like DPP that do not have vector forwarding have several software-visible wait states.
Using the logic applied to Nvidia, GCN's pipeline would be zero cycles, as would the vector pipelines in a CPU like Zen.
I don't follow - are you saying Nvidia is not including execution latency in "dependent FMA issue latency = 4", just operand delivery?
Answering a bit old perhaps, but VideoCardz never explained why they think some scores are OCd and others are not - 3DMark doesn't show any difference in clocks, it's same 1630/945 for every resulthttps://videocardz.com/70777/amd-radeon-rx-vega-3dmark11-performance
That´s the result of one of the multiple overclocked results over 1630 Mhz of the same card. The non overclocked one is still slower than a stock 1080. So, still similar results to Vega FE.
Answering a bit old perhaps, but VideoCardz never explained why they think some scores are OCd and others are not - 3DMark doesn't show any difference in clocks, it's same 1630/945 for every result