22 nm Larrabee

Also the notion that programmable HW gives you better data locality is baseless, it's not like fixed function units cannot share on-chip storage.

Its not like they can't just add new/additional terminals onto Heathrow, Its just construction after all. I'm sure it won't impact anything at all.
 
Its not like they can't just add new/additional terminals onto Heathrow, Its just construction after all. I'm sure it won't impact anything at all.
This logic applies to programmable or fixed function units so I don't see how it refutes nAo's comment.

I do think that current heterogeneous designs require data to travel further than a CPU core with AVX, though it's all implementation dependent.
 
Its not like they can't just add new/additional terminals onto Heathrow, Its just construction after all. I'm sure it won't impact anything at all.
Oh you should see how incredibly diverse Heathrow's terminals are, but don't be afraid, you can visit them all with the same train.
 
I do think that current heterogeneous designs require data to travel further than a CPU core with AVX, though it's all implementation dependent.
If you process the data in the AVX units, they get loaded to the XMM/YMM registers through the L1/L2/L3/memory hierarchy.
If you process them in a throughput optimized unit (with its own scheduler), the data gets loaded to the vector registers of this unit through the L1/L2/L3/memory hierarchy, where the L3, possibly the L2 cache will be the same as for the AVX units. How is this supposed to cause much more or further data traffic for your typical throughput oriented workload?
 
If you process the data in the AVX units, they get loaded to the XMM/YMM registers through the L1/L2/L3/memory hierarchy.
If you process them in a throughput optimized unit (with its own scheduler), the data gets loaded to the vector registers of this unit through the L1/L2/L3/memory hierarchy, where the L3, possibly the L2 cache will be the same as for the AVX units. How is this supposed to cause much more or further data traffic for your typical throughput oriented workload?
I said current designs and I don't believe you're describing one. Current heterogeneous designs like AMD Fusion still have distinct CPU and GPU parts and there is no L1 or register level communication. Data transfer between units is over a bus or a higher level of the cache hierarchy.
 
There's no switching activity in the branch predictors unless a branch is encountered, of which throughput-oriented code has few.

This is not correct.

Instruction fetch and decode can consist of multiple pipeline stages, and if we want our (correctly predicted) branches to be as fast as possible, we cannot wait until we know whether the instruction is a branch or not; branch target buffer has to be used, and that branch target buffer may have to be always accessed, before we know if the instruction was a branch or not.

Also, throughput-oriented code contains quite a lot of small loops, which have one easily-predictable branch instruction. Only if we unroll those loops, we get rid of the branches, but full unroll is often not an option, and lack of registers often prevent aggressive partial unrolling.
 
...we cannot wait until we know whether the instruction is a branch or not; branch target buffer has to be used, and that branch target buffer may have to be always accessed, before we know if the instruction was a branch or not.
With a uop cache, you know instantly whether an instruction is a branch or not.
Also, throughput-oriented code contains quite a lot of small loops, which have one easily-predictable branch instruction. Only if we unroll those loops, we get rid of the branches, but full unroll is often not an option, and lack of registers often prevent aggressive partial unrolling.
AVX-1024 instructions executed on 256-bit units are equivalent to fourfold unrolling. It implicitly enables access to four times more registers.
 
Since I seem to not understand correctly: Did your plan include executing AVX-1024 over four cycles and clock-/power-gating the issue logic in the remaining three cycles?
 
I think several people told you several times already, that there is no significant additional data movement involved.
Which is blatantly false. The vast majority of workloads consists of a mix of sequential code and parallel code. Each of these typically achieve high L1 / L2 cache hit rates in heterogeneous cores optimized for one of these workloads, however when you transition between them you have to go through a bandwidth-limited, high-latency and high-power L3 cache interconnect. With cores capable of handling both workloads instead, you continue to benefit from L1 / L2 cache hits. Hence homogeneous architectures reduce data movement.

Note that heterogeneous architectures force you to strictly partition your code into large sequential and parallel parts. When you have small parallel tasks it's not possible to gain anything by offload them to the GPU cores due to the bandwidth and latency overhead, so you must compute things on the CPU cores instead. And likewise the GPU cores often have to perform tasks which would have ran more efficiently on the CPU cores but the round-trip makes that prohibitively expensive so you just waste GPU power on it instead. So even when the data traffic between the cores of heterogeneous architectures seems limited, that doesn't mean the system runs optimally.

The solution is to unify the capabilities of both cores. And that's easy enough starting from a classic CPU with out-of-order execution by adding gather support and very wide vector instructions which are executed in an in-order fashion on less wide (but still wide enough) execution units. Intel has figured that out years ago.
It shows that it can go further in the optimization for throughput oriented workloads. And more optimized for the workload means a higher performance in the given power budget. And that is a good thing.
Exactly how would an entire APU "go further in the optimization for throughput oriented workloads" than a homogeneous high-throughput CPU with extensive clock gating? Note that no matter how much you optimize the IGP, you still need the CPU cores.
And that can't be done on a GPU?
Certainly, but only when you make them more optimized for sequential code, converging them toward a homogeneous architecture.
That PCU doesn't replace anything, it adds. By the way, I think you can do very similar stuff as proposed there in the hull/domain shader stages.
It replaces fixed-function (hierarchical) depth tests with fully programmable tests containing 'kill' instructions. It only retains the fixed-function hardware for handling the kill itself. And whether the cull unit is implemented using dedicated programmable hardware or the cull shaders are executed in the unified cores, the fact of the matter is that the programmability increases yet the effective performance/Watt improves. And again, this is just the tip of the iceberg. Lots of fixed-function hardware can be replaced with programmable hardware without sacrificing anything.
 
Since I seem to not understand correctly: Did your plan include executing AVX-1024 over four cycles and clock-/power-gating the issue logic in the remaining three cycles?
Yes, during the next three cycles no out-of-order issue is required for the port executing the AVX-1024 instruction since you already have in-order work for it. Only a tiny bit of sequencing logic has to stay live. And this lower instruction rate resonates through the rest of the architecture, creating more clock gating opportunities.
 
That's a pretty positive way to say that the instruction will cause a processor to stall.

I think gating the decoder when there's a hit in the uop cache is a good thing.
I think gating the decoder because the front end needs to stall due to an ROB or scheduling blockage is not as good.

I think that an instruction that can mostly take the place of 4 smaller instructions is a good thing.
There is tension between the space savings and power gating too aggressively to take advantage of what could be up to three times as much opportunity to continue reordering, improvements to the hit rate of the uop cache, and less instruction decode bandwidth.

Thinking back on the duties of the scheduler, I think the gating opportunities may be more modest.
The scheduler has functionality to receive new uops from the front end, monitor the operand status of the ones it already has, and dispatch those it deems ready (and somewhere in here is the secret sauce of measurements and heuristics in how it makes the determination).

When the 4-cycle instruction begins issue, the scheduler's buffer still has 3 entries to spare (if the instruction is not cracked) where there would have been 3 additional 256 bit instructions otherwise. I'd rather not gate it off, which would leave the space underutilized.
At the same time, until the 4-cycle instruction's results become available some number of cycles (possibly 5-7 before the first completion for some arithmetic ops) in the future, the operand readiness logic would be active in order to detect returns for instructions still in the pipeline.

When the operand results start to return, the dispatch and pick logic would have already moved on to new things.
Maybe the result monitoring part for the port would not care until the final cycle, although perhaps the exception monitoring part would need to be awake unless the deliver of exception information is deferred to the end, while the register write part is done per 256 bit chunk.
However, this may leave performance on the table.

Much like how the P4 fast ALU forwarded a 16 bit chunk to the first half of a dependent op, these 4 cycle ops could forward the first chunks of the result to a dependent 4 cycle op, that is, if the scheduler were paying attention to the port.
This would save latency, since a single 4-cycle instruction could take ten cycles to hit the wakeup period at the end.

I think there are tradeoffs to be made, and there could be some nice opportunities for improvement with measurable if not revolutionary benefits. (edit: repetition in sentence is repetitive)

edit:
In addition, Intel has worked on reducing the latency of clock gating units. I think the earliest chips with it had a small cycle penalty because of the wakeup period. I think this has been improved, but I haven't seen mention of it.
 
Last edited by a moderator:
That's a pretty positive way to say that the instruction will cause a processor to stall.

I think gating the decoder when there's a hit in the uop cache is a good thing.
I think gating the decoder because the front end needs to stall due to an ROB or scheduling blockage is not as good.

I think that an instruction that can mostly take the place of 4 smaller instructions is a good thing.
There is tension between the space savings and power gating too aggressively to take advantage of what could be up to three times as much opportunity to continue reordering, improvements to the hit rate of the uop cache, and less instruction decode bandwidth.

Thinking back on the duties of the scheduler, I think the gating opportunities may be more modest.
The scheduler has functionality to receive new uops from the front end, monitor the operand status of the ones it already has, and dispatch those it deems ready (and somewhere in here is the secret sauce of measurements and heuristics in how it makes the determination).

When the 4-cycle instruction begins issue, the scheduler's buffer still has 3 entries to spare (if the instruction is not cracked) where there would have been 3 additional 256 bit instructions otherwise. I'd rather not gate it off, which would leave the space underutilized.
At the same time, until the 4-cycle instruction's results become available some number of cycles (possibly 5-7 before the first completion for some arithmetic ops) in the future, the operand readiness logic would be active in order to detect returns for instructions still in the pipeline.

When the operand results start to return, the dispatch and pick logic would have already moved on to new things.
Maybe the result monitoring part for the port would not care until the final cycle, although perhaps the exception monitoring part would need to be awake unless the deliver of exception information is deferred to the end, while the register write part is done per 256 bit chunk.
However, this may leave performance on the table.

Much like how the P4 fast ALU forwarded a 16 bit chunk to the first half of a dependent op, these 4 cycle ops could forward the first chunks of the result to a dependent 4 cycle op, that is, if the scheduler were paying attention to the port.
This would save latency, since a single 4-cycle instruction could take ten cycles to hit the wakeup period at the end.

I think there are tradeoffs to be made, and there could be some nice benefits with measurable if not revolutionary benefits.

edit:
In addition, Intel has worked on reducing the latency of clock gating units. I think the earliest chips with it had a small cycle penalty because of the wakeup period. I think this has been improved, but I haven't seen mention of it.

More to the point, do you think perf/W increase of ~2x is plausible for the entire core with this? The entire core being defined to include everything upto L2.
 
Compared to what?
2x seems like a very large improvement if all we're talking about is 4-cycle ops and scheduler gating.

Multicycle ops could allow parts of a segment of the scheduler to gate off some of the time, and they can reduce pressure on uop cache and ROB. This can lead to savings, but I'm hesitant to ascribe a blanket 2x improvement when larger changes have lead to more modest jumps.
 
Yes, during the next three cycles no out-of-order issue is required for the port executing the AVX-1024 instruction since you already have in-order work for it. Only a tiny bit of sequencing logic has to stay live. And this lower instruction rate resonates through the rest of the architecture, creating more clock gating opportunities.

Ok. So your AVX-1024 unit would need four cycles to finish its work. Except for the ability to process wider workloads and the power gating opportunities, I cannot see how this would improve compute density or throughput compared to an otherwise identical AVX-256 which would be able to retire one operation per cycle. For the latter, you'd need only 1/4th as wide an issue logic, which of course wouldn't be power gated since it'd feed the AVX-256 unit continously.
 
My interpretation was that this was an AVX 256/(512?)/1024 unit. The unit and the scheduler would have the footprint of an AVX 256 unit, plus some additional logic to handle the larger mode.
 
But would you be able to feed the larger execution width in a single cycle without getting all the neede operands first and checking for dependencies and the like? I don't know, so I thought I'd rather ask.
 
Compared to what?
2x seems like a very large improvement if all we're talking about is 4-cycle ops and scheduler gating.

Multicycle ops could allow parts of a segment of the scheduler to gate off some of the time, and they can reduce pressure on uop cache and ROB. This can lead to savings, but I'm hesitant to ascribe a blanket 2x improvement when larger changes have lead to more modest jumps.

Your analysis was very illuminating. I think it's safe to say that avx1024 over 256bit alu wouldn't affect this transition in any material way.
 
But would you be able to feed the larger execution width in a single cycle without getting all the neede operands first and checking for dependencies and the like? I don't know, so I thought I'd rather ask.

In this scheme, execution and completion are distinct. The first quarter of a register would be available 4 clocks before the entire register is available to feed into the next register.

This is basically cray vectors all over again. We all know how that worked out.
 
But would you be able to feed the larger execution width in a single cycle without getting all the neede operands first and checking for dependencies and the like? I don't know, so I thought I'd rather ask.
The data would be worked on in 256-bit chunks, so only that portion of the register is needed in a cycle.
An implementation detail in this is the physical width of the registers. It was discussed earlier on whether the registers themselves are 256 bits, which would then entail logically combining 4 separate physical registers into a single 1024 bit register, or if the registers are extended again.

Keeping 256-bit registers encourages an increase in the number of registers, since 1024-bit mode would consume rename space more quickly.
Native 1024 bit registers would allow for an unchanged amount of consumption of registers, though that would lead to a large amount of unused capacity in 256 bit mode.
 
That's a pretty positive way to say that the instruction will cause a processor to stall.
It doesn't stall execution, which is all we really care about.

Note that this isn't very different from NVIDIA's full-clock execution and half-clock instruction issue. Except that instead of a 2:1 ratio it's up to 4:1, and the CPU can ramp up the instruction issue rate for sequential code.
Thinking back on the duties of the scheduler, I think the gating opportunities may be more modest.
The scheduler has functionality to receive new uops from the front end, monitor the operand status of the ones it already has, and dispatch those it deems ready (and somewhere in here is the secret sauce of measurements and heuristics in how it makes the determination).
While executing AVX-1024 code, the reservation station will quickly become full. This means the logic to receive new uops from the front-end can be clock gated for a while (and as mentioned before, once the ROB becomes full the entire front-end can be clock gated). Also, monitoring the result register of the ALU executing the AVX-1024 instruction can be clock gated during the three cycles that it doesn't change. And finally, monitoring the result register of every ALU that didn't receive an instruction can also be clock gated (which should already be the case today). So while executing AVX-1024 code, which also eventually slows down the issue rate of any scalar code mixed in, the scheduler's components can be clock gated up to 3/4 of the time (not simultaneously).
Maybe the result monitoring part for the port would not care until the final cycle, although perhaps the exception monitoring part would need to be awake unless the deliver of exception information is deferred to the end, while the register write part is done per 256 bit chunk.
However, this may leave performance on the table.
Actually the result monitoring should only care about first cycle, not the last, to allow chained execution.

Exception handling would essentially be no different than with cracked vector operations. Note that with the Pentium 4's cracked vector operations you got two identical uops, except for the physical register numbers. These can be fused together, and you only need a tiny bit of logic in a few places to sequence the register numbers.
I think there are tradeoffs to be made, and there could be some nice opportunities for improvement with measurable if not revolutionary benefits.
Indeed there are a some tradeoffs to be made. The above proposal allows nearly instant recovery to full speed scalar execution, but is relatively complexity. A potentially simpler alternative would be to drain the execution pipelines and put the entire core on a 1/4 clock regime, except for the ALU, register file and caches, once an AVX-1024 instruction reaches the scheduler. Basically you'd put the core into 'GPU mode'. Note that there wouldn't be any chained execution though; every instruction can only begin execution every 4 cycles. There would also be a penalty for transitioning in and out of this state. But the peak throughput would be the same and the dynamic power consumption of all control logic would be guaranteed to be cut in four. It's a valuable thought experiment, but it seems to me that better performance/Watt can be achieved by going the more complex route of clock gating individual components, mainly because consuming static power while some useful work could have been performed instead would be a waste.
 
Back
Top