22 nm Larrabee

Could anyone help me interpret the following: http://www.indeed.com/r/b1b4922f298d8c33

"Graphics and Media cluster's micro-architecture validation for iLRB (Larabee-3 slice for Haswell Client product and Discrete Larabee-3)"

It makes no sense to me that Haswell would get a Larrabee-based IGP, considering the gather and FMA support for AVX2 and its future extendability to AVX-1024 which would lower the power consumption.

Since he's referring to slices as have been kind of disclosed at last years SF-IDF, it's probably not set in stone as they are quite modular. Maybe, if LRB3 fulfills expectations, they might even attach a few slices for an Intel-APU?
 
Since he's referring to slices as have been kind of disclosed at last years SF-IDF, it's probably not set in stone as they are quite modular. Maybe, if LRB3 fulfills expectations, they might even attach a few slices for an Intel-APU?

At this point, Haswell is probably all about validation only.
 
AVX2 isn't power efficient enough yet. We need AVX-1024 for that. But yes it could be part of a longer term convergence.
There are several assumptions inherent to this statement. One is that AVX-1024 will be a large improvement in power efficiency such that it is needed, rather than a nice addition. The other is that what constitutes the assumed large improvement for a highly complex OoO engine is relevant to an in-order throughput core.


AVX2 doesn't support scatter, only gather. A one cacheline per clock implementation makes the most sense. It doesn't require any changes to the cache itself and doesn't complicate coherency.

There are possibilities on either side of that in how aggressively the implementation can extract data, and how many corner cases it can smooth over.
The graphics slice has certain advantages, such as the fact that its cache architecture does not need to support the same assumed loads as the general purpose cores, it doesn't need to aggressively speculate past unresolved memory ops, and Intel has shown with Ivy Bridge that it can isolate the graphics slice behind an additional cache, which isolates coherence to that single client on the bus.
 
There are several assumptions inherent to this statement. One is that AVX-1024 will be a large improvement in power efficiency such that it is needed, rather than a nice addition.
The instruction rate would be up to four times lower, so it should allow for some very aggressive clock gating of the front-end. Note that Sandy Bridge's uop cache is primarily intended for lowering power consumption. So this gives us an indication that requiring four times fewer instructions allows for significant power savings. Even additional stages like register renaming can take a break while the execution units are chewing on AVX-1024 instructions. The switching activity in the schedulers would be lower as well.

But on top of direct power savings it would also offer higher performance due to improved latency hiding. Which in turn would allow for less aggressive prefetching, another power saving.

So all of it combined should result in a substantial performance/Watt improvement. Given the negligible cost, I can't imagine anything that would offer a better ROI. All the effort put into AVX and AVX2 makes it obvious Intel wants to run high throughput workloads on the CPU, and it would be silly to stop short of achieving competitive performance/Watt.
The other is that what constitutes the assumed large improvement for a highly complex OoO engine is relevant to an in-order throughput core.
I'm not suggesting running AVX-1024 on in-order cores. I'm talking about a fully homogeneous architecture where all cores are out-of-order, which when executing AVX-1024 approach the efficiency of in-order cores. Of course these cores are larger than in-order purely throughput oriented cores due to having to run other workloads as well, but just like unified shader cores I believe it should be feasible to gain more than you lose. Maybe not today, but graphics is getting more complex and the convergence can be spread over multiple years.
 
The instruction rate would be up to four times lower, so it should allow for some very aggressive clock gating of the front-end. Note that Sandy Bridge's uop cache is primarily intended for lowering power consumption.
Instruction fetch should only cease if there is no room left for the uops in the OoO engine.
AVX 1024 would mean that 4 times as many FLOPs can be performed per ROB entry, which means there are 3 additional instructions that can be fetched.
It's not too difficult to for an OoO core with multiple threads to find things to fetch even if the AVX unit is busy. To halt instruction fetch because of a set of 1024-bit ops would have a measurable negative impact on the performance of the rest of the core.

The appropriate response for an OoO core is to keep the instruction rate the same, and use the throughput to allow for further latency hiding.

Even additional stages like register renaming can take a break while the execution units are chewing on AVX-1024 instructions. The switching activity in the schedulers would be lower as well.
I am unsure the AVX unit's being busy is an acceptable reason to stall pipeline stages needed for every other unit on the chip, which can and will continue running.

But on top of direct power savings it would also offer higher performance due to improved latency hiding. Which in turn would allow for less aggressive prefetching, another power saving.
As far as memory demands go, AVX 1024 would have the same burden on the data cache as four 256 bit ops, so data prefetching would work close to the same.
The instruction cache hit rates are already high, but the code density could improve so long as the data granularity of the target problem is a close enough fit to the width of the vector.


So all of it combined should result in a substantial performance/Watt improvement. Given the negligible cost, I can't imagine anything that would offer a better ROI.
There is the assertion that various costs are negligible. I think some of them are unacceptable.
A helpful hint might be a processor status flag or OS scheduler tracker that indicates which cores are using AVX 1024, so every integer and non-1024 thread can steer clear.

I'm talking about a fully homogeneous architecture where all cores are out-of-order, which when executing AVX-1024 approach the efficiency of in-order cores.
The first part of this sentence is a proposal, the second half is begging the question.
 
Instruction fetch should only cease if there is no room left for the uops in the OoO engine.
AVX 1024 would mean that 4 times as many FLOPs can be performed per ROB entry, which means there are 3 additional instructions that can be fetched.
4 uops can be decoded per cycle, so those additional 3 you're talking about are already there. When all instructions are AVX-1024, the instruction execution rate drops by a factor 4, so all the stages above it can also slow down, without actually affecting the processing rate.
It's not too difficult to for an OoO core with multiple threads to find things to fetch even if the AVX unit is busy. To halt instruction fetch because of a set of 1024-bit ops would have a measurable negative impact on the performance of the rest of the core.
It's not just the AVX unit that is busy. It's the entire port that is occupied for 4 cycles! So even if your second thread contains no AVX instructions, you're quickly out of scheduling opportunities, the queues and buffers in the front-end fill up, and you can start clock gating things without affecting performance.
I am unsure the AVX unit's being busy is an acceptable reason to stall pipeline stages needed for every other unit on the chip, which can and will continue running.
The clock gating would be triggered by the fill level of queues and buffers, not by the activity of AVX units.
As far as memory demands go, AVX 1024 would have the same burden on the data cache as four 256 bit ops, so data prefetching would work close to the same.
The arithmetic instructions also take four cycles, so it's easier to schedule around cache misses. Hence it's acceptable/desirable for the prefetching to be less aggressive. What works for GPUs, can also work for CPUs.
There is the assertion that various costs are negligible. I think some of them are unacceptable.
A helpful hint might be a processor status flag or OS scheduler tracker that indicates which cores are using AVX 1024, so every integer and non-1024 thread can steer clear.
Efficient thread switching is already handled by the VZEROUPPER instruction.
 
4 uops can be decoded per cycle, so those additional 3 you're talking about are already there.
Are we back to cracking a single instruction into multiple uops?
That would negate a portion of the possible power savings, and would reduce the effectiveness of the ROB and retire logic.
This is a possible contributor to Bulldozer's performance degradation when running AVX-256.

When all instructions are AVX-1024, the instruction execution rate drops by a factor 4, so all the stages above it can also slow down, without actually affecting the processing rate.
There will be non-vector instructions interspersed even in a lot of FP heavy code, and the very real possibility with two or more threads, that there will be non-FP threads running concurrently.

It's not just the AVX unit that is busy. It's the entire port that is occupied for 4 cycles!
There are five ports in SB, possibly more in future designs. There are many sorts of stack, prefetch, branch, and memory operations that could find a way to run ahead, if they are able to be decoded and issued.
Depending on move elimination strategies, there may even AVX 1024 instructions that can be done in fewer than 4 cycles.
There are optimizations such as getting prefetches sent out and the trick SB uses to get its high read+write bandwidth in 256-bit mode with only 2 AGUs that don't work if the front end doesn't get to them.

So even if your second thread contains no AVX instructions, you're quickly out of scheduling opportunities, the queues and buffers in the front-end fill up, and you can start clock gating things without affecting performance.
This would go back to a fundamental difference in our viewpoints. When I hear "out of scheduling opportunities" for an OoO core, I think "not a good thing". And perhaps things can be clock gated without affecting performance further, but this does not mean performance hasn't been degraded.

The clock gating would be triggered by the fill level of queues and buffers, not by the activity of AVX units.
At least some of the details present seem to be making those queues better for AVX in particular, but globally less effective.

The arithmetic instructions also take four cycles, so it's easier to schedule around cache misses.

Hence it's acceptable/desirable for the prefetching to be less aggressive. What works for GPUs, can also work for CPUs.
It's not scheduling around an L2 miss, so expectations outside of the L1 don't appear to be improved.
The same or more data is going to be processed at some point.
What could be different is that the granularity can no longer rely on the implicit prefetch of a cache line fill, since AVX-1024 at a minimum spans two cache lines. There could be an argument for additional stride detection logic and additional use of prefetch instructions to counteract the loss of the assumption that operand size<<cache line size. Prefetch instructions would need that front end, though.


Efficient thread switching is already handled by the VZEROUPPER instruction.
Is this used by the OS or other threads to control which core they are assigned to? Its primary benefit seems to be reducing context when hopping out of 256-bit mode, which is not what I was addressing.
 
Are we back to cracking a single instruction into multiple uops?
That would negate a portion of the possible power savings, and would reduce the effectiveness of the ROB and retire logic.
This is a possible contributor to Bulldozer's performance degradation when running AVX-256.
Nope, you can get still get all the goodies of a single uop. With 3 arithmetic units the highest instruction rate of code containing only AVX-1024 instructions would be 0.75 per cycle. But the front-end can deliver 4 instructions per cycle. Hence the front-end only has to be active for less than 1/5 of the time!
There will be non-vector instructions interspersed even in a lot of FP heavy code, and the very real possibility with two or more threads, that there will be non-FP threads running concurrently.
So what? If you have lots of scalar code it's fine to burn some more Watts on pushing instructions. Nothing changes on that front. With SIMD-heavy code more power is consumed in processing data and less in handling instructions. With mixed code, you get mixed results. Nothing wrong with that. Every AVX-1024 instruction would save power in the front-end.
There are five ports in SB, possibly more in future designs. There are many sorts of stack, prefetch, branch, and memory operations that could find a way to run ahead, if they are able to be decoded and issued.
You seem to fail to understand that instructions still have to retire in-order. When one execution port is clogged up, it doesn't matter how many other free ports there are, you'll quickly run out of instructions within the scheduling window which either don't need the clogged port or don't depend on an instruction which needs the clogged port.

Of course clogging sounds like like a bad thing but to the execution units one AVX-1024 instruction would be no different than four sequential AVX-256 instructions. The only relevant difference is that it reduces the number of instructions that need to pass through the front-end.
Depending on move elimination strategies, there may even AVX 1024 instructions that can be done in fewer than 4 cycles.
Sounds like a plan.
This would go back to a fundamental difference in our viewpoints. When I hear "out of scheduling opportunities" for an OoO core, I think "not a good thing". And perhaps things can be clock gated without affecting performance further, but this does not mean performance hasn't been degraded.
You have to seriously change/widen your perspective on this. All power consumed by handling instructions is wasted power. So the aim is to minimize that while maximizing actual useful work. If any part of the out-of-order execution engine can be made idle while the ALUs stay active, that's definitely a very good thing.

That said, one AVX-1024 instruction performs the same amount of useful work as four AVX-256 instructions. But the scheduling opportunities actually improve due to consuming fewer uop slots! All I was saying is that in the case of Hyper-Threading the other thread will eventually still run out of instructions to schedule at the usual rate, allowing to clock gate the front-end. Once again that's not a bad thing since you have to look at the combined computational throughput, which would be higher than when using AVX-256.
At least some of the details present seem to be making those queues better for AVX in particular, but globally less effective.
Please explain your sentiment.
It's not scheduling around an L2 miss, so expectations outside of the L1 don't appear to be improved.
Just like on a GPU, that completely depends on the ratio of memory accesses to arithmetic instructions. An L2 cache miss takes only 11 cycles on Sandy Bridge, while two threads can hide 8 cycles with a single AVX-1024 instruction. It would be incredibly unlikely if none of the other 54 instructions in the scheduling window could help cover the remaining 3 cycles.
Is this used by the OS or other threads to control which core they are assigned to? Its primary benefit seems to be reducing context when hopping out of 256-bit mode, which is not what I was addressing.
There are many performance counters, including several for AVX. I'm not sure if any O.S. uses them for thread scheduling or if that would even help, but the possibilities are there.
 
Last edited by a moderator:
Nope, you can get still get all the goodies of a single uop. With 3 arithmetic units the highest instruction rate of code containing only AVX-1024 instructions would be 0.75 per cycle. But the front-end can deliver 4 instructions per cycle. Hence the front-end only has to be active for less than 1/5 of the time!
This is focusing on a workload with 100% AVX-1024 instructions. There are vectorizable workloads that spend over 90% of their time in vector execution, and this small subset may reach a saturation point on occasion, so long as the core doesn't pick up a second integer thread.
For workloads with more book-keeping code, branching, and integer ops, the threshold is reached less often.
I would tentatively suggest that there may also be certain loads whose data granularity is such that the wider vectors can lead to a drop in the total number of vector instructions, but without a matching drop in non-vector instructions, leading to a worsened ratio.
I would consider a vector-heavy thread reducing the utilization of other units a sign of an unbalanced design.

With mixed code, you get mixed results. Nothing wrong with that.
Pure vector code is a minority, and with multithreading, not guaranteed even if an application is written that way.
Intel's hyperthreading solutions have lead to more consistent performance and fewer cases of negative scaling with each generation. A design that has the expectation that instruction execution should be a source of front-end stalls would be a regression.

Every AVX-1024 instruction would save power in the front-end.
Some power would be saved. Allowing the front end to be throttled or gated runs counter to the goals of an OoO core, which in the general case is limited by instruction throughput, not data or execution latency. If non 1024-bit ops are delayed in issue due to congestion caused by AVX, the latency cannot be hidden because the front end and OoO engine are what would be needed to hide it.

You seem to fail to understand that instructions still have to retire in-order. When one execution port is clogged up, it doesn't matter how many other free ports there are, you'll quickly run out of instructions within the scheduling window which either don't need the clogged port or don't depend on an instruction which needs the clogged port.
This is untrue in the multithreaded case, which involves explicitly independent code. A stalled front-end is a global penalty.
In the single-threaded case, the front end and OoO engine actually go through a lot of effort in modern designs of using renaming tricks and sideband stack units to reduce the number of ops sent to the back end. A throttled front end reduces the rate of encountering these, and any ops that cannot issue to non-AVX units can lead to a detectable drop in throughput outside of the AVX unit.

You have to seriously change/widen your perspective on this. All power consumed by handling instructions is wasted power. So the aim is to minimize that while maximizing actual useful work. If any part of the out-of-order execution engine can be made idle while the ALUs stay active, that's definitely a very good thing.
My first contention with that is that the justifications for that OoO core are that it prioritizes execution speed in the general case. An OoO core that regresses in that performance in favor of 1024-bit throughput is a weaker proposition over a broader range of workloads.
The OoO engine can be turned off for power, imperfectly. The broad front end can be turned off for power, imperfectly.
If those general purpose elements all too often are not needed, it should be noted that not having them at all in an execution core is a very effective form of power gating.

Please explain your sentiment.
4-cycle ops complicate instruction scheduling, since they indicate an addition 3 cycles of latency per operation. The fact that the port itself cannot issue for an additional 3 cycles can complicate the heuristics in the scheduler, and the buildup in the ROB and other buffers can lead to non-AVX ports being starved for want of scheduler space.
Sandy Bridge already arranged its units in such a way that each port had consistent execution latencies internally to help reduce contention.

With 1 to 3 to 5 latency ops being possible now, having 6 and 8 or more cycle operations overlaid is a complex problem. I'm curious if they'd be given their own domain to keep them from interfering with the other ports. It might be possible that instead of overloading existing ports and reducing access to all their functionality, the bulkier ops will be issued to new ones.

Just like on a GPU, that completely depends on the ratio of memory accesses to arithmetic instructions. An L2 cache miss takes only 11 cycles on Sandy Bridge, while two threads can hide 8 cycles with a single AVX-1024 instruction.
An L2 cache hit is 11 cycles. A miss to the L3 is roughly in the upper 20s.
Actually serving an AVX-1024 memory access is of uncertain latency without defining how many cache ports of what width we are talking about, cache line length, the width of the L1-L2 bus, and the additional latency of consecutive line fills.
The cache would need to be rearchitected significantly. Sandy Bridge's cache, for example could not cleanly service two threads, each with an AVX-1024 memory access hitting the L2.
Bandwith is insufficient even for an L1 hit.
Bank conflicts would take their toll with at most one cache line per cycle.
If the loads were broken up into separate 128-bit accesses, the L1 cannot have more than 10 misses to higher levels of memory, meaning one of the instructions would be stuck pending a completion of two or more L1 cache line fills.

edit:

The cache controller may be able to pick up on the sequential accesses and coalesce them into a smaller number of L1 cache line misses to the L2. If so, then somewhere between 3-5 may be handled somewhat cleanly by the L2.
The atomicity of the ops may need additional provisions. AVX-1024 would always cross at least one cache line boundary, which may insert a additional latency to prevent an an update to one of the lines while the op is in-flight.
 
Last edited by a moderator:
This is focusing on a workload with 100% AVX-1024 instructions. There are vectorizable workloads that spend over 90% of their time in vector execution, and this small subset may reach a saturation point on occasion, so long as the core doesn't pick up a second integer thread.
That's irrelevant. You don't judge a GPU's power efficiency at throughput computing by adding in the CPU's power consumption for running the driver and application, do you? If the CPU has to run scalar code and this lowers the clock gating opportunities that's totally fine. Business as usual.

All you have to care about is how it compares at high-throughput workloads. So let's go with your example of 90% vector instructions, and an arithmetic port usage of let's say 82% x 3. This works out to the front-end only having to be active 1/6 of the time, even less than my previous "best case" example! With only AVX-256, the front-end would have to be active 62% of the time instead, and that's without taking into account that performance would be lower due to critical path latencies and extra spilling instructions.

So the conclusion is still that every AVX-1024 instruction represents a substantial opportunity for clock gating. Which could make the performance/Watt of a CPU much more competitive to a GPU, at throughput workloads.
4-cycle ops complicate instruction scheduling, since they indicate an addition 3 cycles of latency per operation. The fact that the port itself cannot issue for an additional 3 cycles can complicate the heuristics in the scheduler, and the buildup in the ROB and other buffers can lead to non-AVX ports being starved for want of scheduler space.
To the scheduler an AVX-1024 instruction would be identical to four AVX-256 instructions, except that it only occupies one uop slot. It doesn't "starve" anything any more than those four AVX-256 uops would. In fact you get more things to schedule.

It seems like you're still stuck in the idea that processing four instructions is better than processing one instruction... But when they perform the same amount of useful work, having more instructions is just overhead!
With 1 to 3 to 5 latency ops being possible now, having 6 and 8 or more cycle operations overlaid is a complex problem.
No, it doesn't have to wait for the entire AVX-1024 instruction to finish before the next dependent one can be issued. They can execute in a staggered fashion, or "chained" if you prefer the Cray terminology. That's also how it works when it would be split into separate uops.
An L2 cache hit is 11 cycles. A miss to the L3 is roughly in the upper 20s.
My bad, I got misses and hits confused for a moment. Still, just like a GPU isn't able to schedule around any miss if it happens on every cycle, you shouldn't look at the worst case but the average case. Out-of-order execution already does a pretty decent job hiding L1, L2 and L3 misses. AVX-1024 would make it even way more effective. With a scheduling window of 54 uops like in Sandy Bridge, It's like having a four times bigger scheduling window, the equivalent of 216 uops in the case of Sandy Bridge! Chances of finding enough instructions in there to cover for the vast majority of misses is extremely high.
 
That's irrelevant. You don't judge a GPU's power efficiency at throughput computing by adding in the CPU's power consumption for running the driver and application, do you? If the CPU has to run scalar code and this lowers the clock gating opportunities that's totally fine. Business as usual.
It goes to the significance of 4-cycle AVX-1024 as a crucial improvement for making a hypothetical wide OoO core approach the power efficiency of an in-order core. The claim that it is a large improvement is not mine.

All you have to care about is how it compares at high-throughput workloads. So let's go with your example of 90% vector instructions, and an arithmetic port usage of let's say 82% x 3. This works out to the front-end only having to be active 1/6 of the time, even less than my previous "best case" example!
Why does the case of 100% AVX 1024 lead to 1/4 front-end activity while 90% leads to 1/6? The 10% of non-vector instructions means cycles where the front end remains on longer than if it were purely 1024-bit.


To the scheduler an AVX-1024 instruction would be identical to four AVX-256 instructions, except that it only occupies one uop slot. It doesn't "starve" anything any more than those four AVX-256 uops would. In fact you get more things to schedule.
It is a contiguous run of clock cycles where the port is unavailable for instructions within the thread and those from another thread. This complicates the task of the scheduler's heuristics to interleave uops for different operations as it encounters them. For example, it may be in the interest of the scheduler to push newer instructions ahead of a vector operation if they lead to an earlier calculation of a load/store address, taking advantage of MLP.

From the point of view of the scheduler, AVX 1024 is behaving like an unpipelined instruction that compromises the issue of non-dependent instructions of various types from the current or separate thread.

It seems like you're still stuck in the idea that processing four instructions is better than processing one instruction... But when they perform the same amount of useful work, having more instructions is just overhead!
I am hung up on the fact that while this may increase the amount of useful work done within the ideal AVX-1024 context, it has broader effects outside of the vector execution stream, especially if we are assuming throttling power savings by assuming frequent stalls of the front end.
For the intended purpose of general purpose cores, this is a regression on loads not amenable to AVX 1024, and those are a more important case for these kinds of cores.

More appropriate changes would be to expand the window and port layout such that AVX-1024 has less of an impact on the OoO capabilities for other types of unit. This means the front end will not be apt to stall as often, which means the savings are more likely to be modest.

No, it doesn't have to wait for the entire AVX-1024 instruction to finish before the next dependent one can be issued. They can execute in a staggered fashion, or "chained" if you prefer the Cray terminology. That's also how it works when it would be split into separate uops.
Chained cracked ops have the benefit of having standard writeback latency for the port in question. An AVX-1024 instruction's writeback latency is standard+3 additional cycles. In the pipeline, non-1024 bit ops that have only the standard latency can contend for writeback if they reach writeback somewhere in those additional 3 cycles. This leads to stalling writeback for one of the ops, most likely the non-1024, since the alternative is to stall the 1024-bit writeback in mid-stream.

This is why I postulated separate ports or maybe even separate domains, since SB does have known penalties for mixing uops of differing latencies.
 
Why does the case of 100% AVX 1024 lead to 1/4 front-end activity while 90% leads to 1/6?
Because in the 100% case I also assumed an IPC of 3 arithmetic operations per cycle. With a more realistic lower IPC the front-end has to be less active.
The 10% of non-vector instructions means cycles where the front end remains on longer than if it were purely 1024-bit.
Yes, but note that those 10% only affect instruction throughput by 8%. So the need to have a bit of scalar code in between the vector code doesn't significantly change the fact that AVX-1024 would create major clock-gating opportunities.
Chained cracked ops have the benefit of having standard writeback latency for the port in question. An AVX-1024 instruction's writeback latency is standard+3 additional cycles.
No, it can behave in exactly the same way as cracked instructions, but only take a single uop. All that's required is a two-bit counter to keep track of how many parts remain.
 
Nick, aren't you arguing the merits of software 3D graphics rendering based on that it'll be compared against APUs of the day?

Say your 1TFlop SP Haswell becomes a reality. To achieve that it needs 8 cores, AVX2, and 4GHz frequency. And let's even assume you'll be right that it'll be competitive with Trinity's successor.

The only problem with that is they can't scale down without sacrificing more 3D performance than it would be without the iGPU. A 2 core 3GHz Haswell would be at 190GFlops. Not only early signs for Haswell are going against using CPU for graphics processing, it doesn't even use Larrabee and continue with Gen X instead.

On Ivy Bridge it shows that 17/35/45W mobile parts offering equal/better graphics compared to full on desktop parts. With your suggestion, Haswell on the laptop will have significantly slower graphics than the high-end 8 core chips. And even with the 8 cores its still only on the iGPU level.

I repeat, the above scenario is based on IF software rendering can achieve 1:1 ratio relative to the GPU based on flops numbers.
 
Last edited by a moderator:
Nick, aren't you arguing the merits of software 3D graphics rendering based on that it'll be compared against APUs of the day?
No. I'm arguing that CPUs and GPUs are converging and that in the distant future they might become unified. I have no doubt that anything on the roadmaps today still includes an IGP, so it's not going to happen before the end of the decade. It's just an interesting exercise to see what kind of technological steps would be required to continue the convergence. It will take many more, but after AVX2 it looks like AVX-1024 is the most promising.
Say your 1TFlop SP Haswell becomes a reality. To achieve that it needs 8 cores, AVX2, and 4GHz frequency. And let's even assume you'll be right that it'll be competitive with Trinity's successor.

The only problem with that is they can't scale down without sacrificing more 3D performance than it would be without the iGPU. A 2 core 3GHz Haswell would be at 190GFlops.
If you want more graphics performance out of a homogeneous CPU, get more cores. Where's the harm in that? Note that an A8-3870K has a little over twice the peak CPU power of a A4-3400, but also over twice the GPU power. So CPU and GPU performance go hand in hand. Your assumption is that there are lots of people who want a weak CPU but a powerful GPU. The reality is that such an abomination leads to big disappointment. If you're increasing the die size and bandwidth for the GPU, you might as well spend a few more percent on balancing CPU power.

NVIDIA even thinks quad-core makes sense for mobile chips. And with a homogeneous CPU the area you save on the IGP can be used for more CPU cores. So a low-end homogeneous CPU would definitely have more than 2 cores. That said, even 190 GFLOPS is well over the peak performance of today's HD Graphics 3000!
Not only early signs for Haswell are going against using CPU for graphics processing, it doesn't even use Larrabee and continue with Gen X instead.

On Ivy Bridge it shows that 17/35/45W mobile parts offering equal/better graphics compared to full on desktop parts. With your suggestion, Haswell on the laptop will have significantly slower graphics than the high-end 8 core chips. And even with the 8 cores its still only on the iGPU level.
Are you saying that Ivy Bridge mobile parts will achieve 1 TFLOP?
I repeat, the above scenario is based on IF software rendering can achieve 1:1 ratio relative to the GPU based on flops numbers.
I can honestly tell you that software rendering is remarkably efficient per FLOP if you take into account there is no gather support yet and all the fixed-function stuff takes multiple instructions. The thing is, a GPU's fixed-function hardware is massively over-dimensioned so it doesn't become a bottleneck during peak usage, but the average utilization is pretty low. So a programmable core with a versatile ISA doesn't spend a lot of cycles on this functionality. Also, graphics is becoming more programmable with every generation, so by the time we have AVX-1024 there will be even less stuff to emulate.
 
Because in the 100% case I also assumed an IPC of 3 arithmetic operations per cycle. With a more realistic lower IPC the front-end has to be less active.
In a majority of cases, the front end is significantly more active than the stream of committed instructions indicates. IPC does not measure the activity of the front end directly, since quashed operations are excluded.
The front end sees much greater throughput than the output, with some older studies indicating 30-40% of all instructions that go through the front end never committing.
Of course, if it didn't have that throughput, the core would not have the peak performance or latency hiding capability it has.


Yes, but note that those 10% only affect instruction throughput by 8%.
The number is potentially higher because there are various sideband optimizations that could create additional consumers of integer ops above those available to AVX.
The exact impact is more difficult to discern because the front end has an indirect relationship with utilization on the back end.

No, it can behave in exactly the same way as cracked instructions, but only take a single uop. All that's required is a two-bit counter to keep track of how many parts remain.
I had assumed a scheme that kept things almost as simple in the back end and scheduler.
Cracked ops have all of information of instruction progress and dependence handled in the rename stage and ROB allocation, ahead of instruction issue. The OoO engine operates as normal.
This scheme involves additional feedback from the end of the EXE stage and a small amount of arithmetic being done on the uop itself. This inserts complexities above what I had raised concerns about before.
 
If you want more graphics performance out of a homogeneous CPU, get more cores. Where's the harm in that?

Easy. Higher TDP. Again I note the laptop parts.

The reality is that such an abomination leads to big disappointment. If you're increasing the die size and bandwidth for the GPU,

I guess that's why Intel is giving beefier graphics on mobile parts? Your suggestion would mean 17W parts would have less than a quarter of performance, and regular laptop parts half or less.

That said, even 190 GFLOPS is well over the peak performance of today's HD Graphics 3000!

Because we all know that iGPUs stay in one place in Ivy Bridge, let alone Haswell? A 1GHz Ivy Bridge iGPU would offer 256GFlops, and even the 17W parts are at 1.15GHz. So with your suggestion, the next generation Haswell parts will offer 2/3rd the performance. That's not even saying in reality, Intel is planning to offer 50% or more peformance over Ivy Bridge with Haswell, so the mobile software renderer GPU would be even further behind.

Are you saying that Ivy Bridge mobile parts will achieve 1 TFLOP?

Read again please. To requote what I said: "On Ivy Bridge it shows that 17/35/45W mobile parts offering equal/better graphics compared to full on desktop parts."

Ivy Bridge mobile iGPUs clock at 1.3GHz while desktops top out at 1.15GHz. 1TFlop CPU not happening in laptops.

I can honestly tell you that software rendering is remarkably efficient per FLOP if you take into account there is no gather support yet and all the fixed-function stuff takes multiple instructions.

"Remarkbly efficient". What does that even mean? That seems merely like a spin on "oh its close enough".

All this doesn't matter because of what I said on the previous post. Can't scale down. It would be barely competitive and that's only when using massive die, power hungry 8 cores @ 4GHz.
 
Last edited by a moderator:
NVIDIA even thinks quad-core makes sense for mobile chips. And with a homogeneous CPU the area you save on the IGP can be used for more CPU cores. So a low-end homogeneous CPU would definitely have more than 2 cores.
While it seems that quad-cores should probably become more common eventually, Nvidia's reasons for having 4 (actually 5) cores so early has less to do with their belief in many homogenous cores than their desire to offset a disadvantage using standard ARM cores and a disadvantage in physical design, at least in the near term.
4 okay cores with a shorter time to market was Nvidia's gambit when facing the engineering resources of competitors Qualcomm and TI.

The deployment curve for higher CPU core counts in the Tegra range is not as steep for the competition, perhaps because multithreading is not as common as on the desktop and because power consumption is felt more keenly.
 
No. I'm arguing that CPUs and GPUs are converging and that in the distant future they might become unified. I have no doubt that anything on the roadmaps today still includes an IGP, so it's not going to happen before the end of the decade. It's just an interesting exercise to see what kind of technological steps would be required to continue the convergence. It will take many more, but after AVX2 it looks like AVX-1024 is the most promising.

I'm afraid you are a bit ignoring reality, and taking your wishful thoughts for facts.
 
In a majority of cases, the front end is significantly more active than the stream of committed instructions indicates. IPC does not measure the activity of the front end directly, since quashed operations are excluded.
The front end sees much greater throughput than the output, with some older studies indicating 30-40% of all instructions that go through the front end never committing.
Of course, if it didn't have that throughput, the core would not have the peak performance or latency hiding capability it has.
That's true for scalar workloads with lots of relatively hard to predict branches, but frankly it's ridiculous to think it affects high throughput workloads.
I had assumed a scheme that kept things almost as simple in the back end and scheduler.
Cracked ops have all of information of instruction progress and dependence handled in the rename stage and ROB allocation, ahead of instruction issue. The OoO engine operates as normal.
This scheme involves additional feedback from the end of the EXE stage and a small amount of arithmetic being done on the uop itself. This inserts complexities above what I had raised concerns about before.
Why would there be any additional feedback from the end of the execution stage required? With cracked instructions, the uops are nearly identical. So all I'm proposing is to smack them together into a single uop and keep a 2 bit counter to indicate which one is next in line.

Intel has apparently even done it before with the Pentium 4, since the performance counters only indicate a single uop for each 128-bit SSE instruction, which are executed in 2 cycles (credit for noticing this goes to Agner Fog).
 
Back
Top