22 nm Larrabee

Nick · Jan 31, 2012

DavidC said:
Easy. Higher TDP. Again I note the laptop parts.

Lower power consumption is exactly what AVX-1024 is all about.

I guess that's why Intel is giving beefier graphics on mobile parts?

None of Intel's current IGPs are "beefy".

Your suggestion would mean 17W parts would have less than a quarter of performance, and regular laptop parts half or less.

one-billion-dollars-austin-powers-above-the-law-blog.jpg

"One million dollars!"

Seriously now, you're not thinking far enough into the future. At the very earliest we'll see AVX-1024 on a 14 nm process in 2015. It's currently the biggest but not the last prerequisite to make homogeneous graphics a reality. By the end of the decade even some of the most power constrained CPUs will achieve 1 TFLOP. So it would be a real shame if by then graphics hasn't become fully programmable. If you look back at the previous progress it becomes instantly obvious that GPUs and CPUs have been converging for many years. And the end result can be nothing other than an architecture where any form of programmable computing is performed by homogeneous cores.

Only a few years ago a lot of people would have claimed discrete graphics cards to be the dominant graphics solution for all eternity. But now it has become clear that integrating graphics into the CPU is the new mainstream. So don't act surprised if the next major shift will be the unification of CPU and GPU cores...

Nick · Jan 31, 2012

Voxilla said:
I'm afraid you are a bit ignoring reality, and taking your wishful thoughts for facts.

Strange, that's exactly what people told me when I suggested adding gather support to AVX...

There isn't a single wishful thought here where reality is being ignored. It's a simple technical matter of costs and gains. And right now it's looking like AVX-1024 would offer substantial gains at a negligible cost. If you can think of something that would offer a bigger increase in performance/Watt at a lower cost, and thus would be a more likely feature to be implemented next, then by all means please share your thoughts.

Voxilla · Jan 31, 2012

Nick said:
If you can think of something that would offer a bigger increase in performance/Watt at a lower cost, and thus would be a more likely feature to be implemented next, then by all means please share your thoughts.

Just replace the x86/x64 anachronism with ARM tech.

3dilettante · Jan 31, 2012

Nick said:
That's true for scalar workloads with lots of relatively hard to predict branches, but frankly it's ridiculous to think it affects high throughput workloads.

In SPEC 2006 has 14 FP workloads. Most have a mix of in the low single digits. Three are around ~10%, if I recall correctly. Some kinds of simulation have a higher percentage of branches, but can easily use more FP throughput.

Why would there be any additional feedback from the end of the execution stage required? With cracked instructions, the uops are nearly identical.

Sorry, I thought we were still talking about an implementation able to interleave different ops in the midst of the multi-cycle operation. A fully cracked instruction could have varying delays to each sub-op, and matching this behavior with one would have the scheduler monitoring if there was a possible hazard at the writeback stage.

Intel has apparently even done it before with the Pentium 4, since the performance counters only indicate a single uop for each 128-bit SSE instruction, which are executed in 2 cycles (credit for noticing this goes to Agner Fog).

I think Intel stated plainly that 128-bit ops were done 64 bits at a time. I'm trying to find the pdf where the architectus stated that they had initially looked at a design capable of 128-bits per cycle, but found the die costs did not match the marginal benefits.
I hope I can find it, because one point in the paper that I would have brought up was that while the 128-bit ops did take up two cycles, the ports those units were on were able to issue non-SSE ops to the other unit types on the second cycle of the SSE op. The reciprocal throughput governed what could go on the specific unit, not the execution port.

Nick · Jan 31, 2012

Voxilla said:
Just replace the x86/x64 anachronism with ARM tech.

"I got news for everyone. ARM has decoders, and they have ugly variable length instructions.

It's obviously cleaner than x86, but it's all a matter of degrees." - David Kanter

So I'm afraid you're being far too optimistic about ARM. By the time anyone has created a CPU that is on par with Haswell, the actual ISA is an insignificant detail. Just look at all the other ones which haven't been able to dethrone Intel. Note that RISC requires separate instructions for every memory operation, which negates some of the benefits. It also seems really doubtful that four 256-bit ARM instructions would be more power efficient than one AVX-1024 instructions.

DavidC · Feb 1, 2012

Nick said:
None of Intel's current IGPs are "beefy".

You know what I am talking about, don't act like you don't. The give better GPUs on mobile than on desktops, and the trend is set to continue in the future.

Actually, 2015 is useless to predict, unless you are those that are actually defining the future(like the guys at Intel for example). It's just like the Apple analysts predicting that the next iPad/iPhone will come next month every month. Successful companies always have people trying to predict their next move.

mczak · Feb 1, 2012

DavidC said:
You know what I am talking about, don't act like you don't. The give better GPUs on mobile than on desktops, and the trend is set to continue in the future.

Actually the fastest IGPs from intel are on the desktop, as they have higher base clocks (by quite a bit) and the max dynamic frequency is also ever so slightly higher (1.35Ghz vs. 1.3Ghz unless I missed some odd part which is entirely possible given the nonsense naming with 6 random letters and numbers so you still have to look at every single part). True enough though HD3000 on the desktop are indeed rare. I don't know though why actually intel only ships the slower ones in most cpus on the desktop, maybe to keep amd alive or something.

Voxilla · Feb 1, 2012

Nick said:
So I'm afraid you're being far too optimistic about ARM. By the time anyone has created a CPU that is on par with Haswell, the actual ISA is an insignificant detail. Just look at all the other ones which haven't been able to dethrone Intel. Note that RISC requires separate instructions for every memory operation, which negates some of the benefits. It also seems really doubtful that four 256-bit ARM instructions would be more power efficient than one AVX-1024 instructions.

Isn't competition sweet. Once Windows 8 ARM devices will hit the market, things will not get easier for x86/x64.

3dilettante · Feb 1, 2012

Nick said:
It also seems really doubtful that four 256-bit ARM instructions would be more power efficient than one AVX-1024 instructions.

The guesstimates on overhead for x86 versus more regular ISAs is 10-20% on a core, assuming all else is equal.
Now, if the only difference is that the x86 one has 1024-bit instructions versus 256-bit ARM, are you arguing that ISA is irrelevant, but a subset of those instructions is not?

It is particularly important to define the implementation details more clearly in this debate, as it seems we have been debating slightly different visions of what 1024-bit AVX could be. AVX-1024 is a possible future elaboration of AVX, but as the ISA debate points out, it is implementation we are looking at.

On a side note, it occurred to me that cracked ops can be made in response to several design challenges.
For example, cracked ops are helpful if the physical registers are 256-bits, as this simplifies individual uop behavior. It is less necessary if logical=physical width.
One other implementation can use cracked ops to shorten the execution time of dependence chains of AVX-1024 ops, especially after the addition of multiple FMA ports.
With cracking uops could flow to more than one port, creating an effective 512-bit unit in terms of throughput in a dependence chain. It is possible to have a single op do the same, with some complexity added.
These are all subject to various implementation decisions, of course.

Nick · Feb 1, 2012

DavidC said:
You know what I am talking about, don't act like you don't.

Your argument was that an architecture with homogeneous computing cores doesn't scale down. All I'm saying is that I don't see proof of that in Intel's current CPU+IGP offerings. All I see is that they didn't scale the graphics up for desktop parts.

The give better GPUs on mobile than on desktops, and the trend is set to continue in the future.

No, Haswell will double the peak throughput per core. And there's also a steady evolution from dual-core to quad-core in the mobile market. There's no trend towards a growing imbalance in CPU and IGP performance. With ever more complex graphics and compute APIs there should even be a relative decrease in computing density for the IGP. Discrete GPUs have had decreasing computing density for the last few generations for this reason.

Actually, 2015 is useless to predict, unless you are those that are actually defining the future(like the guys at Intel for example). It's just like the Apple analysts predicting that the next iPad/iPhone will come next month every month. Successful companies always have people trying to predict their next move.

I find it no less useful than the weather man trying to predict the weather. Even if the prediction is off, a detailed analysis has been made and we learn about the aspects that weren't properly taken into account. Of course shoulda coulda woulda can never be taken into account.

As I've said before, this forum is all about speculation. Some of us will be right and some of us will be wrong, about individual aspects. And sometimes the only reason someone's wrong is because the IHV is wrong. But if there's some technical reason why you think I'm wrong about something, please explain it. If you think this is all a useless exercise, don't feel obliged to join in on the discussion.

Nick · Feb 1, 2012

Voxilla said:
Isn't competition sweet. Once Windows 8 ARM devices will hit the market, things will not get easier for x86/x64.

Yes, competition is great, and I'd love to see a fresh new ISA take over. But realistically I don't think ARM can make enough of an impact to obliterate Intel's process advantage.

Look at it this way; if ARM was a threat to Intel, then Intel would switch ISA itself. Easy peasy, and that ISA could be even better than ARM. But there just doesn't appear to be enough advantage in that to justify breaking binary compatibility with the humongous x86 software ecosystem.

So the best we can really hope for is that ARM will keep Intel on its toes. And since x86 is way closer to achieving high throughput computing I'd rather not get sidetracked by a discussion about ARM before it's actually a viable option.

Nick · Feb 1, 2012

3dilettante said:
The guesstimates on overhead for x86 versus more regular ISAs is 10-20% on a core, assuming all else is equal.

The problem is you can't assume all else to be equal. For instance, x86 allows a memory operand for pretty much every instruction. This can be exploited to avoid separate instructions for register spilling/filling. Which in turn affects the L1 cache design and things like store-to-load forwarding. Likewise, ARM has a cascaded barrel shifter, which sometimes reduces the instruction count but also affects the timings of arithmetic instructions that don't use it. So clock frequency comes into play as well, which then likely has an effect on the width of the architecture, and can even influence process decisions. And then there's differences in memory coherency, the availability of branching hints, macro-op fusion, the synergy between software and hardware, etc.

Suffice to say I won't settle for a "guesstimate".

Now, if the only difference is that the x86 one has 1024-bit instructions versus 256-bit ARM, are you arguing that ISA is irrelevant, but a subset of those instructions is not?

For argument's sake let's say that an AVX-256 instruction takes 120% of the power consumption of an equivalent 256-bit ARM instruction, of which 70% in the front-end and 50% in the back-end. Then by executing an AVX-1024 instruction in four cycles there would be a 50% improvement in performance/Watt over the 256-bit ARM instruction(s). Also if ARM had 1024-bit instrutions executed in this way, the advantage ove x86 would only be 8%, not 20%.

So for throughput computing where an instruction is executed over multiple cycles, the ISA becomes less relevant.

It is particularly important to define the implementation details more clearly in this debate, as it seems we have been debating slightly different visions of what 1024-bit AVX could be. AVX-1024 is a possible future elaboration of AVX, but as the ISA debate points out, it is implementation we are looking at.

I'm sorry for all the confusion. It started out as a very rough idea and it has been a learning process for me to understand the challenges and try to come up with solutions. But I think we've reached a point where we'd have to learn about lots of implementation details of today's x86 processors to know how to best fit in AVX-1024 support (assuming the overarching design isn't changed).

The thing I was really most concerned about is whether it's a viable idea. And I'm glad to see that you appear to agree it's worth considering.

On a side note, it occurred to me that cracked ops can be made in response to several design challenges.
For example, cracked ops are helpful if the physical registers are 256-bits, as this simplifies individual uop behavior. It is less necessary if logical=physical width.

I've always envisioned the physical registers to remain 256-bit.

One other implementation can use cracked ops to shorten the execution time of dependence chains of AVX-1024 ops, especially after the addition of multiple FMA ports.

That's an interesting idea! I'm pretty certain that Haswell will have two FMA ports. I wonder how you could detect a dependency chain though (or if you even should). In any case it's a detail that shouldn't affect the overall viability.

3dilettante · Feb 2, 2012

Nick said:
The problem is you can't assume all else to be equal.

We are debating a few new instructions in an ISA while discarding an entire ISA change as being insufficient. We can't debate much at all if we give one side an implied process node and physical design advantage. I've already noted that ISA does not override implementation, but even apparently minor implementation decisions could swing it either way.

Suffice to say I won't settle for a "guesstimate".

None of the parties in question that have the most detailed knowledge have an interest in giving an honest assessment, even if there were comparable designs.
Medfield and the oncoming optimized 28nm ARM cores from Qualcom and others might bring some clarity in the tablet/phone range. At least then we'd have cores with the same target market and a few competitors to Intel that devote at least some effort to optimized physical design.

For argument's sake let's say that an AVX-256 instruction takes 120% of the power consumption of an equivalent 256-bit ARM instruction, of which 70% in the front-end and 50% in the back-end.

I see we might have started guesstimating again.
The numerical basis seems a little muddled, since we have a value that is 120% of an unknown base and then two percentages that don't add up to 100%. What are the 70% and 50% percentages of?
Would my interpretation sound correct to you:
Assume an ARM 256-bit op costs 100 units of energy.
Assume an AVX 256-bit op costs 120 units.
Your next assumption is that for AVX, 70 units are expended by the front end and 50 by the back.
In the 1024-bit comparison:
100 x 4 ARM ops = 400.
70+50+50+50+50 = 270.
This would make the 1024-bit x86 about 33% more power efficient than 4 ARM ops and 44% more efficient than 4 AVX-256, using proportions I cannot verify as being accurate for either ARM or x86.
There are certain assumptions to this, such as it being so simple to divide 60% of an AVX op's power cost into a "front end" bucket that can be made 0 units for 3 of the 4 cycles.
There are parts of the front end that can be very difficult to truly turn off, so there is some undetermined additive factor that persists across all 4 cycles. Also, the usual definition of front end is instruction fetch and decode, but are you including the scheduling phase as front end? The back-end effort is also a little more complex, so what happens in a 1024-bit case is not quite the same as what is done in the 256-bit case.

The fact that our base numbers are unverifiable is the first source of uncertainty.
The second is that the units of power devoted to each portion will be different between an AVX 256 and 1024 implementation, since there are slight differences in how the pipeline behaves.

My earlier arguments concerned the context of these instructions. Are we assuming multithreaded aggressive OoO superscalar cores for each case? There are costs associated with this that I would place as a floor of power consumption that becomes more significant as the rest becomes more efficient.
My other point was that the nature of these cores is to have a policy of keeping the front end and scheduler awake, but that has been covered already.

This then goes back to my post at the start of this chain on the idea of an AVX-1024 core approaching the power-efficiency of an in-order throughput core. The unit costs for the front end and back end when comparing either ARM or x86 fat cores to a power optimized simple core are very different.

The thing I was really most concerned about is whether it's a viable idea. And I'm glad to see that you appear to agree it's worth considering.

I agree that it is doable and that it can be an improvement. I believe we have been discussing as best we can the question of how much.

That's an interesting idea! I'm pretty certain that Haswell will have two FMA ports. I wonder how you could detect a dependency chain though (or if you even should). In any case it's a detail that shouldn't affect the overall viability.

With cracked ops, it would be handled in the renaming stage as the 256-bit registers are allocated. The dependence would be indicated by the rename registers used, and the scheduler would put the ops across multiple ports as naturally as if they were separate 256-bit ops.

If not cracked, it would still be possible. One port would need the ability to send control signals to a unit on the other, and the scheduler would in this scenario would suppress instruction issue on the same domain for the secondary port. The scheduler would have a more active role in tracking how often it has an opportunity to gang the ports together, and it's not as transparent to the back end in that case.

Voxilla · Feb 2, 2012

Nick said:
Look at it this way; if ARM was a threat to Intel, then Intel would switch ISA itself. Easy peasy, and that ISA could be even better than ARM. But there just doesn't appear to be enough advantage in that to justify breaking binary compatibility with the humongous x86 software ecosystem.

Surely Intel prefers a semi-monopoly, allowing outrageous margins, the ARM ecosystem would not allow.

Nick · Feb 2, 2012

3dilettante said:
We are debating a few new instructions in an ISA while discarding an entire ISA change as being insufficient.

I'm not dismissing the advantages a new ISA could bring. I just think it's largely orthogonal and while I would love a more power efficient ISA to displace x86 I just don't think that's realistic in the foreseeable future given the dominant software ecosystem.

And I'm not even convinced that a RISC architecture would be the answer given that it requires additional instructions for memory accesses. So then the question would become: how do you keep x86 functionally the same, but encoded more efficiently? I'm sure that Intel has the answer to that question, and if it made a significant difference they'd switch over to that ISA and have a thin code-morphing layer to guarantee backward compatibility...

Maybe that's worth considering further down the road, but right now AVX-1024 seems more important. Like I said before, there are many more steps to take toward fully homogeneous computing. AVX-1024 would be the nail in the coffin for GPGPU, but it will take additional convergence for real-time graphics.

Medfield and the oncoming optimized 28nm ARM cores from Qualcom and others might bring some clarity in the tablet/phone range. At least then we'd have cores with the same target market and a few competitors to Intel that devote at least some effort to optimized physical design.

Intel has previously announced to ramp up the development cycle of Atom, so it can also enjoy the same process lead. So Medfield is just the very first glimpse of what Intel can offer in this market. It's a foot in the door and they'll only start putting their full weight into it when demand picks up. Furthermore, it's an x86-64 capable design, while ARMv8 designs still have to be announced. So it's questionable we'll see parts which allow an accurate assessment of x86 versus ARM any time soon.

I see we might have started guesstimating again.

No. The intention was not to make estimates, but to merely illustrate that AVX-1024 could offer significant gains even if x86 itself is less efficient by a certain amount. You can play with those numbers any way you want and still reach that same conclusion.

The numerical basis seems a little muddled, since we have a value that is 120% of an unknown base and then two percentages that don't add up to 100%. What are the 70% and 50% percentages of?
Would my interpretation sound correct to you:
Assume an ARM 256-bit op costs 100 units of energy.
Assume an AVX 256-bit op costs 120 units.
Your next assumption is that for AVX, 70 units are expended by the front end and 50 by the back.
In the 1024-bit comparison:
100 x 4 ARM ops = 400.
70+50+50+50+50 = 270.
This would make the 1024-bit x86 about 33% more power efficient than 4 ARM ops and 44% more efficient than 4 AVX-256, using proportions I cannot verify as being accurate for either ARM or x86.

Indeed, my percentages and your units of energy work out to the same thing. 33% less power means 50% higher performance/Watt. And that's more than any realistic number for a more efficient encoding.

There are certain assumptions to this, such as it being so simple to divide 60% of an AVX op's power cost into a "front end" bucket that can be made 0 units for 3 of the 4 cycles.
There are parts of the front end that can be very difficult to truly turn off, so there is some undetermined additive factor that persists across all 4 cycles. Also, the usual definition of front end is instruction fetch and decode, but are you including the scheduling phase as front end? The back-end effort is also a little more complex, so what happens in a 1024-bit case is not quite the same as what is done in the 256-bit case.

Yes, there are many assumptions, but again I think that any reasonable change in the parameters will still lead to the same conclusion. Depending on the exact implementation there could be lower switching activity beyond the classic definition of the front-end which offers further improvement, but you're right that clock gating isn't perfect which probably cancels out such advantage. But then there's also a reduction in spilling instructions versus the equivalent unrolled AVX-256 code, which independently improves performance and reduces memory accesses.

Are we assuming multithreaded aggressive OoO superscalar cores for each case? There are costs associated with this that I would place as a floor of power consumption that becomes more significant as the rest becomes more efficient.

Yes, reducing scalar performance would be unacceptable. But note that there are the new BMI1 and BMI2 instruction extensions (which are non-destructive). And by extending macro-op fusion they could implement non-destructive operations for legacy instructions as well. Interestingly just like AVX-1024 that has the effect of reducing the number of uops for the same amount of work. So techniques like this would help balance things out. Granted, other architectures have had non-destructive instructions for decades, but there are code density advantages to not using non-destructive instructions when not needed.

This then goes back to my post at the start of this chain on the idea of an AVX-1024 core approaching the power-efficiency of an in-order throughput core. The unit costs for the front end and back end when comparing either ARM or x86 fat cores to a power optimized simple core are very different.

Keep in mind that in-order throughput cores are not free from schedulers. They have to keep a scoreboard to know which thread can run next. And while that's not as complex as Tomasulo's out-of-order execution, I'm sure it worsens the power consumption of these in-order throughput cores compared to in-order blocking cores, converging them closer to out-of-order cores. Furthermore, they need humongous register files to store the context of many threads, and storing a useful amount of thread-local data into caches is barely an option so they often have to reach out to RAM memory. And this both creates a bottleneck and increases power consumption over cache accesses.

In other words, scheduling instructions within the same thread can be more power efficient overall than scheduling across threads, depending on the workload characteristics. As I've noted before, we already see the effect of that in GPU designs which perform superscalar execution. And as long as the workloads get more complex, this convergence will continue.

If not cracked, it would still be possible. One port would need the ability to send control signals to a unit on the other, and the scheduler would in this scenario would suppress instruction issue on the same domain for the secondary port. The scheduler would have a more active role in tracking how often it has an opportunity to gang the ports together, and it's not as transparent to the back end in that case.

Yeah, I'm sure it can be done, but after giving it some thought it doesn't seem worthwhile. The purpose would be to reduce dependency chains, but remember that it already covers 8 cycles (using 2 threads). So there could only be some benefit if the dependency chain consists of instructions with an average latency over 8 cycles. Furthermore, dependent instructions with a latency less than 4 cycles could be chained through the other port, shortening the dependency chain. Of course that assumes forwarding paths at the appropriate stages and there could be a latency penalty. Anyway, there are many trade-offs that can be made and it would require very detailed analysis of typical workloads to determine which would be beneficial. Since the use of AVX-1024 won't be very high for the first generation, and the subsequent generations can still tune it further, it seems wise to initially go with a cheap implementation.

Nick · Feb 2, 2012

Voxilla said:
Surely Intel prefers a semi-monopoly, allowing outrageous margins, the ARM ecosystem would not allow.

That's exactly why I added "if ARM was a threat to Intel". And even if the ISA becomes a problem they'd go with code-morphing of some kind. So x86 is guaranteed to stick around and be competitive for a long long time.

Voxilla · Feb 2, 2012

Nick said:
That's exactly why I added "if ARM was a threat to Intel". And even if the ISA becomes a problem they'd go with code-morphing of some kind. So x86 is guaranteed to stick around and be competitive for a long long time.

At least Intel is not a threat to ARM, I'm not sure the other way around.

3dilettante · Feb 2, 2012

Nick said:
And I'm not even convinced that a RISC architecture would be the answer given that it requires additional instructions for memory accesses.

My point in bringing up Medfield and 28nm ARM is that it is the first time in about a decade that x86 and a RISC competitor have targeted the same market where a number of variables have been partially brought into at least a similar ballpark. For the sake of a theoretical comparison, this is probably the best data point we can get.

Maybe that's worth considering further down the road, but right now AVX-1024 seems more important. Like I said before, there are many more steps to take toward fully homogeneous computing. AVX-1024 would be the nail in the coffin for GPGPU, but it will take additional convergence for real-time graphics.

It may damage GPGPU, but it can serve as a trojan horse for non-homogenous computing.

Furthermore, it's an x86-64 capable design, while ARMv8 designs still have to be announced. So it's questionable we'll see parts which allow an accurate assessment of x86 versus ARM any time soon.

This is unlikely to matter for Medfield where it meets ARM, and x86-64 versus x86 with the 64 bit fused off is probably one of the smaller variables to worry about.

Indeed, my percentages and your units of energy work out to the same thing. 33% less power means 50% higher performance/Watt. And that's more than any realistic number for a more efficient encoding.

Well, the ISA guesstimate is a ballpark on everything as opposed to the subset that uses a very high concentration of ultrawide vectors. Designers would need to evaluate the workloads they are targeting.
It's a very good improvement on that element of the system if the proportions work out that way.

Keep in mind that in-order throughput cores are not free from schedulers. They have to keep a scoreboard to know which thread can run next. And while that's not as complex as Tomasulo's out-of-order execution, I'm sure it worsens the power consumption of these in-order throughput cores compared to in-order blocking cores, converging them closer to out-of-order cores.

This does assume a many-thread throughput core, rather than a more modestly threaded core with very wide vector resources. The gap between even these and a single-threaded powerhouse is very wide. The prior math we've been going on about for an OoO core's improvement may be in the wrong order of magnitude.

Furthermore, they need humongous register files to store the context of many threads, and storing a useful amount of thread-local data into caches is barely an option so they often have to reach out to RAM memory. And this both creates a bottleneck and increases power consumption over cache accesses.

The register file for an OoO in the AVX-1024 time frame is going to be large. It may not match a GPU SM in size, but it may only differ by a factor of two or so from a reg file like what was on GT200 (edit: Make that G80, if Intel happens to increase the number of threads to 4, otherwise its 1/4 the size for architected state alone.), and the actual porting is many times higher for the OoO one.
Creating a separate domain for AVX-1024 could avoid this by allowing the inclusion of a separate reg file that doesn't need so many ports.

Gubbi · Feb 3, 2012

3dilettante said:
The register file for an OoO in the AVX-1024 time frame is going to be large. It may not match a GPU SM in size, but it may only differ by a factor of two or so from a reg file like what was on GT200 (edit: Make that G80, if Intel happens to increase the number of threads to 4, otherwise its 1/4 the size for architected state alone.), and the actual porting is many times higher for the OoO one.
Creating a separate domain for AVX-1024 could avoid this by allowing the inclusion of a separate reg file that doesn't need so many ports.

You could implement a big AVX1K register file in multiple levels. Have the lowest level hold 64 or 128 entries, ported umpteen ways, and a second level with back store for all the registers with just a few ports. Recall how Pentium Pro made due with just two read and one store port in its register file. Most values were provided by the ROB.

IMO, you don't want to banish AVX to an in-order co-processor, you end up with your OOO CPU core stalling on AVX RAW hazards, effectively turning it into an in-order machine.

Cheers

3dilettante · Feb 3, 2012

Gubbi said:
You could implement a big AVX1K register file in multiple levels. Have the lowest level hold 64 or 128 entries, ported umpteen ways, and a second level with back store for all the registers with just a few ports.

Is the core going to handle re-renaming its rename registers and migrating values based on use data? There would be no software way to control this.

Recall how Pentium Pro made due with just two read and one store port in its register file. Most values were provided by the ROB.

Result forwarding is a significant source, though the port restriction was still noticeable, per Agner for many preceding Intel cores. With SB, this seems to have been alleviated or significantly reduced, though I haven't found a port count. With a physical register file, the ROB no longer contains data.

IMO, you don't want to banish AVX to an in-order co-processor, you end up with your OOO CPU core stalling on AVX RAW hazards, effectively turning it into an in-order machine.

I suggested a different domain. Sandy Bridge has several, and they are out of order.

22 nm Larrabee

Similar threads