Intel working on a completely new x86 implementation dropping SIMD compatibility

fehu

Veteran
intel_small.png

http://www.bitsandchips.it/english/52-english-news/7854-rumor-even-intel-is-studying-a-new-x86-uarch

According to our sources – the same who reported the development of Zen two days before official AMD public announcement – Intel is studying a new uArch in order to replace the current x86 uArchs in Desktop and Enterprise market.

TigerLake (2019) will be the last evolution step of this Core generation, started with Sandy Bridge (Developed by Haifa Team). We can say that Haswell, Skylake and Cannonlake are only main adjustment (Skylake is the first Intel uArch developed to battle from the Mobile market to the Enterprise market)

The next Intel uArch will be very similar to the approach used by AMD with Zen – perfect balance of power consumption/performance/price – but with a huge news: in order to save physical space (Smaller Die) and to improve the power consumption/performance ratio, Intel will throw away some old SIMD and old hardware remainders.

The 100% backward hardware x86 compatibility will not guaranteed anymore, but could be not a handicap (Some SIMD today are useless, and also we can use emulators or cloud systems). Nowadays a lot of software house have to develop code for ARM and for x86, but ARM is lacking useful SIMD. So, frequently, these software are a watered-down compromise.

Intel will be able to develop a thin and fast x86 uArch, and ICC will be able to optimize the code both for ARM and for x86 as well.

This new uArch will be ready in 2019-2020.
 
ARM has useful SIMD. 64 bit specification has 128 bits (30 regs) NEON SIMD instruction set. Floats, doubles, integers. Not as wide as AVX, but definitely better than SSE, and supported by all ARM 64 bit designs.

Also ARM recently announced new SVE instructions to battle with AVX-512 and future AVX-1024:
https://www.community.arm.com/proce...or-extension-sve-for-the-armv8-a-architecture

Good thing about SVE is that it separates software vector width from hardware width. Software can be written with 2048 bit vectors, but hardware can be more narrow. For example, a 128 bit vector unit processes a 2048 wide instruction in 16 cycles. I would guess that Intel wants to move to a similar instruction set. It would allow full software compatibility with all processors, and at the same time allow them to produce CPUs with narrow SIMD for matkets where SIMD performance isn't critical (dense servers, etc). This would also make the instruction set simpler.
 
We should start keeping track of all the extraordinary rumors Bits and Chips publishes and grade them over time. I have a very good feeling that this is going to be wrong, not least of all because like sebbbi says the claim that ARM lacks useful SIMD is already incorrect.

Good thing about SVE is that it separates software vector width from hardware width. Software can be written with 2048 bit vectors, but hardware can be more narrow. For example, a 128 bit vector unit processes a 2048 wide instruction in 16 cycles. I would guess that Intel wants to move to a similar instruction set. It would allow full software compatibility with all processors, and at the same time allow them to produce CPUs with narrow SIMD for matkets where SIMD performance isn't critical (dense servers, etc). This would also make the instruction set simpler.

SVE is not that simple, it's more that it allows software to be written with a vector size that is not known at compile-time (and to some extent may not even be constant at run-time). Simply specifying a fixed huge vector size would perform badly on narrow uarchs given workloads that don't divide well by that width.

I agree that ultimately something like SVE makes the most sense for Intel and all things considered I'd say they kind of blew it by not moving AVX in this direction, but they made their decision and I doubt they'll back away from it this easily. If they do I expect an SVE-like extension that builds on AVX instead of replacing it, maybe replacing the reserved 1024-bit width specifier with one that means variable width.
 
SVE is not that simple, it's more that it allows software to be written with a vector size that is not known at compile-time (and to some extent may not even be constant at run-time). Simply specifying a fixed huge vector size would perform badly on narrow uarchs given workloads that don't divide well by that width.
Vector length needs to be a multiple of 128 bits. 128 bit HW SIMD divides every allowed vector width perfectly. 256 bit HW SIMD is perfect for 256/512/768/1024.../2048 vector widths. Executing wide vectors on narrow SIMD isn't a bad idea. All GPUs already do that. AMDs vectors are 64 wide (2048 bits) and HW SIMD is 16 wide (512 bits) = 4x difference. Intel GPU HW SIMD is 4 wide (128 bits) and vectors are 8 - 32 wide (256 bits to 1024 bits). Up to 8x difference. Wide vector on narrow SIMD provide lots of independent instructions to execute (SIMD instruction lanes are independent, excluding cross lane / horizontal ops). It is easier to latency hide (caches, pipelines, memory) this kind of execution. It also is much more friendly to instruction caches and allows OoO see futher in the "future".

Intel already has scalable narrow SIMD vector execution in their GPU. Maybe their future scalable CPU SIMD has borrowed some ideas from their GPUs.
 
Intel working on a new x86 uarch is not particularly surprising.
The slavish praise for Zen's "perfect" balancing of integer and FP is noise.
Discarding old opcodes is eyebrow-raising if only because it is uncommonly done in x86.
It's not without precedent, since at some junctures x86 does lose/repurpose opcodes, and Zen itself is an example of AMD abandoning some of its extensions.
It's not clear from doing that that Intel is going to do something like SVE. AVX did evolve over a period of time where it needed to plug into prior implementation choices and uncertainties (encoding, 32-bit and 64-bit, methods of handling execution, FMA3/4 etc.), which led to some implementation choices bleeding up from the uarch to the architecture like the delayed promotion of integer SIMD and the later introduction of gather/scatter. SVE does benefit from coming out with a more clear direction, the knowledge that we now have transistor budget to blow on it, and a clean decision to dedicate opcode space in a fresh architectural context.

Dropping instructions would seem to hint at a new mode.
If nothing else, I would think Intel's been coveting the opcode space and its fetch complexity and code density has suffered to the point that adding more prefixes might have proven prohibitive.
Cracking instructions might also not be done in the same way AMD handles the 256-bit split, which could be implemented differently OoO and at speed for longer or varied lengths.
 
Vector length needs to be a multiple of 128 bits. 128 bit HW SIMD divides every allowed vector width perfectly. 256 bit HW SIMD is perfect for 256/512/768/1024.../2048 vector widths. Executing wide vectors on narrow SIMD isn't a bad idea. All GPUs already do that. AMDs vectors are 64 wide (2048 bits) and HW SIMD is 16 wide (512 bits) = 4x difference. Intel GPU HW SIMD is 4 wide (128 bits) and vectors are 8 - 32 wide (256 bits to 1024 bits). Up to 8x difference. Wide vector on narrow SIMD provide lots of independent instructions to execute (SIMD instruction lanes are independent, excluding cross lane / horizontal ops). It is easier to latency hide (caches, pipelines, memory) this kind of execution. It also is much more friendly to instruction caches and allows OoO see futher in the "future".

Intel already has scalable narrow SIMD vector execution in their GPU. Maybe their future scalable CPU SIMD has borrowed some ideas from their GPUs.

I agree that what you're saying works well for GPUs. But if everything that worked well for GPUs was a good idea for CPUs too we'd just use GPUs for everything instead of worrying about extending SIMD on CPUs.

It's pretty normal for CPUs to execute vectorized loops over batches that a) aren't really that large (not as embarrassingly parallel) b) vary in size pretty routinely and c) are fairly latency sensitive (especially if they're switching to some non-vectorizable component). Going with something like a fixed 2048-bit SIMD width and executing over N 128-bit cycles is disastrous if the batch size is frequently not close to some multiple of 2048-bits. CPU SIMD code often works on integer data types that could be 16-bit or even 8-bit so the problem is exacerbated, you could need batch sizes well in the hundreds to prevent poor occupancy/wasted cycles. For some simple examples imagine string operatoins.

You could talk about short circuiting the vector sequencing in conjunction with predication, but this isn't really the simple hard width ratio you're describing anymore. Nonetheless, SVE chooses a different path of being architected for vector width agnostic code.

Not saying that architectural width > hardware width doesn't make some sense on CPUs too, but not to where you'd want to use one size fits all of 2048-bit architectural width for all uarchs, giving 128-bit implementations a whopping 16x ratio. CPUs usually just don't benefit that much from that kind of latency hiding in SIMD, they have a bunch of other mechanisms for that. Ratios of 2x or maybe 3x or 4x are more reasonable.
 
Last edited:
I agree that what you're saying works well for GPUs. But if everything that worked well for GPUs was a good idea for CPUs too we'd just use GPUs for everything instead of worrying about extending SIMD on CPUs.

It's pretty normal for CPUs to execute vectorized loops over batches that a) aren't really that large (not as embarrassingly parallel) b) vary in size pretty routinely and c) are fairly latency sensitive (especially if they're switching to some non-vectorizable component). Going with something like a fixed 2048-bit SIMD width and executing over N 128-bit cycles is disastrous if the batch size is frequently not close to some multiple of 2048-bits. CPU SIMD code often works on integer data types that could be 16-bit or even 8-bit so the problem is exacerbated, you could need batch sizes well in the hundreds to prevent poor occupancy/wasted cycles. For some simple examples imagine string operatoins.

You could talk about short circuiting the vector sequencing in conjunction with predication, but this isn't really the simple hard width ratio you're describing anymore. Nonetheless, SVE chooses a different path of being architected for vector width agnostic code.

Not saying that architectural width > hardware width doesn't make some sense on CPUs too, but not to where you'd want to use one size fits all of 2048-bit architectural width for all uarchs, giving 128-bit implementations a whopping 16x ratio. CPUs usually just don't benefit that much from that kind of latency hiding in SIMD, they have a bunch of other mechanisms for that. Ratios of 2x or maybe 3x or 4x are more reasonable.
I am assuming that most SVE hardware is narrow (128/256 bit SIMD), and I am assuming that the software developer knows to select appropriate vector width for each use case (128-2048). I was not talking about wide fixed width (2048 bit) SIMD, but more generally using narrow (fixed width) SIMD to execute wider vectors (Intel GPUs already use variable width vectors).

When you have a very wide loop, you use 2048 bit wide vectors, when you have 4d AoS vector class you use 128 bit vectors. Maybe 256 bit is best for string processing. You can select vector width separately for each algorithm, based on your data size and your branch granularity, among other factors. This approach should be "best of both worlds". 2048 bit vectors provide very dense code (excellect i$ utilization, good latency hiding). Every instruction provides work for multiple cycles. But for smaller jobs and/or incoherent branch heavy code you'd likely want to use narrow vectors such as 128/256/512 in your code.

I 100% agree with you that wide fixed size SIMD has problems with lane utilization. We already see this with AVX (8 wide) in some cases. AVX-512 (16 wide) is practically only usable for large scale batch processing (similar code that runs on GPU). AVX-1024 will match CUDA warp size in branch granularity. I would understand perfectly if Intel was interested about unified variable width vector ISA with "narrow" HW SIMD (128 bit on common hardware, 256/512 bit for HPC).

Also it's worth noting that x86-64 doesn't use dedicated floating point registers and floating point hardware anymore. Bottom 32 bits (or 64 bits) of XMM0-16 is used for floating point math. Single SIMD lane only. Other lanes are disabled. Similarly, when software written for SVE uses 4 wide (128 bit) vectors on a wider CPU SIMD (256 for example), the upper lanes would simply be disabled. This is certainly not optimal case for CPU transistor count, but we can assume that the extra lanes don't consume much power when disabled. AVX hardware already power manages wide vector hardware pretty well. Agner Fog has some interesting measurements how much time it gets to power up the SIMD completely (after long time of no AVX code). SVE style code would actually allow better power management, since you could always execute it with 128 bit SIMD. CPU could keep the wide SIMD lanes disabled until it notices that the 128 bit SIMD is becoming a bottleneck.
 
Last edited:
Cracking instructions might also not be done in the same way AMD handles the 256-bit split, which could be implemented differently OoO and at speed for longer or varied lengths.
GPUs don't even split the wide instruction. For example GCN SIMD just starts executing the same instruction for 4 cycles in a row (16 lanes at a time). A SIMD has four 16 wide instructions pipelined (executing at different stages). Of course this kind of execution wouldn't be optimal for OoO hardware. Splitting the wide instruction to micro-ops would allow executing the parts separately in case of stalls (cache miss for example). This would actually be very nice with OoO hardware, as it would "see" lots of wide (2048 bit) vector instructions simultaneously and could optimize the execution order (of SIMD wide partial vectors) based on L1 cache residency.

AMD (Bulldozer/Jaguar) splits 256 bit vectors to 128 bit SIMD instructions when the instructions are decoded. This is half rate compared to fast path, and costs as much as two 128 bit SIMD instructions after that. This practically only saves i$ lines. It would be better to split the vectors later, but if you want to track the data of each 128 bit partition separately (for OoO), then you need to split it much earlier than GPUs do.
 
Last edited:
Zen only has 1cop/mop for 256bit instructions and they are split into uop's at the disbatch/schedule stage ( just like all other mops/cops). so its still 1/2 rate 256bit ops but they also have 4 128bit units.
 
GPUs don't even split the wide instruction. For example GCN SIMD just starts executing the same instruction for 4 cycles in a row (16 lanes at a time).
Lately, I have been wondering if that's entirely the case for the internal pipelines in the CU.
Externally there wouldn't be much difference, but there are certain behaviors such as LDS direct reads and scalar sources that change behaviors without an explicit wait count, and vector instruction types like DPP that would involve at least some different internal control signals for each row in a wavefront, including adjustments for wavefront base register and indexing. A micro-sequencer could do some of the monitoring for special operands/values, and make changes for each row. In the case of designs with a different FP ratio, it would save design work if there's one vector sequencer that looks up a different set of internal control words based on the ratio, particularly for the GPUs that have a higher architectural FP rate that is reduced for consumer cards.
On the other side, there are places where manual wait states are required where one would think an integrated pipeline could catch a hazard, unless perhaps there's a sequencer that has limits to what it can check.

I'm not sure where the translation would happen, but a lot of the routine behaviors for GCN would be a complex op if this were a CPU.


A SIMD has four 16 wide instructions pipelined (executing at different stages). Of course this kind of execution wouldn't be optimal for OoO hardware. Splitting the wide instruction to micro-ops would allow executing the parts separately in case of stalls (cache miss for example). This would actually be very nice with OoO hardware, as it would "see" lots of wide (2048 bit) vector instructions simultaneously and could optimize the execution order (of SIMD wide partial vectors) based on L1 cache residency.
The GPU is load/store with a weak memory model, which helps isolate difficulties. The x86 cores have complicated and coherent memory pipelines, and semantics that keep a mem/reg instruction's worst case as an access split across 2 cache lines.
For Intel, at least, the steady state execution model would have native-width registers, so any split L1 access would not reach an ALU until the source register has been fully populated. The transition states or mixed modes incur penalties.
For the allocation/rename state, ROB tracking/retire, scheduler capacity, memory access, and status/exceptions, splitting a wide op isn't quite that nice. A vector format that cannot reside in a single cache line could require some rethinking of the internal sequencing and architecture in general.
 
Back
Top