SSE4, future processors and GPGPU thoughts

Humus · Nov 28, 2006

nAo said:
In that case or you make up your own arrays or simply you don't use them.

Huh?

nAo said:
So What? Many are still amenable of being SoAed.

"can" is very different from "practical". And I'd say most of typical 3D related tasks are not fit for SoA. Most stuff just isn't 100% parallel. You need both horizontal and vertical operations. A typical simple vector operation like dot(lightPos - pos, normal) requires both.

nAo said:
Having smaller granularity almost always give you a better memory access pattern over a certain threshold.

You typically don't access only one attribute at a time. In that case SoA makes sense since they are independent anyway. That's usually not the case, thus AoS is the more natural structure, and gives better memory access pattern since it'll be localized accesses rather than scattered reads from several streams.

nAo said:
SIMD for small computations here and there won't change your performance a bit, no one is going to gain a thing just cause compute a few dot products instead of a few madds.

Those dot products going to be scattered across your entire application and speed up lots of operations. You can write your code with inline functions and get perfectly readable code, yet get all the advantage of SIMD.

Nick said:
Not really: SoftWire. A far better alternative if you want to get the assembly code you actually wrote...

Could you elaborate on what exactly this app does? There's no description on that page.

Nick said:
I fully agree. The only way to get the real potential of SIMD is to write the whole bottleneck in assembly. In many cases that means a whole loop or a function, not just a few of the vector operations inside of it.

Assuming you have a single bottleneck.

Nick · Nov 28, 2006

Humus said:
You need both horizontal and vertical operations. A typical simple vector operation like dot(lightPos - pos, normal) requires both.

Code:

movaps xmm0, [lightPos_x]
movaps xmm1, [lightPos_y]
movaps xmm2, [lightPos_z]
subps xmm0, [pos_x]
subps xmm1, [pos_y]
subps xmm2, [pos_z]
mulps xmm0, [normal_x]
mulps xmm1, [normal_y]
mulps xmm2, [normal_z]
addps xmm0, xmm1
addps xmm0, xmm2

No horizontal operations needed, 4 values in 11 instructions (versus 1 in 5 with horizontal operations).

The hard part is getting 4 different x coordinates in pos_x, etc, and keeping data in registers without running out of them. Here's one software renderer doing it this way: softwarerenderer.com. Many SSE optimized raytracers work like this as well. Horizontal operations are still more practical, but I wouldn't say you truely need them, or that 3D related tasks are not fit for SoA.

Those dot products going to be scattered across your entire application and speed up lots of operations. You can write your code with inline functions and get perfectly readable code, yet get all the advantage of SIMD.
...
Assuming you have a single bottleneck.

The dot products will indeed be scattered across the application, but only a few out of the hundred will be in a bottleneck and potentially contribute to a speedup. In any given application there is one, two, maybe three bottlenecks. Not more. And it will give vastly better results to focus on just these, rewriting them in pure assembly, than 'optimizing' all hundred dot products with a badly inlined intrinsics.

Could you elaborate on what exactly this app does? There's no description on that page.

It's a dynamic code generator for C++, with a syntax very close to Intel assembly:

Code:

movaps(xmm0, xword_ptr [lightPos_x]);
movaps(xmm1, xword_ptr [lightPos_y]);
movaps(xmm2, xword_ptr [lightPos_z]);
subps(xmm0, xword_ptr [pos_x]);
subps(xmm1, xword_ptr [pos_y]);
subps(xmm2, xword_ptr [pos_z]);
mulps(xmm0, xword_ptr [normal_x]);
mulps(xmm1, xword_ptr [normal_y]);
mulps(xmm2, xword_ptr [normal_z]);
addps(xmm0, xmm1);
addps(xmm0, xmm2);

It also supports x86-64. Its successor powers SwiftShader.

3dilettante · Nov 28, 2006

Nick said:
What I think could make a lot of sense is to add SPE-like x86 cores. By removing out-of-order execution, we could have for example two full cores, and about eight SPE-like cores, on the die space of a regular quad-core.

That would be more painful for x86 than it would be for other ISAs. Even with 16 registers in x86-64, the register pressure would be immense. Register renaming is one of the reasons why x86 got by with just 8 integer registers for so long.

For an in-order core, the lack of registers would be brutal. The SPE's large set is no accident.
Adding enough while keeping the cores x86 compatible would be hard, since this would affect every instruction that uses an integer register. It would be akin to yanking out a car's transmission and rebuilding it while the car's running.

edit:
Actually, I minimized the issue. The limited number of regs would apply to any instruction that used a register: int,mmx,fp,sse. Unless they intend to completely junk x86 semantics, there isn't much room to add addressing for more registers.

Humus · Nov 29, 2006

Nick said:
The hard part is getting 4 different x coordinates in pos_x, etc, and keeping data in registers without running out of them.

Yup. By the time you've assembled 4 x-values, 4-y values etc. you've pretty much already lost to a plain FPU implementation. Unless you know beforehand exactly what values go together and can place coordinates together statically I'd say this is not practical. Sometimes it's just plain impossible to arrange things this way. Take traversing a BSP for example. You need to do a single dot product at each node, then based on the result you decide to go to the front tree or back tree. With horizontal adds this is no problem.

Nick · Nov 29, 2006

3dilettante said:
That would be more painful for x86 than it would be for other ISAs. Even with 16 registers in x86-64, the register pressure would be immense. Register renaming is one of the reasons why x86 got by with just 8 integer registers for so long.

With an in-order architecture, there would be no need to have register renaming. Also note that latencies can be lower than with an out-of-order architecture, meaning that most of the time dependent instructions can execute the next clock cycle. The 'old' Pentium was in-order, and was able to compete with an out-of-order Pentium Pro. Extended with SSE, x64, and Hyper-Threading it could make a fine mini-core. Note also that Hyper-Threading lowers register pressure.

I realize this is no small feat. And the x86 mini-cores would definitely not perform like the full-blown cores. But I believe that a large number of mini-cores could perform better than just a few full-blown cores that take the same die space, in multi-process/thread situations. Getting rid of out-of-order execution is one approach, but there might be better ones...

For an in-order core, the lack of registers would be brutal. The SPE's large set is no accident.

I believe the Cell SPEs mainly have a large register set because they have no real cache. So it's best to really keep everything in registers as much as possible. With a cache and Hyper-Threading an x86 mini-core could make up for having only 16 registers.

It would still be hard to reach peak performance, but being x86 compatible would be a major advantage. It would run legacy x86 software without recompile, existing tools can be used, and there can be no bottlenecks because of ISA-specific threads.

Nick · Nov 29, 2006

Humus said:
Yup. By the time you've assembled 4 x-values, 4-y values etc. you've pretty much already lost to a plain FPU implementation.

It's been a while since I wrote FPU code but I'll take the challenge:

Code:

fld lightPos_1x
fsub pos_1x
fmul normal_1x
fld lightPos_1y
fsub pos_1y
fmul normal_1y
fld lightPos_1z
fsub pos_1z
fmul normal_1z
faddp
faddp
fstp result_1

So that's one 'lighting' operation. To do four you need 48 instructions. The SoA implementation took 11 instructions, 12 when also storing to memory. So we have to do the AoS to SoA conversion in less than 36 instructions:

Code:

movlps xmm0, pos_1[0]
movhps xmm0, pos_2[-8]
movlps xmm1, pos_3[0]
movhps xmm1, pos_4[-8]
movaps xmm2, xmm0
shufps xmm2, xmm1, 0x88   // pos_x
shufps xmm0, xmm1, 0xDD   // pos_y
movlps xmm1, pos_1[8]
movhps xmm1, pos_2[0]
movlps xmm3, pos_3[8]
movhps xmm3, pos_4[0]
shufps xmm1, xmm3, 0x88   // pos_z

And again for the normals (lightPos can be assumed to be constant). So that's a total of 36 instructions for four lighting operations. Even with some stalls it still beats the FPU code. Also note that with horizontal operations it's not incredibly faster. If the result has to be written to memory it takes 24 instructions.

But the biggest advantage is that once the data is in SoA format, every following operation can be done with highly efficient SoA code (note that with horizontal operations the w component is often not used). If besides the lighting operation you also need some other processing the balance easily tips toward SoA (imagine you also have to do vector normalization and specular lighting), and you don't need horizontal operations at all. Register management can be a bitch, but this was pretty much solved with x86-64. When working with scalar data (e.g. the dot product result), SoA is actually more efficient.

I'm not claiming SoA is suited for everything, but as illustrated it's often underestimated. But you have to write the whole processing loop in assembly. With an inlined dot product, even if using horizontal operations, you're not using the full potential of SSE.

Unless you know beforehand exactly what values go together and can place coordinates together statically I'd say this is not practical.

It's practical, and when you do can convert to SoA beforehand any code with horizontal operations doesn't even get close in terms of performance. Raytracers are an example of this.

Sometimes it's just plain impossible to arrange things this way. Take traversing a BSP for example. You need to do a single dot product at each node, then based on the result you decide to go to the front tree or back tree. With horizontal adds this is no problem.

It's unlikely you have to determine just one point-in-node. But even if you do it's possible to already compute the dot product for the child nodes, so you can traverse two nodes per iteration. Note that the position is constant so it can be stored in SoA form, and the plane data can also be packed in triplets. Ok maybe it's not faster in practice, or just not practical to code, but don't say it's plain impossible. Anyway, traversing the BSP for multiple points at once would be the way to go, in which case SoA is definitely the fastest, and practical. I can hardly believe that if you have to locate just one point it would be a bottleneck of the application.

3dilettante · Nov 29, 2006

Nick said:
With an in-order architecture, there would be no need to have register renaming.

Then the full penalty of the tiny register pool would be revealed. OoO does a very impressive job of hiding or removing the latencies associated with such a limited number of register addresses.

Also note that latencies can be lower than with an out-of-order architecture, meaning that most of the time dependent instructions can execute the next clock cycle.

Dependent instructions can execute in the next clock cycle with an OoO chip, there's no restriction against that. In-order can potentially shorten a clock cycle or reduce the power draw, but it doesn't magically do anything to how instructions are processed.

The 'old' Pentium was in-order, and was able to compete with an out-of-order Pentium Pro. Extended with SSE, x64, and Hyper-Threading it could make a fine mini-core. Note also that Hyper-Threading lowers register pressure.

I'm not recalling any major wins against the Pentium Pro, especially not once code moved to 32-bit. I can't remember enough about what the gains or penalties were from that situation.
I doubt any modern code that isn't exclusively pointer chasing would run better on an in-order Pentium, since modern code has adapted to a different target architecture.

Hyperthreading does not lower register pressure for a given process any more than having two cores lowers register pressure. The register set is unique to each thread, so each thread will have to deal with it.

If having multiple hardware threads allows smaller contexts, it might indirectly reduce problems with register pressure, but at the cost of degrading cache performance and contention for shared resources.

I believe the Cell SPEs mainly have a large register set because they have no real cache. So it's best to really keep everything in registers as much as possible. With a cache and Hyper-Threading an x86 mini-core could make up for having only 16 registers.

Is that why the PPE has a larger register set than x86, or why the Itanium has such a large register set despite having perhaps some of the best cache architecture around?

In-order cores rely heavily on compiler optimizations, including things like software pipelining and loop unrolling. These optimizations burn software-visible registers very quickly. Unrolling is at least done for small loops by OoO scheduling hardware, and a lot of speculation is handled without explicitely consuming visible registers.

Any kind of heavy software optimization is likely to consume a large portion of the visible register set, which is why it was so helpful for a register-starved architecture to use renaming that gave programs more room.

silent_guy · Nov 29, 2006

Nick said:
With an in-order architecture, there would be no need to have register renaming.

Register renaming and OoO execution are orthogonal concepts. The presence of both can magnify the performance gain, but, especially in an x86 set just the presence of renaming would already be a win. Register renaming was already used sometime in the early seventies (I believe), long before OoO was in use.

Nick · Nov 30, 2006

3dilettante said:
Dependent instructions can execute in the next clock cycle with an OoO chip, there's no restriction against that. In-order can potentially shorten a clock cycle or reduce the power draw, but it doesn't magically do anything to how instructions are processed.

I know, but the essential part is to create x86 mini-cores. You have to sacrifice something. Only a minor fraction of a full-blown x86 core is execution units and other essential parts, the rest is to support out-of-order execution. So going in-order is one way to allow lots of mini-cores and increase total throughput.

Hyperthreading does not lower register pressure for a given process any more than having two cores lowers register pressure. The register set is unique to each thread, so each thread will have to deal with it.

Two cores take twice the die space, Hyper-Threading only requires extra registers and a bit of control to keep track of which thread an instruction belongs to. It lowers register pressure in the sense that the core stalls less because of register starvation. With four threads per mini-core it might get the same utilization of a full-blown core, but with far less transistors!

If having multiple hardware threads allows smaller contexts, it might indirectly reduce problems with register pressure, but at the cost of degrading cache performance and contention for shared resources.

The cache will always have to be shared, whether the cores are in-order or out-of-order. There's also only a limited amount of thread-specific variables, the rest is shared/streaming data. Contention for shared resources is ok as long as there is good utilization.

Is that why the PPE has a larger register set than x86, or why the Itanium has such a large register set despite having perhaps some of the best cache architecture around?

I won't deny that x86 had too little registers from the start. But consider the alternatives. If you add mini-cores with a different ISA then you get a Cell clone. It wouldn't sell, because it takes many years to develop software for, and there's no backward compatibility. That's ok for a game console but unacceptable for server/desktop markets. A GPU core would have the same problem. How do you write applications for systems with a GPU in the CPU and/or chipset and/or PCI-Express slot? With x86 mini-cores you keep full compatibility for old and new software.

The in-order Pentium compensated for the lack of registers with a very fast L1 cache. Stack variables almost function as extra registers. Modern caches require multiple clock cycles, but this is where the Hyper-Threading comes in. Individual threads run quite slow, but it's the combined throughput that counts.

Nick · Nov 30, 2006

silent_guy said:
Register renaming and OoO execution are orthogonal concepts. The presence of both can magnify the performance gain, but, especially in an x86 set just the presence of renaming would already be a win. Register renaming was already used sometime in the early seventies (I believe), long before OoO was in use.

The Pentium Pro introduced both out-of-order execution and register renaming at the same time. As far as I kow you could have out-of-order execution without register renaming, but it's rather inefficient. Register renaming without out-of-order execution seems pointless to me. But I'm interested in any technique that improves in-order execution performance. Could you give me a simple example of code executed in-order which would benefit from register renaming?

3dilettante · Nov 30, 2006

Nick said:
I know, but the essential part is to create x86 mini-cores. You have to sacrifice something. Only a minor fraction of a full-blown x86 core is execution units and other essential parts, the rest is to support out-of-order execution. So going in-order is one way to allow lots of mini-cores and increase total throughput.

The quote I was addressing was the claim that having in-order allowed depedent instructions to execute in the next cycle, as if OoO prevented it. Both in-order and out-of-order cores do this.

Two cores take twice the die space, Hyper-Threading only requires extra registers and a bit of control to keep track of which thread an instruction belongs to. It lowers register pressure in the sense that the core stalls less because of register starvation. With four threads per mini-core it might get the same utilization of a full-blown core, but with far less transistors!

That has nothing to do with register pressure. Since each individual thread still only sees the architectural registers in the ISA, you could have five billion hyper-threads, and they'd all be constrained by the same small software-visible register pool.

The cache will always have to be shared, whether the cores are in-order or out-of-order. There's also only a limited amount of thread-specific variables, the rest is shared/streaming data. Contention for shared resources is ok as long as there is good utilization.

There are limited gains beyond a certain point of SMT. Intel's magic limit was 2 simultaneous threads. IBM had up to 4, I believe. After that, the penalties from contention of buffers, register ports, and other critical resources make it better to just go with separate cores.

Hyperthreading also assumes a significant number of free parallel resources, which for an SPE-type core wouldn't exist. There wouldn't be 5 spare execution units waiting for another thread to take over, and there wouldn't be 5 times as many register ports to handle the traffic.

SMT is useful when a wide processor is having trouble utilizing its resources. If the core is narrow, it is not that useful.

Some switch on event or coarser methods would be preferable.

I won't deny that x86 had too little registers from the start. But consider the alternatives. If you add mini-cores with a different ISA then you get a Cell clone. It wouldn't sell, because it takes many years to develop software for, and there's no backward compatibility. That's ok for a game console but unacceptable for server/desktop markets. A GPU core would have the same problem. How do you write applications for systems with a GPU in the CPU and/or chipset and/or PCI-Express slot? With x86 mini-cores you keep full compatibility for old and new software.

Sometimes it is better to write code for something that is new but performs well as opposed to using old code on something that performs like a joke. Software programs using a system wouldn't like being ambushed by suddenly horrible performance.

The in-order Pentium compensated for the lack of registers with a very fast L1 cache. Stack variables almost function as extra registers. Modern caches require multiple clock cycles, but this is where the Hyper-Threading comes in. Individual threads run quite slow, but it's the combined throughput that counts.

That's more amenable to a Niagra approach, with non-threaded simple cores. Hyperthreading would do little without bloating a core of this type, and even Niagra's cores have more registers.

edit: non-hyperthreaded simple cores

The Pentium Pro introduced both out-of-order execution and register renaming at the same time. As far as I kow you could have out-of-order execution without register renaming, but it's rather inefficient. Register renaming without out-of-order execution seems pointless to me. But I'm interested in any technique that improves in-order execution performance. Could you give me a simple example of code executed in-order which would benefit from register renaming?

OoO through scoreboarding doesn't use register renaming. As a result, it is vulnerable to stalls on false dependencies where register names are reused. With a small register pool, it becomes much more likely.

I believe register renaming is very helpful for speculating past branches in the case of loops, which wind up using the same register names over and over.

I think some eager execution schemes try to speculatively execute in-order and try to avoid false dependencies through renaming.

arjan de lumens · Nov 30, 2006

Nick said:
The Pentium Pro introduced both out-of-order execution and register renaming at the same time. As far as I kow you could have out-of-order execution without register renaming, but it's rather inefficient. Register renaming without out-of-order execution seems pointless to me. But I'm interested in any technique that improves in-order execution performance. Could you give me a simple example of code executed in-order which would benefit from register renaming?

The Pentium-Classic CPU used register renaming heavily in order to be able to wring reasonable performance out of its FPU; in particular, the FXCH instruction was carefully implemented to do a register renaming instead of a physical register swap. This was widely used as e.g. follows:

Code:

FADD st,st(2)
FXCH st(3)
FADD st,st(4)

where the FXCH, instead of adding an extra data dependency between the two FADDs, breaks a dependency instead, allowing the second FADD to start before the first one is finished. I have also seen register renaming being used to break dependencies in other in-order processors as well - the code sequence

Code:

ADD EAX, EBX
ADD EBX, ECX

ran in 1 cycle on e.g. the old cyrix 6x86; by renaming the EBX register, the processor would break a false dependency (the Pentium-Classic, OTOH, did not do renaming on integer registers, so that code sequence took 2 cycles.)

3dilettante · Nov 30, 2006

I forgot how helpful renaming was with destructive operands and x87's FXCH hack.

silent_guy · Nov 30, 2006

Nick said:
The Pentium Pro introduced both out-of-order execution and register renaming at the same time. As far as I kow you could have out-of-order execution without register renaming, but it's rather inefficient. Register renaming without out-of-order execution seems pointless to me. But I'm interested in any technique that improves in-order execution performance. Could you give me a simple example of code executed in-order which would benefit from register renaming?

(It's been a long time since my last x86 code...)

MOV ax, [...]
MOV bx, [...]
ADD BX, AX, BX
STR [...], BX
MOV BX, [...]
ADD CX, AX, BX

The second usage of BX isn't in anyway related to the first one, but as long as the previous instructions are still in the pipeline, the second MOV BX can stall.

Now introduce loop unrolling in your code and you will have tons of pesky dependencies that are purely there because of register pressure. It's in those cases that renaming without OoO can already give a signficant benefit.

Nick · Dec 1, 2006

3dilettante said:
The quote I was addressing was the claim that having in-order allowed depedent instructions to execute in the next cycle, as if OoO prevented it. Both in-order and out-of-order cores do this.

Sorry for the confusion but that was not a claim. In fact I mentioned forwarding on the first page already. What I did say is that with in-order the complexity is lower, reducing execution latencies. So some instructions that previously took two clock cycles to execute might now require only one.

That has nothing to do with register pressure. Since each individual thread still only sees the architectural registers in the ISA, you could have five billion hyper-threads, and they'd all be constrained by the same small software-visible register pool.

Look, 16 registers is really ok as long as you're not doing heavy software pipelining. It's not ideal, there will still be some spills, but it's close enough to ideal. Now, Hyper-Threading makes software pipelining unnecessary, because instead of working on multiple tasks in the same thread simultaneously, which would require extra logical registers, you work on them simultaneously in separate threads, which uses extra physical registers. So it does help reduce register pressure.

Just compare it with a GPU. It runs many threads per shader unit, but each of them individually only has a small number of physical registers. Correct me if I'm wrong, but don't they typically have far less than 16 registers?

So that's the other end of the spectrum. To create x86 mini-cores, the road in the middle would be to have say four thread per mini-core, and 64 physical registers. So both the number of logical and physical registers is manageable and the end result is a good throughput per core.

There are limited gains beyond a certain point of SMT. Intel's magic limit was 2 simultaneous threads. IBM had up to 4, I believe. After that, the penalties from contention of buffers, register ports, and other critical resources make it better to just go with separate cores.

That's with out-of-order execution. Beyond two threads the complexity quickly becomes unmanageable. You have to widen every component required to support out-of-order execution, wich is a large part of the core area, to the point where having separate cores is indeed far more interesting. But with in-order execution the complexity is fairly low and you can keep adding threads up to the point where there's a good balance between core utilization and core area.

Hyperthreading also assumes a significant number of free parallel resources, which for an SPE-type core wouldn't exist. There wouldn't be 5 spare execution units waiting for another thread to take over, and there wouldn't be 5 times as many register ports to handle the traffic.

The Pentium 4 with Hyper-Threading didn't have extra execution units, yet the combined throughput of both threads was higher than one thread with Hyper-Threading disabled. That's because each thread only uses a limited number of execution units at any time. The major reason it didn't work so well in practice was because speculative execution of each thread took away precious resources from the other thread. This doesn't happen with just one thread (even if 1 out of 10 speculatively executed instructions is correct that's still a win). So essentially out-of-order execution was the problem.

With in-order execution, Hyper-Threading would be used to hide latencies, and only has a positive effect on the combined throughput. Every executed instruction is as good as one from another thread. And even with four threads I don't see a big need to add extra execution units (although that's definitely an option). We just need a high utilization. If adding extra execution units lowers the total utilization, even if it makes an individual thread run faster, don't do it. Inreacing single-thread performance at all cost used to be the motto for Intel for over a decade. Now they really have to start looking at throughput per area, just like a GPU.

SMT is useful when a wide processor is having trouble utilizing its resources. If the core is narrow, it is not that useful.

Some switch on event or coarser methods would be preferable.

That's true when out-of-order execution is used to keep utilization high. With smaller in-order cores, wide or narrow, it has to hide latency to increase utilization.

Out-of-order execution is in terms of hardware just a very expensive way to have a high utilization. It was devised specifically to keep CPUs single-threaded. Now that the barriers have been broken, there's no strict need to hang on to it. With in-order execution and Hyper-Threading ideally we'd have the same utilization per core, just far smaller cores.

Sometimes it is better to write code for something that is new but performs well as opposed to using old code on something that performs like a joke. Software programs using a system wouldn't like being ambushed by suddenly horrible performance.

If that was true for CPUs, x86 would be something only my dad remembers, and we'd be coding for 16-core already. And every couple years a new ISA.

Don't underestimate the importance of software. If you create a new CPU and none of the legacy software runs any faster on it, it won't sell. Just like you can't sell a GPU that isn't DirectX compatible. And it's the server market that pays for much of the research cost, before the architecture goes mainstream.

That's why I believe x86 mini-cores would be very interesting. You do get a speedup for legacy software, and especially server applications would get almost the full potential. I also don't believe it would "perform like a joke". There's no reason why an x86 CPU with mini-cores couldn't perform equivalently to a Cell processor. It also has the benefit of already having excellent development tools.

That's more amenable to a Niagra approach, with non-threaded simple cores. Hyperthreading would do little without bloating a core of this type, and even Niagra's cores have more registers.

Niagara cores are threaded. Strictly speaking it's CMT. In fact that's what I pretty much have in mind for x86 mini-cores. Heck, Intel could easily get a bite out of Sun's server market with an x86 processor with mini-cores. And with 32 registers Niagara's Sparc cores don't really have an abundance like Cell's SPEs either. So I'm sure 16 is adequate when you have a fast L1 cache.

Nick · Dec 1, 2006

arjan de lumens said:
The Pentium-Classic CPU used register renaming heavily in order to be able to wring reasonable performance out of its FPU; in particular, the FXCH instruction was carefully implemented to do a register renaming instead of a physical register swap.

It took one clock cycle and did a physical register swap. Only starting from the out-of-order Pentium Pro it's a free instruction and it's implemented via register renaming.

Even if it did use something similar to register renaming on the classic Pentium, I would still categorize it as 'software register renaming'. It's not an option nor would it be very useful to add it for integer, MMX or SSE. FXCH was specifically for the FPU's stack-based register set. In fact if you add an FXCH to every FPU instruction it would behave like a linear register set. But it doesn't improve in-order execution.

I have also seen register renaming being used to break dependencies in other in-order processors as well - the code sequence [...] ran in 1 cycle on e.g. the old cyrix 6x86; by renaming the EBX register, the processor would break a false dependency (the Pentium-Classic, OTOH, did not do renaming on integer registers, so that code sequence took 2 cycles.)

Cyrix 6x86 has out-of-order execution. Furthermore I doubt this code created a false dependency on a Pentium. These instructions can execute simultaneously without register renaming.

Nick · Dec 1, 2006

silent_guy said:
(It's been a long time since my last x86 code...)

You didn't have to write x86 code, but that is not x86 code.

The second usage of BX isn't in anyway related to the first one, but as long as the previous instructions are still in the pipeline, the second MOV BX can stall.

You can execute it as long as they complete in-order. Allow me to illustrate:

Code:

mov r0, [...]
mov [...], r0
mov r0, [...]
mov [...], r0
...

This is a typical situation for copying memory blocks. Assume every instruction has a latency of two clock cycles. Then these four instructions can still complete in seven clock cycles (instead of eight). The third instruction can start before the second has finished because the value of r0 is already in the write buffer (assuming of course there is a write buffer and it doesn't stall because of aliasing analysis). Anyway, register renaming wouldn't help as long as instructions are executed in-order. Also note that CMT would completely hide the latency.

3dilettante · Dec 1, 2006

Nick said:
Look, 16 registers is really ok as long as you're not doing heavy software pipelining. It's not ideal, there will still be some spills, but it's close enough to ideal. Now, Hyper-Threading makes software pipelining unnecessary, because instead of working on multiple tasks in the same thread simultaneously, which would require extra logical registers, you work on them simultaneously in separate threads, which uses extra physical registers. So it does help reduce register pressure.

You've traded register pressure for inter-thread synchronization issues and cache thrashing, that's not exactly a home-run. All those extra register spills also tie up the cache ports and load/store units, which every thread is going to have to share.

Just compare it with a GPU. It runs many threads per shader unit, but each of them individually only has a small number of physical registers. Correct me if I'm wrong, but don't they typically have far less than 16 registers?

Each shader unit isn't a fully independent core, its context is defined already by the command processor and its memory traffic goes to a separate and shared memory unit. Much of the extra work that would have gone into a larger register set has been computed and done already.

If its workload weren't so data parallel, the cost of doing that centrally would have been prohibitive. Each thread is an independent context, the virtual processor running it can't assume that it will be spoon-fed anything.

That's with out-of-order execution. Beyond two threads the complexity quickly becomes unmanageable. You have to widen every component required to support out-of-order execution, wich is a large part of the core area, to the point where having separate cores is indeed far more interesting. But with in-order execution the complexity is fairly low and you can keep adding threads up to the point where there's a good balance between core utilization and core area.

Cache performance also suffers significantly, and going in-order doesn't make SMT's demands for more ports go away. Simultaneity is the issue with with SMT, since a large part of the bypass, cache, and reg file routing must still be designed as if the processor was as wide as the number of threads expected to execute simultaneously.

If that is not what is desired, switch-on-event or fine-grained threading would be better if there isn't a large surplus of execution units available either due to the width of the processor or from a large number of bubbles in a very long pipeline.

The Pentium 4 with Hyper-Threading didn't have extra execution units, yet the combined throughput of both threads was higher than one thread with Hyper-Threading disabled. That's because each thread only uses a limited number of execution units at any time. The major reason it didn't work so well in practice was because speculative execution of each thread took away precious resources from the other thread. This doesn't happen with just one thread (even if 1 out of 10 speculatively executed instructions is correct that's still a win). So essentially out-of-order execution was the problem.

Yes it did. The Pentium 4 was much wider than an SPE, about 3 times as wide.
SMT did poorly because Intel's implementation cut everything in half, without respect to a given thread's needs. That, coupled with the extremely involved replay mechanism that gobbled up much of the spare units anyway, made threaded performance so uneven.

IBM's SMT does much better at four, and other coarser threading methods don't go much further per core, because everything else around the core gets picked on.

With in-order execution, Hyper-Threading would be used to hide latencies, and only has a positive effect on the combined throughput. Every executed instruction is as good as one from another thread. And even with four threads I don't see a big need to add extra execution units (although that's definitely an option). We just need a high utilization. If adding extra execution units lowers the total utilization, even if it makes an individual thread run faster, don't do it. Inreacing single-thread performance at all cost used to be the motto for Intel for over a decade. Now they really have to start looking at throughput per area, just like a GPU.

High utilization doesn't mean anything if each thread sees a disproportionate amount of contention. One ALU shared between 1000 threads will have 100% utilization, but you still wouldn't want it to run code.

That's true when out-of-order execution is used to keep utilization high. With smaller in-order cores, wide or narrow, it has to hide latency to increase utilization.

Unless that core is wide, or its pipeline is very long, SMT will have limited effect. There are less aggressive strategies that would do better at lower silicon cost.

Out-of-order execution is in terms of hardware just a very expensive way to have a high utilization. It was devised specifically to keep CPUs single-threaded. Now that the barriers have been broken, there's no strict need to hang on to it. With in-order execution and Hyper-Threading ideally we'd have the same utilization per core, just far smaller cores.

The barriers have been broken in the tasks where they have been found to be broken, except when they're not.

If that was true for CPUs, x86 would be something only my dad remembers, and we'd be coding for 16-core already. And every couple years a new ISA.

Don't underestimate the importance of software. If you create a new CPU and none of the legacy software runs any faster on it, it won't sell. Just like you can't sell a GPU that isn't DirectX compatible. And it's the server market that pays for much of the research cost, before the architecture goes mainstream.

How much legacy software do you expect to run faster on a SPE-like in-order x86 core?
Do you think going superscalar and OoO only affected programs made after they were introduced?

There's no reason why an x86 CPU with mini-cores couldn't perform equivalently to a Cell processor. It also has the benefit of already having excellent development tools.

Perform equivalently to a Cell processor in which workloads?
The ones Cell is already good at, or just the ones it does passably or underwhelmingly at?

Niagara cores are threaded. Strictly speaking it's CMT. In fact that's what I pretty much have in mind for x86 mini-cores. Heck, Intel could easily get a bite out of Sun's server market with an x86 processor with mini-cores. And with 32 registers Niagara's Sparc cores don't really have an abundance like Cell's SPEs either. So I'm sure 16 is adequate when you have a fast L1 cache.

CMT is not hyperthreading, and the cores actually use a hybrid of fine-grained round-robin threading and CMT.
Hyperthreading is specifically Intel's implementation of SMT.
Due to fine-grained threading, each thread issues an instruction every 4 cycles. That actually hides latency, and it keeps the pressure off of the L1 cache, which you assume can be made fast enough for 4 simultaneous threads to compete with a register file.

Techno+ · Dec 1, 2006

Nick as for mini cores I have something that might interest you,

'AMD's system resource management involves OS-CPU communication about what sort of execution resources an application needs at a given time. Presumably, such information could be used to invoke the reconfiguration of one or more processor cores in order to meet the app's demands, and to help direct the dispatching of threads to those cores. Such a facility might also be used to enable a power-saving idea that AMD has mentioned a number of times: the use of a simple, low-power, x86-compatible processor core to handle basic system chores and spin off more intensive threads to beefier cores when needed.'

But something sounds strange, if the future is mini-cores, then what type of core are they talking about, a cores smaller than a mini-core. or will AMD only go for for heavy cores and one simple core? maybe they will go for something like a PPE for handling system chores.

did you know that the ULTRASPARC T2 has

8 cores with 8 threads per core total is 64 threads

arjan de lumens · Dec 2, 2006

Nick said:
It took one clock cycle and did a physical register swap. Only starting from the out-of-order Pentium Pro it's a free instruction and it's implemented via register renaming.

Nope. The Agner Fog optimization guides (manual #3, page 34) (which have been as close to an authoritative guide as we have at all had for Pentium optimizations for the last decade) disagrees with you. However, PentiumPro FPU did get an additional, fairly large speed boost from doing OoO+hardware-register-renaming in additon to the software-renaming-FXCH trick. Although you are correct about FXCH-type tricks not really being applicable to processing units with linear register file. (As for the other example I gave, the Agner guide disagrees with me instead; the ADD EAX,EBX; ADD EBX,ECX sequence does indeed pair on the Pentium-Classic with no renaming.)

Sorry for the confusion but that was not a claim. In fact I mentioned forwarding on the first page already. What I did say is that with in-order the complexity is lower, reducing execution latencies. So some instructions that previously took two clock cycles to execute might now require only one.

I don't see how this could be the case. The added complexity of OoO does not appear within the execution units themselves or their forwarding networks, but in all the support/scheduling logic around them. The decision logic needed to decide whether forwarding should happen or not will be a fair bit more complex in an OoO processor than in an in-order processor, but once the decision is made, the actual forwarding & execution itself will take the exact same amount of time. In a high-clock-speed processor, you would probably even place the decision-logic and the actual-forwarding in separate pipeline stages. AFAICS, the only instructions that inherently get longer visible latencies in an OoO architecture are mispredicted branches (OoO -> additional pipeline stages for instruction reordering -> heavier mispredict penalty).

SSE4, future processors and GPGPU thoughts

Humus

Crazy coder

Nick

3dilettante

Humus

Crazy coder

Nick

Nick

3dilettante

silent_guy

Nick

Nick

3dilettante

arjan de lumens

3dilettante

silent_guy

Nick

Nick

Nick

3dilettante

Techno+

arjan de lumens

Similar threads