SSE4, future processors and GPGPU thoughts

You've traded register pressure for inter-thread synchronization issues and cache thrashing, that's not exactly a home-run.
You can't increase throughput without extra threads, so inter-thread synchronization issues are just inherent. But for running many processes, like on servers, that's not really a problem. And for multimedia, where the throughput is really needed, there's lots of parallelism. Cache thrashing is also relative. With more threads the cache is holding more useful data. In the extreme case like a GPU you even hide memory latency completely with threading, without a cache.
CMT is not hyperthreading... Hyperthreading is specifically Intel's implementation of SMT.
Both run multiple threads per core. Hyper-Theading is just Intel's market name for CMT and I wouldn't be suprised at all if they reused it for other implementations. The essential thing for this discussion is that it's an effective way to increase core utilization, especially for in-order execution. Sun's idea of "throughput computing" is exactly what I have in mind for x86 mini-cores, and I don't believe the ISA would be a major obstacle.
 
'AMD's system resource management involves OS-CPU communication about what sort of execution resources an application needs at a given time. Presumably, such information could be used to invoke the reconfiguration of one or more processor cores in order to meet the app's demands, and to help direct the dispatching of threads to those cores.
That's interesting, especially for legacy software. For new software it would be most optimal to let the developer decide for which type of core to optimize a certain thread and direct the O.S. to prefer execution on such core.
Such a facility might also be used to enable a power-saving idea that AMD has mentioned a number of times: the use of a simple, low-power, x86-compatible processor core to handle basic system chores and spin off more intensive threads to beefier cores when needed.'

But something sounds strange, if the future is mini-cores, then what type of core are they talking about, a cores smaller than a mini-core. or will AMD only go for for heavy cores and one simple core? maybe they will go for something like a PPE for handling system chores.
Maybe 'beefier cores' doesn't necessarily mean larger ones. Cores with optimized SSE implementations could be quite powerful for multimedia but less suited for O.S. tasks. So I don't think these ideas are mutually exclusive.
did you know that the ULTRASPARC T2 has

8 cores with 8 threads per core total is 64 threads
Yeah, now imagine the x86 equivalent with full 128-bit SSE units...
 
Nope. The Agner Fog optimization guides (manual #3, page 34) (which have been as close to an authoritative guide as we have at all had for Pentium optimizations for the last decade) disagrees with you.
Tomáto, tomäto. I don't know for sure whether the 'register renaming' implementation does physical register copying or not (and I doubt Agner knows), but it behaves like it does. There is no logical and physical register set. What matters to the discussion is that it doesn't improve in-order execution performance. It just makes the register stack more like a linear register file at low cost.
I don't see how this could be the case. The added complexity of OoO does not appear within the execution units themselves or their forwarding networks, but in all the support/scheduling logic around them. The decision logic needed to decide whether forwarding should happen or not will be a fair bit more complex in an OoO processor than in an in-order processor, but once the decision is made, the actual forwarding & execution itself will take the exact same amount of time. In a high-clock-speed processor, you would probably even place the decision-logic and the actual-forwarding in separate pipeline stages. AFAICS, the only instructions that inherently get longer visible latencies in an OoO architecture are mispredicted branches (OoO -> additional pipeline stages for instruction reordering -> heavier mispredict penalty).
All the wires and multiplexers used for forwarding take a significant bite out of a clock cycle, leaving only a fraction for actual execution. The Pentium 4 even has 'drive' stages, solely for transporting data from one point in the chip to another point. So I'm positive that some instructions that currently take two clock cycles could fit within one clock cycle on a simplified architecture. The ones that take say six clock cycles on an out-of-order architecture could be ready in four on an in-order architecture, making it possible to hide all latency with four-way CMT.
 
Both run multiple threads per core. Hyper-Theading is just Intel's market name for CMT and I wouldn't be suprised at all if they reused it for other implementations. The essential thing for this discussion is that it's an effective way to increase core utilization, especially for in-order execution. Sun's idea of "throughput computing" is exactly what I have in mind for x86 mini-cores, and I don't believe the ISA would be a major obstacle.

Here's the set of acronyms I'm working with.

SMT = Simultaneous Multi-Threading: instructions from different threads are issued and exist in the same pipeline stages with other threads at the same time
CMT = Coarse-grained Multi-Threading: multiple thread contexts are maintained, but only one thread is actually in the pipeline at a given time
FMT = Fine-grained Multi-Threading: multiple thread contexts are maintained, and multiple threads can be in the pipeline, but they never share the same pipe stages

Hyperthreading= The specific implementation Intel has for SMT on the P4.

The only other core Intel has multithreaded is Montecitio, which uses Switch on Event (a variant of CMT). It is not branded as Hyperthreading.

Since SMT has the highest silicon cost for multi-threading, the utility of SMT on a core as narrow as an SPE is reduced, because there are few chances of a unit being idle when there aren't any extras lying around.

A coarser method such as Switch-on-Event or even Niagra's fine-grained shifting between threads would be better, even if not ideal for an in-order x86 mini-core. There is perhaps a lesson in how Niagra's cores don't do multimedia. There certainly would not be a bonus going in-order for any legacy code, except in rather uncommon circumstances.

For the SPE, its large register pool is a sign of what is needed to get good performance out of an architecture that is incapable of routing around stalls. If there isn't a large register pool, loop unrolling and software pipelining, an in-order's performance is almost invariably worse.

Since x86 offers neither the SIMD capabilities of an SPE and a mini-x86 offers none of the improvements needed to keep churning through data, an SPE would likely need to be matched by several x86 minis. At that point, the silicon savings are negated.
 
Last edited by a moderator:
Here's the set of acronyms I'm working with.

SMT = Simultaneous Multi-Threading: instructions from different threads are issued and exist in the same pipeline stages with other threads at the same time
CMT = Coarse-grained Multi-Threading: multiple thread contexts are maintained, but only one thread is actually in the pipeline at a given time
FMT = Fine-grained Multi-Threading: multiple thread contexts are maintained, and multiple threads can be in the pipeline, but they never share the same pipe stages
Ok, Sun calls CMT chip multithreading. AnandTech sais it's FMT. And that's what I'm having in mind for x86 mini-cores. But obviously lots of variations are possible. The goal is just to increase core utilization.
Hyperthreading= The specific implementation Intel has for SMT on the P4.

The only other core Intel has multithreaded is Montecitio, which uses Switch on Event (a variant of CMT). It is not branded as Hyperthreading.
I don't think it's branded at all. ;)
...or even Niagra's fine-grained shifting between threads would be better, even if not ideal for an in-order x86 mini-core.
Why not?
There is perhaps a lesson in how Niagra's cores don't do multimedia.
They were just never intended for multimedia at all. They even share just one FPU between all eight cores. They were designed specifically for transaction workloads. That doesn't mean that such architecture with multimedia instructions (SSE) would be bad at multimedia too.
There certainly would not be a bonus going in-order for any legacy code, except in rather uncommon circumstances.
If 64 threads on 8 mini-cores have a higher combined throughput than 2 threads on 2 big cores then I would call that a bonus. Even they don't, it's still more interesting than one big core and a number of mini-cores cores with a new ISA because that doesn't run legacy code faster at all.
For the SPE, its large register pool is a sign of what is needed to get good performance out of an architecture that is incapable of routing around stalls. If there isn't a large register pool, loop unrolling and software pipelining, an in-order's performance is almost invariably worse.
GPUs prove you wrong. They don't have 128 registers per thread (not even close), are in-order, and yet total throughput is fenomenal. I realize a GPU thread is not comparable to a CPU thread, but I believe that several small cores each running multiple threads can improve CPU throughput significantly.
Since x86 offers neither the SIMD capabilities of an SPE...
SSE is more than adequate. If you have eight mini-cores each processing four floating-point numbers per clock cycle that's still a fantastic performance for a x86 compatible CPU.
 
GPUs prove you wrong. They don't have 128 registers per thread (not even close), are in-order, and yet total throughput is fenomenal. I realize a GPU thread is not comparable to a CPU thread, but I believe that several small cores each running multiple threads can improve CPU throughput significantly.
The programming model is so different that it really questions your point imho :)
 
GPUs prove you wrong. They don't have 128 registers per thread (not even close), are in-order, and yet total throughput is fenomenal.
http://www.beyond3d.com/forum/showthread.php?t=34122

Mike Houston: There is a BrookGPU CTM backend currently being worked on and we hope to have it public when CTM is public. It is not being used for the current algorithms running in the Folding client though. However, it will enable other algorithms we weren't able to do in the past because of access to larger register files (128 registers!), scatter, and explicit control of the memory formats and memory system.
More registers for the win. More registers does mean less threads in flight, though.

Jawed
 
I don't think it's branded at all. ;)
You're right.
For the sake of correctness, I should have said Intel has Hyperthreading trademarked.

My main concerns are if the minicores share a cache like the cores do in Niagara. x86 has those awkward reg/memory instructions that would raise the average amount of contention for cache ports. This could be reduced if the code is compiled to use only register ops (this rules out a lot of legacy code), but then the register set's small size gets in the way.

They were just never intended for multimedia at all. They even share just one FPU between all eight cores. They were designed specifically for transaction workloads. That doesn't mean that such architecture with multimedia instructions (SSE) would be bad at multimedia too.
It wouldn't set the world on fire either. The SPE's software architecture is targeted for the task, an x86 minicore would fare poorly in comparison to an SPE or dedicated hardware.

If 64 threads on 8 mini-cores have a higher combined throughput than 2 threads on 2 big cores then I would call that a bonus. Even they don't, it's still more interesting than one big core and a number of mini-cores cores with a new ISA because that doesn't run legacy code faster at all.

I'm not disputing the concept, just the idea that removing all of x86's crutches is the way to go.
Theoretical peak doesn't always reflect what can be acheived. 8 mincores that spend half their time spilling registers or fighting over cache are less effective than 6 slightly larger cores that do not.

Minicores don't really help legacy software either. If the code isn't explicitely threaded (most legacy code is not) then those extra cores won't matter either way.

GPUs prove you wrong. They don't have 128 registers per thread (not even close), are in-order, and yet total throughput is fenomenal. I realize a GPU thread is not comparable to a CPU thread, but I believe that several small cores each running multiple threads can improve CPU throughput significantly.
Minicores would suffer from the same lack of execution units as bigger cores. Actually, with their much larger thread contexts and separate control hardware (GPUs try to minimize this, CPU-type threading increases it) to make room for other threads would mean they wouldn't do that much better compared to specialized hardware.

SSE is more than adequate. If you have eight mini-cores each processing four floating-point numbers per clock cycle that's still a fantastic performance for a x86 compatible CPU.

I suppose in the context of x86 chips, sure. That won't be the only competition in the future.
Perhaps if there were a way to get another extension to the ISA, but that still rules out legacy code.
I also would not really call anything that uses SSE legacy software.
 
Last edited by a moderator:
My main concerns are if the minicores share a cache like the cores do in Niagara. x86 has those awkward reg/memory instructions that would raise the average amount of contention for cache ports. This could be reduced if the code is compiled to use only register ops (this rules out a lot of legacy code), but then the register set's small size gets in the way.

Spills are normally to the stack, so they have excellent temporal and spatial locality. I don't think individual cores thrashing a shared L2 cache because of spills would be a problem at all.

But, I do agree that x86 would be a poor fit for a massively multicored CPU. Since one would have to rethink software development more or less completely anyway to take advantage of the massive amount of cores, there's really no reason to take the variable length instruction scheme, the low number of registers and the cumbersome flag semantics from x86 into a new mini-core.

Just stick 64 r4400s on a die and be done with it.

Cheers
 
Tomáto, tomäto. I don't know for sure whether the 'register renaming' implementation does physical register copying or not (and I doubt Agner knows), but it behaves like it does. There is no logical and physical register set. What matters to the discussion is that it doesn't improve in-order execution performance. It just makes the register stack more like a linear register file at low cost.
A physical register swap requires both of the values that are to be swapped to physically exist at the same time. The instruction timings that the Pentium-Classic exhibits in the presence of FXCH cannot be reconciled with such a constraint.
All the wires and multiplexers used for forwarding take a significant bite out of a clock cycle, leaving only a fraction for actual execution.
If you have an in-order processor with N execution units and an out-of-order processor with the same N execution units, they will have exactly the same forwarding network between the execution units themselves. Only the control differs, and that control is cleanly separated from the actual MUXes themselves.
The Pentium 4 even has 'drive' stages, solely for transporting data from one point in the chip to another point.
The drive stages are not within the execution units themselves or their forwarding networks. The first one appears after ICache fetch and the second one between branch check and instruction retire. The PPE (which is the in-order processor that comes closest to matching Pentium4 on raw no-holds-barred clock speed) has 'drive' stages too, by the way.
 
really interestig thread! remind me of the one about future of console cpu in the console section.

The sad point is that I don't understand clearly, everything...

Nick, what do you mean by mini X86.
when I read you I am under the impression that you speak more of ppe like version of x86 than of spe (sorry I'm dumb... but I try to understand ;)).

So I need some exemples. (sorry this is really interesting)

So why not try to argue on extrapolating of actual processor processor.
And discussing the pro/cons of if choices
(nb I'm just trying to give order of magnitude)

The ppe in the xenon is quite tiny ~40 millions of transistors the whole xenon is ~160 millions of transistors (I 'll count 40 ~millions of transistors for 1MB of cache....sorry)

the cell is ~ a ppe + 8spe = a ppe is ~ 40 millions of transistors a spe is ~20 millions of transistors

A quad core duo is 580 millions of transistors.

For the transistors budget of the quad core what would you choose?

a xenon like x86 with 8 ppe(IoE+poor branch prediction+long execution pipeline+smt)+6 Mo of L2 cache

a cell like design 2 cells + some more cache.

this could be an interesting discussion in regard to the first ps3 ppe linux bench ;)
 
Spills are normally to the stack, so they have excellent temporal and spatial locality. I don't think individual cores thrashing a shared L2 cache because of spills would be a problem at all.
That was a mistake of mine. For some reason I misremembered Niagra's L1 as being partially shared. Spills wouldn't hurt too much, unless the L1s were write-through.

Intel would probably have to change the write policy for its L1s if it went with that many minicores.
 
GPUs prove you wrong. They don't have 128 registers per thread (not even close), are in-order, and yet total throughput is fenomenal. I realize a GPU thread is not comparable to a CPU thread, but I believe that several small cores each running multiple threads can improve CPU throughput significantly.
nAo is right. In the context of 3dilettante's remarks, GPU's have far more registers. They interleave the workload of thousands of pixels at once, and after each instruction all the data for that group of pixels is put aside in the register file until the shader engine comes back to it a couple hundred clocks later (in the case of a texture fetch). That's how they avoid stalls.

Each cluster on the 8800GTS probably has 30,000 FP32 registers (but it's not completely arbitrary random access though). Then it could keep in flight 10 float4's of per-pixel temp values for 200 quads. This lets it fetch one quad of texture data each clock hopefully without stalling.

So you can see that it's not even close. I suppose you can liken it to cache on a CPU, but no CPU cache has the ability to swap 8*40 FP32 values from the registers each clock.
 
GPUs prove you wrong. They don't have 128 registers per thread (not even close), are in-order, and yet total throughput is fenomenal. I realize a GPU thread is not comparable to a CPU thread, but I believe that several small cores each running multiple threads can improve CPU throughput significantly.
I replied to this already, but I had some time to review.

I went back and read a section in the ATI CTM documentation, which indicates that each program that can be dispatched to an array can address 128 temporary registers and 256 constant registers.
I believe it has been mentioned in another thread that AMD/ATI at least does unroll loops in their drivers, though I didn't hear anything about Nvidia doing the same.

This is acheivable even with the GPU being highly threaded because GPUs try to minimize the extra context and synchronization that many CPU threads would bring.

I'm not disagreeing with the idea that multiple minicores can increase throughput, just that the balance between the costs of that approach for x86 vs. its benefits exists at a different point than it does for an SPE or for a GPU.

Each disadvantage or advantage for any given architecture changes the "sweet spot". I think for x86 it would probably be closer to having at least some register renaming capability and possibly some OoO ability, unless the designers create another extension that significantly alters how the ISA works, with more registers or some kind of architectural support for the kind of pervasive threading GPUs use.

edit:
On another note, OoO or register renaming would not take up as much room as they do if the mini-cores were scalar. The big contributor to complexity is the need to do all that scheduling for N instructions simultaneously. If these mini-cores were as narrow as an SPE, the cost for renaming would be manageable, and even basic OoO wouldn't be too bad in my opinion.
 
Last edited by a moderator:
Spills wouldn't hurt too much, unless the L1s were write-through.

Intel would probably have to change the write policy for its L1s if it went with that many minicores.

Yeah, they would have to go the AMD way and make the L2, essentially, one big shared victim cache for the L1 caches.

Cheers
 
The programming model is so different that it really questions your point imho :)
I want you to question it. Like I said, I realize a GPU thread is not comparable to a CPU thread. I'm not at all claiming that a CPU could get anywhere close to the throughput/area density of a GPU. I'm just saying that some GPU-like concepts could be partially applied to CPUs to increase total throughput.
 
Mike Houston said:
There is a BrookGPU CTM backend currently being worked on and we hope to have it public when CTM is public. It is not being used for the current algorithms running in the Folding client though. However, it will enable other algorithms we weren't able to do in the past because of access to larger register files (128 registers!), scatter, and explicit control of the memory formats and memory system.
More registers for the win. More registers does mean less threads in flight, though.
You need that many registers for GPGPU only because there is no fast cache. Even infrequently accessed variables consume precious register space.

Giving mini-cores 128 registers wouldn't help that much if there's a fast L1 cache, and completely rules out x86 compatibility. I'd even say that 128 registers without a real cache is far worse. A lot of general-purpose applications wouldn't run on it, simply because even a moderate call depth requires a significant stack space. Think towards the future: A Cell-like approach is only good for the kind of multimedia workloads we know today. But Cell isn't going to last much longer than one console generation.
 
My main concerns are if the minicores share a cache like the cores do in Niagara. x86 has those awkward reg/memory instructions that would raise the average amount of contention for cache ports. This could be reduced if the code is compiled to use only register ops (this rules out a lot of legacy code), but then the register set's small size gets in the way.
As Gubbi already noted, your main concern is not a concern at all if each mini-core has its own L1 cache. There's still contention for the L2 cache, but you can buffer requests and transfer whole cache lines so each core doesn't need L2 cache all the time.

So with a good cache hierarchy there's no great need for a large register set.
It wouldn't set the world on fire either. The SPE's software architecture is targeted for the task, an x86 minicore would fare poorly in comparison to an SPE or dedicated hardware.
It doesn't have to set the world on fire in the sense you're thinking. We already have GPUs for the ultra-parallel workloads. There's no need to have GLU-like or SPE-like cores in a typical server or desktop. What we do need is CPUs that can work on many different general-purpose tasks simultaneously. For a server it can run different processes, for a game engine it can run every component on one or several threads while keeping compatibility, and for something like raytracing all threads could together be processing rays. Such CPU architecture would also be ready for future applications (whatever they may be), without requiring to fully rewrite all software again. That's what I call fire.
Theoretical peak doesn't always reflect what can be acheived. 8 mincores that spend half their time spilling registers or fighting over cache are less effective than 6 slightly larger cores that do not.
I fully agree. Whatever configuration works best in practice should obviously be implemented. I'm just comparing current CPUs with Cell, Niagara, and GPUs, and I suggest adding in-order x86 mini-cores that each run multiple threads as one approach to cover future CPU workloads.
Minicores don't really help legacy software either. If the code isn't explicitely threaded (most legacy code is not) then those extra cores won't matter either way.
You can still run multiple processes, which is very important for the server market, which as I mentioned before is an important driver for the CPU market. For other legacy software it's no worse than mini-cores with a new ISA. By the way, even single-threaded applications could benefit if the libraries they use go multi-threaded.
Minicores would suffer from the same lack of execution units as bigger cores.
Yes but because they are smaller you can have more. Hence the density of execution units is still higher.
Actually, with their much larger thread contexts and separate control hardware (GPUs try to minimize this, CPU-type threading increases it) to make room for other threads would mean they wouldn't do that much better compared to specialized hardware.
True, but that's a fair price for x86 compatibility. Specialized hardware would do great for one specific workload but poorly for another and is not even an option for a lot of other. If I look at GPGPU applications, even when run on the latest hardware, I see some applications with fantastic speedups and others where clearly the GPU is not efficient at all. That's not what we need for the CPU market. We need a fairly consistent speedup along the whole line, with minimal software effort, and consumer prices. I believe x86 mini-cores can offer that. Applications that still run faster on a GPU, can keep running on a GPU.
I also would not really call anything that uses SSE legacy software.
That's not an argument. Mini-cores with a new ISA wouldn't be compatible with anything at all. x86 mini-core would run legacy code (with or without SSE), and it would take little effort to make recent and future software make good use of them. With a new ISA you're starting from scratch. Besides, every five years there's a new ISA that would be more optimal, but it's really no option to rewrite all software every five years. So it's better to stick to x86 and hide its imperfections. That's working fine so far; Intel is doing such a great job that even Apple goes x86. The only reason this ISA switch works is because Apple offers most software itself and because the emulation is fast enough. Mini-cores with a new ISA would have to be able to run x86 threads efficiently, on all operating systems, before they are widely accepted. But then it's more interesting to just make them x86 from the start and work around the limitations.
 
A physical register swap requires both of the values that are to be swapped to physically exist at the same time. The instruction timings that the Pentium-Classic exhibits in the presence of FXCH cannot be reconciled with such a constraint.
I still don't see why. When you pair FXCH with for example FABS, you write st(0) to st(1), use st(1) as input to the abs unit, and write the result to st(0). There are no extra timing constraints. Ok you could call that register renaming, but there's still a physical swap happening. Real register renaming works with a physical register set that is larger than the logical register set, and any physical register could correspond with any logical register. That's not the case for x87. If you still would like to call that register renaming, fine, but this type of register renaming is not helping in-order execution with a linear register set.
If you have an in-order processor with N execution units and an out-of-order processor with the same N execution units, they will have exactly the same forwarding network between the execution units themselves. Only the control differs, and that control is cleanly separated from the actual MUXes themselves.
The classic Pentium did not have a forwarding network. It just writes results directly to the register set. The only MUXes are the small and fast ones for the register set.
The drive stages are not within the execution units themselves or their forwarding networks. The first one appears after ICache fetch and the second one between branch check and instruction retire. The PPE (which is the in-order processor that comes closest to matching Pentium4 on raw no-holds-barred clock speed) has 'drive' stages too, by the way.
I know. What I tried to say is that with out-of-order execution the distances are so big that it takes a significant amount of time just to move data around. Likely not just in the drive stages, but also in other crucial execution (related) stages. By going back to in-order execution, everything can be closer together, with a positive effect on timing constraints.
 
nAo is right. In the context of 3dilettante's remarks, GPU's have far more registers. They interleave the workload of thousands of pixels at once, and after each instruction all the data for that group of pixels is put aside in the register file until the shader engine comes back to it a couple hundred clocks later (in the case of a texture fetch). That's how they avoid stalls.

Each cluster on the 8800GTS probably has 30,000 FP32 registers (but it's not completely arbitrary random access though). Then it could keep in flight 10 float4's of per-pixel temp values for 200 quads. This lets it fetch one quad of texture data each clock hopefully without stalling.

So you can see that it's not even close. I suppose you can liken it to cache on a CPU, but no CPU cache has the ability to swap 8*40 FP32 values from the registers each clock.
I know that. But 8 in-order x86 mini-cores each running 8 threads would already have 4096 FP32 registers. Also note that they could run 64 totally independend and completely general purpose tasks, at an affordable price. And L1 caches add a relatively large amount of data they can access with low latency.

Again, I don't see a need to have a GPU-like unit in the CPU. We already have the GPU for that. And I believe there is great value in being x86 compatible. The only reason I'm referring to GPUs is because they reach their high throughput by threading and having many pipelines with simplified control. Translating that approach to CPUs gives me x86 mini-cores with FMT and in-order execution...
 
Back
Top