SSE4, future processors and GPGPU thoughts

3dilettante · Dec 5, 2006

Nick said:
So with a good cache hierarchy there's no great need for a large register set.

Without a software model that supports running out of a cache, similar to what G80 can do internally, that's rather unlikely.
With a standard software model, the claim there's no need for a large register set is dubious. A register set on such a core would have a 0-cycle use penalty.
Going by x86, caches the best that can be hoped for at present is 3 or 4 cycles that must pass prior to an operand being available. With fine-grained threading, this can be somewhat hidden, but the penalty cannot be non-zero.

Unless a cache can magically match that, there will always be a need for register set that doesn't spill over all the time.
From the point of view of optimizing compilers, the small architectural register pool severely limits optimization options, and in-orders desperately need software help to get full utilization.

It doesn't have to set the world on fire in the sense you're thinking. We already have GPUs for the ultra-parallel workloads. There's no need to have GLU-like or SPE-like cores in a typical server or desktop. What we do need is CPUs that can work on many different general-purpose tasks simultaneously. For a server it can run different processes, for a game engine it can run every component on one or several threads while keeping compatibility, and for something like raytracing all threads could together be processing rays.

That would allow applications to transparently walk into a performance minefield. Performance will magically improve and degrade depending on where the thread is routed. Any program that existed prior to these minicores will stutter or zoom by, depending on whether there is a stronger IPC core that exists alongside the mini-cores.

If there is a wide gap in performance between cores that are identical to the software, it is also likely that a fair amount of legacy systems code and a good number of other applications will probably break.

There's no way for a system to really know if that's the case, not with software that isn't made to be aware. Such a chip would be potentially unsafe for any multithreaded program made before the chip's release, and single-threaded programs would be highly vulnerable to performance upsets.

A conservative OS might just decide not to route anything to the minicores.

If the chip is only made of minicores, it is likely that the x86 variant would do worse than a CELL made only of SPEs. Unlike x86, the SPE's instruction set hasn't made the job of getting usable performance so difficult.

You can still run multiple processes, which is very important for the server market, which as I mentioned before is an important driver for the CPU market. For other legacy software it's no worse than mini-cores with a new ISA.

I'm unconvinced that this is true for x86. By not being differentiated from other cores, they ensnare old programs into situations where they falter.

Yes but because they are smaller you can have more. Hence the density of execution units is still higher.

And because of the minicores' complete inferiority to any other architecture, you're going to need a huge number of them.
I wouldn't be surprised if a core with as many ISA limitations as x86 would put forth half or less the per-clock performance of a core like an SPE or SPARC, something that will likely seriously impact the amount of usable throughput.

True, but that's a fair price for x86 compatibility. Specialized hardware would do great for one specific workload but poorly for another and is not even an option for a lot of other. If I look at GPGPU applications, even when run on the latest hardware, I see some applications with fantastic speedups and others where clearly the GPU is not efficient at all. That's not what we need for the CPU market. We need a fairly consistent speedup along the whole line, with minimal software effort, and consumer prices. I believe x86 mini-cores can offer that. Applications that still run faster on a GPU, can keep running on a GPU.

If you look at where throughput computing chips are targeted, you would see that the speed-up would be wildly inconsistent. Given the clunky nature of the ISA, the gains would be noticeably less than they could be.

That's not an argument. Mini-cores with a new ISA wouldn't be compatible with anything at all. x86 mini-core would run legacy code (with or without SSE), and it would take little effort to make recent and future software make good use of them. With a new ISA you're starting from scratch. Besides, every five years there's a new ISA that would be more optimal, but it's really no option to rewrite all software every five years.

The mini-cores on a new ISA would at least be usable within the set of programs made for them. Most software will need to be completely redeveloped for such a shift in paradigm, so the pain of shifting to another ISA isn't as great.
x86 is so klunky that it isn't just sub-optimal, it's performance poison.

Since the minicores are going to need special treatment anyway, why not go for a bigger change? It's better than getting stuck with x86 for another 20 years.

So it's better to stick to x86 and hide its imperfections. That's working fine so far; Intel is doing such a great job that even Apple goes x86.

What do you think the addition of OoO, aggressive speculation, and superscalar issue were for? Apple didn't go Intel because of the Pentium classic.

The only reason this ISA switch works is because Apple offers most software itself and because the emulation is fast enough.

Fast enough on a design that extracts more ILP than any desktop processor before it, one that isn't even multithreaded.

Mini-cores with a new ISA would have to be able to run x86 threads efficiently, on all operating systems, before they are widely accepted. But then it's more interesting to just make them x86 from the start and work around the limitations.

The ways for working around the limitations are either hardware or software. You've thrown out most of the hardware ones, and x86 keeps out a vast number of the software ones.

I think a chip with a bunch of cores with a highly extended or new ISA or slightly more complex (but still enhanced) x86 cores would be better than a bunch of minicores that would almost be actively trying to sabotage the code that's running on them.

arjan de lumens · Dec 5, 2006

Nick said:
I still don't see why. When you pair FXCH with for example FABS, you write st(0) to st(1), use st(1) as input to the abs unit, and write the result to st(0). There are no extra timing constraints. Ok you could call that register renaming, but there's still a physical swap.
Real register renaming works with a physical register set that is larger than the logical register set, and any physical register could correspond with any logical register. That's not the case for x87. If you still would like to call that register renaming, fine, but this type of register renaming is not helping in-order execution with a linear register set.

OK. Those two renaming methods are qualitatively different, and an FXCH-style mechanism (which is only really relevant if you are truly fond of x87, something which no sane person is) can, as you describe, even be implemented in a non-register-renaming manner in the in-order processor without adversely affecting performance (except on some esoteric code sequences that perform FXCH on two work-in-progress values, but such sequences are both rare and useless.)

The classic Pentium did not have a forwarding network. It just writes results directly to the register set. The only MUXes are the small and fast ones for the register set.

Doing both register file read AND register file write within your 1-cycle timing window is actually quite painful, and likely also an important part of the reason why PentiumMMX was so handily outclocked by PentiumII on the same manufacturing process (about 30%, in both stock clock speeds and overclocking potential).

I know. What I tried to say is that with out-of-order execution the distances are so big that it takes a significant amount of time just to move data around. Likely not just in the drive stages, but also in other crucial execution (related) stages. By going back to in-order execution, everything can be closer together, with a positive effect on timing constraints.

Increased wire delays within the execution units themselves is only really an issue if the execution units themselves become larger or placed farther away from each other. Large wire delays in the OoO-specific logic are generally just absorbed by splitting the delays over as many pipeline stages as needed; this ultimately only affects the cycle count of a branch mispredict (and power consumption; these are however perfectly valid concerns in their own right, as they did eventually force Intel to abandon Pentium4.)

Jawed · Dec 5, 2006

Nick said:
You need that many registers for GPGPU only because there is no fast cache. Even infrequently accessed variables consume precious register space.

The reason's more fundamental than that: read after write is outside of the tradition of GPU pipelines (except for pixel blending in the render target, after shading has been completed).

I simply pointed out the contrary example to indicate that GPUs are changing.

G80 offers some kind of small memory area ("cache"), per thread, which fully supports read after write and cross-thread sharing of data.

Jawed

Mintmaster · Dec 5, 2006

Nick said:
I know that. But 8 in-order x86 mini-cores each running 8 threads would already have 4096 FP32 registers.

I was just replying to your remark that GPUs have far fewer registers, that's all. 3dilettante was saying that CELL indicates that a large register pool is how you get around stalls, and you said GPU's prove him wrong. Even 4096 registers for this hypothetical CPU is only a couple of percent of what G80 has, so GPUs in fact support his point even if it doesn't seem so when pixel shader code doesn't go beyond r9.

Nick said:
Again, I don't see a need to have a GPU-like unit in the CPU. We already have the GPU for that.

Neither do I, actually. To me it looks like computation tasks are mostly either massively parallel or tough to get even a few threads. For the former, GPUs are the way to go. In the latter case, don't go crazy with the number cores. 2-4 seems like plenty to me, and once you have that serial execution rate is the number one priority.

Going forward, bolting a few CELLs together seems silly to me. If you can split your work to 32 threads then it's likely you can split it among hundreds or thousands of threads. In that case future GPU's will slaughter any CPU because they're built to hide all data latency and have extremely high FLOPS/mm2.

Nick · Dec 6, 2006

3dilettante said:
Going by x86, caches the best that can be hoped for at present is 3 or 4 cycles that must pass prior to an operand being available. With fine-grained threading, this can be somewhat hidden, but the penalty cannot be non-zero.

Unless a cache can magically match that, there will always be a need for register set that doesn't spill over all the time.

The 3 GHz Northwood processor had an 8 kB L1 cache with 2 cycle latency. Core 2 Duo has a 32 kB L1 cache with 3 cycle latency. So even with just four threads per in-order mini-core you can totally hide that, practically solving the small register set problem. Also, while x86-32 does indeed spill all the time, x86-64 is quite comfortable with 16 general-purpose and 16 SIMD registers.

That would allow applications to transparently walk into a performance minefield. Performance will magically improve and degrade depending on where the thread is routed. Any program that existed prior to these minicores will stutter or zoom by, depending on whether there is a stronger IPC core that exists alongside the mini-cores.

I would definitely keep at least one heavy core, just for the situation when there's only one application running, that relies on single-threaded performance. Every other situation benefits from having mini-cores. Running two single-threaded applications would result in a performance slightly higher than a single-core, but most importantly it wouldn't stutter. If the O.S. detects that the mini-core is fully used then it should give each thread some time on the heavy core.

By the way, this kind of situation already happens on current dual-core processors. If you run one application that would consume 50% of one core, and two applications that occupy 100% of a core, you get 100% utilization for both cores and stutter-free behaviour for all three applications.

In fact I see having mini-cores with a different ISA as a bigger performance minefield. There is always one of the two parts a bottleneck. You're either not fully using the heavy core or you're not fully using the mini-cores (or not at all).

Nick · Dec 6, 2006

arjan de lumens said:
Doing both register file read AND register file write within your 1-cycle timing window is actually quite painful...

I can imagine that's a problem on an architecture with 128+ registers, but with 8 or 16 registers I can't see how that would be a significant issue. The write is practically for free since the MUX can switch while the ALU is working. With an out-of-order architecture with forwarding network, the result of every execution unit, and every stage at which a result can be ready, has to be routed to the top again and MUXed as well.

Nick · Dec 6, 2006

Jawed said:
The reason's more fundamental than that: read after write is outside of the tradition of GPU pipelines (except for pixel blending in the render target, after shading has been completed).

I simply pointed out the contrary example to indicate that GPUs are changing.

G80 offers some kind of small memory area ("cache"), per thread, which fully supports read after write and cross-thread sharing of data.

Yes, but wasn't this 128 register example for a GPU without such cache? In other words, can't G80 offer high GPGPU performance without using that many registers per thread, thanks to the cache?

Nick · Dec 6, 2006

Mintmaster said:
I was just replying to your remark that GPUs have far fewer registers, that's all. 3dilettante was saying that CELL indicates that a large register pool is how you get around stalls, and you said GPU's prove him wrong. Even 4096 registers for this hypothetical CPU is only a couple of percent of what G80 has, so GPUs in fact support his point even if it doesn't seem so when pixel shader code doesn't go beyond r9.

We can't directly compare G80 with any CPU architecture like this. G80 is gigantic, costs a fortune, and hardly has anything else but execution units. I was mainly referring to how a GPU keeps one shader unit fully utilized while using only little registers per pixel.

Furthermore, even a CPU with 16 SPE-like cores with 128 registers each would still only have 2048 in total. So clearly GPUs don't support 3dilettante's point.

Neither do I, actually. To me it looks like computation tasks are mostly either massively parallel or tough to get even a few threads. For the former, GPUs are the way to go. In the latter case, don't go crazy with the number cores. 2-4 seems like plenty to me, and once you have that serial execution rate is the number one priority.

There are a few problems with that. The first is that you don't always have a (powerful) GPU. A low-end or integrated GPU is close to worthless for running GPGPU tasks. A CPU with 8 mini-cores running at several times higher frequency and each an 128-bit SIMD unit could be significantly more powerful. And even for better GPUs it's not always a good option to use it for GPGPU purposes when it's already being used for 3D.

Which brings me to the second problem. GPUs will always be primarily designed for 3D rasterization. It's a massively parallel task, and very floating-point oriented. A CPU with mini-cores could run many totally independent threads, with lots of integer operations, pointer tracing, and branching. Threads of a megabyte each wouldn't be a gigantic problem. Take for example ray-tracing. A GPU would process rays in groups, but it requires lots of branching. And with the many indirect and incoherent memory accesses the memory latency hiding mechanisms start to fail. And with practically every memory access being a RAM access, it could also be bandwidth limited. A CPU with mini-cores can have a thread per ray, they can branch independently, the code can be arbitrarily complex, and most memory accesses would still be within the L1 and L2 caches. As another example, I don't think a GPU would ever be suited for running server tasks. It's also not designed for handling a high call depth, or even recursive calls.

Finally, it is my conviction that a lot of applications that seemingly have low parallelism, can actually be split into many threads. We only just have dual-core for consumers, and most developers are still getting their feet wet with concurrent multi-threading. But for the next generation of programmers it will be a natural thing, and they will have no problem extracting lots of parallelism. Current software optimized for dual-core is actually old software that through difficult refactorings has been made multi-threaded. But for new software written from scratch the design can be optimized for multi-core from the start. Tools and programming languages have to evolve as well. Most popular languages still imply one point of execution, but declarative languages can control many things at the same time. Think about a spreadsheet; change one value and all other (dependent) cells are re-evaluated. Each cell could be implemented with a thread and evaluated concurrently. Most software is very similar. Every module and sub-module can have well-defined dependencies and each instance can update its state when a dependend instance changes. Today's code is still very much spaghetti code when it comes to defining dependencies (and independencies). With stricter approaches to define module boundaries and message protocols it's very possible to let the compiler spread the code over many threads.

In the meantime x86 mini-cores would be excellent for servers and software that does parallelize without much extra effort, like most multimedia. It solves the chicken-and-egg problem that would exist for CPUs with a new ISA. You can't sell the hardware because there's no software for it, and you can't develop software for it because nobody is buying the hardware.

Bob · Dec 6, 2006

Nick said:
A low-end or integrated GPU is close to worthless for running GPGPU tasks. A CPU with 8 mini-cores running at several times higher frequency and each an 128-bit SIMD unit could be significantly more powerful. And even for better GPUs it's not always a good option to use it for GPGPU purposes when it's already being used for 3D.

So, somehow a 600 USD GPU is too expensive for most users but a 1200 USD CPU is common? Or are you talking about the hypothetical future where 8 cores are commonplace and cost 60 USD or less? In that timeframe, you're likely to be able to pick a G80 for that price too...

Jawed · Dec 6, 2006

Nick said:
Yes, but wasn't this 128 register example for a GPU without such cache?

Yes. But I quoted it to show that GPUs are changing. Your arguments seem to be on the basis that GPUs have done all the changing they're going to do.

In other words, can't G80 offer high GPGPU performance without using that many registers per thread, thanks to the cache?

Yes. In fact it should be able to do both, since D3D10 says it should support 4096 registers per object (fragment/vertex/primitive) - we have no idea how malleable G80 is in this respect.

It may be that there's some kind of sharing where the thread cache and register file are a single physical entity in G80 - and the split between the two is determined by the kind of thread (GPU or GPGPU) or it may be some decision taken by the driver compiler or it might be something that an application can make a call to request.

It's too early to tell. R600 will prolly offer variations on these themes too. After G80/R600 bed-in, there'll be another significant evolutionary step in 18 months' time.

Jawed

3dilettante · Dec 6, 2006

Nick said:
The 3 GHz Northwood processor had an 8 kB L1 cache with 2 cycle latency. Core 2 Duo has a 32 kB L1 cache with 3 cycle latency. So even with just four threads per in-order mini-core you can totally hide that, practically solving the small register set problem.

No, it would not. At bare minum, firing off the necessary read from cache will take at least one additional cycle for the thread that requests it, even with cycling between threads to overlay execution. To "hide" that last cycle as best as you could, you would need to double the number of threads all over again.

Northwood's L1 cache was also write-through, which would be absolute murder when it comes to port contention if the L2 is shared. How well 8 kB for an L1 can do with a core that has 4 (or 8) active threads is another interesting question.

Every instruction that has a memory operand will take at least 2 cycles to complete after being issued, since address calculation cannot be started until after decode and cannot be collapsed into the same time frame as decoding what registers an instruction has.

It wouldn't be as large a problem if a program relied heavily on register-register instructions, but we're getting back to the small software-visible pool again.

Also, while x86-32 does indeed spill all the time, x86-64 is quite comfortable with 16 general-purpose and 16 SIMD registers.

It appears comfortable because the only x86-64 cores are heavily OoO and speculative. Take that away, and things are not so comfortable.
It also ignores the trend that the code expansion from using the necessary prefixes in long mode tends to decrease cache effectiveness by approximately the same amount that having the larger register pool gains in performance. It would be worse if the architecture weren't designed to work around the latency anyway.

Junking the ability to compensate for worse average memory latency will make things even less comfortable. Adding several threads to fight over the same cache is also not going to make relying on the L1 a comfortable bet.

I would definitely keep at least one heavy core, just for the situation when there's only one application running, that relies on single-threaded performance. Every other situation benefits from having mini-cores. Running two single-threaded applications would result in a performance slightly higher than a single-core, but most importantly it wouldn't stutter. If the O.S. detects that the mini-core is fully used then it should give each thread some time on the heavy core.

Every other situation benefits as long as an OS is capable of predicting which threads desperately need to stay on the big core. If there are two relatively demanding tasks, one can easily get shafted by being sent to one of the minicores.
A minicore is also very likely to spike to full utilization by even minor tasks. It's so narrow that utilization is an unreliable metric.

By the way, this kind of situation already happens on current dual-core processors. If you run one application that would consume 50% of one core, and two applications that occupy 100% of a core, you get 100% utilization for both cores and stutter-free behaviour for all three applications.

The threads on the other core don't suddenly take a 80% dive in performance in a dual-core processor. A thread can safely assume it gets the same kind of resources on core1 as it did on core0.

The differential between full-sized core and minicore is almost to the point that it behaves like an asymmetric setup. If the cores are not treated differently, performance becomes unpredictable to the application.

There are ways around this, but I'm having trouble thinking of a reliable metric that can be used to determine this when each part of the the system has incomplete information.

In fact I see having mini-cores with a different ISA as a bigger performance minefield. There is always one of the two parts a bottleneck. You're either not fully using the heavy core or you're not fully using the mini-cores (or not at all).

At least it is possible to know when this happens. Identical front ends that are freely allow threads to migrate will hide the bottleneck that will still exist with the minicores behind a wall of confusion.
For legacy software, no software run could be expected to complete within a given range of time.

Completely arbitrary and invisible factors can cut an application's performance to less than a quarter at times a program cannot predict or prevent.
If the software is reconfigured so heavily that it is able to handle this, why not go the extra step and get minicores that can actually do their job well--even if not transparently, as opposed to lagging behind transparently?

A CPU with 8 mini-cores running at several times higher frequency and each an 128-bit SIMD unit could be significantly more powerful.

That is not going to happen, complex cores can't clock as high as simple ones, but that doesn't mean simple cores are free to crank up the clock all they wish. The plots for the SPEs show that power draw scales supralinearly with clock speed.

Furthermore, even a CPU with 16 SPE-like cores with 128 registers each would still only have 2048 in total. So clearly GPUs don't support 3dilettante's point.

My objection lies in the number of software-visible registers, not physical ones. The SPE's ISA allows the addressing of 128; the CTM manual (which I've only skimmed) indicates that there are 128 variable registers (and 256 constant registers) that can be addressed by a program that is issued to an array (and ATI's drivers apparently do unroll loops); x86-64 has 16, assuming the code actually uses the prefix necessary to access the additional 8.

How much harder do you think it would be to unroll a loop or use software pipelining with even a simple loop with 16 registers than it is with 128?

Register renaming and OoO play a large part in allowing the limited number of architectural registers to stretch farther than they might.

I'm of the opinion that trying to force x86 hardware back so far back that they return to the times that they lagged RISC in integer as well as floating point is a dead-end.

Either other designs will do what the minicores do in a superior manner, or a few medium-sized x86 cores will do the work without the burden of unnecessary thread context.

arjan de lumens · Dec 6, 2006

Nick said:
I can imagine that's a problem on an architecture with 128+ registers, but with 8 or 16 registers I can't see how that would be a significant issue. The write is practically for free since the MUX can switch while the ALU is working. With an out-of-order architecture with forwarding network, the result of every execution unit, and every stage at which a result can be ready, has to be routed to the top again and MUXed as well.

With more than just 1 execution unit, you need a 2nd layer of MUXes for each register to choose which execution unit the register is supposed to accept a write from; the path from the execution unit to these MUXes also has a fairly large fanout. You also need some logic to shut off the MUXes in case of an error anywhere in your computations as well (an "error" in this case would be any instruction with an unexpected running time, such as CPU exceptions, branch mispredicts, L1 cache-misses/port-contention/misaligned-accesses/STLF-failures, TLB misses, that sort of stuff); with direct register-file access, it's hard to avoid having this kind of error checking right in the middle of your critical path - especially if the critical path goes anywhere close to your load/store unit.

For load/store units, it's also worth noting that speculatively retrieving data from an L1 cache (just make a guess that the data is there, and then block instruction retire several pipeline steps downstream if we were wrong) can actually take a much shorter time than checking whether we even have a valid cache hit in the cache in the first place; if you insist on doing direct register-file writes, you cannot actually exploit this discrepancy to cut down the visible latency of your L1 cache (for an L1 cache with size and clock speed as that of the Northwood Pentium4, I would expect that if you do not allow for speculative accesses, a cache lookup would take about 5-6 cycles.)

Mintmaster · Dec 7, 2006

Nick said:
We can't directly compare G80 with any CPU architecture like this. G80 is gigantic, costs a fortune, and hardly has anything else but execution units. I was mainly referring to how a GPU keeps one shader unit fully utilized while using only little registers per pixel.

Furthermore, even a CPU with 16 SPE-like cores with 128 registers each would still only have 2048 in total. So clearly GPUs don't support 3dilettante's point.

Dude, go read your post again. CELL was his primary example. You were the one saying GPU's are counterevidence of the need for more registers when in fact it has more. You don't need a huge G80 to make that point, either. G71/RSX would have barely be any fewer for the calculations I made. 880 pixels in flight in each of six shading quads, and 8 float4 registers without penalty.

About your other price argument, the whole point of this thread is discuss the best way to maximize performance for your dollar. A $1000 super cpu could well be more expensive than a cheaper one with a fast GPU and fast memory.

Also, we don't even know what kind of limitations G80 has for general computation, let alone architectures that will be around in the time-frame of this CPU you're talking about. For all we know G80 could be good enough for many tasks. It's already juggling hundreds of pixels groups that are at different points in an instruction stream, so that's close enough to "independent". I have no doubt raytracing will become very fast on GPU's once scatter support is used. Server tasks could be tough, but I don't know how CPU hungry they are in day to day use beyond benchmarks except for a handful of customers.

I still think CPUs should focus on maximizing each thread's speed, and only go multicore when the perf-area elasticity goes below say 0.3 (i.e. 3% boost for 10% area increase). Companies won't sacrifice coding productivity to make everything highly multithreaded, so threads won't grow in number very easily IMO except for the really parallel stuff which would likely map to a GPU.

Nick · Dec 8, 2006

3dilettante said:
No, it would not. At bare minum, firing off the necessary read from cache will take at least one additional cycle for the thread that requests it, even with cycling between threads to overlay execution. To "hide" that last cycle as best as you could, you would need to double the number of threads all over again.

Ok, but even if four threads is not enough to hide all latency in the worst case, it could still do pretty well on average. If it's one clock cycle short, then that can still be hidden if one the other three threads still has some work.

It appears comfortable because the only x86-64 cores are heavily OoO and speculative. Take that away, and things are not so comfortable.

Add FMT and things could be comfortable again.

It also ignores the trend that the code expansion from using the necessary prefixes in long mode tends to decrease cache effectiveness by approximately the same amount that having the larger register pool gains in performance.

I believe that's incorrect. A REX prefix is only needed when using 64-bit registers or when more than 8 registers are used. But most code still uses 32-bit values ('int' is still 32-bit), and if you're using the upper half of the register set it means you've avoided spill instructions. So in practice the x86-64 code is more compact.

A possible decrease in performance comes from the use of 64-bit pointers. But that's where everything is evolving to anyway.

That is not going to happen, complex cores can't clock as high as simple ones, but that doesn't mean simple cores are free to crank up the clock all they wish. The plots for the SPEs show that power draw scales supralinearly with clock speed.

I wasn't suggesting to clock the mini-cores any higher than the rest of the CPU. What I meant was that GPUs are typically clocked lower than CPUs. So 8 mini-cores with high utilization clocked at 3 GHz could beat a future low-end GPU with 16 pipelines clocked at 1 GHz. So even for massively parallel tasks, using the GPU is not always the best option. And if you do have a powerful GPU but it's busy with 3D then it's interesting to still have a CPU capable of lots of GFLOPS.

Either other designs will do what the minicores do in a superior manner, or a few medium-sized x86 cores will do the work without the burden of unnecessary thread context.

I'm really interested to hear about other designs, as long they are x86 and they maximize throughput. Theoretically I fully agree that x86 should be ditched. But in practice it's not that simple, and any new ISA would also become a limitation in the long-term future, making x86 almost just as good.

Nick · Dec 8, 2006

arjan de lumens said:
With more than just 1 execution unit, you need a 2nd layer of MUXes for each register to choose which execution unit the register is supposed to accept a write from; the path from the execution unit to these MUXes also has a fairly large fanout.

I see. So out-of-order architectures store the results of the execution units in a buffer register at the end of a clock, and the next clock they select the right one and write it to the register file? Ok that way the MUXes are pure overhead for in-order architectures. Still pretty small MUXes though, nothing like the ones for 128 registers...

You also need some logic to shut off the MUXes in case of an error anywhere in your computations as well (an "error" in this case would be any instruction with an unexpected running time, such as CPU exceptions, branch mispredicts, L1 cache-misses/port-contention/misaligned-accesses/STLF-failures, TLB misses, that sort of stuff); with direct register-file access, it's hard to avoid having this kind of error checking right in the middle of your critical path - especially if the critical path goes anywhere close to your load/store unit.

Isn't that's only for predictive out-of-order execution? An in-order architecture would just wait till the branch or cache miss is resolved. With FMT the other threads keep the execution units busy.

For load/store units, it's also worth noting that speculatively retrieving data from an L1 cache (just make a guess that the data is there, and then block instruction retire several pipeline steps downstream if we were wrong) can actually take a much shorter time than checking whether we even have a valid cache hit in the cache in the first place; if you insist on doing direct register-file writes, you cannot actually exploit this discrepancy to cut down the visible latency of your L1 cache (for an L1 cache with size and clock speed as that of the Northwood Pentium4, I would expect that if you do not allow for speculative accesses, a cache lookup would take about 5-6 cycles.)

Well my basic strategy would be to hide that latency by doing work on other threads first (like GPUs do), not to do anything speculatively (which takes lots of extra logic).

Nick · Dec 8, 2006

Mintmaster said:
Dude, go read your post again. CELL was his primary example. You were the one saying GPU's are counterevidence of the need for more registers when in fact it has more. You don't need a huge G80 to make that point, either. G71/RSX would have barely be any fewer for the calculations I made. 880 pixels in flight in each of six shading quads, and 8 float4 registers without penalty.

Relax, dude. Have a look at ps 1.0 hardware (or even fixed-function). It has extremely little logical registers, yet the throughput is still fenomenal compared to a CPU. So GPUs do prove that you don't need 128 logical registers like Cell for a high throughput. Threading is key when the register set is limited. So x86 doesn't have to be ruled out for a high throughput CPU architecture. Exactly how to achieve that is still a big challenge, but there's no fundamental law forbidding it.

About your other price argument, the whole point of this thread is discuss the best way to maximize performance for your dollar. A $1000 super cpu could well be more expensive than a cheaper one with a fast GPU and fast memory.

A Pentium D costs only 91.99 US$. That's a fantastic price/performance ratio, for a CPU. Neither does Cell cost a fortune. So there's no reason to believe that an x86 CPU with a high throughput architecture would be very expensive.

Also, we don't even know what kind of limitations G80 has for general computation, let alone architectures that will be around in the time-frame of this CPU you're talking about. For all we know G80 could be good enough for many tasks. It's already juggling hundreds of pixels groups that are at different points in an instruction stream, so that's close enough to "independent". I have no doubt raytracing will become very fast on GPU's once scatter support is used. Server tasks could be tough, but I don't know how CPU hungry they are in day to day use beyond benchmarks except for a handful of customers.

I'm afraid that even with the most optimistic vision a GPU would do very poorly at running lots of independent tasks with deep call stacks and megabytes of code. We can't magically have a CPU that has the throughput of a GPU, and neither can we have a GPU that is as versatile as a CPU. So even though this forum is mainly concerned about GPU performance I don't believe CPU performance will become any less important even if GPUs become more powerful in every way.

I still think CPUs should focus on maximizing each thread's speed, and only go multicore when the perf-area elasticity goes below say 0.3 (i.e. 3% boost for 10% area increase). Companies won't sacrifice coding productivity to make everything highly multithreaded, so threads won't grow in number very easily IMO except for the really parallel stuff which would likely map to a GPU.

I think this is the main point of discussion. In my opinion there really is a lot of prallalism in all software. We just need to get our hands on the hardware. Personally I think dual-core is brilliant. Finally a CPU upgrade can give you almost 2x performance increase, whereas previously we mainly only had small increases in clock frequency. Currently only multimedia experiences this 2x speedup, but in time all software development can be multi-threaded. Several game studios already claim they'll be able to make good use of quad-core, and it's not even available. So by the time we have octa-core and hexa-core there will be a use for it as well. Just look at it this way: If previously 100% area increase meant 30% single-threaded performance increase, then how hard can it be to get 30% out of twice the number of cores? In the future I see almost every function call running on a separate thread. Just like today we have out-of-order instruction execution to extract parallelism, we would have out-of-order function execution... It takes new programming languages and advanced compilers to automate this, but I have little doubt that whatever route Intel and AMD take, the software will eventually follow. But the only way to bridge that time is to keep it x86 compatible so old and new software in every segment can benefit from it.

arjan de lumens · Dec 8, 2006

Nick said:
I see. So out-of-order architectures store the results of the execution units in a buffer register at the end of a clock, and the next clock they select the right one and write it to the register file? Ok that way the MUXes are pure overhead for in-order architectures. Still pretty small MUXes though, nothing like the ones for 128 registers...

Yep, that's the basic idea. Although the penalty is really a penalty for not having a forwarding network, rather than a penalty for being in-order in general; it is entirely possible to put a forwarding network in the in-order processor as well to avoid this kind of overhead. (Such a network can also be used to mask from the execution units the overhead of reading the physcial register file as well; there are good reasons that most high-clock-speed processor, in-order or not, have a dedicated pipeline stage for register file read.)

Isn't that's only for predictive out-of-order execution? An in-order architecture would just wait till the branch or cache miss is resolved. With FMT the other threads keep the execution units busy.

No. The argument I gave applies specifically to in-order processors with early register writes; you need to detect all cases where you need to stall and/or block register writes and you need to do it FAST; stalling a long pipeline and blocking register writes requires some extremely (and thus very slow) high-fanout gates just after the stall decision logic. The usual solution at high (1GHz+) clock speeds is to defer real-register-file writes a pipeline step or two (allowing you to pipeline the fanout) and, instead of stalling, replaying the failed instructions from the first point of failure - both of these two tricks are entirely possible to pull off in an in-order processor.

Well my basic strategy would be to hide that latency by doing work on other threads first (like GPUs do), not to do anything speculatively (which takes lots of extra logic).

Whay your propose sounds like essentially an x86 Barrel Processor; while you can hide an arbitrarily large amount of latency that way by just piling on enough threads, you run into the problem of splitting up your workload in enough threads in the first place - although as time progresses and we get new tools and programing practices/languages, the problem is becoming increasingly tractable.

But x86? The main technical reason to keep x86 around is to retain decent support for legacy software, most of which is extremely deficient in its ability to take advantage of multithreading. However, x86 still actually makes (a twisted sort of) sense in this scenario. If people are to distribute applications as machine code, there are major benefits (of an organizational rather than technical kind :!:

) to support only one version, and distributing as anything else than machine code is likely to be inadequate for reasons that have to do with performance (there does not seem to be any way to e.g. get Java or C# to take advantage of SSE unless you allow for x86 binary libraries compiled from a language that is NOT Java/C# ) or IP protection; as such, a common standard for machine code is still a virtual necessity. And displacing x86 as the de-facto standard for machine code has several times in the past proven to be a shockingly difficult task.

Bob · Dec 8, 2006

Nick said:
It has extremely little logical registers, yet the throughput is still fenomenal compared to a CPU. So GPUs do prove that you don't need 128 logical registers like Cell for a high throughput.

You'd be surprised at how many registers are floating around in an GeForce 256's fragment pipe. Hint: It's not close to register counts of CPUs.

Nick · Dec 8, 2006

arjan de lumens said:
Yep, that's the basic idea. Although the penalty is really a penalty for not having a forwarding network, rather than a penalty for being in-order in general; it is entirely possible to put a forwarding network in the in-order processor as well to avoid this kind of overhead. (Such a network can also be used to mask from the execution units the overhead of reading the physcial register file as well; there are good reasons that most high-clock-speed processor, in-order or not, have a dedicated pipeline stage for register file read.)

Thanks for the information. I've had a closer look at the Cell SPE architecture and it has a forwarding network as well. So it must be a good trade-off between transistor count and performance.

No. The argument I gave applies specifically to in-order processors with early register writes; you need to detect all cases where you need to stall and/or block register writes and you need to do it FAST; stalling a long pipeline and blocking register writes requires some extremely (and thus very slow) high-fanout gates just after the stall decision logic. The usual solution at high (1GHz+) clock speeds is to defer real-register-file writes a pipeline step or two (allowing you to pipeline the fanout) and, instead of stalling, replaying the failed instructions from the first point of failure - both of these two tricks are entirely possible to pull off in an in-order processor.

The whole pipeline wouldn't stall, just one thread out of several. All replaying would negatively affect the preformance of the other threads, so I'd keep speculative instructions out of the execution pipeline until their arguments have been read and branches have been taken.

Whay your propose sounds like essentially an x86 Barrel Processor; while you can hide an arbitrarily large amount of latency that way by just piling on enough threads, you run into the problem of splitting up your workload in enough threads in the first place - although as time progresses and we get new tools and programing practices/languages, the problem is becoming increasingly tractable.

Indeed, it's very close to a barrel processor. But I wouldn't restrict it to round-robin execution. That would require as many threads as the longest possible latency. I'd only use as many as the average latency, executing an instruction from any thread that is ready. So it's still important to have low minimal latency (which leaves a forwarding network an interesting option). The goal would be high throughput without an insane number of threads.

But x86? The main technical reason to keep x86 around is to retain decent support for legacy software, most of which is extremely deficient in its ability to take advantage of multithreading. However, x86 still actually makes (a twisted sort of) sense in this scenario. If people are to distribute applications as machine code, there are major benefits (of an organizational rather than technical kind ) to support only one version, and distributing as anything else than machine code is likely to be inadequate for reasons that have to do with performance (there does not seem to be any way to e.g. get Java or C# to take advantage of SSE unless you allow for x86 binary libraries compiled from a language that is NOT Java/C# ) or IP protection; as such, a common standard for machine code is still a virtual necessity. And displacing x86 as the de-facto standard for machine code has several times in the past proven to be a shockingly difficult task.

Good point. I'm certainly not the biggest fan of x86 myself, but I'm afraid we're stuck with it for at least another decade. I believe Intel had plans for a radical new ISA for 64-bit desktops, but AMD beat them to it with an x86 extension, and it proved very popular thanks to the x86-32 compatability.

The future I envision is that we'll have 4-core on 45 nm, 8-core on 32 nm, 16-core on 22 nm, but somewhere around that point it actually becomes cheaper to have mini-cores running 32 threads. Software has to adapt to Cores-a-plentyâ„¢ anyway...

Nick · Dec 8, 2006

Bob said:
You'd be surprised at how many registers are floating around in an GeForce 256's fragment pipe. Hint: It's not close to register counts of CPUs.

D3DTA_CURRENT and D3DTA_TEMP? Can't think of any other software accessible (for Direct3D).

SSE4, future processors and GPGPU thoughts

Similar threads