SSE4, future processors and GPGPU thoughts

Can you give me an example of a VLIW processor that didn't have a decoder at all?
That would be one huge instruction word.

This is from memory (from my beer-hazy university days), but IIRC, Multiflow had 1024 bit instructions. The idea was that the instruction word had, erh, instructions for each individual execution unit in the processor, so instead of fetch, decode, issue, execute, you only had fetch and execute stages.

In retrospect it is clear why that wasn't such a good idea: It was impossible to fill the instruction word to keep all execution units busy resulting in a massive number of nops with associated piss poor instruction stream density. Another problem would be the register file would have to support 2 read ports and 1 write port for every execution unit and not for every issued instruction that did actual work as in a regular CPU. It's fairly easy to imagine this as a showstopper for subsequent and wider implementations.

Some of the ideas are carried over in IPF, which can be thought of as compressed VLIW, ... - sort of. The template bits in every 128bit instruction bundle (holding three instructions) dictate which execution units the individual instructions are sent to. Compared to "real" VLIW IPF has a decoding stage, but the issue logic is vastly simplified compared to a normal processor.

Cheers
 
Last edited by a moderator:
Regarding the other discussion with umpteen simple "mini" x86 cores on a die.

I don't think it makes sense.. At all.

As 3dilettante already pointed out, the sole reason x86 still exists is because binary backwards compatibility matters (and matters a lot!). Every generation of x86 CPUs has not only run new software faster, but also run old software significantly faster.

A chip with umpteen minicores would run older software slower than a "big" core with a similar silicon budget. A minicore chip would therefore only be useful for new software. Because of that, you might as well start with a clean slate instead of replicating the unfortunate things in x86:

1. Non-uniform instruction size. Complicates the fetch and decode stages
2a. Reg, mem ops, and
2b. Number of registers. - Yeah, current x86 has good performance despite being register starved, but only because they have OOO and fantastic L1 cache systems (low latency, store-to-load forwarding, memory disambiguation etc), - systems that take up silicon real estate and, almost more important, use more power.
3. Flags. almost all ALU instructions modifies the flags. This introduces read-after-write hazards (stalls) when there's absolutely no reason to do so.

Then there's the whole memory model with segments and what not that just needs to die.

Cheers
 
Without round-robin, one well-utilized thread will block 7 other threads.
Since a cache read takes multiple clock cycles, there's a thread switch very often. It could also use round-robin on the threads that are ready. So it's not that hard to have fairness.
Your willingness to discount the importance of Niagara's larger register file seems to assume thread context takes no room in a cramped L1.
It only has 8 local registers. There are also three sets of 8 registers for input, output, and global data. But that's really not much different from having 16 registers and a fast L1 cache.
Since no L1 I'm aware of is 32-way associative or 32 times the size of the smallest working cache on a single-threaded processor, the mythical 32-threaded minicore will thrash as a rule, unless you want a huge L1 with latencies that would require >128 threads to hide.
An AMD K8's L1 cache is only 2-way associative, so with 4 threads per core 8-way associativity would be workable. You also have to realize that with an architecture like this some extra conflicts are unavoidable, but it's the combined throughput that counts. Like I already said, the L1 cache just holds more useful data. In the future I also expect threads to become somewhat smaller. The core isn't working on one big thread doing several tasks, but several threads doing one task.
If you want monster throughput, why not just toss out the cache entirely and add another 30 minicores, since the cache is unlikely to work anyway?
Having entirely no cache would mean it's constantly waiting for memory accesses, so utilization per core is extremely low.
The programming model used by GPUs allows them to minimize the working set and context of each thread they run, and they go to great lengths to keep it that way. Independent CPUs cannot assume this.
Yes, but keeping CPU threads independent is important for versatility. GPUs will always be worthless for server tasks, and wouldn't be able to run applications split in many threads. Also, Intel isn't going to let GPUs dominate the high-throughput computing market. So I really think there's a need for a CPU with high throughput for independent threads. Some compromises are unavoidable.
A scalar and limited OoO core would probably be in the same ballpark of complexity and the L1 would still be useful for a much wider niche than a poor-man's Niagara. It could even do some limited multithreading to 2 threads, since OoO can be repurposed to do most of the lifting.
Indeed, that seems like an interesting option as well. I've compared some transistor numbers and it actually seems possible to fit 32 Pentium III's on a Core 2 Duo die! Of course caches are much more dense but on 45 nm it definitely seems possible to have lots of cores with an efficient out-of-order architecture and good cache sizes. Given that Intel is already planning to reintroduce Hyper-Threading that might mean we could get 8-core CPUs running 2 threads each. Without the negative effect of Pentium 4's replay mechanism this could increase the utilization significantly (current CPUs have quite low IPC compared to the number of execution units).
Do you mean 32 mini-cores running 32 threads each?
No, I meant more like 8 mini-cores running 4 threads each, which would be cheaper than 8 big cores running 1 thread, but might offer the same total throughput. But my idea of ditching out-of-order execution to make the cores really small might be a little extreme. By simplifying the cores a little and reintroducing Hyper-Threading Intel might already be maximizing throughput per area.
It's not too difficult to maintain incremental gains in x86 performance, and slowly ease in a few non-compatible cores.
With a steady increase in x86 performance and cheap but high performance additional cores that would indeed be an interesting option. But that's a big if. You can't sell a Cell as a Power processor at several times the price, and with less SPE's it's no longer an interesting high throughput architecture (compared to a dual-core Power for instance)...
 
I believe Intel had plans for a radical new ISA for 64-bit desktops, but AMD beat them to it with an x86 extension, and it proved very popular thanks to the x86-32 compatability.
the sole reason x86 still exists is because binary backwards compatibility matters (and matters a lot!). Every generation of x86 CPUs has not only run new software faster
Yes, I believe Intels strategy was Itanium from the top (big iron) down to the desktop. AMD was quick and snuck x86-64 on the desktop first and grew it up through the low end server market. I'm sure everyone but intel will be happy for this in the end. =)
 
Last edited by a moderator:
Since a cache read takes multiple clock cycles, there's a thread switch very often. It could also use round-robin on the threads that are ready. So it's not that hard to have fairness.

It would be a variant of Niagara's threading model. Round-robin unless a long-term event such as a cache miss occurs.
A cache read on x86 is so common that if you are saying that a switch can occur on every cache access that the thing would devolve into a round-robin processor anyway. The corner case is if somebody does cram some code's data into the register set, in which case a trivial loop could block every other thread without active intervention by the issue hardware.

Since a register read is part of the instruction, switching on that would be round-robin.

It only has 8 local registers. There are also three sets of 8 registers for input, output, and global data. But that's really not much different from having 16 registers and a fast L1 cache.
There are 8 local registers per register window. Niagara's implementation supports 4 windows. That's a total of 32 local registers (I think the windows are implemented by a renaming scheme, by the way) that are software-visible. They are not all immediately accessable, but they are still software-visible.

I'm sure there's some debate about which scheme is more awkward, the REX hack or the rigid windowing system in SPARC.

{EDIT: The following is not an issue if the caches are not write-through. This was covered earlier, but I had forgotten.
Also important, register files aren't kept coherent. Any write to memory must be kept coherent, so reliance on coherent memory for a core's internal spills is not without cost. }

I would expect an x86 version of Niagara with everything else being equal (this would never happen, but still) to lag noticeably, especially at high thread counts.


An AMD K8's L1 cache is only 2-way associative, so with 4 threads per core 8-way associativity would be workable. You also have to realize that with an architecture like this some extra conflicts are unavoidable, but it's the combined throughput that counts. Like I already said, the L1 cache just holds more useful data. In the future I also expect threads to become somewhat smaller. The core isn't working on one big thread doing several tasks, but several threads doing one task.
The K8's L1 cache is 64 KB and 2-way with a 3-cycle penalty. It is unlikely that an 8-way variant (8-way and 64 KB? 8-way and 256 KB?) is going to keep that latency.

If it climbs to over four cycles, then 4-way threading cannot hide the cache accesses.

With a steady increase in x86 performance and cheap but high performance additional cores that would indeed be an interesting option. But that's a big if. You can't sell a Cell as a Power processor at several times the price, and with less SPE's it's no longer an interesting high throughput architecture (compared to a dual-core Power for instance)...

Odds are, Cell is at least a process node ahead of that point. If IBM redesigned a more suitable chip at 65 nm or less, it probably could if it had the desire.
POWER6 is the server architecture, and I've read elsewhere that some people argue IBM used the Cell design partnership to test-run some of their ideas for POWER6 with Sony and Toshiba defraying some of the cost.

Since Power6 is a more throughput-intensive design with streamlined OoO abilities, we may see the compromise variant win out. It may not beat Sun's offering on power consumption or cost at that point, but it's likely POWER6 will be significantly ahead on performance.
 
Last edited by a moderator:
Also important, register files aren't kept coherent. Any write to memory must be kept coherent, so reliance on coherent memory for a core's internal spills is not without cost.

This is only a problem if you use a write-through cache. Spills always go to the stack. The stack is private to the context (normally anyway). So the stack cachelines ends up in the 'E' state (assuming M(O)ESI nomenclature)) in the cache, and should never be queried for coherence since no other core/context would request them.

I would expect an x86 version of Niagara with everything else being equal (this would never happen, but still) to lag noticeably, especially at high thread counts.
I agree, even though SPARC has it's own dead wood to lug around (the register windows as you already pointed out).

A PPC or MIPS should fare even better.

POWER6 is the server architecture, and I've read elsewhere that some people argue IBM used the Cell design partnership to test-run some of their ideas for POWER6 with Sony and Toshiba defraying some of the cost.

Micro architecturally, they probably found out that an in-order with 2 cycle basic ALU and 6 cycle L1 D$ load-to-use latency would kill performance... And went the other way. From a circuit design point, they probably learned alot with the shallow FO4 metholodogy used in CELL.

Cheers
 
This is only a problem if you use a write-through cache. Spills always go to the stack. The stack is private to the context (normally anyway). So the stack cachelines ends up in the 'E' state (assuming M(O)ESI nomenclature)) in the cache, and should never be queried for coherence since no other core/context would request them.

Oh that makes sense. I didn't know the line status could be set for that purpose.
 
Gubbi, 3dilettante,

Is it possible/desireable to implement SPARC register windows via register renaming? Reason I ask is Sun's upcoming Rock architecture is rumored to be doing all kinds of fancy OoOe things (out of order retirement, scout threading):

I don't think there's any reason why they can't be. Fujitsu's SPARC64 V is OoO.

I can't find any in-depth information on the implementation, but it looks from a high-level to be based on a Tomasulo OoO design with reservation stations, which would use renaming.

I think windowing complicates rename a little bit, since a command that shifts windows would require updating the rename status of any instructions that are issued alongside it.
Since some registers carry over to the new window, something must keep track of which physical register can be addressed as a passed-in value or a local one, in effect, some kind of double-rename (or brute-force copy instead).

It reminds me a little of what x86 chips do with x87 fp instructions, which require an additional rename stage in order to simulate the stack-based ISA on a register set that is then run OoO.

I could be wrong, since I'm not too familiar with SPARC beyond some generalities.
 
D3DTA_CURRENT and D3DTA_TEMP? Can't think of any other software accessible (for Direct3D).
Let me get this straight.

You said GPU's are an example of why a CPU that can't route around stalls doesn't need many registers like Cell's SPE's do. After proving you wrong with pretty much any DX9 card (which all have tens of thousands of registers), you then say a 16-SPE CPU has "only" 2048 registers, contradicting your earlier stance that SPE-esque register numbers aren't needed. Finally, you mention fixed-function GPUs for throughput with only two "software accessible" registers per thread, even though these cannot accomodate anywhere near today's shader complexity let alone general CPU code, and you ignore the number of pixels in flight also.

You need to understand that a "thread" in a GPU is not a single pixel, but a batch of pixels. Feed G71 only 100 pixels with a shader program, and it's extremely slow, probably running in the same time as 5000 pixels. Each shader quad runs the same program and needs a batch of about 880 pixels to be efficient. If the program need four float4's, that's 14,000 FP32 registers per thread. Even a Geforce256 will have at least 500 pixels in flight, so the two temps you mentioned would total 4,000 registers per thread.

The GPU is effectively doing the same loop unrolling that 3dilletante was talking about. Each loop iteration goes over a pixel, and each pixel needs say 4 float4 registers. If you only want to hide instruction latency, then you just need to unroll and interleave 7 iterations at a time in Cell, but that still fills up a lot of registers (7*4*4=112). If you want to hide data stalls to get the "phenomenal throughput" of GPUs, then you either need many, many more registers or a cache with such high bandwidth that you can swap all necessary data in and out of the registers fast enough.

In any case, GPU's are not evidence that you don't need many registers for an architecture unable to route around stalls.
 
In any case, GPU's are not evidence that you don't need many registers for an architecture unable to route around stalls.
Certainly. But I wasn't talking about an architecture that can't route around stalls. GPUs are proof that you don't need a lot of logical registers to route around stalls and reach high throughput.

I know GPUs process pixels in batches, but if you regard every pixel as a separate thread (entering the pipeline in round-robin fashion), then the key to hiding latency is really threading. So it's theoretically possible to get high throughput for x86 cores (even without out-of-order execution), as long as you add enough threads to adequately hide latency. I realize it's more complicated than that in practice, but the number of registers is certainly not the limiting factor. I'm also not claiming that more registers wouldn't be a good thing, I'm just saying we can route around stalls even with few logical registers. GPUs do prove that and nobody has proven me wrong that it's impossible for CPUs to take that approach.
 
If you want to talk about pixels as threads, then that's next to useless for any analogy in the CPU world. The only reason GPU's can hide latency is that they have an insanely parallel workload. I'm talking about thousands of "threads" that are completely independent. If you reduce the number of threads even into the hundreds, their performance collapses.

With this in mind, thousands of threads each with their own copy of relatively few logical registers adds up to a lot of effective registers. You're basically saying that instead of unrolling loops to use many registers like you would for an SPE, spawn a ton of threads with each using only a few registers. The latter is a lot tougher for compilers and developers to do effectively, and less flexible also.
 
If you reduce the number of threads even into the hundreds, their performance collapses.
For a GPU, yes, but a CPU already minimizes execution latency and has a cache for fast memory access. So a fairly low number of threads can keep a simplified core utilized.
The latter is a lot tougher for compilers and developers to do effectively, and less flexible also.
Well we're not going to get things for free. More registers can help a bit to get higher utilization, but it doesn't scale. For high throughput we need lots of cores and this implies lots of threads either way you (and me) like it.
 
Certainly. But I wasn't talking about an architecture that can't route around stalls. GPUs are proof that you don't need a lot of logical registers to route around stalls and reach high throughput.

The software model is signficantly different, but I think it's probably not impossible to hack x86 into emulating it. The hardware would have to dynamically spawn internal threads to get the full latency-hiding ability of a GPU or even approach it. Software threads could emulate it partially, but it would be hackish beyond imagining.

Then again...
CTM specification: section 3.3.1 said:
Each instruction can specify the addresses for 6 different sources--3 RGB vectors and 3 Alpha scalars. Each source can come from one of 128 temporary registers (which can be modified during the program, and be different for each processor), or from one of 256 constant registers (which can only be changed when a program is not running).

The number of architectural registers is somewhere between 8 and 24 or 4 and 12 times as many registers available for x86, depending on how you rate constant registers and whether we're going to count x86's SSE and GP sets.
 
The software model is signficantly different, but I think it's probably not impossible to hack x86 into emulating it. The hardware would have to dynamically spawn internal threads to get the full latency-hiding ability of a GPU or even approach it. Software threads could emulate it partially, but it would be hackish beyond imagining.
Could you elaborate that thought for me? What exactly would you emulate, and why would it be hackish?
The number of architectural registers is somewhere between 8 and 24 or 4 and 12 times as many registers available for x86, depending on how you rate constant registers and whether we're going to count x86's SSE and GP sets.
swShader had not much trouble with only 8 SSE registers. You really have to count the L1 cache as quickly accessible data.

What's CTM specification?
 
Could you elaborate that thought for me? What exactly would you emulate, and why would it be hackish?

I'm just saying trying to match a GPU would require special measures either in hardware or software.
GPU programs are dispensed by a command processor to local schedulers that control arrays of processors that run their threads from the same program or portion of a program.

For x86 to do the same, there would need to be hardware support for spawning internal threads on the cores. This has been discussed in other contexts, but not emulating a GPU.

The same end-result could be made by creating many, many separate threads at the OS scheduler level, but it would be very inefficient compared to how a GPU can thread transparently to the external system.

Another way would be to create some kind of software emulator that dispatches a work in a similar manner to the minicores, but it's forcing low-level GPU behavior into software that is run on cores that are over-designed to be array processors.
 
What a weird world we live in. In one direction, computers are going with more dedicated processing units(PPU, APU, GPU, NPU), while in other direction, they want to integrate CPU and GPU into one.

If they want absolute processing power, we have to go all dedicated, but we'll sacrifice for increased power consumption, increased prices, more waste(as they are replaced by newer versions).

If they go the way of integrating everything, we can go smaller, consume less power, less waste, but we'll get much less processing power.

Reading couple of pages from this thread, the impression I got is "why have a CPU when GPUs are far superior at everything CPUs can do"

IMO, CPUs are comparatively slower than dedicated units because it can do everything. As GPUs go closer to CPU, and if in the future they are able to do everything CPUs can, it won't do it any faster.

There's a reason CPUs haven't got much beyond the 3-issue core(little more with Conroe), because the software CPU runs have a limit on instruction level parallelism, and on a multi-core CPU, you gotta change your software to take advantage of the multi-core.

Now if GPUs can do everything a CPU can do(go beyond whether GPUs have programming languages that are good for general purpose processing) it won't be faster as it'll encounter the exact same problems CPUs encountered for general purpose processing, and the extra parallel processing is basically useless.

Some people can get GPU to accelerate their 3D games, some may want PPU to accelerate physics, some want NPU to lower ping, hell maybe we'll have AI accelerators, and as some mentioned, WinZIP accelerators, but for me I'll stick to the simple CPU+GPU combo, maybe even simpler CPU integrating GPU(I use an IGP btw). I sure don't want a computer that takes whole room and cost $300 electricity a month by itself, even though it may be super-fast.

A processor that can run 1000s of apps and run at CPU speed=a processor that can run 1 app N x faster(where N is a number), in their own way.

For the sake of the planet itself(and money), I hope everything goes back to before the 3D graphics card days and run everything on the core processing unit(like a CPU). A motherboard that lasts more than one generation won't hurt either(Doesn't Intel feel guilty about users having to update their ENTIRE system every CPU generation??)
 
Reading couple of pages from this thread, the impression I got is "why have a CPU when GPUs are far superior at everything CPUs can do"
GPU's are only good at highly parallel tasks. For everything else, they're nearly worthless. You can pretty much throw any code at a CPU and it will execute it at one billion instructions per second. Single-threaded code with lots of branches and incoherent memory accesses, something a GPU would choke on, a CPU just churns through. The downside is that the maximum throughput is limited.
There's a reason CPUs haven't got much beyond the 3-issue core(little more with Conroe), because the software CPU runs have a limit on instruction level parallelism...
There's significantly more instruction level parallelism in every application. It's just very hard to extract this parallelism in hardware. Independent instructions can be very far away, across several functions. And to issue more than three instructions per clock the increase in complexity is prohibitive.
...and on a multi-core CPU, you gotta change your software to take advantage of the multi-core.
It's the only way forward. But it's not as terrible as it looks. It's just a one-time paradigm shift. Once applications are properly multi-threaded they can easily scale with the number of cores. Several applications already claim to be ready for quad-core, while the hardware is even hardly available. So bring on the octa-core and hexa-core...
 
What do you think about the idea of multiple CPUs inserted into multiple ZIF sockets? I think with 1-cycle DOT and the dx10 shader instructions inside a new SSE plus the ability to put like 2 or 4 of those CPUs on the motherboard could rock!

You really seem to have issues with understanding some basic concepts of electronics and physics.

For example, don't you realize that multiple CPUs on one board would:

1. Increase the cost of the board because of:
a) Much more expensive signal routing (probably more layers needed)
b) More elements required for decoupling of 4 sockets instead of 1 (solid or aluminium capacitors)
c) More power MOSFETs for powering 4 CPUs instead of 1
d) Bigger surface needed to accomodate all the elements
e) Next to inpractical synchronization for memory access

2. Increase the cost of the whole system because of:
a) mainboard layout change needs new case layout
b) You will need to install 4 coolers instead of 1 (also think of heat and noise which goes with power required for 1-cycle DOT you are so obsessed with)
c) RAM would have to be much faster (like 10 times today bandwidth)

3. Increase the complexity of software design. Not to mention that they would have to develop the tools for programming that monstruosity first. Don't count on free compilers there.

Furthermore, you are ignoring one simple fact in your requests for more powerfull instructions -- GPU is much better at random access to memory than the CPU. It is because it has multiple smaller texture caches and the data layout in memory is optimized for certain types of accesses (data order is not linear AFAIK). On the CPU side you have 10 times slower RAM (8.5GB/sec or less versus ~85GB/sec on 8800GTX) and one big cache, both delivering greatest bandwidth in simple streaming (linear RAM access) operations. It is something that can't just be changed overnight.
 
Back
Top