SSE4, future processors and GPGPU thoughts

3dilettante · Dec 8, 2006

Nick said:
Thanks for the information. I've had a closer look at the Cell SPE architecture and it has a forwarding network as well. So it must be a good trade-off between transistor count and performance.

For an architecture as narrow as the SPE, the cost would be relatively minor. Forwarding networks have delays that scale quadratically with issue width, but being mostly scalar pretty much makes it worthwhile.

Considering how sensitive in-orders are to latency, it was probably a very important feature to include.

Indeed, it's very close to a barrel processor. But I wouldn't restrict it to round-robin execution. That would require as many threads as the longest possible latency. I'd only use as many as the average latency, executing an instruction from any thread that is ready. So it's still important to have low minimal latency (which leaves a forwarding network an interesting option). The goal would be high throughput without an insane number of threads.

An Out-of-Order processor tries to execute an instruction that is ready to execute and is waiting in some reservation station at the front of an execution unit.
A superscalar Out-of-Order processor attempts to do the same, it just has to scan for multiple instructions from a wider buffer.

A barrel processor just iterates through the thread list.

Your non-barrel FMT processor attempts to execute a ready instruction by simultaneously scanning multiple entries from some buffer of instructions in front of the execution unit...

It's not exactly the same, since superscalar OoO has to wrestle with quadratic problems with multiple issue and forwarding.

The multithreaded example has some drawbacks compared to a barrel model, since determining readiness involves cross-chip communication, and there is a slightly more timing-constrained multiple issue check (on a scalar processor, no less), and the lack of round-robin brings in questions of fairness. Without round-robin, one well-utilized thread will block 7 other threads.

Your solution would probably be a minor variation of Niagara's barrel/switch on event hybrid scheme.

Your willingness to discount the importance of Niagara's larger register file seems to assume thread context takes no room in a cramped L1. Since no L1 I'm aware of is 32-way associative or 32 times the size of the smallest working cache on a single-threaded processor, the mythical 32-threaded minicore will thrash as a rule, unless you want a huge L1 with latencies that would require >128 threads to hide.

If you want monster throughput, why not just toss out the cache entirely and add another 30 minicores, since the cache is unlikely to work anyway?

The programming model used by GPUs allows them to minimize the working set and context of each thread they run, and they go to great lengths to keep it that way. Independent CPUs cannot assume this.

A scalar and limited OoO core would probably be in the same ballpark of complexity and the L1 would still be useful for a much wider niche than a poor-man's Niagara. It could even do some limited multithreading to 2 threads, since OoO can be repurposed to do most of the lifting.

We'll have to see what level of utilization Sun's upcoming Rock cores have, which still multicores with OoOE.

The future I envision is that we'll have 4-core on 45 nm, 8-core on 32 nm, 16-core on 22 nm, but somewhere around that point it actually becomes cheaper to have mini-cores running 32 threads. Software has to adapt to Cores-a-plentyâ„¢ anyway...

Do you mean 32 mini-cores running 32 threads each?

That would be multithreaded to an absurd degree, and from a wiring perspective, it would be pointless. Signal propagation through cache will not be 8 times as fast at 22nm as it is now, and the cache will be bigger to handle that much data. You'd probably need to have increase the number of threads again in order to hide the latency of the cache needed to support 32 threads, and then increase the number of threads again to hide the latency of a cache that can support the increased number of threads, and so on.

I'd rather have 16 effective threads with mostly utilized hardware than have 100% utilization on weak hardware that spends most of its time dealing with the busywork of 32x32 ineffective threads.

I believe that's incorrect. A REX prefix is only needed when using 64-bit registers or when more than 8 registers are used. But most code still uses 32-bit values ('int' is still 32-bit), and if you're using the upper half of the register set it means you've avoided spill instructions. So in practice the x86-64 code is more compact.

In known cases, x86-64 code expands code enough that it reduces the effectiveness of the L1 cache.
This leads to the overall improvement being a wash in many cases on aggressively OoO cores.
A high-clocked minicore as heavily dependent on the cache as what you want would be would suffer more.

I'm really interested to hear about other designs, as long they are x86 and they maximize throughput. Theoretically I fully agree that x86 should be ditched. But in practice it's not that simple, and any new ISA would also become a limitation in the long-term future, making x86 almost just as good.

In a million years, any ISA will be a drag on performance. x86 is a horrible drag now, unless you use hardware to compensate. It will be even worse in the future, as silicon performance improvements falter, and the process lead the x86 manufacturers rely on to maintain competitive performance becomes harder to maintain.
It's not too difficult to maintain incremental gains in x86 performance, and slowly ease in a few non-compatible cores.

Frank · Dec 8, 2006

Intel and/or AMD could simply add a second instruction decoder to supply a much better and completely different instruction set.

3dilettante · Dec 8, 2006

It might be easier to have one or two strong x86 cores for performance-demanding tasks, then have another core occassionally running a translator to convert x86 code to the new ISA for the rest of the cores.

DEC had a pretty interesting way of doing it (translate, store results) that would smooth the transition to another ISA considerably if applied now.

Intel's trying something similar to keep some old x86 apps running on Itanium, since they've junked hardware translation entirely. It doesn't store results, to my recollection.

Frank · Dec 8, 2006

3dilettante said:
It might be easier to have one or two strong x86 cores for performance-demanding tasks, then have another core occassionally running a translator to convert x86 code to the new ISA for the rest of the cores.

The instruction decoder on x86/64 cores already is a translator that feeds multiple, different processors (execution units).

Frank · Dec 8, 2006

Btw, it might make more sense if you look at registers as local storage, that is shared by all the execution units.

3dilettante · Dec 8, 2006

DiGuru said:
The instruction decoder on x86/64 cores already is a translator that feeds multiple, different processors (execution units).

The instruction decoder in an x86-64 processor is a complicated piece of silicon, and it feeds multiple schedulers that feed multiple ALUs. An execution unit has no idea what an instruction is, or how to get one. It just takes a collection of signals and spits out an output that corresponds to the inputs.

Putting two decoders in would complicate a critical stage in the pipeline even more.
The decode stage may take 50% longer, even if one of the decoders never gets used.

Frank · Dec 8, 2006

If you don't mind a bit more initial latency, you could use a separate processor for that, and upload your preferred JIT compiler.

nAo · Dec 8, 2006

3dilettante said:
The programming model used by GPUs allows them to minimize the working set and context of each thread they run, and they go to great lengths to keep it that way. Independent CPUs cannot assume this.

Simple as that..smart as that, thanks 3dilettante

(and with a name like that I wonder if you are italian..; )

3dilettante · Dec 8, 2006

DiGuru said:
If you don't mind a bit more initial latency, you could use a separate processor for that, and upload your preferred JIT compiler.

Thats what Intel does for Itanium. DEC did things a little different by keeping some of the translated code between runs, so the initial latency and overhead got better with time.

Frank · Dec 8, 2006

Am I right in thinking that a single 8800 ALU more or less repeats the following steps during each pipeline pass:

1) Fetch current register contents from local storage
2) Rename them
3) Execute the instruction(s)
...
4) Store the results back to local storage

?

Frank · Dec 8, 2006

3dilettante said:
DEC did things a little different by keeping some of the translated code between runs, so the initial latency and overhead got better with time.

That's what I mean.

3dilettante · Dec 8, 2006

DiGuru said:
Am I right in thinking that a single 8800 ALU more or less repeats the following steps during each pipeline pass:

1) Fetch current register contents from local storage
2) Rename them
3) Execute the instruction(s)
...
4) Store the results back to local storage

?

The ALU's a small component of a functional unit. It doesn't do any renaming at all, or fetch anything. The control logic sends it the signals that correspond to the data for each operand and what portions of circuitry should be active when those signals arrive.

As a result of the inputs, a certain combination of bits comes out that gets passed to hardware that worries about what it was the instruction wanted to happen.

Frank · Dec 8, 2006

3dilettante said:
The ALU's a small component of a functional unit. It doesn't do any renaming at all, or fetch anything. The control logic sends it the signals that correspond to the data for each operand and what portions of circuitry should be active when those signals arrive.

As a result of the inputs, a certain combination of bits comes out that gets passed to hardware that worries about what it was the instruction wanted to happen.

Yes, I know that. Control words. I simply wanted to outline the steps.

3dilettante · Dec 8, 2006

DiGuru said:
Yes, I know that. Control words. I simply wanted to outline the steps.

The ALU's job is basically only step 3.

The other steps are done by other hardware.

Frank · Dec 8, 2006

Exactly. And that goes for all the other execution units as well. They have support logic around them that does the same things as well. They're no different.

Frank · Dec 8, 2006

At one side you have Very Long Instruction Words, that remove the need for an instruction decoder. And also leave you no option to improve performance in later generations by changing the architecture, unless you add an instruction decoder. That won't work.

At the other side, you can virtualize the CPU. Make it into a black box. And run a complex instruction decoder to keep all your execution units busy as much as possible.

3dilettante · Dec 8, 2006

DiGuru said:
At one side you have Very Long Instruction Words, that remove the need for an instruction decoder.

This wording is confusing. It sounds like you are saying VLIW chips don't have decoders. They do.

At the other side, you can virtualize the CPU. Make it into a black box. And run a complex instruction decoder to keep all your execution units busy as much as possible.

I still don't get what this means. CPUs are already black boxes to software, as long as the final software state is the same as defined in the instruction semantics, the CPU can do whatever it wants.

Complex instruction decoders tend to keep execution units idle, waiting for the decoder to do its job.

Frank · Dec 8, 2006

3dilettante said:
This wording is confusing. It sounds like you are saying VLIW chips don't have decoders. They do.

The original idea for VLIW chips is, that the instruction is the actual bit pattern needed to set everything in motion. It didn't work out as well as that, so they added complex decoders to be able to optimize it.

Complex instruction decoders tend to keep execution units idle, waiting for the decoder to do its job.

No, they only add latency, if done well. And if they have the time (during loops, for example), they can start optimizing.

3dilettante · Dec 8, 2006

DiGuru said:
The original idea for VLIW chips is, that the instruction is the actual bit pattern needed to set everything in motion. It didn't work out as well as that, so they added complex decoders to be able to optimize it.

Can you give me an example of a VLIW processor that didn't have a decoder at all?
That would be one huge instruction word.

No, they only add latency, if done well. And if they have the time (during loops, for example), they can start optimizing.

Latency=waiting

Are we talking about the same "decoder" here, or are we using the word differently?

There is no time during a loop to optimize. A decoder still decodes an instruction it ran into previously. An instruction is meaningless to the back end without the decoder.

I've certainly never heard of a silicon run-time optimizing compiler, either.

BeyondEnergy · Dec 9, 2006

more permutations possible

There are many ways to slice and dice the various ways CPU and GPU can be coupled together.

In one configuration, Nvidia could couple an ARM core with a cut-down version of NV80 and sell it to the handheld market. Buying Portal Player may be a hint of this kind of set-up.

SSE4, future processors and GPGPU thoughts

3dilettante

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

nAo

Nutella Nutellae

3dilettante

Frank

Certified not a majority

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

3dilettante

BeyondEnergy

Similar threads