Intel vs. AMD

Screw all those deep pipelines ... just give me a dual issue in order x86 core. One issue slot for x86 instructions, or rather macro-ops, and non arithmetic SIMD instructions and one for SIMD arithmetic (MMX/3DNow! only, forget about SSE). Give me as many of those as you can fit on a chip with say 16/64 KB L1/L2 cache per core and a shared L3 cache of say 512-1024 KB. Oh and support 3+ thread contexts in hardware (wont take a lot of space, this type of processors doesnt need huge register files ... less space needed for renaming, and for multiple accesses per cycle).

Of course I have said that before, but now people on comp.arch are saying it too ... so maybe AMD/VIA/Transmeta will stop being such pussies and actually do this, instead of waiting for Intel to set the trend. It is going to happen sooner or later. There is a niche for these kinds of processors, and if the Cell patents are anything to go by Sony/IBM's design will leave plenty of oppurtunity for x86 to live in that niche too.
 
Yeah, you mentioned this a while back (Console forums, IIRC), I agreed back then. Moreso now, after messing around with scheduling algorithms and my recent bouts with multiprogramming -- Active Object Systems are cool. It just makes a lot of sense to move into this area, libraries would of course, be key.

Mind you, I'd rather have one or more small chips which serve as decoders only, they'd have decoders and large caches to decode various instruction streams and likely handle memory control. These would then feed smaller (no decode logic and so on) and faster (reduced critical paths) units. The rest of the cores being in order and being able to handle 2 states at the least (Hyperthreading) would be nice, that should for the most part take care of most of the performance lost from not having OoOE.
 
MfA said:
Screw all those deep pipelines ... just give me a dual issue in order x86 core. One issue slot for x86 instructions, or rather macro-ops, and non arithmetic SIMD instructions and one for SIMD arithmetic (MMX/3DNow! only, forget about SSE). Give me as many of those as you can fit on a chip with say 16/64 KB L1/L2 cache per core and a shared L3 cache of say 512-1024 KB. Oh and support 3+ thread contexts in hardware (wont take a lot of space, this type of processors doesnt need huge register files ... less space needed for renaming, and for multiple accesses per cycle).

You are hard core.

In order CPU in this day and age?

Level 1 cache will be 2-4 cycles away, level 2 cache 5-10 cycles away, level 3 cache 20+ cycles away and main memory eons away.

Manually scheduling 2 ops/cycle for those latencies is going to be very painful.

Cheers
Gubbi
 
MfA said:
Screw all those deep pipelines ... just give me a dual issue in order x86 core. One issue slot for x86 instructions, or rather macro-ops, and non arithmetic SIMD instructions and one for SIMD arithmetic (MMX/3DNow! only, forget about SSE). Give me as many of those as you can fit on a chip with say 16/64 KB L1/L2 cache per core and a shared L3 cache of say 512-1024 KB. Oh and support 3+ thread contexts in hardware (wont take a lot of space, this type of processors doesnt need huge register files ... less space needed for renaming, and for multiple accesses per cycle).

Of course I have said that before, but now people on comp.arch are saying it too ... so maybe AMD/VIA/Transmeta will stop being such pussies and actually do this, instead of waiting for Intel to set the trend. It is going to happen sooner or later. There is a niche for these kinds of processors, and if the Cell patents are anything to go by Sony/IBM's design will leave plenty of oppurtunity for x86 to live in that niche too.

:oops:

8)
 
Such a long topic on a very good subject... how did I miss it? :cry:

Now it's gonna take a week to read and dissect all of these remarkable comments... ;)
 
Gubbi said:
You are hard core.

In order CPU in this day and age?

Level 1 cache will be 2-4 cycles away, level 2 cache 5-10 cycles away, level 3 cache 20+ cycles away and main memory eons away.

Manually scheduling 2 ops/cycle for those latencies is going to be very painful.

Well Sun says it's cool. ;) Mind you they have a less crippling ISA and control not only that but all the way up to the OS and even some apps.

Additionally, Gubbi, if you read carefully, he wants one slot for macro-ops, one for non-arithmetic SIMD and one for SIMD arithmetic. 3 ops/cycle.

One thing, however, does MfA's scheme require that each insturction slot has to have instructions from the same instruction stream?
 
Saem said:
Gubbi said:
You are hard core.

In order CPU in this day and age?

Level 1 cache will be 2-4 cycles away, level 2 cache 5-10 cycles away, level 3 cache 20+ cycles away and main memory eons away.

Manually scheduling 2 ops/cycle for those latencies is going to be very painful.

Well Sun says it's cool. ;) Mind you they have a less crippling ISA and control not only that but all the way up to the OS and even some apps.

Sun is at the bottom of the heap performance wise.

Saem said:
Additionally, Gubbi, if you read carefully, he wants one slot for macro-ops, one for non-arithmetic SIMD and one for SIMD arithmetic. 3 ops/cycle.

Restricting instruction issue is just going to compound the pain (VLIW, *ouch*).

I'd like to see a bunch of fast and narrow cores aswell, but I'd want them to be self scheduling (like 80-100 instruction scheduling window). Having an out of order core also makes SMT easier to implement.

Of course with that kind of resources committed to speeding up thread level parallellism having fast synchronization imperatives are going to be essential to performance. So I want a Tera MTA 65bit memory subsystem with one empty/full bit for every 64bit data word.

When storing to a full address (and the 65th bit set) a thread should halt until the full/empty bit is emptied. That would allow superfast inter-thread synchronization.

Cheers
Gubbi
 
Gubbi said:
Sun is at the bottom of the heap performance wise.

I didn't say they were right. =P Then again, IIRC, isn't their Niagara project basically a bunch (8) of in order CPUs with 8 way SMT each on one die? I'd imagine that would put many things to shame.

Gubbi said:
Restricting instruction issue is just going to compound the pain (VLIW, *ouch*).

Hrm, this is what I'm wondering about. Restricting it might not be that bad on the execution engine end (VLIW) just because one talks to that interface doesn't say much about stuff on the other side of the fence. IIRC, Power4 and up use this method. I'm just thinking about it on a much larger scale.

Gubbi said:
I'd like to see a bunch of fast and narrow cores aswell, but I'd want them to be self scheduling (like 80-100 instruction scheduling window). Having an out of order core also makes SMT easier to implement.

With a window that large it might seriously hamper the clockrate, and the rough 30% gain from OoOE (which you seem to imply) might be wasted. If you decouple the scheduling heavy lifting, you could have the execution resources executing very fast. Much faster than the scheduling, to put things into balance, make the scheduling more parallel. Somewhat like the P4 with it's double pumped ALU versus the rest of the chip. It makes more sense to me to have a high clocked execution units and then make the feeding end the braniac. You get the best of both worlds, IMHO.
 
Saem said:
Gubbi said:
Restricting instruction issue is just going to compound the pain (VLIW, *ouch*).

Hrm, this is what I'm wondering about. Restricting it might not be that bad on the execution engine end (VLIW) just because one talks to that interface doesn't say much about stuff on the other side of the fence. IIRC, Power4 and up use this method. I'm just thinking about it on a much larger scale.

Instruction groups in Power 4/5 are not (V)LIW. The individual instruction in a group can be issued/executed out of order. The groups are used to reduce the amount of tracking the OOO engine has to do, instructions are retired when all the instructions in a group has been executed (ie. the entire group is retired).

Gubbi said:
I'd like to see a bunch of fast and narrow cores aswell, but I'd want them to be self scheduling (like 80-100 instruction scheduling window). Having an out of order core also makes SMT easier to implement.

With a window that large it might seriously hamper the clockrate, and the rough 30% gain from OoOE (which you seem to imply) might be wasted. If you decouple the scheduling heavy lifting, you could have the execution resources executing very fast. Much faster than the scheduling, to put things into balance, make the scheduling more parallel. Somewhat like the P4 with it's double pumped ALU versus the rest of the chip.[/quote]

It'll be fast because it's narrow, the size of the instruction window is inconsequential (if necessary dice scheduling into multiple pipeline stages, just look at P4)

If you look at it more generally scheduling is all about pairing producers (completed instructions) with consumers (instructions in the scheduler). The fewer execution units you have the less producers you have and the narrower your scheduler the fewer consumers, this all helps to *greatly* reduce the result forwarding network. Remember the result forwarding network is basically one huge MUX/DEMUX and hence lots of wire, - and wire delay is getting proportionally larger compared to transistor speeds going forward.

I believe that a few generations ahead most designers will regard execution units as mostly free - both in terms of performance (delay) and in terms of die space. Keeping the exec units fed is where the challenge lies.

Cheers
Gubbi
 
Gubbi said:
It'll be fast because it's narrow, the size of the instruction window is inconsequential (if necessary dice scheduling into multiple pipeline stages, just look at P4)

I was under the impression that the P4 didn't examine all that many instructions for scheduling. But that's just a vague memory from anandtech or some such.

Gubbi said:
If you look at it more generally scheduling is all about pairing producers (completed instructions) with consumers (instructions in the scheduler). The fewer execution units you have the less producers you have and the narrower your scheduler the fewer consumers, this all helps to *greatly* reduce the result forwarding network. Remember the result forwarding network is basically one huge MUX/DEMUX and hence lots of wire, - and wire delay is getting proportionally larger compared to transistor speeds going forward.

I agree, wire delay is the new battle front along with other akward effects that are cropping up, though, the difference between (MUX/DEMUX) this and large bus networks is hard to see.

Gubbi said:
I believe that a few generations ahead most designers will regard execution units as mostly free - both in terms of performance (delay) and in terms of die space. Keeping the exec units fed is where the challenge lies.

Wire delay aside, this was always a problem. Things like speculative execution, though a win at times could become preceived as wasters of data delivery resources as we walk down this path.
 
Back
Top