Larrabee, console tech edition; analysis and competing architectures

Either way, the compiler/programmer has a pretty big load/use slot to fill on either chip. Of course, the XeCPU has two-threads to hide some of the latency if needed.

Wait, doesn't this:
Up to two instructions are issued per cycle; one issue slot supports fixed- and floating-point operations and the other provides loads/stores and a byte permutation operation as well as branches. Simple fixed-point operations take two cycles, and single-precision floating-point and load instructions take six cycles. Two-way SIMD double-precision floating point is also supported, but the maximum issue rate is one SIMD instruction per seven cycles. All other instructions are fully pipelined.
(from the IBM paper I linked earlier. I included the cycle times for anyone interested)

accomplish the same type of thing for the SPE that two-threads does for the XeCPU, as far as latency is concerned?
 
So how do Intel intend to get 256 kb of 1 cycle cache working? As I understood it, the larger the cache, the slower the access, so to increase both size and reduce latency would be an amazing achievement.
 
So how do Intel intend to get 256 kb of 1 cycle cache working? As I understood it, the larger the cache, the slower the access, so to increase both size and reduce latency would be an amazing achievement.

They can test different parts of architectures can't they? (Sorry if I sound ignorant)

So does that mean they can be confident ahead of time in getting it working?
 
So how do Intel intend to get 256 kb of 1 cycle cache working? As I understood it, the larger the cache, the slower the access, so to increase both size and reduce latency would be an amazing achievement.
Probably static vs dynamic RAM? static RAM takes way more space though.
 
Not the same thing. You can think of the PPU (or a Xenon core) as *two* 1.6GHz processors with the execution latencies halved.

Cheers

Why would the speed be cut in half? Cell's PPE has almost everything, including registers, duplicated. I would assume the Xenon core would have at least some duplication also. I could see it if they were competing for resources, but that's not supposed to be the case with the PPE, at least as I understand it.

As for the SPE, it has two instruction slots, with restrictions on the types of commands that can be dual issued, but a 6 cycle SP Float and a 6 cycle Load, which can be dual issued, should start at the same time and finish at the same time, unless I'm misunderstanding something.
 
I'm surprised about the amount of Cell and Xenon talk in this thread, yet no mention of Xenos.
After all, it looks like Xenos might be the smartest thing MS had done this gen (assuming it doesn't have much to do with RRoD).

So how do Intel intend to get 256 kb of 1 cycle cache working?
That would be very very significant achievement even with reduced clock rate, but I thought 256 kb is L2 which should have most definitely more than 1 cycle latency.
 
Why would the speed be cut in half? Cell's PPE has almost everything, including registers, duplicated. I would assume the Xenon core would have at least some duplication also. I could see it if they were competing for resources, but that's not supposed to be the case with the PPE, at least as I understand it.

Instruction issue isn't duplicated. The two threads will issue in altenating cycles. This means that context_0 will issue instructions on cycle 0, 2, 4; context_1 will issue on cycle 1, 3, 5.

From context_0 POV you have a 1.6GHz dual issue CPU core. Looking at instruction latencies. Say you have a load that hits the data cache, with 6 cycle load to use latency. You (or your compiler) need to find something useful to do for the next 6 cycles, - or stall. Six cycles equals up to 12 instructions when you run the CPU in single-thread mode. When you run the PPE or Xenon core with two contexts, the same 6 cycles equals 6 instructions, - because each context only get to issue instructions every other cycle. From a single contexts POV you have just halved all execution latencies and thus have an easier job scheduling instructions.

As for the SPE, it has two instruction slots, with restrictions on the types of commands that can be dual issued, but a 6 cycle SP Float and a 6 cycle Load, which can be dual issued, should start at the same time and finish at the same time, unless I'm misunderstanding something.

Correct, but not related to the above.


Cheers
 
So how do Intel intend to get 256 kb of 1 cycle cache working? As I understood it, the larger the cache, the slower the access, so to increase both size and reduce latency would be an amazing achievement.

Nobody talked about a one cycle 256KB L1 cache. There was mention of a single cycle D$ and a 256KB private L2 cache.

Personally I'd be surprised if it had a single cycle cache.

Cheers
 
Nobody talked about a one cycle 256KB L1 cache. There was mention of a single cycle D$ and a 256KB private L2 cache.

Personally I'd be surprised if it had a single cycle cache.

The L1 caches (probably 32KBs) will likely have a cycle of two of extra latency on Larrabee. We're not going to see anything like the 6-cycle latencies in the 3.2 Ghz Cell and Xenon processors. Part of the reason is 90nm vs 45nm, and Larrabee isn't even expected to be running at 3.2 Ghz. So, between two process (90nm->65nm->45nm) generations and a slower clock, a few cycle latency will be reasonable.

Larrabee's 256KB private (per-core) L2 cache is reported to have something like a 10-cycle latency (which actually seems a little slow to me... perhaps they made it slower to reduce the power it takes to access it?).
 
Instruction issue isn't duplicated. The two threads will issue in altenating cycles. This means that context_0 will issue instructions on cycle 0, 2, 4; context_1 will issue on cycle 1, 3, 5.

From context_0 POV you have a 1.6GHz dual issue CPU core. Looking at instruction latencies. Say you have a load that hits the data cache, with 6 cycle load to use latency. You (or your compiler) need to find something useful to do for the next 6 cycles, - or stall. Six cycles equals up to 12 instructions when you run the CPU in single-thread mode. When you run the PPE or Xenon core with two contexts, the same 6 cycles equals 6 instructions, - because each context only get to issue instructions every other cycle. From a single contexts POV you have just halved all execution latencies and thus have an easier job scheduling instructions.
Ah, that makes sense. Thanks for the explanation.

Correct, but not related to the above.


Cheers
While I understand that the limited dual issue nature of SPE is not the same as 2 threads, I was meaning that it helps the SPE to avoid lost clock cycles, which is what I thought ArchitectureProfessor meant when he was talking about hiding some of the latency.
 
From the figure, that would give a die size of 185 mm^2 for 165 million transistors. That seems in the ball park. Cell is 221 sq. mm for 234 million transistors in the same process.

Let me correct my earlier estimate: according to the Microprocessor Report article (Oct 2005), the XuCPU has 165 million transistors and a 168 mm^2 die (not the 185mm^2 I estimated above).

That means it is almost as dense (in terms of transistors per mm^2) as Cell. It also means that Cell's die is 30% larger. For a similar die area, the XuCPU could have increased the number of CPUs from three to four (and added a corresponding amount of cache to the L2 cache).
 
For a similar die area, the XuCPU could have increased the number of CPUs from three to four (and added a corresponding amount of cache to the L2 cache).

Perhaps, but considering the rushed nature of the 360's launch schedule, a much larger die might/would have hampered the number of units worldwide even more so.
 
Perhaps, but considering the rushed nature of the 360's launch schedule, a much larger die might/would have hampered the number of units worldwide even more so.

I'm not saying the smaller die was a mistake. In contrast, perhaps Cell was too big. ;)

The smaller die was also probably cheaper than Cell, just one factor that contributed to the cost difference between the two systems.

My only reason for bringing this up is that a if you want an equal-resource comparison between Cell and XeCPU, you would need to shrink Cell or add another core to XeCPU. With an additional core, the XenonCPU would be much more competitive with Cell in raw FLOPS (but Cell still has more raw flops) and with a simpler programming model.

I would say that a XeCPU with five cores would be pretty competitive with Cell. Each XeCPU has two-way vector execution (whereas I think the Cell SPEs can only do one vector per cycle). The XeCPUs have two-threads, helping hide pipeline bubbles and memory latency (and 32 more general-purpose registers). Perhaps four XeCPU cores is roughly comparable to eight of Cell's SPE? The fifth XeCPU core would match up one-to-one with the PowerPC core on the Cell, with XeCPU with a slight edge because it has 128 SIMD registers plus 32 scalar registers vs the Cell's PowerPC core's few registers.

Of course, five XeCPU cores would be larger than Cell...
 
I'm not saying the smaller die was a mistake. In contrast, perhaps Cell was too big. ;)

The smaller die was also probably cheaper than Cell, just one factor that contributed to the cost difference between the two systems.

My only reason for bringing this up is that a if you want an equal-resource comparison between Cell and XeCPU, you would need to shrink Cell or add another core to XeCPU. With an additional core, the XenonCPU would be much more competitive with Cell in raw FLOPS (but Cell still has more raw flops) and with a simpler programming model.

I would say that a XeCPU with five cores would be pretty competitive with Cell. Each XeCPU has two-way vector execution (whereas I think the Cell SPEs can only do one vector per cycle). The XeCPUs have two-threads, helping hide pipeline bubbles and memory latency (and 32 more general-purpose registers). Perhaps four XeCPU cores is roughly comparable to eight of Cell's SPE? The fifth XeCPU core would match up one-to-one with the PowerPC core on the Cell, with XeCPU with a slight edge because it has 128 SIMD registers plus 32 scalar registers vs the Cell's PowerPC core's few registers.

Of course, five XeCPU cores would be larger than Cell...
You can't just slap more cores on that CPU given that they share the L2 cache, it would scale badly.
BTW..SPUs can execute up two instructions per clock cycle, and on decently written code a single SPU runs circles around a XeCPU core at any time of the day.
 
You can't just slap more cores on that CPU given that they share the L2 cache, it would scale badly.
BTW..SPUs can execute up two instructions per clock cycle, and on decently written code a single SPU runs circles around a XeCPU core at any time of the day.

And there is Cell's greatest advantage. Relatively tiny cores compared to a full fledged core. Microsoft has to be considering how the brute-force approach just does not scale so well.

Perhaps with an extended number of years for this generation, they can more properly fix up the problems with Xenon and figure other ways of extending "CPU power". IIRC, devs are utilizing extra threads for decompression or audio. Why not develop mini-cores that deal with those directly :?: Hardware decompression for those advanced compression schemes (as are done in software, mentioned in the Gamefest 2007 slides)... or a customized CPU for audio (or is that essentially an SPE) :?:
 
Perhaps with an extended number of years for this generation, they can more properly fix up the problems with Xenon and figure other ways of extending "CPU power". IIRC, devs are utilizing extra threads for decompression or audio. Why not develop mini-cores that deal with those directly :?: Hardware decompression for those advanced compression schemes (as are done in software, mentioned in the Gamefest 2007 slides)... or a customized CPU for audio (or is that essentially an SPE) :?:
360 has a hardware WMA Pro audio decoder.
 
360 has a hardware WMA Pro audio decoder.

But they're still using a thread on Xenon for audio work. I'm not talking about just decoding a particular format. As the SPEs are proving, you don't a full PPE with VMX128 enhancements et al. to do audio calculations.
 
The smaller die was also probably cheaper than Cell, just one factor that contributed to the cost difference between the two systems.
Having a redundant reconfigurable SPU may very well offset the cost of larger size, of course depending on the defect distribution.

You can't just slap more cores on that CPU given that they share the L2 cache, it would scale badly.
Not only diminishing returns, it will probably not even work under the same clock speed (with the same latencies). Same goes for increased cache size.
 
Back
Top