<delurk>
OK, let's look at some actual numbers.
Grab your copy of the CBEA Programming Handbook (Version 1.11), if you want to read along.
Section 6.1.2 gives us an overview of the PPE caches. There's a separate L1 I$ and D$, 32KB each. If you look at figure 6-1, you can see the entire setup, but the important part to notice is that the two SMT threads share all caches. What this effectively means is that for the purpose of this comparison, one PPE should be assumed to execute one thread. SMT doesn't help when you're going for peak performance. Both SMT threads can share the data in the caches, so if both threads run the same code, at least the I$ contention can be minimized. But again, not that interesting.
If you look at 6.1.3.1, you'll notice that the L1 I$ contents do not need to be in L2, so this gives the PPE the ability to use the entire L2 as a D$, as long as your codes fits into the I$. Is is actually really nice.
Of course, caches suffer from aliasing, which you can look up in sections 6.1.3.5 and 6.1.3.10. I'll be ignoring this.
The L2 is 512KB.
So how fast are these? Sadly, there is no official information out there that I can find. You can google around a bit and find some ball-park numbers, it's not really super important, as you'll see in a bit.
Also, if you enjoy these things, look at section 6.2, so the next time someone tells you that SPE doesn't have caches, you can make them angry at you.
A.3.2. gives us the latencies and throughput of the VMX32 instructions. The latency of an instruction depends on what unit processes it, so we have VXU load/store (memory access, 2 cycles), permute (4 cycles), simple (integer stuff, 4 cycles), FPU (single-precision floating point, 12 cycles), estimate (single precision floating point estimator for rcp and rsqrt, 14 cycles) and complex (integer multiplication, 9 cycles). All of these are well behaved and have a throughput of 1 instruction per cycle.
The register file is 32 128b entries large, or 512B.
Also note that VXU load/store is coupled with the LSU, which has implications for dual issue.
As it turns out, the PPE actually does support dual issue, as defined in A.5. But it's complicated. There is this great figure A-1, which explains the rules. So if you have a VSU Type 1 instruction, which is all the math stuff, in slot 0, you can get a Type 2 in slot 1. Cool stuff. You can even dual-issue scalar integer stuff with VMX, which the SPU can't. No dual-issue of FPU and VMX, however.
Now let's compare that to the SPU.
The Synergistic Processing Element has 256KB of Local Store, an SPU with 128x128b registers (that's 2KB) and 4 execution units (+ change) and a Memory Flow Controller, which is the funky DMA unit.
Figure 3-2 gives you a nice idea how the SPU is set up. The important part is that every cycle, you can execute an instruction in both the even and odd pipelines (or EVEN and ODD, as they are usually referred to). If this reminds you of the PPU earlier, you wouldn't be too far off the mark.
I'd also recommend looking at section 3.2.4.2 about DMA lists, which should make clear that the MFC is actually quite powerful and can take a lot of work off the SPU's shoulder. SPU beginners often fail to exploit the MFC fully, as it's not really something you have in a regular CPU. You can often lay out your data in a way that the MFC will be significantly more efficient than any prefetcher could ever be, simply because you have fine-level control. There will never be speculative fetching that wastes bandwidth and LS, for example.
But well, let's jump to table B-2, because this is getting ridiculously long.
Look at those latencies and stalls. Notice something? Most of the instructions that are “simple” on the PEE are 2 cycles now instead of 4 and all that “float” stuff is mostly 6 cycles instead of 12. Estimates are a lot faster. The lq* and stq* instructions are listed as 6 cycles, which is literally how long it takes to get the data out of local store.
Now that we've seen some numbers, what does that mean?
Starting from the top: Why do you care about instruction latencies and not just throughput? In terms of throughput the two PEs are pretty much the same, after all.
This comes down to the dependencies and critical path of your computation. You need to wait a number of cycles defined by the instruction latency, before you can use the result of that instruction. If you don't want to stall the chip, you will need to have other work to do in that time. And this basically means having more than one computation in flight at a time and interleaving those. Of course, to do that you need to be able to store the data for all those computations, which first and foremost means you need more register space the higher your latencies are. So the SPUs have 4 times the registers and half the latency. That's a pretty major advantage. Just imagine the SPU only had 16 registers, which would be equivalent to the VMX.
You could make an argument that for a lot of code the SPUs register file is a bit overkill, but I'd disagree with that for the general case.
Then how do local store and the cache hierarchy compare? As we have less register space, we have less opportunities to hide memory accesses with computation. Let's assume that the 2 cycle latency we saw for VXU load/store is the actual guaranteed time it takes to load data into the VMX registers from L1 D$. That would mean we have 32KB which are 3 times faster than the 256KB of LS. To illustrate what that means, let's normalize those numbers. 32KB/2cycles is 16KB/cycle and 256KB/6cycles is 42.7KB/cycle. We learned earlier that the L1 I$ is independent, and the SPU instruction reside in local store, so we either pretend we have twice the L1 cache on the PPU (giving 32KB/cycle for PPU) or that we only have 224KB of LS (37.3KB/cycle for SPU). And that's assuming you actually use all 32KB of I$. In either case, the SPU wins here before we even factor in register set size.
You can do the same comparison with L2 if you have the numbers, but let's just say that L2 is twice the size of LS and more than twice the latency.
So what good is the bytes/cycle metric? It's basically an estimator how how many simultaneous data instances you can have in the system at any given level. If it takes twice as long for new data to arrive, I'll need twice as much data readily available to prevent stalls.
This of course means that having a higher value further down the memory hierarchy is only useful if you can support that many instances at higher levels. This is the entire point of a cache hierarchy.
Using this metric, the SPUs shine with their low-latency data pipes and large register file.
When doing post-processing, a data instance can be either a pixel or a scanline or a tile. If it's a pixel, chances are a PPE can compete with an SPE (on a one-to-one basis, not one PPE vs. 6 SPEs). If it's a scanline, this becomes a lot harder. The reason for this is simple: If I'm tight on memory, the SPE will need less than half the memory that the PPE needs to stay saturated. So the SPE has a much better chance of not needing to hit main memory more than once per scanline. Once you need to loop the data through main memory, you are consuming precious main memory bandwidth (of which the PS3 has plenty, but no nearly as much as aggregate LS bandwidth), which is bad for many reasons.
</delurk>