Local Store: Possible with a "traditional" cache CPU?

nAo said:
I only have one doubt
Indeed, a PPU with 256KB L1 might actually be a half decent performer as opposed to what we have now.
If it's supposed to be as easily applicable in practice as the 8local stores on that same chip, I invite all the resident hw engineers to tell me why this hasn't been used?
Because the Actual result we got are console PPC cores that are fundamentally crippled.
 
Indeed, a PPU with 256KB L1 might actually be a half decent performer as opposed to what we have now.
If it's supposed to be as easily applicable in practice as the 8local stores on that same chip, I invite all the resident hw engineers to tell me why this hasn't been used?
Because the Actual result we got are console PPC cores that are fundamentally crippled.

Because PPC does not stand for Performance Oriented With Enhanced RISC PC, but for Performance through Pain Control: you must learn to cope with difficulties to reach your goal so the PPE team added some challenges to your learning.

And I thought you were the crazy mad-scientist kind of programmer... tsk... tsk...
 
even though you still need to make an effort and code as you were working on a local store based architecture to extract maximum performance from your cachesbased architecture.

This paper is somehow very confusing.

Stream programming is essentially fine grain parallelism. That implies very strong data locality (maintaining the coherency is too costly).

Having a cache makes life easier, fair enough, but as nAo suggested, the programmer still have to work "as if" it wasn't here...

Imo what the paper is fundamentally about is:
- the continuity between coarse/fine grain parallelism: how can we deal with the stuff in the middle?
- the lack of good tool/languages which intuitively brings the lambda-programmer to make better choices

As a side note, "pre-fetching" data in a C program is basically turning a cache into a local store: the programmer takes more control on what's in the cache.
 
Last edited by a moderator:
This paper is somehow very confusing.

Stream programming is essentially fine grain parallelism. That implies very strong data locality (maintaining the coherency is too costly).

Having a cache makes life easier, fair enough, but as nAo suggested, the programmer still have to work "as if" it wasn't here...

Imo what the paper is fundamentally about is:
- the continuity between coarse/fine grain parallelism: how can we deal with the stuff in the middle?
- the lack of good tool/languages which intuitively brings the lambda-programmer to make better choices

As a side note, "pre-fetching" data in a C program is basically turning a cache into a local store: the programmer takes more control on what's in the cache.

I agree, they also neglect that that the stream program model can more efficiently attack the "memory wall" problem by double and tripple buffering coarse grain data. Which mean that the application can continuously utilise close to the maximal off-chip bandwidth without any stalls. The pre-fetch instructions do not offer the same kind of scalpel and the write-backs can be much better organised by the streaming model approch to save bandwidth.

The memory model for their Streaming Implementation with 24 kB LS and 8 kB cache is very different from the SPUs in Cell, which makes it hard to compare them. The fact that they were multiplexing threads on those cores as well which is a no-no on SPEs makes the results hard to interpret.

The original Xenon and Cell 90 nm implementations were 168 mm2 and 235 mm2.
-> 30% larger die area on Cell has more than 100% larger on-chip memory ( approx. 1 MB resp 2.5 MB) and 100 - 200% more floating point capacity (depends on which inflated numbers you choose)

Obviously the Cell SPE implementation must have some advantages if you look at the die size vs. capacity, but it also have some trade-offs like the 16 byte block read/write access to LS, manual branch prediction etc. but I guess a lot of the quirks can be worked around when you are not to dependent on legacy code.

But I am open to their suggestion "small, potentially incoherent caches in streaming memory could vastly simplify the use of static data structures with abundant temporal locality". Maybe we will see something like that in Cell2 implemented as a cache shared by all SPEs, IIRC I think there have been some diagrams of Cell suggesting such a cache.
 
Last edited by a moderator:
Sharing lower level caches is also hard. Even with CMP hardware cache coherency is hardly low cost if you allow writable data be replicated (personally I would prefer writable data to never be replicated between caches).

I'm not disagreeing with your sentiment that memory coherency isn't free. I just don't think abandoning it completely and relying on software controlled coherence is justified.

Write through caches are a dying breed, even Intel abandoned it for their Core 2 microarchitecture. Writing every single store through to the lower level caches does seem like an extravagant waste of power, especially in x86 land where most stores are to the stack segment (and hence are local to the current context/core).

Reducing the coherency traffic to only broadcasting invalidates on cache state changes (shared -> exclusive/modifed) is imperative for massively multicored systems.

Of course you are screwed if you have many cores updating a shared data-structure but that's either really bad programming practice or unavoidable (in which case you'd be equally screwed on your software controlled explicit coherence memory architecture).

The great advantage of large shared lower level caches is of course the ease with which you get constructive interference from large read-only data-structures. This advantage becomes larger as the number of cores and the amount of cache goes up.

Cheers
 
Last edited by a moderator:
A cpu with a full range of cache control instructions would be nice, e.g. cache-block invalidate which worked, prefetch to L1 which worked, etc :) Then you wouldn't need to double up data in L1 & L2 always..
I suppose every possible transaction being covered would be nice, e.g. evict L1 to L2 but keep in L2, (hint for sharing, to do the job of inter-LS dma)

User controled LS is nice because you can tailor a code AND a game's content to work around it; (e.g. design the game to keep things in the sweetspots!)
Ideal algorithms look different depending if they fit in L1 or not so it's good for the programmer & designers to know how much bandwidth they're taking away from the graphics consumers care about so much :)
 
As a side note, "pre-fetching" data in a C program is basically turning a cache into a local store: the programmer takes more control on what's in the cache.
In theory yes, with enough outstanding prefetches ... which in practice you often don't have.
 
Indeed, a PPU with 256KB L1 might actually be a half decent performer as opposed to what we have now.
If it's supposed to be as easily applicable in practice as the 8local stores on that same chip, I invite all the resident hw engineers to tell me why this hasn't been used?
LS area is ~2 times smaller than cache area of same capacity.
256K L1D could increase PPE area by ~50-60% (25-30% if counted together with L2 area) and seriously increase latency.
That means major processor redesign, and unlikely will give any significant advantage.
 
Back
Top