deathkiller said:They hide latency/prevent stalling using more of the shared cache so I think that they shouldn't be ignored when counting the cache.
Shared cache also have lower size efficiency because of aliasing.
In this case, the cores/hardware threads in question would be sharing/hitting on the locked (part of the) cache. I assume the remaining cache continues to serve its purpose although the lesser cache will now have a smaller hitrate. Isn't this correct ?
I do agree that since Cell is built for LS kind of environment, it should be more performant... like the "fast LS vs cache management overhead" issue below:
Shifty Geezer said:L2 cache is slower than the LS, I think in the order of 50% slower (6 cycles versus LS 4 cycles).
and
Jesus2006 said:But still how does the memory managment work? What does a locked cache do? When you write data to it, will it write back to main memory immediately, or do you have control over this, when to write back and when to read from it? Because every write back will make performance drop siginficantly, since a SPE works on it's own LS until the task is finished and then DMAs back results user controlled. I thought that locked cache only guarantees an amount of cache being reserved for a specific thread, but nothing else (like addressability etc.).
Instead of zooming into specific hardware advantages, can we layer the analysis based on some sort of context (model or application framework) to get a coherent picture ?
Here's me thinking out loud [Remember: I am not a game programmer]:
Assuming the application/game requirements are known beforehand ( ;-) ) ...
(1) (Near) real-time, event-driven run-time forms the foundation. This is where basic things like the game loop, user controls, storage/streaming, memory management, network code, animation, rendering, audio, AI and physics are master-planned, laid out and tie down.
(2) Key resources (e.g., cores, memory) are budgeted for each focus area (e.g., visuals, animation, network) based on experiences and requirements. Although in the interest of time, some may be folded within the PPE until devs have time to spin them off to another SPE. This also determines the planned worst case scenarios for individual areas (e.g., How small a time slice are we talking about, how small/big the data structures can be)
(3) Within each focus area, specific techniques are prototyped/developed.
(4) Things are put together and run end-to-end.
(5) More variety of stuff (e.g., weapons) added
(6) More optimization follows to hit specific performance target.
Now, one way to do this on Cell seems to be:
(A) Cooperative game loop and user-controls go to PPE to form the basic skeleton. Basic math library goes to VMX on PPE. Esoteric math stuff goes into domain-specific SPEs.
(B) Supporting storage and network code go into 1 SPE (Although these are rather slow and we may not need to use compression that much -- thanks to Blu-ray and HDD -- I assign the work to a separate SPE to free up the PPE in case other tasks below are thrown back to PPE to run because of 256K LS limits).
(C) All sorts of AI and world simulation assistance = 1 SPE
(D) Rendering assistance (?) = 1 SPE
(E) Audio assistance = 1 SPE
We still have 2 SPEs to spare for touch up. We can also do it such that the 4 SPEs above share some of their workload using a job queue model. Each of these cores would run at almost full-speed.
Now if we take a step back and look at the overhead. e.g., Let's say putting 1 SPE to work takes up additional X% of memory used, Y% of SPE performance, Z% of PPE performance, W more man-month of development (because we need an area to re-org the memory for LS; or the data structures have to be SPE-friendly to begin with).
For such a 4 SPE arrangement, we would take up 4 * X% of PPE cycles, 4 * Z% of main memory and 4 * W man-month. The worst thing is these overheads may hit the PPE simultaneously if not done carefully (e.g., All 4 SPEs needed the PPE to pack the memory at the same time to meet near real-time requirements). For people who cannot make it, we may need to scale back the problem size (e.g., use simpler AI), or reassign the work back to the PPE to avoid the 256 LS overhead.
Now _if_ Cell is a fictitious SMP system, these advanced features would be possible since the same data structure can be used without restrictions. But we run 50% slower borrowing the Xenon's shared cache performance numbers with 100% cache hit.
What kind of conclusion can we draw from this simplified example ?