Dont you remember the BW slide? It showed RSX having fairly full access to the RAM pools, and Cell having limited access to GDDR.
If the backbuffer were stored in XDR, Cell could read it in and write a processed buffer to GDDR for output (assuming front buffer is in GDDR here). 1080p@60 Hz consumes c. 360 MB/s Bw, so that 4 GB/s Cell>>GDDR can easily accomodate that. Although, how does that figure affect RSX? If Cell were to write to GDDR at 4 GB/s, would that consume all the 22 GB/s BW freezing RSX out, or would it consume 4 GB/s and leave 18 GB/s for RSX?
Actually, I don't recall if I've seen that picture, but I'm happy to see it now, it helps, thanks.
Let's see. So RSX can really use the XDR almost as well as its own GDDR3. It's pretty hard to figure out what the optimum use of all this might be, but I'm going to make a (note: MY) first guess (I'm expecting to be corrected/slapped for most of the stuff I'll write next):
- 4gb/s from Cell to RSX. This seems to me most useful for streaming in textures and vertex data. It would be more efficient to stream this in into the appropriate RSX buffers, as this would still keep RSX in control and probably not interfere with the RSX to GDDR3 bandwidth too much - efficient use of this bandwidth, because of the nature of GDDR3, is best left to one controlling device, where XDR is very much optimised for shared access (correct?)
This seems to indicate that the main location for storing textures is GDDR3, which makes sense obviously, but the Cell can update the textures in GDDR3 memory at a fair pace. However, the RSX itself could read in textures from XDR memory on its own at a much higher bandwidth still, nearly four times as fast, in fact. The main advantage from cell being able to write at 4Gb/s to RSX would therefore seem to be if the above is indeed the case, i.e. the Cell can stream in data to the RSX into certain buffers that do not directly tax the GDDR3. Again, maybe Cell generated vertex and texture data ...
- the 16mb/s Cell read from GDDR3 is probably mostly intended for messaging / debugging / monitoring purposes, and may not be used at all in most instances (?)
- RSX and Cell can read equally well from XDR memory. This seems to be plenty fast, to the point where there's hardly a difference between RSX accessing its own local memory or main XDR memory. Presumably though there may be a difference in latency, and obviously if the RSX accesses GDR memory, this should leave more bandwidth for the Cell to play around with the XDR memory in paralel and vice versa.
- RSX can write quite fast to XDR memory too (10Gb/s), though not as fast as Cell (24.9Gb/s - I'm using measured speeds for now).
So, to summarise the basic Rendering Pipeline:
0. Cell pre-processes vertex data (animations, decompression, etc.) and textures (decompression or conversion to the compressed format that RSX likes, maybe generate textures from scratch, modify them to make them darker or add shadow, etc.) and sends them to RSX (Cell read from XDR, perhaps write to XDR, then write to RSX)
1. RSX renders a scene to GDDR3 framebuffer (RSX write to GDDR3 memory)
2. RSX copies the framebuffer from GDDR3 to XDR (read GDDR3, write to XDR memory)
3. Cell post-processes the scene into an XDR framebuffer (Cell read/write XDR memory)
4. RSX copies the framebuffer to GDDR3 memory (RSX read from XDR memory, write to GDDR3 memory)
5. RSX displays the newly read framebuffer (adjust vram pointer with correct v-sync timing)
And I'm assuming that some of these may not have to wait for each other either ... I expect some reads and writes to overlap.
Also a few questions, like could the RSX render a scene directly to the XDR and would that be beneficial?
Also, I'm not sure yet where what kind of streaming will happen. Right now, we only have information of direct memory access, but we don't know how directly we can connect streams of data from, say, the SPE to RSX buffers. Maybe these fall under the 4Gb/s to GDDR3 memory?
Certainly there is a lot of stuff to play around with here, because a game could also almost exclusively use the RSX and GDDR3 memory to render, leaving the Cell out of it almost completely (just basic AI and main loop stuff).
So all in all I can see how there are very many different ways to setup a render pipeline and then there's all the different programming models for SPEs too, so I start to understand how figuring out the best way to use the Cell isn't all that obvious from day one.
Am I on the right track?