New Technical Information About RSX???...But In Japanese

Even with 1 SPU disabled and 1 SPU reserved for the OS, it should be 6 SPUs. At least I think that the blakjedi found that strange. But I think that the authors of that paper just chose to use 5 SPUs, and leave 1 SPU idle to handle other hypothetical code.

I often see Xenon resource consumption referred to as "20% of one core" or some other % of a core. However, with Cell, the standard seems to reserving whole SPUs for similar tasks. Can a % of an SPU be used or is the design such that you need to dedicate one to things like audio, decompression, game code, etc. ?
 
I think the cordoning off of an entire SPU has as much to with running the OS in a 'sandboxed,' secure environment as it has to do with anything else. Security was actually a central feature to the Cell BE architecture when it was being developed ifI recall.
 
I often see Xenon resource consumption referred to as "20% of one core" or some other % of a core. However, with Cell, the standard seems to reserving whole SPUs for similar tasks. Can a % of an SPU be used or is the design such that you need to dedicate one to things like audio, decompression, game code, etc. ?

Short answer is: Yes, you can "multi-task" on a SPU. It depends on how you write the SPU software/kernel. I think DeanoC mentioned a little bit (just a little) about this when he was doing his crowd AI. As long as you can keep the SPU busy while waiting for memory... and also if you have enough space in the Local Store, anything goes. :)
 
Mapping Deferred Pixel Shaders onto the Cell Architecture
Alan Heirich - Sony Computer Entertainment America

Abstract
This paper studies a deferred pixel shading algorithm implemented on a Cell-based computer entertainment system. The pixel shader runs on the Synergistic Processing Units (SPUs) of the Cell and works concurrently with the GPU to render images. The system's unified memory architecture allows the Cell and GPU to exchange data through shared textures. The SPUs use the Cell DMA list capability to gather irregular fine-grained fragments of texture data generated by the GPU. They return resultant shadow textures the same way. The shading computation ran at up to 85 Hz at HDTV 720p resolution on 5 SPUs and generated 30.72 gigaops of performance. This is comparable to the performance of the algorithm running on a state of the art high end GPU. These results indicate that a hybrid solution in which the Cell and GPU work together can produce higher performance than either device working alone.

This is what I was talking about...deferred composite rendering schemes, or interleving work on each frame between CELL and RSX before the frame is rendered to. Regardless of the tad bit more theoretical shader efficiency Xenos has over RSX's architecture and the bigger framebuffer bandwidth...Xenos probably would not be able to compensate for the work of both of these processors (CELL + RSX) working together in synergy. People seem to not realize that they were made very specifically to work together well. In fact, that may have been the emphasis of the design...that is to not focus on the core of the chip but how it relates externally with the rest of the system. I believe that is why Sony "borrowed" NVidia's tried and true NV47 core design and focused on working on other things.
 
From what I have read dedicating an entire SPE to a certain task is the least efficent way of programming the CELL in most situations. I have read that the best way is to break all the various threads (physics, AI, game code, antimation, sound, ray casting, vertex work, etc) into many, many more smaller threads that would be streamed from SPE to SPE. When an SPE is finished with one of these threads it would automatically ask for another. This way with so many small tasks being streamed around the SPEs you can do a better job at keeping all the SPE's busy. The only dedicated SPE should be the one already reserved for the OS.

I'm not a programmer, so this information might be incorrect. But I have heard more than one developer say this.

What would really be nice is if they find a way to practically include some pixel and/or vertex shading work in the mix to help out the RSX. That would be amazingly awsome.
 
I might be wrong but I don't see (at least in the foreseeable future) SPEs being used to substitute RSX to shade pixels in the general case, it can be done, but there are so many other useful tasks (graphics related) that a bunch of SPEs can do to help RSX that I would not think about implementing something so complex (yet).
SPEs are very good at operating on vertices and triangles and from many POVs much more flexible than any D3D10 GPU (and maybe even faster, who knows ;) ), they can also be useful to process pixels as in with post processing effects that must be run all over the image, etc..

Marco
 
I might be wrong but I don't see (at least in the foreseeable future) SPEs being used to substitute RSX to shade pixels in the general case, it can be done, but there are so many other useful tasks (graphics related) that a bunch of SPEs can do to help RSX that I would not think about implementing something so complex (yet).
SPEs are very good at operating on vertices and triangles and from many POVs much more flexible than any D3D10 GPU (and maybe even faster, who knows ;) ), they can also be useful to process pixels as in with post processing effects that must be run all over the image, etc..

Marco

Considering the stanfod F@H implementation on R580+ already performs twice as good as the cell one, I'd doubt it would come even close to rivalling i.e. an R600 on its home turf...
 
Considering the stanfod F@H implementation on R580+ already performs twice as good as the cell one, I'd doubt it would come even close to rivalling i.e. an R600 on its home turf...
In which way is this relevant wrt what I wrote?
 
they can also be useful to process pixels as in with post processing effects that must be run all over the image, etc..

Hang on - I feel this is very significant, because if an SPE is going to be able to do full screen post-processing then it's going to have some very good access to the framebuffer / GDDR3 memory in some way. This is relatively new to me.

How does this work? Not to long ago I read that the Cell and RSX can exchange information through textures, which seemed to imply this is bi-directional (apparently through some sort of pointer-list? I have to look that up again). And now you indicate that an SPE can do full-screen post-processing on a full image. Now if I understand this correctly, this would mean that the SPE has direct access to framebuffers, or at least the contents of a framebuffer can be streamed through an SPE somehow and at reasonable speed too, and two-ways at that.

The only other option I can think of is that the RSX can send the framebuffer to the XDR as a texture, have the SPE processing it there, and then sent back again as a texture. Seems to be bandwidth intensive, but maybe possible.

Still, some kind of clever hook-up between the SPE and the RSX seems more logical, and if it is possible to use the SPE as a pixel shader, then that seems to imply that the RSX and SPE can work together more closely than many people so far suspected.

At any rate, what you are saying, the most obvious thing I can come up with in this regard is that an SPE may be responsible for MSAA in Heavenly Sword as well as other games.

@Pinky: that's not so relevant a comparison as you might think, because the F@H client for ATI cards in fact limits itself to certain kinds of calculations most suitable for GPGPU on ATI, whereas the Cell version accepts all forms like all the other versions out there so far for CPUs.
 
Hang on - I feel this is very significant, because if an SPE is going to be able to do full screen post-processing then it's going to have some very good access to the framebuffer / GDDR3 memory in some way. This is relatively new to me.
New? Kutaragi explicitely said (E3 2005!) that CELL and RSX can read/write each other data

At any rate, what you are saying, the most obvious thing I can come up with in this regard is that an SPE may be responsible for MSAA in Heavenly Sword as well as other games.
How can CELL be responsable of something (multisampling rendering) that is running on a GPU?
 
New? Kutaragi explicitely said (E3 2005!) that CELL and RSX can read/write each other data

True, but so far at least in public people seemed to assume that the RSX to Cell bandwidth was 16Mbit/s or something. Remember that stupid story? :D

But thanks for reminding me of that. I didn't know anything about 3D graphics back then (!). ;) Still very little now, but comparatively, a LOT more. ;)

How can CELL be responsable of something (multisampling rendering) that is running on a GPU?

Ehm. You are asking me hard questions. I was just trying to figure out what a real time application could be of full screen post-processing ... But now I'm starting to understand that multi-sampling rendering is something is something the RSX does while drawing the triangles, so to speak, right? So then MSAA by Cell could never happen - only something like FSAA (if that is, in fact, Full Screen AA ;) ), and maybe certain kinds of filters ... Maybe also ways to enhance certain colors / contrasts? - I am thinking also of effects like when explosions happen and the full screen turns read as happens in Resistance, or maybe even the water drops on the screen in F1. Maybe even other kinds of photo-shop style stuff.
 
Hang on - I feel this is very significant, because if an SPE is going to be able to do full screen post-processing then it's going to have some very good access to the framebuffer / GDDR3 memory in some way. This is relatively new to me.
Dont you remember the BW slide? It showed RSX having fairly full access to the RAM pools, and Cell having limited access to GDDR.

PS3_memory_bandwidths.jpg


If the backbuffer were stored in XDR, Cell could read it in and write a processed buffer to GDDR for output (assuming front buffer is in GDDR here). 1080p@60 Hz consumes c. 360 MB/s Bw, so that 4 GB/s Cell>>GDDR can easily accomodate that. Although, how does that figure affect RSX? If Cell were to write to GDDR at 4 GB/s, would that consume all the 22 GB/s BW freezing RSX out, or would it consume 4 GB/s and leave 18 GB/s for RSX?
 
True, but so far at least in public people seemed to assume that the RSX to Cell bandwidth was 16Mbit/s or something. Remember that stupid story? :D
See my above pic. That's the speed at which Cell can read from GDDR.

But now I'm starting to understand that multi-sampling rendering is something is something the RSX does while drawing the triangles, so to speak, right? So then MSAA by Cell could never happen - only something like FSAA (if that is, in fact, Full Screen AA ;) ), and maybe certain kinds of filters
You should go dig up some explanations, such as good old Wiki, on AA techniques. Both MS and SS are usually FSAA. FSAA just means antialiasing is applied across the whole image. What you're thinking is SSAA, or supersampling. Yes, Cell could do that taking a larger buffer and shrinking it down to add AA. It'd be quicker to have RSX do that I imagine unless you're wanting something more advanced than a straight regular bilinear filter.

Maybe also ways to enhance certain colors / contrasts? - I am thinking also of effects like when explosions happen and the full screen turns read as happens in Resistance, or maybe even the water drops on the screen in F1. Maybe even other kinds of photo-shop style stuff.
Cell would allow any image processing, like PhotoShop, such as a B&W filter followed by adding grain if you wanted a film noire look on your cutscenes. The water drops are likely particle effects (can't recall them) which is different to post processing, but something Cell seems to be being used for. Thus in post processing, Cell seems to be doing composition work in some games.
 
See my above pic. That's the speed at which Cell can read from GDDR.

You should go dig up some explanations, such as good old Wiki, on AA techniques. Both MS and SS are usually FSAA. FSAA just means antialiasing is applied across the whole image. What you're thinking is SSAA, or supersampling. Yes, Cell could do that taking a larger buffer and shrinking it down to add AA. It'd be quicker to have RSX do that I imagine unless you're wanting something more advanced than a straight regular bilinear filter.

Cell would allow any image processing, like PhotoShop, such as a B&W filter followed by adding grain if you wanted a film noire look on your cutscenes. The water drops are likely particle effects (can't recall them) which is different to post processing, but something Cell seems to be being used for. Thus in post processing, Cell seems to be doing composition work in some games.

First of all I don't think you need a whole lot of bandwidth for these post processing tasks. You just fill the SPE's LSs with the next chunk of your render target and go at it. Besides you'd probably have RSX output to a render target in main memory and let the spe(s) take it from there. Then copy it to vram for scanout when finished (the bandwidth costs should be rather negligible even at 1080p/60hz (somewhere around 0.5 gb/sec)). The Folding comparsion was in regards to Cell being able to outdo nextgen gpus on vertex shading which I sincercely doubt. Come unified shaders I would expect that even the huge clock lead will not help cell a whole lot in comparsion to something like R600. I'd be really interesting to see what happens once shaders get more complex control flow wise. How bad is the branch penalty on something like xenos (considering its working on three quads per pipeline)?
 
Regarding that table, could someone please enlighten me, as to why writes are way more efficient than reads on the xdr memory?
 
Dont you remember the BW slide? It showed RSX having fairly full access to the RAM pools, and Cell having limited access to GDDR.

If the backbuffer were stored in XDR, Cell could read it in and write a processed buffer to GDDR for output (assuming front buffer is in GDDR here). 1080p@60 Hz consumes c. 360 MB/s Bw, so that 4 GB/s Cell>>GDDR can easily accomodate that. Although, how does that figure affect RSX? If Cell were to write to GDDR at 4 GB/s, would that consume all the 22 GB/s BW freezing RSX out, or would it consume 4 GB/s and leave 18 GB/s for RSX?

Actually, I don't recall if I've seen that picture, but I'm happy to see it now, it helps, thanks. :)

Let's see. So RSX can really use the XDR almost as well as its own GDDR3. It's pretty hard to figure out what the optimum use of all this might be, but I'm going to make a (note: MY) first guess (I'm expecting to be corrected/slapped for most of the stuff I'll write next):

- 4gb/s from Cell to RSX. This seems to me most useful for streaming in textures and vertex data. It would be more efficient to stream this in into the appropriate RSX buffers, as this would still keep RSX in control and probably not interfere with the RSX to GDDR3 bandwidth too much - efficient use of this bandwidth, because of the nature of GDDR3, is best left to one controlling device, where XDR is very much optimised for shared access (correct?)

This seems to indicate that the main location for storing textures is GDDR3, which makes sense obviously, but the Cell can update the textures in GDDR3 memory at a fair pace. However, the RSX itself could read in textures from XDR memory on its own at a much higher bandwidth still, nearly four times as fast, in fact. The main advantage from cell being able to write at 4Gb/s to RSX would therefore seem to be if the above is indeed the case, i.e. the Cell can stream in data to the RSX into certain buffers that do not directly tax the GDDR3. Again, maybe Cell generated vertex and texture data ...

- the 16mb/s Cell read from GDDR3 is probably mostly intended for messaging / debugging / monitoring purposes, and may not be used at all in most instances (?)

- RSX and Cell can read equally well from XDR memory. This seems to be plenty fast, to the point where there's hardly a difference between RSX accessing its own local memory or main XDR memory. Presumably though there may be a difference in latency, and obviously if the RSX accesses GDR memory, this should leave more bandwidth for the Cell to play around with the XDR memory in paralel and vice versa.

- RSX can write quite fast to XDR memory too (10Gb/s), though not as fast as Cell (24.9Gb/s - I'm using measured speeds for now).

So, to summarise the basic Rendering Pipeline:

0. Cell pre-processes vertex data (animations, decompression, etc.) and textures (decompression or conversion to the compressed format that RSX likes, maybe generate textures from scratch, modify them to make them darker or add shadow, etc.) and sends them to RSX (Cell read from XDR, perhaps write to XDR, then write to RSX)
1. RSX renders a scene to GDDR3 framebuffer (RSX write to GDDR3 memory)
2. RSX copies the framebuffer from GDDR3 to XDR (read GDDR3, write to XDR memory)
3. Cell post-processes the scene into an XDR framebuffer (Cell read/write XDR memory)
4. RSX copies the framebuffer to GDDR3 memory (RSX read from XDR memory, write to GDDR3 memory)
5. RSX displays the newly read framebuffer (adjust vram pointer with correct v-sync timing)

And I'm assuming that some of these may not have to wait for each other either ... I expect some reads and writes to overlap.

Also a few questions, like could the RSX render a scene directly to the XDR and would that be beneficial?

Also, I'm not sure yet where what kind of streaming will happen. Right now, we only have information of direct memory access, but we don't know how directly we can connect streams of data from, say, the SPE to RSX buffers. Maybe these fall under the 4Gb/s to GDDR3 memory?

Certainly there is a lot of stuff to play around with here, because a game could also almost exclusively use the RSX and GDDR3 memory to render, leaving the Cell out of it almost completely (just basic AI and main loop stuff).

So all in all I can see how there are very many different ways to setup a render pipeline and then there's all the different programming models for SPEs too, so I start to understand how figuring out the best way to use the Cell isn't all that obvious from day one. ;)

Am I on the right track?
 
You got it all wrong, read those slides again, they are about reading and writing data from/to main or video MEMORY.
 
You got it all wrong, read those slides again, they are about reading and writing data from/to main or video MEMORY.

Now I really don't understand anymore. Every line in the theoretical pipeline I set up illustrates what kind of read and write to which kind of memory takes place, so obviously I know that the slides are about reading and/or writing data to video memory.

So the only point at which I'm not clear about whether buffers are used or whether it would be a write directly to memory, was in the discussion of the Cell-to-GDDR3 memory, where I wasn't sure (nor probably clear enough) about how Cell would communicate most efficiently with RSX and whether or not this impacted the Cell-to-GDDR3 bandwidth, or whether there are different channels for pipelining, say, textures to RSX without the Cell directly accessing the GDDR3 memory, and without the RSX copying these textures from XDR on its own being more efficient.

Could you be a tiny bit more specific when you say I got it all wrong? Or are you saying that I should read all the available slides of that presentation? Maybe there is more information there that I am not aware of.

I really got it *all* wrong? Wow. :cry:
 
Back
Top