Tiles rendering on PS3

purpledog

Newcomer
Here we go, another cannabis abuse:

What about using 3 or 4 SPUs to mimic the Dreamcast rendering engine:
- small tile, let say 64x64=4096 pixel
- per-pixel sorting, transparency "for free"

* The first SPU sort the triangles and dispatch them on the tiles
* when there's enough triangle on one tile, they are sent to the second SPU which rasterise them and accumulate the fragment (eventually writting a Z value for visibility hierachical-Z optimisation)
* a third SPU is processing fragment list (per pixel), and shade them

Colors and depth tile-buffers would be in main RAM, and each tile is processed one-at-a-time.




Having a "GPU" so close to the PPU and other SPUs would also open new opportunity to dynamic vertex generation, like adaptative multi-resolution mesh.

I guess making this king of algorithm fast enough on Cell is very hard.
Also, the RSX power should not be wasted. Post-process ? "General purpose" ?
Anyway, it could quite ironic in the end to find a good balance where the RSX is not doing what it's suppose to do.

Probably an insane idea, but "tile" rime with "256k local store"...
 
Sorting and checking for visibility might be a tall task without specialized hardware. The size of display lists could be hard to manage. Also, having some of the tile buffers out in external memory would seem to negate a lot of the advantage.
 
Lazy8s said:
Sorting and checking for visibility might be a tall task without specialized hardware. The size of display lists could be hard to manage..

A solution could be to make the first SPU accumulating triangles into the area corresponding to the tiles (but the tiles are not loaded in memory). Then, when there's "enough" triangles, the triangle list of one tile is sent to the next SPU. Something like a "multiple queue with automatic pressure-release".
If all the tiles are full at the same time and do not reach the wanted limit of number of triangles, then the biggest one is sent.

Sorting per-tile-triangle-list from front to back doesn't afray me so much. I guess it could be done just before sending the list. If the triangle are not well-sorted, not a big deal, we still have a per-pixel-sorting.

Also, ideally, since dynamically generating triangle is easier in this context, I would go for a "mesh generator" which always output triangle of roughly the same area (roughly hundred pixel), which help a lot with sorting and accumulating in tiles.

Lazy8s said:
Also, having some of the tile buffers out in external memory would seem to negate a lot of the advantage.

I'm not quite sure of what you're implying here. The Cell internal bandwith is huge, and I don't see anything bad with reading and writing back tiles to the memory. This is actually how you can expect maximum performances from SPU, by localising data.

Now some number:
- the tiles will cover something like 1000x1000= 1.10e6 pixel
- let's be greedy, and let say each "pixel" can contains 10 fragment in a list, each of them contain 20 bytes. We now have a structure which size is: 1.10e6*10*20 = 2.10e8
- now let say we want to read and write each tiles 5 times in the memory, we are using: 2.10e8*5*2=2.10e9 byte/per frame

Cell internal bandwith is 300 Gb/s which is 10Gb/fram, which is 1.10e10 byte/per frame. Well, we have been very greedy, but still, it's ok.
Please note that I'm only dealing with order of magnitude here.

The idea is to use and *abuse* Cell internal bandwith, because it's one of the greatest ressource in this Chip.
 
it seems that this has been discuted in threads about cell based gpu.
Earlier rumors (before the switch to a Nvidia based graphic solution) spoke of multicell ps3 design.
I guess (the fact is that i'm not alone lol) that perfomance issues push Sony to switch to full blown graphic card.
how can the spu help in graphic rendering is already discuted in lot of others posts.
 
Cost

liolio said:
it seems that this has been discuted in threads about cell based gpu.
Earlier rumors (before the switch to a Nvidia based graphic solution) spoke of multicell ps3 design.
I guess (the fact is that i'm not alone lol) that perfomance issues push Sony to switch to full blown graphic card.
how can the spu help in graphic rendering is already discuted in lot of others posts.

My friend, PS3 is not so different from PS2 concept of CPU with large floating point capacity and Renderer from outside source. I feel choice of Nvidia for renderer instead of custom chip of other source as in PS2 is to solve biggest PS2 problem of incompatibility with directx & opengl for sake of developers.

But for custom software by CELL expert developers, CELL can do very powerful rendering. IBM provided a Ray-Casting tech demo with texture filtering, bump map computation, dynamic cloud generation, atmosphere computation, MSAA with 4 to 16 samples per pixel (!) etc. with no GPU, only CELL of resolution = 720P @ 30fps.


http://img310.imageshack.us/img310/4893/38il.jpg
http://img310.imageshack.us/img310/3575/44gq.jpg

With 2 CELL design same demo is 1080P @ 30fps so 2 CELL design may be enough for performance but developer obstacles for full length game graphics is too high since all effects must be custom software. For standards of next-gen game this obstacle for developers is too high.
 
Well ihamoitc2005, i'm happy that you call me a friend ;).
In fact, my english is quite poor, so I use simple sentence, i don't know sometime i'm feeling
misunderstanded lol...
I wasn't trying to proove your wrong, or to enthousiastic.
I was just telling you that i remenber thid E3 tech demo and the graphic ability of the cell have been discussed and are still on other threads, because i saw that your thread is not very succefull.
I can't make an efficient search on this forum my english is somehow ******g lol.

hence post-processing, advanced lighting, some vertex job, seem good candidates for cell mathematic power.
 
Difficult

liolio said:
Well ihamoitc2005, i'm happy that you call me a friend ;).
In fact, my english is quite poor, so I use simple sentence, i don't know sometime i'm feeling
misunderstanded lol...
I wasn't trying to proove your wrong, or to enthousiastic.
I was just telling you that i remenber thid E3 tech demo and the graphic ability of the cell have been discussed and are still on other threads, because i saw that your thread is not very succefull.
I can't make an efficient search on this forum my english is somehow ******g lol.

hence post-processing, advanced lighting, some vertex job, seem good candidates for cell mathematic power.

I am sorry my friend your post is confusing for me. I have read it many times and I am not certain what is your pattern of thought.
 
Something that would be even better, is if the SPUs could be used for detailed bounding box checking before fetching the real geometry.
That way bandwidth for geometry, textures and framebuffer could be saved.
 
Last edited by a moderator:
Squeak said:
Something that would be even better, is if the SPUs could be used for detailed bounding box checking before fetching the real geometry.
That way bandwidth for geometry, textures and framebuffer could be saved.

What do you mean by "detail boundinx box" ? A box whose precision is somewhere between a simple cube and the detailed mesh ?
 
purpledog said:
What do you mean by "detail boundinx box" ? A box whose precision is somewhere between a simple cube and the detailed mesh ?

Well my reading of that would be bounding boxes used at a finer granularity than usual - i.e. around small lumps of primitives rather than whole meshes.

You could do that kind of thing on PS2 - stick a bounding volume in with a packet of geometry being sent to VU1 and have it reject stuff that isn't visible. You still spent the EE bus bandwidth but you save on VU1 time and GS bandwidth/time. Bearing in mind that a VU1 packet is probably only a small number of polys (probably in double figures, but not too high) it's a pretty fine grain of culling. The only caveat is that it works better when your mesh has locality, whereas the GS and VU will prefer stuff with aggressive stripping - and the two are not necessarily compatible. However a balance can probably be reached in most cases that results in a net gain.

I think using SPUs to do GPU tasks is insane though - it's interesting in an academic sense, especially if you follow the idea that GPUs and CPUs are slowly reconverging. However in a practical system that actually has a decent GPU... nah.
 
MrWibble said:
The only caveat is that it works better when your mesh has locality, whereas the GS and VU will prefer stuff with aggressive stripping - and the two are not necessarily compatible. However a balance can probably be reached in most cases that results in a net gain.
Actually I've found this kind of "packet" culling to be a win pretty much always on PS2, without any special rules for strip constructions (the only rule we enforced on PS2 was that strips cannot cross the packet borders, but that was solely for simplifying VU code reasons).
That said, I actually culled packets on CPU side rather then VU1, but that's besides the point :p

However in a practical system that actually has a decent GPU... nah.
Key word being a 'decent' gpu? :devilish: :p :oops:
Anyway, keep in mind that many of us aren't what you'd consider clinically sane when we have our times of "revelations". ;)


purpledog said:
Having a "GPU" so close to the PPU and other SPUs would also open new opportunity to dynamic vertex generation, like adaptative multi-resolution mesh.
Oh by all means... that is If you have a GPU and CPU actually close.
At any rate, I am a big proponent of certain tiling solutions for SPU but they don't rhyme with scanconversion.
 
Last edited by a moderator:
purpledog said:
Here we go, another cannabis abuse:

What about using 3 or 4 SPUs to mimic the Dreamcast rendering engine:
- small tile, let say 64x64=4096 pixel
- per-pixel sorting, transparency "for free"

* The first SPU sort the triangles and dispatch them on the tiles
* when there's enough triangle on one tile, they are sent to the second SPU which rasterise them and accumulate the fragment (eventually writting a Z value for visibility hierachical-Z optimisation)
* a third SPU is processing fragment list (per pixel), and shade them
If you want to correctly handle and sort all non opaque fragments you have also to store a full description of the scene, you can't just send stuff to a SPE when there's enough triangles per tile( well..you can do that, but only for opaque primitives).
I believe a SPE can be very good at computing triangles coverage masks (clipping included) and interpolating quantities over a triangle.. but what about texturing?
BTW when I say it can be good I'm not saying it can be competetive in the general case with a GPU. A moderm 3D pipeline comprise quite a number of different tasks, some of the tem maps nicely to CELL, some others are probably a nightmare ;)

I guess making this king of algorithm fast enough on Cell is very hard.
What you taling about here, tesselation or rasterization?
Also, the RSX power should not be wasted. Post-process ? "General purpose" ?
Anyway, it could quite ironic in the end to find a good balance where the RSX is not doing what it's suppose to do.
if you're rendering pipeline is unbalanced there's nothing wrong trying to make it more balanced.
Probably an insane idea, but "tile" rime with "256k local store"...
Good cannabis indeed :)
 
MrWibble said:
You could do that kind of thing on PS2 - stick a bounding volume in with a packet of geometry being sent to VU1 and have it reject stuff that isn't visible. You still spent the EE bus bandwidth but you save on VU1 time and GS bandwidth/time.
Bearing in mind that a VU1 packet is probably only a small number of polys (probably in double figures, but not too high) it's a pretty fine grain of culling. The only caveat is that it works better when your mesh has locality, whereas the GS and VU will prefer stuff with aggressive stripping - and the two are not necessarily compatible.
However a balance can probably be reached in most cases that results in a net gain.
I found that worked better (in the end was faster) for me to test each vertex in a packet (48 triangles per packet..) against the frustum for every mesh that had a AABB intersecting the view frustum instead of storing/testing some kind of bounding box per packet. It was more accurate (with BBox I was getting a lot of 'false' to be clipped packets), was culling more packets and it worked with skinned characters /procedural geometry too.

I think using SPUs to do GPU tasks is insane though - it's interesting in an academic sense, especially if you follow the idea that GPUs and CPUs are slowly reconverging. However in a practical system that actually has a decent GPU... nah.
I agree, there's no way you can match a GPU..at the moment :)
What you can do, maybe, is to find a little niche where CELL can be competetive enough..
 
nAo said:
It was more accurate (with BBox I was getting a lot of 'false' to be clipped packets), was culling more packets and it worked with skinned characters /procedural geometry too.
The main benefit of the BBox test is detecting packets that cross visible frustrum but still fall within guardband frustrum - the same reason why it helps a lot to test every triangle that fails screen(narrow)clipcheck against the second (wide)frustrum.
IIRC you had tests against dual frustrums as well, no?

That said, your reversing the order of inner loops (clip first, then shade) is still good if you have the storagespace for list of triangles that will need scissor. Worst case it'll have the same performance as normal order, so every block it might 'early cull' is free bonus.
 
MrWibble said:
I think using SPUs to do GPU tasks is insane though - it's interesting in an academic sense, especially if you follow the idea that GPUs and CPUs are slowly reconverging. However in a practical system that actually has a decent GPU... nah.

Fafalada said:
At any rate, I am a big proponent of certain tiling solutions for SPU but they don't rhyme with scanconversion.

nAo said:
but what about texturing?

Granted, rendering material on a SPU is insane, so let's simplify the dicussion and agree on the GPU to do the material rasterisation pass. That also mean that we don't have anymore per-pixel sorting. Granted.




Still, we can do some depth rasterisation per-tile to quickly decide if a pack of triangle is visible or not. Yes, I'm talking about a (lowRes?) tiled Z-buffer on SPUs.

Fafalada said:
Actually I've found this kind of "packet" culling to be a win pretty much always on PS2.

Actually, I'm sure there's a way to generate this packet using some progressive mesh representation. The idea is simple: before rendering the wanted precision, a lower version is first check against our SPU depth buffer.

But we need more than a lowRes mesh, we need a bounding-box. So let say the progressive mesh always contain two offsets of itself which are generating a volume around all the geometry it can produce by adding details.

Then, we render our progressive mesh front to back, taking care of checking the visibility before adding detail.
Note: because adding detail to a triangle depends on its position, all the mesh deformers are processed before, therefore in the SPUs (no skinning in GPU).

We end up with a big list of triangles (per-tile) that we send to the GPU according to classic optimisation rules (per material, per texture...).
If we want transparency, then we sort the triangle (and re-cut them) before sending (expensive I guess).

nAo said:
If you want to correctly handle and sort all non opaque fragments you have also to store a full description of the scene, you can't just send stuff to a SPE when there's enough triangles per tile( well..you can do that, but only for opaque primitives).

If you can insure a good-enough front-to-back rendering, I believe you can: at some point, you known that nothing will mask what you are about to send.
 
Fafalada said:
The main benefit of the BBox test is detecting packets that cross visible frustrum but still fall within guardband frustrum - the same reason why it helps a lot to test every triangle that fails screen(narrow)clipcheck against the second (wide)frustrum.
IIRC you had tests against dual frustrums as well, no?
main rendering loop was something like that:
on the EE core (precomputed PVS provided me the visibile instances from a spatial node):
for each instance
test if the istance is visibile against the view frustum, if it's not reject, otherwise test if it's intersecting guardband frustum. if it is set the clipping flag to true

on the VU1
for each packet in a instance:
if the clipping flag is false transform and rasterize, otherwise check all the vertices in that packet against a (narrower) guardband frustum, at the same time store intersection count and clip flags for each vertex into the packet itself.if the packet is completely out don't process and go to the next packet (at the same time dma would have uploaded a new packet into VU1 mem)
To be fair I remeber it was far complex than that..there were others tests but I don't remember was I was exactly doing anymore, I'm getting old!

btw PS2 ROCKS!! emh..sorry :oops:

That said, your reversing the order of inner loops (clip first, then shade) is still good if you have the storagespace for list of triangles that will need scissor. Worst case it'll have the same performance as normal order, so every block it might 'early cull' is free bonus.
yep!
 
purpledog said:
Still, we can do some depth rasterisation per-tile to quickly decide if a pack of triangle is visible or not. Yes, I'm talking about a (lowRes?) tiled Z-buffer on SPUs.
Modern GPUs already do that, real rejection rate can be way higher than theoretical fillrate, but I see your point, you want to avoid extra tesselation work
If you can insure a good-enough front-to-back rendering, I believe you can: at some point, you known that nothing will mask what you are about to send.
To ensure that you need a some kind of full description of the scene, since you can't tell the future, can you? :)
 
purpledog said:
If you can insure a good-enough front-to-back rendering, I believe you can: at some point, you known that nothing will mask what you are about to send.

Thing is, a GPU these days can do occlusion queries anyway. You could insert low-res test-mesh of some kind into your drawing order and have the GPU skip the "real" geometry if it doesn't rasterise. Again, that's likely to be a lot quicker and easier than spending CPU resources implementing a secondary rasterisation.

Bearing in mind the limited space on an SPU you'd have to be talking a very low res (and possibly low precision) depth buffer, unless you're tiling it into multiple SPUs.

It *could* work, but I'm sceptical that you'd really be making best use of the system.
 
nAo said:
If you want to correctly handle and sort all non opaque fragments you have also to store a full description of the scene, you can't just send stuff to a SPE when there's enough triangles per tile( well..you can do that, but only for opaque primitives).

nAo said:
To ensure that you need a some kind of full description of the scene, since you can't tell the future, can you? :)

I'm not quite sure what you mean by "full description of the scene".
Ofcourse, the scene exist *somewhere*, in main Ram or on the Disk.
Let's simplify and say the mesh is just one progressive mesh.
The idea is to quickly reach the desired precision in the front, and progressively move to the back of the mesh.

Obvsiouly, you don't want to have the entire mesh (fully-detailed) in a SPU, you just want to stream the needed detail.
Nothing easy here, the tesselator have to constantly look for other part of the mesh to tesselate, waiting for the old query to finally arrive. But that's what it is all about: hiding latency.

Well, doing so, triangles can be accumulated within tiles and sent to the GPU while geometry is still being processed: the tesselator can just say: there's no more geometry whose depth is smaller than "minDepth".

Having said that, the whole scene could also be accumulated in main RAM and to resent after the whole stuff is done. Considering the huge bandwith between SPUs and main RAM, why not... But I'm sure you agree that it would be better to directly "stream it out".
 
MrWibble said:
Thing is, a GPU these days can do occlusion queries anyway. You could insert low-res test-mesh of some kind into your drawing order and have the GPU skip the "real" geometry if it doesn't rasterise. Again, that's likely to be a lot quicker and easier than spending CPU resources implementing a secondary rasterisation.


You're touching the bone :)
The whole point is to keep frustrum/visibility/precision culling whithin the CELL to allow truly "dynamic" mesh. Yes, the GPU can do it faster. But in the flexibility we gain in the Cell is (supposedly) largely balancing this drawback:
- Ideally, the GPU more or less renders exactly what it need to fill the screen, that means very complex shader, and a lot of time for other effect.
- Also, using progressive mesh would generate triangle of roughly the "good" size", solving a lot of aliasing problem.


MrWibble said:
Bearing in mind the limited space on an SPU you'd have to be talking a very low res (and possibly low precision) depth buffer, unless you're tiling it into multiple SPUs.

Not tiling into multiple SPU. Tiling into one spu, but only one tile is present at a time, hence the use of triangle accumulation list.
Tiles are constantly reading from and to memory. If I'm right, Cell EIB can support such king of extreme bus (ab)use.
 
Back
Top