GPU/CPU sync

purpledog

Newcomer
Traditionnaly, GPU sync is quite simple: it's one frame behind (or more).
The CPU doesn't need anything from the GPU, it's just sending command/stream/texture without receiving anything. Simple and efficient...

The question is: will this very simple model still optimal for next-gen consoles ?

Some facts:
- being able to retrieve data from the GPU allow very interesting tricks
- next-gen are multi-core, so coders have to face the parallelism issue anyway

One way of allowing the CPU to use the GPU is make the GPU only a fraction of a frame behind (number = frame number):
CPU: 444444444444 555555555555
GPU: 3333 44444444444 555555555555


That way, the CPU can send command to the GPU and get the result later within the same frame. This model can be seen as the exact copy of the previous (historical) one, but on a "sub-frame" basis. Indeed, it's acting like if a frame was composed of n virtual sub-frames (n~10)

Yeah, that's a bit tricky, but I can't live without bringing back some of the RSX power into the cell. What do you think ?
 
purpledog said:
Traditionnaly, GPU sync is quite simple: it's one frame behind (or more).
The CPU doesn't need anything from the GPU, it's just sending command/stream/texture without receiving anything. Simple and efficient...

The question is: will this very simple model still optimal for next-gen consoles ?

Some facts:
- being able to retrieve data from the GPU allow very interesting tricks
- next-gen are multi-core, so coders have to face the parallelism issue anyway

One way of allowing the CPU to use the GPU is make the GPU only a fraction of a frame behind (number = frame number):
CPU: 444444444444 555555555555
GPU: 3333 44444444444 555555555555


That way, the CPU can send command to the GPU and get the result later within the same frame. This model can be seen as the exact copy of the previous (historical) one, but on a "sub-frame" basis. Indeed, it's acting like if a frame was composed of n virtual sub-frames (n~10)

Yeah, that's a bit tricky, but I can't live without bringing back some of the RSX power into the cell. What do you think ?

I could hear DeanoC and DeanA sharpening their knives!

I dunno, I suspect RSX will have a DMA engine like the SPUs, and each can DMA to themselves/to others...
 
The traditional model is just a FIFO.
It could be one frame ahead 20 frames ahead or 1/100th of a frame ahead. As long as there is more data in the FIFO than the GPU is working on both the CPU and GPU can continue not stalled.

What has made (makes) feedback systems difficult, is that GPU's are actually 100's of instructions into the FIFO, so rendering something based on the result of an earlier operation in the same frame is always a potential GPU stall, if the CPU has to read the data. It's one of the reasons that GPU based occlusion queries haven't really taken off, it's difficult to schedule the occluders and react to the occlusion query results while keeping GPU from starving.

If you can pipeline the results of the CPU operations so they affect the following frame, then you can trivially eliminate the potential stall, and most GPU's have some mechanism to trigger an interupt on the CPU side for it to do work. Say base your lens flare on previous frames occlusion data, or add a frame of latency if you want to do a pass on an initially rendered frame to do work on the CPU before passing it back into the GPU to create the final image.

I guess what I'm saying is that currently there is no requirement that the GPU be a frame behind the CPU. But that doesn't solve the working together problem because GPU's have very high latency. Doesn't mean yo can't do it, just means you have to have enough work for the GPU to do while your farting around on the CPU before submitting the results back to it.

If your just talking about 1 way (say a final CPU pass) then it's trivial.
 
Predicated tiling in XB360 requires that the CPU mark up the vertex buffers after the GPU has performed it's z-only prepass. There's a special function to do this.

Additionally, under Xbox Procedural Synthesis, the GPU is pulling data directly out of CPU-L2 cache.

So in XB360, at least, the CPU won't be running as far ahead. Indeed it could well be argued that at least one hardware thread on Xenon will be Xenos's slave.

As far as I can tell it's the intention within the XB360 design that skinning and tessellation be performed in the GPU, so the share of the workload between CPU and GPU is shifted as compared with earlier designs.

Jawed
 
ERP said:
I guess what I'm saying is that currently there is no requirement that the GPU be a frame behind the CPU. But that doesn't solve the working together problem because GPU's have very high latency.

I would like to have a practical value of the"GPU very high latency" for next-gen console. The CPU, whithin one frame, is sending many "atomic list of commands" to the GPU, like: render shadow map, render scene, post-process...
IMHO, the time taken by one of this atomic list to go throuht the GPU is typically no more than a fraction of a frame (or you're screw anyway).
If the fifo is empty enough, this seems reasonable to expect the result of one of this atomic list within the same frame.

For instance, let's say we're dealing with a quite complex rendering engine which need to render 10 of this atomic list L0..l9
For instance, the CPU send L3 and can expect the result back after having sent L6.

ERP said:
Doesn't mean yo can't do it, just means you have to have enough work for the GPU to do while your farting around on the CPU before submitting the results back to it.

As you say, that's not trivial to make sure the GPU is never starving. To do so, you can see the rendering engine as a dependency graph of atomic list Li. some special nodes are "sync node" which can be walk through when a given Li is complete. Then you process the graph taking into account the dependencies...

This graph is just a formal way of imaging a complex rendering engine able to do something interesting instead of "stupidely" waiting for a linear list of commands to complete.

Doing so is not easy, but the increasing complexity of rendering engine is actually helping here. People are now frequently dealing with: normal geometry, terrain, sky, shadow map, stencil shadow, various procedural stuff, particles, effect, post-process... blah blah. There must be a way to organise all that in a smart way so that the blocking time is sufficiently reduced.

To put it in a nutshell, that's a hard job, but seems doable. mostly a question of imagination I would say...
 
Jawed said:
Predicated tiling in XB360 requires that the CPU mark up the vertex buffers after the GPU has performed it's z-only prepass. There's a special function to do this.

It also makes me think that tiling is a nice way of reducing the size of the "atomic command list", and thus make it easier to achieve a tighter CPU/GPU sync.
Or am I missing something ?

Jawed said:
Additionally, under Xbox Procedural Synthesis, the GPU is pulling data directly out of CPU-L2 cache.
So in XB360, at least, the CPU won't be running as far ahead. Indeed it could well be argued that at least one hardware thread on Xenon will be Xenos's slave.

... very interesting indeed. Does it mean there NO way to have the GPU far behind using this techniques ? seems like a very tought constraint !

Jawed said:
As far as I can tell it's the intention within the XB360 design that skinning and tessellation be performed in the GPU, so the share of the workload between CPU and GPU is shifted as compared with earlier designs.

I'm not sure to understand the relevance of this affirmation in this thread... Can you elaborate ?
 
purpledog said:
It also makes me think that tiling is a nice way of reducing the size of the "atomic command list", and thus make it easier to achieve a tighter CPU/GPU sync.
Or am I missing something ?
Perversely, I'm not sure that's going to happen any time soon.

Xenos can support upto 8 concurrent render states, I believe. I don't honestly understand the implications of that - apart from one stated advantage: because render states can overlap, the cost of switching states, which in DX9 is fairly high, are diminished.

Beyond geometry, there's still plenty of work for the CPU to do with respect to managing the shaders being run, and the multiple rendering passes it will take to construct a frame.

I say all this as a non-dev though - and getting devs to comment specifically about this stuff is pretty hard.

... very interesting indeed. Does it mean there NO way to have the GPU far behind using this techniques ? seems like a very tought constraint !
I think it's possible, but that's more like programming older style GPUs. And because of the multiple rendering passes, the CPU has to keep on top of what the GPU is doing anyway.

I'm not sure to understand the relevance of this affirmation in this thread... Can you elaborate ?
Memexport is designed to allow Xenos to create vertex data.

This means it can
  • take vertex data from the CPU, plus animation parameters,
  • skin the models as appropriate
  • write the models to memory with parameters for the next stage
  • perform basic tessellation and/or adaptive tesselation and generate higher order surfaces
  • write the finalised models to memory
  • perform z-based tasks such as shadow generation (multiple passes)
  • perform the z-only prepass (to prepare predicated tiling)
  • perform predicated tiling - multiple passes of pixel shading
Something along those lines. The CPU initiates those stages. The GPU performs the actual computation for those stages.

I'm still trying to understand this stuff - so that's what I can gather.

You could program XB360 more traditionally, but it would miss out on the advantages of the architecture. Although predicated tiling does force a degree of change in itself, particularly if the developer wants to do any fancy multiple render target work - as the EDRAM is too small to support MRTs without futzing around with predicated tiling (unless they're all small).

The other question mark is on the performance (or performance implications) of memexport. It may put unexpectedly low ceilings on the capabilities of some of those stages I outlined above - a limit on the amount of data written, or a limit on the number of passes. Too early to say.

Jawed
 
purpledog said:
Some facts:
- being able to retrieve data from the GPU allow very interesting tricks
Can you elaborate on this?
- next-gen are multi-core, so coders have to face the parallelism issue anyway
Yep, but that's a not a good excuse to cripple the GPU too.. :)
One way of allowing the CPU to use the GPU is make the GPU only a fraction of a frame behind (number = frame number):
CPU: 444444444444 555555555555
GPU: 3333 44444444444 555555555555


That way, the CPU can send command to the GPU and get the result later within the same frame. This model can be seen as the exact copy of the previous (historical) one, but on a "sub-frame" basis. Indeed, it's acting like if a frame was composed of n virtual sub-frames (n~10)

Yeah, that's a bit tricky, but I can't live without bringing back some of the RSX power into the cell. What do you think ?
In practice you're trying to avoid stall introducing a lot of 'mini' synchronizations.
It could work in some very specific case, if you can garantee that some of this tasks is not going to bite you back (ie it doesn't take a quasi random time to complete) but in the general case I wouldn't use this since a could stall the GPU or the GPU for a considerably amount of time, and there's nothing you can do it to bound it.
One way to avoid this kind of problems is to 'waste' some memory and buffer in advance all the data you need.
Let say you wrote your own stencil shadows renderer on CELL to offload shadowing from RSX: then you might need to read back the zbuffer for any given frame from the video ram on CELL to render some stencil shadow.
At the sime time you don't want RSX to wait for CELL to complete its shadowing work in order to be able to correctly shade a scene.
What you can do is to 'slightly' change your classical rendering pipeline in order to use RSX to generate a zbuffer a frame in advance.
First thing RSX does is to generate the zbuffer for frame N, then it starts to shade frame N-1 since the zbuffer for that frame was generated in the previous frame.
so while RSX is shading frame N-1, CELL can performs shadowing on frame N.
Maybe CELL will have to wait for RSX to generate a zbuffer for the next frame but this is not a big deal since we already know that CELL is going to excute a lot of different jobs at the same time, so that time could be spent working on something else as long our SPE jobs granularity is fine enough
Btw...that's just a stupid example, but at this time I can't think anything better than this ;)
 
"Maybe CELL will have to wait for RSX to generate"

I really have a problem with this. this what wrote before:

"Maybe i'm making it sound too simple but I think Cell could be a traditional GPU nightmare. Cell is only CPU around that could overload modern GPU's. Parallalism of Cell's independent spe goes far beyond simply wideing data bus or add faster memory. This SPE's will be sending instructions for physics processing, ray casting, and whatever esle cell can throw at the GPU simultaneously. So why did Sony decide Cell need to be a multi-core CPU in order execute these tasks simultaneously only to have a GPU bottleneck and killing the potential computing throughput?

Cell is to me like a team of eight movers told to load a moving truck. The GPU is that one guy waiting in the truck to stack up the boxes neatly. Now the movers all come back to the truck at the same time everytime with different size boxes and hands them to that poor guy waiting to stack neatly simultaneously! How is the guy on the truck going to take all those boxes and stack them properly? I say this is what Nvidia what talking about when they said they've never had a CPU that could feed a chip like it before. So how do you fix the problem? You get more people on the truck. If put four guy on the truck to stack the different size boxes the eight movers are delivering - you avoid backups or bottlenecks."

nAo,

your statement just made me worry more because it seem the Cell will general overload the RSX. I don't see any reason why Sony and Nvidia would create a GPU will this problem . There must be a better way.
 
leechan25 said:
Cell is to me like a team of eight movers told to load a moving truck. The GPU is that one guy waiting in the truck to stack up the boxes neatly. Now the movers all come back to the truck at the same time everytime with different size boxes and hands them to that poor guy waiting to stack neatly simultaneously!
Why do all the SPE's have to be passing data to the GPU? Why don't you want any SPE's working on physics, audio, AI and such? A better variation of you analogy is a removal firm with an office, van and 8 employees all capable of manning the phones, driving the truckloading and unloading etc. there's one guy who always stack the boxes inside the truck. When there's a big job, perhaps loading thosands of individually boxed VCRs, all 8 employees can take items to the truck, with 1 helping to pass the boxes to thge stacker. But most of the time the employees are doing different jobs and the stacker only has a couple of loaders helping him.
 
leechan25 said:
Cell is to me like a team of eight movers told to load a moving truck. The GPU is that one guy waiting in the truck to stack up the boxes neatly. Now the movers all come back to the truck at the same time everytime with different size boxes and hands them to that poor guy waiting to stack neatly simultaneously! How is the guy on the truck going to take all those boxes and stack them properly? I say this is what Nvidia what talking about when they said they've never had a CPU that could feed a chip like it before. So how do you fix the problem? You get more people on the truck. If put four guy on the truck to stack the different size boxes the eight movers are delivering - you avoid backups or bottlenecks."

nAo,

your statement just made me worry more because it seem the Cell will general overload the RSX. I don't see any reason why Sony and Nvidia would create a GPU will this problem . There must be a better way.

Won't the DMA part of the Cell (the bit which does the grouping) provide some relief to this? In my mind it will be far more like sending a huge pallet of boxes to stack onto shelves than lots of little ones (and as we're dealing with RAM this is no issue, it just has to be consistent when RSX requires it). As for overload, yes, I'd imagine, that it is perfectly possible and probably intended to be so, after all it's better that it is not limited by throughput.
 
leechan25 said:
your statement just made me worry more because it seem the Cell will general overload the RSX. I don't see any reason why Sony and Nvidia would create a GPU will this problem . There must be a better way.
My statement is not related at all with your fears, let me explain it better.
In my example CELL has to wait RSX to generate a zbuffer, but in the mean time it can work on other stuff, so waiting it's not a problem in this case.
What I want to avoid are stalls, not synchronization. Synchronization is fine as long as it doesn't introduce stalls, or it doesnt' cripple performances in any other way.
Your are worried about CELL overloading RSX and it can happen, but it's not like all the work CELL does has to be sent someway back to the GPU, so don't worry.
Nonetheless a balance point has to be found.. ;)
 
Ok, but would it be easier if Cell SPE 's were able to work directly with programable SPE's on GPU(if there are SPE's on the GPU) for more efficient synchronization between CPU & GPU (parallelism)?
 
The topic of CPU and GPU synchronisation is an extremely complex one (of which Frank and I have spent a lot time talking about ;-) ).

The problem is that most people don't understand the important of timing and that these days we aren't running on a hard real-time system. Back in the days of yore (I did it on Amigas, but its a lot older...) we used to do just this ("Chasing the raster") where we know how many instructions the GPU would take to render something and sync everything perfectly.

I won't yet rule it out, but until I have had time to examine the real world timing issues I won't even go there. Trust me I've seen many clever people try this and waste masses of performance on much simplier systems, you have to understand at the most fundemental level how the machine works. Things like how do the CPU and GPU communicate (polled token, interupt, token stack?), bus contention, coherancy models...

I feel its something I should write an article on, over the years I encountered so many people who just don't get it because there isn't any good descriptions out there of why its so hard. Don't take that the wrong way I've seen several hardware vendors cock it up as well :D
 
leechan25 said:
Ok, but would it be easier if Cell SPE 's were able to work directly with programable SPE's on GPU(if there are SPE's on the GPU) for more efficient synchronization between CPU & GPU (parallelism)?

Would make no difference.
The problem here isn't processor architecture, it's that anytime you introduce additional synchronisation, you introduce a potential stall. This is fundamentally why algorythms don't generally scale with number of processors.

A GPU is just another processor with a very specific job. Usually that processor is very loosely coupled, causing the CPU to stall only if it can fill the fifo, and stalling itself only if the CPU fails to submit polygons fast enough, but if you want closer cooperation, you will introduce more synchronisation primitives and as a result more stalls.
 
Stall

ERP said:
Would make no difference.
The problem here isn't processor architecture, it's that anytime you introduce additional synchronisation, you introduce a potential stall. This is fundamentally why algorythms don't generally scale with number of processors.

A GPU is just another processor with a very specific job. Usually that processor is very loosely coupled, causing the CPU to stall only if it can fill the fifo, and stalling itself only if the CPU fails to submit polygons fast enough, but if you want closer cooperation, you will introduce more synchronisation primitives and as a result more stalls.

Is there ever decision in this form of software to accept possibility for "x" amount random stall to have time be greater than average load for predctable performance or is it always for maximum limit performance?
 
nAo said:
Can you elaborate on this?
I believe GPGPU methods are all good examples, but I don't known them well enough to really argue here. So let's talk about rendering.

The holy Graal of a renderer is to be output-sensitive, meaning that the complexity of the rendering is proportional to number of pixels on the screen, and not to the size of the scene. To do so, a renderer has to only rasterize visible triangles with a decent size, and must ignore hidden or too small triangles. That directly lead to 2 very difficult problems: visibility and multi-resolution.

To solve this 2 issues, CPU-based techniques compute a very coarse approximation of this criteria:
- visibility: portals, precomputed visibility cell, bounding-boxes
- multires: discreet LOD, smoother procedural stuff (speed-tree)
Note that some stuff are also done in the GPU:
- visibility: conditional rendering, view-frustrum/back-facing culling
- multires: anisotropic filtering + mipmap for textures

But this is allways completely on ONE side, which allways sound weird to me. Indeed, one can see the CPU has the world, meaning that it produces all the description (meshes, animation...) of the 3d scene. The GPU (and the screen behind) is an observer, an "eye". To achieve output sensitivity, the observer must query to the world what it needs, and not the world to blindly send a copy of itself to the eye. Then the world send the requested information... Ideally, it should be an exchange

Ok, concretely...

The scene is describe as a multi-resolution object. You first send objects in the front and you start filling the z-buffer. Than you only render bounding-box of further object and WAIT for the occlusion query to be back before rendering: you don't want to waste work on hidden object...
That way, Rsx is saying to cell: "That big object which is here, well, I don't give a shit. Nevertheless, this one here takes a considerable portion of the screen, could you please elaborate ?". And then Rsx and Cell get married and have a lot of children. Hum...

nAo said:
Yep, but that's a not a good excuse to cripple the GPU too..
Granted. I guess it always comes down to what is gained comparing to how much algorithmic complexity has been added. Nevertheless, let's argue:
- first, making a good game on nextgen console will require programmers to have a good knowedge of parallelism. This knowedge can apply on the CPU as well as the GPU (cheap)
- You don't want the CPU to stall. Well, no problem, because the CPU part of the game will already be divided into threads that you can trigger "anytime". (less cheap, but cheap too)
 
It is pssible to use hardware occlusion query in a fashion similar to that proposed, although I have never actually seen it be a win in a real application.

The agorythm goes like this.

Submit occluders to GPU
Submit query volumes to GPU

Submit stuff that can't be occluded and can't occlude to GPU

Wait for reults of occlusion queries this will likely be multiple milliseconds

Use results to submit remaining geometry.

OK assuming we can transition back to submitting geometry after we get the occluusion result without stalling the GPU, and that we can occupy the CPU while were waiting, and we have enough CPU time to submit the geometry that is left to submit without running over the end of the frame then we might have a win.

IME the reason it usually isn't a win is because you are generally not vertex bound so rendering the occluded objects isn't that much more expensive than renderng the occlusion queries, since the GPU is already discarding pixel work if they are occluded. It might be a win if your occluded objects were very high polygon count.

It's also difficult though not impossible to not end up with a significant stall somewhere.

There are some none obvious issues associated with the way the FIFO is actually implemented, which varies from chip to chip and is rarely actually documented. NVidia for instance on NV2x doesn't actually have a hardware FIFO, it's implemented in software which actually breaks the FIFO up into sections and only deals with resource locking at the start and end of the section. This is not documented anywhere as far as I know.
 
ERP said:
The problem here isn't processor architecture, it's that anytime you introduce additional synchronisation, you introduce a potential stall.

IMO, that's indeed the core of the problem. Could almost be the title of that thread...

ihamoitc2005 said:
Is there ever decision in this form of software to accept possibility for "x" amount random stall to have time be greater than average load for predctable performance or is it always for maximum limit performance?

From what I understand, you're screw because it's random. One solution here: the stall must be amortised, meaning that you must find something to do. On the CPU side, I would not worry too much: as Nao said, Cell has to be massively parralel anyway. On the GPU side, well, here's the issue...

nAo said:
One way to avoid this kind of problems is to 'waste' some memory and buffer in advance all the data you need.
Let say you wrote your own stencil shadows renderer on CELL to offload shadowing from RSX: then you might need to read back the zbuffer for any given frame from the video ram on CELL to render some stencil shadow.
At the sime time you don't want RSX to wait for CELL to complete its shadowing work in order to be able to correctly shade a scene.
What you can do is to 'slightly' change your classical rendering pipeline in order to use RSX to generate a zbuffer a frame in advance.
First thing RSX does is to generate the zbuffer for frame N, then it starts to shade frame N-1 since the zbuffer for that frame was generated in the previous frame.
so while RSX is shading frame N-1, CELL can performs shadowing on frame N.

Looks like a brilliant solution, but I'm not sure to understand everything here... Let me rephrase. You're saying that it's possible to tighly synchronised the CPU and GPU on frame N as long as you can avoid GPU stall by finishing the job of the frame N-1.

Very simply, something like:

syncpoint.png


I'm wondering what's happening when adding new syncpoints to achieve finer granularity for the GPU thread (using tiles?).
Hey! It sounds weird: "GPU threads".
 
DeanoC said:
I won't yet rule it out, but until I have had time to examine the real world timing issues I won't even go there. Trust me I've seen many clever people try this and waste masses of performance on much simplier systems, you have to understand at the most fundemental level how the machine works. Things like how do the CPU and GPU communicate (polled token, interupt, token stack?), bus contention, coherancy models...

So...
Do you see any way of doing some complexe CPU/GPU sync on PS3 ? Or perhaps it's too early to speculate on what's possible and what's not ?
Please keep in mind that I'm not necessarily looking for a solution than a game company would like to commit into let's say tomorrow.
I'm more interested by "potential" solutions. Let say: "Light at the End of the Tunel" ;)
 
Back
Top