IMRs and temporal coherency

nAo

Nutella Nutellae
Veteran
Stimulated by a SA's post about IMR efficiency I was thinking on how NV30 could perform as fast as a R300 even adopting -just- a 128 bit data bus.
IMHO nvidia will employ some really smart and efficient compression scheme for frame and z/stencil buffers and an improved version of their memory controller (actually their compression engine sits into the memory controller)
Then I asked myself what may be the next move in the improving rendering efficiency war.
Maybe this was discussed in the past but I'm not sure at all.
So..what about the hw extracting informations from a frames sequence and re-using that info the next frame to improve performance?
I know this is not a novel idea...my question revolve around what kind of info the hw could extract and re-use.
What about keeping track (using internal performance counters) of some functional parameter the hw could change, frame by frame, according the nature of the data the GPU is processing?
Some more advanced technique could track screen areas and try to differentiate and predict work on each given area on an already rendered frames sequence. (it could be everything from memory access patterns, statistics on buffers compression to occlusion information (what about tracking occluders movements...)).
Speculative rendering? :)
I'd love some comment....

ciao,
Marco
 
Some thoughts:
  • If you do temporal antialiasing (motion blur) by rendering N frames, each representing a time step, and then blending them together, you could for each polygon in the scene render that polygon to all the N frames before you move on to the next polygon. This should allow extensive reuse of vertex and texture data, decreasing memory traffic substantially. It may strain the framebuffer cache, though.
  • The performance counters present in modern GPUs could conceivably be used to monitor performance at run-time, and these data may be used to adjust stuff like texture prefetching behavior, memory access granularity, cache replacement algorithms, memory layout for textures/framebuffer etc. even from one 3d object to another. This may present an interesting challenge to driver writers, though (if this sort of things is not already being done today).
  • Virtual texture memory, such as that supported by 3dlabs P10, can help reducing the pressure on onboard texture memory as long as the working set of textures (texels actually drawn) does not change too dramatically from one frame to the next. A similar idea can be applied to vertex arrays as well, if bounding volumes can be used to determine that their associated objects are never drawn.
  • Direct reuse of framebuffer data from one frame to the next probably only makes sense when you can guarantee that at least some objects or landscapes absolutely do not move relative to camera at all or change in any other way (like e.g. dynamic lighting) - in this case, you may be better off just grabbing the immobile portions of the frame and making a texture map out of them (barring Z/stencil problems; some modern cards do allow you to write a texture map to the Z buffer these days, though).
 
arjan de lumens said:
Direct reuse of framebuffer data from one frame to the next probably only makes sense when you can guarantee that at least some objects or landscapes absolutely do not move relative to camera at all or change in any other way (like e.g. dynamic lighting) - in this case, you may be better off just grabbing the immobile portions of the frame and making a texture map out of them (barring Z/stencil problems; some modern cards do allow you to write a texture map to the Z buffer these days, though).

I do not want MPEG compression artefacts in my 3D games!
 
I don't think rendering the scene N times per frame is a good way of doing temporal anti-aliasing. This has been done many times by the demoscene, and it has never looked particularly good.

Anyways, it doesn't sound fast even if you did do what you described : Consider a skinned and skeletally animated model. For each vertex you would have to calculate N positions, each involving a potentially large update of the vertex shader constants... If you calculate just 2 and then lerp between them this is less prohibitive, but : you don't know in advance how long it will take to render your frame, so the end of frame vertex positions will be incorrect. Also, if the frame rate is low, then the path of a vertex becomes much more non-linear. And what about collisions?
The path of a whole object's vertices would be discontinuous...

I dunno, it just doesn't sound like this approach models motion blur very well. Why just not render the N frames normally (meaning each frame will have accurate vertex positions and projection matrices etc... supplied by the graphics engine)? So long as graphics cards get faster (and monitors support faster refresh rates), you will automatically get the type of motion blur you describe (but where each frame will be more accurate than if you attempt to approximate game engine animation output inside the GPU)

Interesting topic though... Does anyone know how motion blur is handled by off-line renderers?
 
Riddle me this:

I've experimented with some low-end offline renderers (I own truespace 6) and, while I haven't checked recently, it seemed to just render n frames and blend them together with equal weight. Using a 16-frame sample actually gives relatively good results provided you use a short enough time window and also provided the motion isn't too quick relative to that time window.

Anyhow, let's assume (for a starting spot) that you cant get 60 FPS from your app/card combo.

What would be the problem with a system where you have, say, five output buffers. Each time you generate a frame, you put it in the next buffer in a cyclical fashion. For the output to the monitor (or dac, whatever) you combine all five buffers, yet you weight them bell curve style, such that the 3rd frame in line is, say, 50%, the 2nd and 4th are 20%, and the 1st and 5th are 5%. The numbers can be dinked with until the output is pleasing (whatever that means.)

Sure, you aren't really viewing "this" frame, but the frame 1/30th (2/60ths) of a second ago as the center of the time frame, but I honestly don't think that would be much of an issue.

As a side note, do you remember the "latency" diatriabes concerning the old ATI MAXX cards? Specifically, some gamers said they would be able to tell if they weren't viewing "this" frame, but were rather viewing the frame before "this" frame. Had something to do with each vpu displaying it's image while the other vpu was working on the current image for the next frame.

Anyhow, I don't think a 1/30th of a second difference between what the computer displays and what we see is going to make a whole heck of a lot of difference. We have built-in biological latencies in our bodies that are probably worse than this with regards to the vision system.

Anyhoo... is this totally off the mark, or what? I personally don't see why people want to render five frames for each frame when we could simply keep a cache of previous frames for averaging. (I'm kind of sidestepping the issues of what happens when the actual FPS gets too low, or for that matter, too fast.)
 
flf said:
As a side note, do you remember the "latency" diatriabes concerning the old ATI MAXX cards? Specifically, some gamers said they would be able to tell if they weren't viewing "this" frame, but were rather viewing the frame before "this" frame. Had something to do with each vpu displaying it's image while the other vpu was working on the current image for the next frame.

I know it's totally ridiculous sounding, but I wouldn't discount their statements out of hand. From my own experiences, I have found that people actually have acute senses...much more so than we might expect. For example, when your FPS drops down to 30 from 60, you can really feel the difference. And some things like mouse lag...you can't really see them, but you can just feel them. In that case you'd never be able to tell unless you actually are moving the mouse yourself, but that doesn't mean it doesn't exist.

All that said there's a good chance they were just full of BS, but just enabling triple buffering seems to give me a sense of mouse lag in UT2k3. If I'm really sensing that tiny little difference thanks to enabling an extra buffer, then I'm sure they could do the same with the Maxx. On the other hand maybe TB is doing some other weirdness that's causing it, so who knows...
 
I'm not sure if its related or not, but someone mentioned to me that NVIDIA has a patent on reusing generated pixel data in some cases.
 
psurge said:
I don't think rendering the scene N times per frame is a good way of doing temporal anti-aliasing. This has been done many times by the demoscene, and it has never looked particularly good.
Fair enough. Any better ideas?
Anyways, it doesn't sound fast even if you did do what you described : Consider a skinned and skeletally animated model. For each vertex you would have to calculate N positions, each involving a potentially large update of the vertex shader constants... If you calculate just 2 and then lerp between them this is less prohibitive, but : you don't know in advance how long it will take to render your frame, so the end of frame vertex positions will be incorrect. Also, if the frame rate is low, then the path of a vertex becomes much more non-linear. And what about collisions?
The path of a whole object's vertices would be discontinuous...
The problem with the vertex shader constants can be solved by just storing N instances of them on-chip, one for each of the N time steps, and then swapping between them as needed, which should be quite fast. (This would, of course, impose a hardware limitation on N.) Non-linear/discontinuous vertex paths shouldn't be a problem - just set up N sets of transform matrices/vertex shader constants corresponding to a set of N points on the path. The time it takes to render a frame can, most of the time, be estimated by the time it took to render the previous frame (although there are obviously cases where this doesn't quite work).

The benefit of this scheme is that each vertex is loaded once instead of N times (although it, of course, gets T&Led/shaded N times), and that redrawing the same polygon into the N frames will allow the texture cache to be loaded once for all the N instances - instead of N times, once for each instance (assuming that each polygon by itself isn't large enough to overflow the texture cache) - in sum saving memory bandwidth.

The main disadvantage is that you need to estimate an appropriate value of N before you start rendering the frame, rather than being able to adjust N in the middle of the rendering.
 
nAo said:
Stimulated by a SA's post about IMR efficiency

We had this conversation 2 years ago :) Feeling a bit nostalgic, anyways:


"By taking full advantage of the spatial and temporal coherence in a typical realistic scene, it is possible to access and process (essentially) only the visible triangles and to access them only once using a tiling algorithm. I say essentially since there are instances where some forms of dynamic behavior require access to some hidden primitives. In realistic scenes these are rare, since most of the scene is fairly static (terrain, walls, floors, furniture, buildings, etc.). In addition most dynamic objects and dynamic vertices move coherently so they rarely require access to hidden primitives. - SA, circa 2000

I have the rest of his post, but feel kinda awkward quoting a member of the board like this, in this situation.
 
As a side note, do you remember the "latency" diatriabes concerning the old ATI MAXX cards?

Whats the diffrence between the inherient frame of latency when using AFR and that found on current Region-Based Deferred Rendering schemes? Because, I don't ever recollect people conversing on this topic c/o a Tiler.
 
Vince said:
Whats the diffrence between the inherient frame of latency when using AFR and that found on current Region-Based Deferred Rendering schemes? Because, I don't ever recollect people conversing on this topic c/o a Tiler.

Yes, they should be the same for tilers. However, since many people have not used a tiler, they have no way to know whether it is a problem or not. Furthermore, the FPS games are getting faster and faster (UT2003 is even faster than Q3A IMHO). Such problem (if it really exists) may be more evident than in past.

I think the only way to know whether such latency could be a problem or not is to try yourself. When people were debating about 30fps vs 60fps, some believes that human brain can't process more than 20 images per second. Therefore, there can't be any difference between 30fps and 60fps. However, it is clearly wrong. Some people believes that 48kHz is enough for digital sampling of audio. However, there are still some people who can hear the difference between 48kHz and 96kHz even on the same system. So I won't just say "it is a non-issue for everyone" in a review, since I don't really know about it. I still think it is worth pointing out.
 
Vince said:
nAo said:
Stimulated by a SA's post about IMR efficiency

We had this conversation 2 years ago :) Feeling a bit nostalgic, anyways:


"By taking full advantage of the spatial and temporal coherence in a typical realistic scene, it is possible to access and process (essentially) only the visible triangles and to access them only once using a tiling algorithm. I say essentially since there are instances where some forms of dynamic behavior require access to some hidden primitives. In realistic scenes these are rare, since most of the scene is fairly static (terrain, walls, floors, furniture, buildings, etc.). In addition most dynamic objects and dynamic vertices move coherently so they rarely require access to hidden primitives. - SA, circa 2000

I have the rest of his post, but feel kinda awkward quoting a member of the board like this, in this situation.

I remember this quite well. We talked about fluidstudios REVi - engine which was ment to do (IMHO) just that what you describe above in SA's post. Unfortunately Fluidstudios has no longer information available about this concept.
 
arjen,

let me think about it some more... On a related note, what happens to d-mapped and tesselate primitives in this scheme? The actual geometry data (vertices and number of triangles) would be changing within the N frames being simultaneously rendered.

Serge
 
psurge said:
arjen,

let me think about it some more... On a related note, what happens to d-mapped and tesselate primitives in this scheme? The actual geometry data (vertices and number of triangles) would be changing within the N frames being simultaneously rendered.

Serge

Haven't though that fully through yet - hmmm.. OK.

Tessellation comes before T&L/vertex shading in the 3d pipeline, so if the tessellation level is kept constant across (and for displacement mapping: using a static d-map) the N frames, it should be possible to do the tessellation only once and then pass on the results to be vertex-shaded and rendered N times.

On the other hand, if the tessellation level varies or a dynamic displacement map is used, then you would, instead of rendering one triangle to N frames at a time, render one higher-order-surface patch to N frames at a time, requiring you to do the tessellation N times per patch before you move on to the next patch. If the d-map is kept constant, it should be possible to cache it so that it is read only once for all the N tessellation passes.

That should work just fine, as far as I can see.
 
arjan,

1. sorry for mispelling your name in my previous post.

2. not treating things as N discrete time steps seems
really complicated. The best motion blur I have ever
seen was the output of a ray-tracer which, for each pixel,
sent out rays stochastically distributed in both time and
space. Just as some form of sample jittering gives better
FSAA, I think you would need to jitter samples temporally,
per pixel, to get good motion blur... here's the best I can
think of.

Take N states, producing N sets of vertices ready for rasterization.
For step i ( 1 <= i <= N ), compute sample colors the way you normally
would, but, in addition, use the geometry from step i + 1 to compute a
sample velocity (in 3d).

Now, for a sample S at (x,y) in sample buffer B taken at time T (which looks like a frame-buffer but with velocity for the position attached), you could use the sample velocity to figure out which pixel the sample contributes to at time T + dT.

Convert each sample to multiple pixels by varying T (smear the sample across the actual framebuffer). How much a sample contributes to a single pixel would depend on the length of time it takes the sample to traverse the pixel.

just an idea... i'm honestly not sure if this would work or produce results much better than just upping N and doing as you suggest.

best regards,
Serge
 
Well, I heard of one motion blur technique that randomly renders only about 10% of the frame, then steps, renders another 10%, and so on. This could produce a nice motion blur effect, but it may require significant supersampling to avoid a grainy look.

For realtime graphics, I don't think motion blur for anything other than special effects is going to be feasible for some time, unfortunately. The first attempts that will be feasible will likely use linear interpolation to generate intermediate frames more quickly than the same number of discreet frames.

That is, the hardware calculates the velocity of each vertex in screen space, and interpolates its position across the screen for each of the intermediate frames. This might not decrease the fillrate requirements of the intermediate frames, but should make them a heck of a lot easier on the vertex shader.
 
psurge,

I am not sure if the pixel 'smearing' method could be made to work. Specifically, using plain alpha blending for the smearing will almost certainly fail, resulting in a object that is translucent where it should be opaque (especially noticeable for slow-moving objects). Other methods for computing the per-pixel coverage might work - although figuring out a data representation that will work may be a lot of work.

Another idea, that builds on the jittered grid supersampling method you suggested: If we have a triangle present at time step X and X+1, we could extrude a prism-like object along the time axis, with the triangle at the two time steps becoming the two endpoints, (just visualize the time 'axis' as a 3rd space axis for now) and then do the jittered sampling, for the triangle rasterizing the sample points that actually end up inside the 'prism'. In case of multiple time steps, we form such a 'prism' and sample within it for each time step. This way, we can get sample points anywhere inbetween the N time steps rather than just at the time steps themselves.

The amount of supersampling needed to give really good quality motion blur is a bit difficult to estimate offhand - I've heard estimates of about 100x being used for solutions like the ray-tracer you mentioned.

Also, a question for me to ponder for a while: does temporal texture filtering make any sense?

Chalnoth said:
Well, I heard of one motion blur technique that randomly renders only about 10% of the frame, then steps, renders another 10%, and so on. This could produce a nice motion blur effect, but it may require significant supersampling to avoid a grainy look.

Yes, it will require a lot of supersampling - I'd estimate about 10x or more is needed to get rid of the worst graininess, which would correspond to ~100 time steps. Also, the framebuffer cache is going to thrash insanely with such a scheme.
 
arjan... thought about it some more, and...neither of our methods will capture motion blur of, for lack of a better term, "indirect rendering effects". They jitter sample positions based on local geometric velocity without considering how to jitter pixel shader constants, or non-local effects of the moving geomtry. Examples of what I mean: the blurred shadows of a high velocity occluder, blur resulting a point light moving across a static scene very quickly.

Perhaps some form of time jitter could be used, but it sounds like you would still need a very large number of samples per pixel to get acceptable results...

anyways... i'm back to thinking about temporal coherency.
Serge
 
Isn't this what Talisman did? Using compositing and 2D image space processing (IBR warping), it reused previously rendered parts of the scene?


This seems to work nice if the camera isn't moving, or is moving slowly, or if large parts of the screen aren't changing.
 
Back
Top