AFR: Preferred SLI Rendering Mode

pcchen said:
I'm curious. How does it improve response time, as the time to render a particular frame does not change?
Right, but rendering a particular frame is only part of the entire latency between input and display. The time to render doesn't change, but the other parts take less time (assuing you're not CPU-limited, of course...but then the framerate won't increase).
 
Chalnoth said:
3dilettante said:
I recall that a disconcertingly large number of situations had it performing less than the single-chip variant, and that's with the two chips on the same PCB. Perhaps it was just bad luck
Bad luck? You see a poor implementation from XGI and you put it up to bad luck?

I was trying to be polite. ;)
 
Chalnoth said:
Right, but rendering a particular frame is only part of the entire latency between input and display. The time to render doesn't change, but the other parts take less time (assuing you're not CPU-limited, of course...but then the framerate won't increase).

Could you elaborate this? I think the response time is the time between the input and the completion of the rendering. In the case of AFR, it is done entirely on one video card for a particular frame. Although another card may start to render another frame, it would not decrease the response time.
 
pcchen said:
Could you elaborate this? I think the response time is the time between the input and the completion of the rendering. In the case of AFR, it is done entirely on one video card for a particular frame. Although another card may start to render another frame, it would not decrease the response time.
Well, here we go:
1. First, you need to calculate the frame on the CPU.
2. Second, you need to wait for the buffer between the CPU and GPU. This needs to be at least double-buffered for there to be shared processing.
3. Third, the GPU renders (this is the only part whose latency does not decrease)
4. Fourth, you have to wait for the buffer swapping for display.
 
Chalnoth said:
Well, here we go:
1. First, you need to calculate the frame on the CPU.
2. Second, you need to wait for the buffer between the CPU and GPU. This needs to be at least double-buffered for there to be shared processing.
3. Third, the GPU renders (this is the only part whose latency does not decrease)
4. Fourth, you have to wait for the buffer swapping for display.

1. AFR does not decrease anything.
2. This does not need to be double buffered. You can use rename buffers.
3. Ditto.
4. This does not necessarily decrease when Vsync is on, and does not change when Vsync is off.
 
pcchen said:
Chalnoth said:
Well, here we go:
1. First, you need to calculate the frame on the CPU.
2. Second, you need to wait for the buffer between the CPU and GPU. This needs to be at least double-buffered for there to be shared processing.
3. Third, the GPU renders (this is the only part whose latency does not decrease)
4. Fourth, you have to wait for the buffer swapping for display.

1. AFR does not decrease anything.
2. This does not need to be double buffered. You can use rename buffers.
3. Ditto.
4. This does not necessarily decrease when Vsync is on, and does not change when Vsync is off.
1. If you are not CPU-limited, the CPU can calculate more frames per unit time if AFR is enabled, thus reducing latency.
2. Not sure exactly what you mean by a rename buffer, but the CPU can't be processing AI, physics, and whatnot if it's busy dealing with the stuff being sent to the graphics card, i.e. you need double buffering.
3. ...
4. Once again, as long as this is not the limitation, there will be more frames available for display, and thus the latency of this step will again be reduced.
 
If you are GPU limited it means your command buffer is filling up faster than it can be emptied. AFR mode because it accesses each chip independantly then means you will have 2 command buffers. What this allows is for commands for frame 2 can be issued instantly instead on GPU B of waiting for the GPU A to finish issuing all the commands for frame 1.

Frame 3 will use GPU A while GPU B is still finishing Frame 2
Frame 4 will use GPU B while GPU A is still finishing Frame 3
Frame 5 will use GPU A while GPU B is still finishing Frame 4
and so on is the theory.
 
Then again, you can only react to the game after you see the latest frame on the screen. That's where the clock starts, so it doesn't matter that the CPU has been able to calculate the next frame while one of the GPUs is drawing the frame. Actually, as you react to the frame you just saw, won't the perceived latency be *larger*, if the CPU has already decided what the next frame will be?
 
Daliden said:
Then again, you can only react to the game after you see the latest frame on the screen. That's where the clock starts, so it doesn't matter that the CPU has been able to calculate the next frame while one of the GPUs is drawing the frame. Actually, as you react to the frame you just saw, won't the perceived latency be *larger*, if the CPU has already decided what the next frame will be?
Well, I think you kind of confused yourself there. Better to think of this by starting with the input, which goes directly into how the CPU calculates a given frame, and on from there.
 
Chalnoth said:
1. If you are not CPU-limited, the CPU can calculate more frames per unit time if AFR is enabled, thus reducing latency.
2. Not sure exactly what you mean by a rename buffer, but the CPU can't be processing AI, physics, and whatnot if it's busy dealing with the stuff being sent to the graphics card, i.e. you need double buffering.
3. ...
4. Once again, as long as this is not the limitation, there will be more frames available for display, and thus the latency of this step will again be reduced.

I think CPU should be able to parallel with GPU in many situation, i.e. CPU set up a command buffer, and let GPU to fetch them. It would be very inefficient if CPU have to be busy sending those information.

Of course, sometimes you can't parallel CPU with GPU very well, and AFR may reduce some latency (mostly only the part which is not parallel). However, in GPU limited cases, I think these part tends to be quite small.
 
Chalnoth said:
Daliden said:
Then again, you can only react to the game after you see the latest frame on the screen. That's where the clock starts, so it doesn't matter that the CPU has been able to calculate the next frame while one of the GPUs is drawing the frame. Actually, as you react to the frame you just saw, won't the perceived latency be *larger*, if the CPU has already decided what the next frame will be?
Well, I think you kind of confused yourself there. Better to think of this by starting with the input, which goes directly into how the CPU calculates a given frame, and on from there.

When thinking about gaming, you cannot think about it in any other way than "stimulus first, response next, result last" -- right?
 
pcchen said:
I think CPU should be able to parallel with GPU in many situation, i.e. CPU set up a command buffer, and let GPU to fetch them. It would be very inefficient if CPU have to be busy sending those information.
Sure, but the issue is that the system has no idea what needs to be rendered until the CPU is done with its processing for the frame. Therefore, you need the double-buffering (or more: I believe DirectX allows between 1 and 4 intermediate buffers).
 
Daliden said:
When thinking about gaming, you cannot think about it in any other way than "stimulus first, response next, result last" -- right?
But then you're adding in human response, and that just clutters the whole picture when we're talking about a phenomenon that can be analyzed just be looking at the way the computer works.
 
Not to mention it is not necessarily true, lets say you see someone running across a window, so you know in advance he will get to a door, thus you do not need to see him at the door before you move to intercept him there...
 
I think it's easiest to consider an example. For the sake of simplicity, let's assume vsync is off, and the driver never buffers more render commands than necessary, i.e. no more than one frame that isn't being processed currently.
Let's say calculating a single frame takes 10ms on the CPU, and 50ms for rendering on a single card.

Single card system:
010 - CPU finished first frame, GPU starts rendering
020 - driver has buffered one frame (#2) in the command buffer, stalls
060 - GPU finishes first frame displays it, starts second. Driver returns from stall. CPU captures user input for #3
# we're now in normal operation #
070 - CPU finished frame #3, driver stalls again
110 - GPU finishes frame #2, displays it, starts rendering third one. Driver returns from stall. CPU captures user input for #4
120 - CPU finishes #4, driver stalls
160 - GPU finishes frame #3, displays it, starts rendering #4. CPU captures user input and starts computing #5
170 - CPU finishes #5, driver stalls
210 - GPU finishes frame #4, displays it, starts #5
etc.

Now let's see. At 110, the CPU captures all the keypresses that happened during the 50ms before for frame #4. During that same time, 60 to 110, frame #1 became visible. At 210, frame #4 starts to become visible on screen (but it will take a full refresh cycle until completely visible). So you have at least two frames' time delay (100ms in this case) between keypress and visible response.


AFR system:
010 - CPU finished first frame. GPU1 starts rendering
020 - CPU finished second frame. GPU2 starts rendering
030 - CPU finished third frame. Driver stalls
060 - GPU1 finished #1, starts #3. CPU captures input, starts #4
070 - GPU2 finished #2. Knowing the rendering time of both frames, the driver (hopefully) attempts to balance frame distribution and delays displaying #2 for 15ms. CPU finished #4, driver stalls.
085 - display swaps to #2. GPU2 starts rendering #4. CPU starts #5
095 - CPU finished #5, driver stalls
110 - GPU1 finishes #3, starts #5. CPU starts #6
120 - CPU finishes #6, driver stalls
135 - GPU2 finishes #4, starts #6. CPU starts #7
145 - CPU finishes #7, driver stalls
160 - GPU1 finishes #5, starts #7. CPU starts #8
etc.

At 085, the CPU captures all the keypresses that happened during the 25ms before for frame #5. During that same time, 60 to 85, frame #1 became visible. At 160, frame #5 starts to become visible on screen (but it will take a full refresh cycle until completely visible). So you have at least three frames' time delay (75ms in this case) between keypress and visible response.


Overall, the relative latency will increase by one frame, but the absolute latency will actually go down.
 
Chalnoth said:
. If you are not CPU-limited, the CPU can calculate more frames per unit time if AFR is enabled, thus reducing latency.
Well, if for you SLI means more fps, yes, but if you use the SLI to have an higher resolution with the same framerate it does not reduce anything, right :?:
 
MfA said:
The problem is that all these methods are driver hacks more than anything else, just an afterthought to push a few more boards to niche markets for which it isnt worth spending any real money on during hardware design.
So 3dfx's "SLI design from the get go (best on a single board)" should be adopted by the current IHVs? Wonder why neither ATI nor (especially) NVIDIA has adopted this since 1999...
 
Marc said:
Well, if for you SLI means more fps, yes, but if you use the SLI to have an higher resolution with the same framerate it does not reduce anything, right :?:

In this case, I think the latency will be worse.

To summarize, the main reason AFR can reduce latency (relative to non SLI setup, not SFR) is that the CPU don't know when the GPU will finish its rendering, so to maximize parallelism, CPU have to read input long before the GPU may finish its work. Thus this latency can be reduced by AFR. However, this is compared to non-SLI setup.

If compared to a real double speed setup (be it SFR or simply a faster card), the latency of AFR will still be rough the same as triple buffering (which is the same as shown in Xmas' example).
 
Reverend said:
So 3dfx's "SLI design from the get go (best on a single board)" should be adopted by the current IHVs? Wonder why neither ATI nor (especially) NVIDIA has adopted this since 1999...

Considering current high-end cards already need massive cooling solutions, I doubt it would be a good idea to put even more chips on them :)

Seriously, dual card set-up still has its merits. For example, if you need stereo vision, dual card set-up is an obvious way to maintain speed. However, I think a real "SLI" design is important for high-end use. Most high-end users render complex or high resolution scenes, and AFR won't help much to reduce the latency. On the other hand, it's very hard to speed up vertex shader computation in a real "SLI" set-up (not impossible especially if you have fast interconnection between the two chips).

IMO a real nice dual card set-up is something like this: two chips have very fast, low latency interconnection between them. Two chips exchange results from vertex shader and render-to-texture through this interconnection. However, it's probably too expensive for dual card set-up.
 
I think Chalnoth is saying that two 6800GT's will not have any more latency than one 6800GT, and will generally have slightly lower latency when GPU limited because the CPU waits less.

He's not saying two 6800GT's will have less latency than a 7800, or whatever the next gen single chip product at 2x the speed will be.

60fps AFR will have more latency than 60fps normal, but 60fps AFR will have lower latency than 30fps normal.
 
Back
Top