Synchroniszation issues with SLI and CrossFire

Still

Newcomer
We measured an anomaly a while ago on SLI/Crossfire systems that should be of very high interest for consumers in the market for related hardware, because it destroys most of the performance gains by a dual-GPU setup.

The problem is that the drawing of frames is not effectively synchronized between the cards.

Let us consider a scene that is rendered at 30FPS, which should mean a homogenuous frame time of 33ms.
The reality on dual GPU setups however is that one frame is updated only a few miliseconds behind the previous one.

So what we measured consistently for example in a ~30FPS scene is that every even frame number is updated after ~11 ms and every uneven frame number took ~55ms for being updated.
The problem gets so bad that is clearly visually noticeable - especially at lower frame rates.

We made sure that this issue is not caused by AFR related data fetching or lazy scene updates and the like, and we observed this behavior in all applications we tested and with all Windows and driver versions.

We confronted Nvidia with this problem a while ago, but we didn't receive a valuable reply.
Our attention fell back on this after we saw a discussion about this problem in a German PC enthusiast forum (3DCenter.de) and I think that the whole market should know about these very important facts.
 
Does it matter who is meant with "we"? :D

Btw., you can easily test this out on your own.
All you need to do is to log the frame times using a tool like fraps for example.
The problem is clearer to see at lower frame rates.
 
I have seen this personally myself back when the 7800s came out, and SLI was supposed to be "fixed" vs the 6800 mess.

As Still states just plot frame times and you will see two distinct peaks instead of one as you should. Although I find it to be most evident at high frame rates as the standard deviation for each presented frame time drops, and the the spread of each peak drops.

Regardless of why, it sucks if you can see it during game play, but as we know people perceive things differently. However there's no denying the frame time data, unless fraps is not calculating it correctly for some reason.
 
Fraps is not making incorrect measurements, I checked this.
Our timer in the app delivered the same results.

And yes, this really is a huge thing because most of the gains by the second card get evaporated by this problem. People buying the second card don't realize this because all hardware reviews (of course) consider FPS rates over long sample times only.
 
W
So what we measured consistently for example in a ~30FPS scene is that every even frame number is updated after ~11 ms and every uneven frame number took ~55ms for being updated.
The problem gets so bad that is clearly visually noticeable - especially at lower frame rates.

This is just a basic side effect of how Nvidia and ATI implement AFR in order to get high benchmark scores without regard for actual quality and performance.

Effectively, as soon as the driver finishes receiving frame A, it says its completed and starts receiving frame B. Meanwhile frame A is run on GPU A and frame B is run on GPU B.

While AFR in general is just a borked mode, both companies could make it much better by dynamically delaying frame A's completion from the driver back to the app until 1/2 the frame A computation time has past.

This would end up with a higher effective real world frame rate and better display update of the actual game state but would probably lower the benchmark performance.

In general though, both companies should be designing to increase performance and capabilities in parallel rendering instead of relying on the crutch of AFR.

Aaron Spink
speaking for myself inc.
 
This is just a basic side effect of how Nvidia and ATI implement AFR in order to get high benchmark scores without regard for actual quality and performance.

Effectively, as soon as the driver finishes receiving frame A, it says its completed and starts receiving frame B. Meanwhile frame A is run on GPU A and frame B is run on GPU B.
But if only that would be the case, than we wouldn't see irregular frame time.

So I rather think that this a synchronization problem.
 
But if only that would be the case, than we wouldn't see irregular frame time.

So I rather think that this a synchronization problem.

No its as I described. What happens generally is that the frame takes 66ms to render, and say it takes 5.5 ms to dump the frame into the driver:

Frame-sent:frame-displayed
-5.5 :66
0 :77
66 :137.5
71.5 :148.5
....


It pretty much matches what you are seeing exactly. AFR isn't changing the time it takes to render the frame, just allowing 2x the frames to be rendered in the same time.

In order to fix the issue, you have to delay the request for the B frame. In this case you would want to delay until 27.5 nS after the first frame was received.

In essence AFR is little more than just flipping the same frame buffer twice in a lot of cases. Most of the sync issue probably do go away as you add more GPUs in AFR as the app->driver frame transmit time will will fill in the deadspots.

Its also one of the reason you see diminishing returns as you add more GPUs into the AFR config. As the cumulative app->driver transfer time approaches the per frame render time you hit a wall and can't increase the frame rate.

Aaron Spink
speaking for myself inc.
 
I've seen this when playing low rez on my pair of 3870's with a 4.33Ghz E8400.

Example: Playing Oblivion at 1024x768 without HDR demonstrates this behavior; the framerate is quite high, but the "effective framerate" is visibly lower. However, I don't play with those settings or resolutions -- 1680x1050 with HDR on and antialiasing is sufficient to bring the cards much more into line with what the CPU is capable of, and while the framerate does come down, the issue rate is FAR more even and nearly undetectable.
 
Yeah, we were reminded on our old observations as I said, when we saw the discussion going on at 3DCenter.de (the related thread also triggered the PCGames article).
It is very unfortunate that there is no international, widespread awareness about this huge problem.
 

good thing I don't read German or I could be accused of plagiarism.

Basically it boils down to this, AFR is pretty borked. What we really should be demanding is parallel rending, aka real SLI or chessboard division between the graphics cards that instead of increasing frames per a given time, decreases the actual rendering time. But the problem for Nvidia/ATI is real parallel rending is harder and won't give the *bling* framerates of AFR.

Aaron spink
speaking for myself inc.
 
Yeah, we were reminded on our old observations as I said, when we saw the discussion going on at 3DCenter.de (the related thread also triggered the PCGames article).
It is very unfortunate that there is no international, widespread awareness about this huge problem.

thats cause no one has actually written the benchmark to mark FAIL on the cards when they do this. It wouldn't actually be so hard to develop a benchmark scene that would easily verify this. Just need a black and write or number wall and motion across the wall. Pretty much the same thing you do with high speed cameras to determine the how fast something like a bullet is going.

Aaron Spink
speaking for myself inc.
 
No its as I described. What happens generally is that the frame takes 66ms to render, and say it takes 5.5 ms to dump the frame into the driver:

Frame-sent:frame-displayed
-5.5 :66
0 :77
66 :137.5
71.5 :148.5
....
I don't understand what you mean with "dump the frame into the driver".
Every card has it's own driver context and the frame is updated as if it was a single card. The only difference is that the slave card sends the completed frame to the master card, where it is copied into frontbuffer.

But yes, this is somewhat what it looks like without proper synchronization and that's the issue I am pointing at.
 
thats cause no one has actually written the benchmark to mark FAIL on the cards when they do this. It wouldn't actually be so hard to develop a benchmark scene that would easily verify this. Just need a black and write or number wall and motion across the wall. Pretty much the same thing you do with high speed cameras to determine the how fast something like a bullet is going.
I think it would be sufficient to measure the slowest of two (or more) frames and base the frame rate on this frame time.
 
No its as I described. What happens generally is that the frame takes 66ms to render, and say it takes 5.5 ms to dump the frame into the driver:

Frame-sent:frame-displayed
-5.5 :66
0 :77
66 :137.5
71.5 :148.5
....

Hmmm, thanks....I had always naively assumed that dumping time was close enough to rendering time to give each rendered frame relatively equal screen time. It looks like all AFR does is let the CPU burn through frames faster so that GPU A gets a new frame slightly faster than it would if running by itself. If the difference is as great as you indicate shouldnt this stuttering be glaringly obvious in faster paced games?
 
Hmmm, thanks....I had always naively assumed that dumping time was close enough to rendering time to give each rendered frame relatively equal screen time. It looks like all AFR does is let the CPU burn through frames faster so that GPU A gets a new frame slightly faster than it would if running by itself. If the difference is as great as you indicate shouldnt this stuttering be glaringly obvious in faster paced games?
aaronspinks example doesn't really reflect what is happening.

It looks more like this:

Code:
Frame, Time (ms)
    1,     0.000
    2,    11.231
    3,    52.277
    4,    62.364
    5,   110.195
    6,   120.274
    7,   167.977
    8,   178.608
    9,   225.865
   10,   236.023

It really only is that the cards are not synchronized, so frames are delivered too late/too early every second frame.
 
This is not new. ((I'm surprised at anyone who'd think this was something "new")) And there are different ways of handling AFR synchronization in the nvidia drivers.

Some games respond differently. Hence the differences needed in the Nvidia profiling system and while some standard AFR profiles wont work.

Chris
 
Back
Top