NVIDIA Working on Tile-based Multi-GPU Rendering Technique Called CFR - Checkered Frame Rendering

pharma

Legend
NVIDIA Working on Tile-based Multi-GPU Rendering Technique Called CFR - Checkered Frame Rendering
November 21, 2019
"Forum user Blair at 3DCenter had a sharp eye noticed an added entry towards the drivers for Muli-GPU rendering, the technique is called CFR and basically slices up a frame in many small pieces, in order for the GPUs the render them in a parallel manner.

You could also refer to the technique as checkerboard rendering, where you split everything up into smaller tiles and have the GPUs render them based on an algorithm or simply, FIFO, first-in, fist-out, this could increase scaling performance but also helps with things like micro stuttering as frames and their output pacing are processed in way more stable manner. The basis is, of course, an existing technique applied in many solutions. NVIDIA, however, wants to use if for multi-GPU rendering.
...
Since CFR is currently activated with the help of extra tools and/or requires some manual work at Tweaking. The results and entries that NVIDIA is actively working on this methodology. The new technique would be DirectX compatible only, and as it seems for Turning and upcoming based GPUs as it will require NVLink."

index.php

https://www.guru3d.com/news-story/n...que-called-cfr-checkered-frame-rendering.html
 
Copied post to a new thread for folowup discussions.
 
Won't work for realtime rendering. Pixel N needs access to pixel Y in pass Z because it's screenspace tracing and oops it's on another GPU better stall the frame for a while data copies over, or worse as is suspected, wait while data is synced every bloody pass, how long is that going to take, GPUs are already highly latency sensitive.

Nvidia has gotten the "too in the lead for too long" syndrome where they try things because they have the money to do so rather than having because it's a good idea. There's a reason multi-gpu support was dropped already.
 
It already works, some users have enabled it in games with no multi GPU support.
Works and "works" are two different things though, we'd need thorough review of how it works, is there some artifacts because of it, performance anomalies and what not to make any conclusions on wether it really works or not
 
Won't work for realtime rendering. Pixel N needs access to pixel Y in pass Z because it's screenspace tracing and oops it's on another GPU better stall the frame for a while data copies over, or worse as is suspected, wait while data is synced every bloody pass, how long is that going to take, GPUs are already highly latency sensitive.

Nvidia has gotten the "too in the lead for too long" syndrome where they try things because they have the money to do so rather than having because it's a good idea. There's a reason multi-gpu support was dropped already.

Well, they can inovate and make it works. I'm not saying it's working, but it's not because the concept was problématic in the past than some clever dudes can't find solutions.
We'll see.
 
Won't work for realtime rendering. Pixel N needs access to pixel Y in pass Z because it's screenspace tracing and oops it's on another GPU better stall the frame for a while data copies over, or worse as is suspected, wait while data is synced every bloody pass, how long is that going to take, GPUs are already highly latency sensitive.

Nvidia has gotten the "too in the lead for too long" syndrome where they try things because they have the money to do so rather than having because it's a good idea. There's a reason multi-gpu support was dropped already.

Yeah seems like it would require a ton of inter GPU bandwidth and latency will be a problem. Also scaling will be limited due to redundant geometry processing.

Maybe it’s just a proof of concept for a future MCM implementation. Can’t hate them for trying even if it’s because they have R&D dollars to burn.
 
How much latency are we really looking at?
DX12 supports explicit multi-adapter. Which also means it knows how to memory pool. We've seen mGPU operate very well on titles (Tomb Raider series) optimized in this way. Why is those cases is mGPU successful, but this CFR style will suffer all sorts of bottlenecks?
 
Games it currently works show a 40-50% fps increase using a 2x2080ti @4K

Metro Exodus (DX11/DX12) works (DLSS not compatible!)
BattleField V (DX11/DX12) not compatible
Borderlands 3 (DX11 works, DX12 crash)
Chernobylite (DX11) works
Crysis 3 (DX11) works
Shadow of the Tomb Raider (DX12 doesn't start, DX11 works)
Deus Ex Mankind Divided (DX12 doesn't start, DX11 works)
GRID (2019) (DX12) crash
Control (DX12) stability problems
F1 2019 (DX12) crash
Hitman 2 (DX11+DX12) (Visuel problems,flicker)
Forza Horizon 4 (DX12/UWP -> crash BSOD)
The Elder Scrolls Skyrim SE (DX11) (no scaling)
Final Fantasy XV (DX11) (no scaling)
A Plague Tale Innocence (DX11) works
Mafia III (DX11) works
Monster Hunter: World (DX11) crash
Tomb Raider (2013) (uneven GPU)
Middle Earth Shadow of Mordor (DX11) works (shadows show sometimes problems)
Devil May Cry 5 (DX11) works
Quantum Break (DX11) (no sclaing)
Resident Evil 7 (DX11) works
Far Cry 5 DX11 (Terrain flickers)
Resident Evil 2 Remake (DX11) works
The Division II (DX12) (works, z-fighting and framepacing-problems)

https://www.forum-3dcenter.org/vbulletin/showpost.php?p=12144578&postcount=3586
 
Last edited:
Games it currently works show a 40-50% fps increase using a 2x2080ti @4K

https://www.forum-3dcenter.org/vbulletin/showpost.php?p=12144578&postcount=3586

Neat! Glad to be proven wrong. But there's the bandwidth and latency copying problems already. Do they force sync after each pass? It'll be interesting to see more details.

I'm also surprised, and skeptic, that it'd work at all under DX12/Vulkan, specifically the list says Metro Exodus works under DX12, but unless you build drivers for each specific game it seems unlikely to work (maybe they did so for Exodus?). Still, looking forward to details on how they handled bandwidth/latency problem.
 
Last edited:
Neat! Glad to be proven wrong. But there's the bandwidth and latency copying problems already. Do they force sync after each pass? It'll be interesting to see more details.

I'm also surprised, and skeptic, that it'd work at all under DX12/Vulkan, specifically the list says Metro Exodus works under DX12, but unless you build drivers for each specific game it seems unlikely to work (maybe they did so for Exodus?). Still, looking forward to details on how they handled bandwidth/latency problem.
Just keep a copy of the memory on both GPUs. Why the need to copy back and forth?
 
Just keep a copy of the memory on both GPUs. Why the need to copy back and forth?

Because then each GPU would have to do the exact same work for both copies to match up perfectly, making it pointless? That's not what they're doing. Half the frame is rendered on one GPU, half on the other based in some sort of tiled manner apparently. Does SSR just not work, are there obvious SSAO lines from missing info? How would you handle non graphics related work?

The way this is described couldn't work for anything other than primary visibility without a lot of cross GPU data syncing, even then there'd be obvious artefacts. So either it's useless or they've figured out some frame sync calling to copy all data. Which in my head is screaming stalls, but the performance numbers look good. I'm really trying to figure what it is they're doing, the graphic they have doesn't seem related at all, that's just checkerboard temporal reconstruction. Guess it'll have to wait for a paper or some other explanation.
 
Ok, might've figured it out. The graphic might be correct, you could just dual checkerboard render on two GPUs, thus the "tiling" thing could be kind of misleading, as it's not totally important. One renders the "Frame N" pixels, the other "Frame N+1" pixels.
index.php

You then only have to sync and resolve final frame output. Clever really, I feel dumb now. Things like SSAO and SSR will indeed be a bit glitchy, but with a decently high resolution it shouldn't be that noticeable. And the more GPUs you use the less you scale, same with the lower you set the resolution, as frame setup will start dominating for both. Still, overall a good solution for most anyone that would buy two GPUs to begin with.

Definitely something AMD and Intel can replicate with some effort, as well as on APIs like DX12/Vulkan if the developer supports it. EG highly likely to show up in UE4/Unity, as the first already sells to pre-viz VFX and the second one wants to. And hey if that's not what Nvidia's doing, from an initial impression it could work anyway.
 
Why should cooperative work in large tiles on the same frame be implausible? It's not as if NVlink didn't at least provide the necessary bandwith to compete eye to eye with a local memory access. Well, at least it's only a factor 2 behind, but full duplex in return.

Well, yes, as this was published with a graphic, it does appear possible that the paired GPUs are actually performing driver side TAA.

Eventhough it's not quite clear which framerate was actually measured then. +50% in terms of performed present calls (and then cut in half by driver side TAA recombination before display)? As the other possible number of a +125% boost per GPU just by effectively halving shading rate doesn't appear plausible. Well, for the first option not necessarily cut in half, as frame N+1 can be combined with both N and N+2, not as limited as with a classic interlaced video steam.

Not sure what they are doing internally. Hijacking multisampling with a programmable pattern and effectively lower res targets, or clever use of variable rate shading to avoid interfering with data layout?
 
Why should cooperative work in large tiles on the same frame be implausible? It's not as if NVlink didn't at least provide the necessary bandwith to compete eye to eye with a local memory access. Well, at least it's only a factor 2 behind, but full duplex in return.

My concern wasn't bandwidth necessarily, but latency. Anything over a link like Nvlink tends to be far slower than local access, as in microsecond versus nanosecond access. Just a lot of time to wait for whatever stalls crop up. Though I suppose if Nvidia carefull built a driver profile for each and every title enabled those could be somewhat minimized.

Well, yes, as this was published with a graphic, it does appear possible that the paired GPUs are actually performing driver side TAA.

Eventhough it's not quite clear which framerate was actually measured then. +50% in terms of performed present calls (and then cut in half by driver side TAA recombination before display)? As the other possible number of a +125% boost per GPU just by effectively halving shading rate doesn't appear plausible. Well, for the first option not necessarily cut in half, as frame N+1 can be combined with both N and N+2, not as limited as with a classic interlaced video steam.

Not sure what they are doing internally. Hijacking multisampling with a programmable pattern and effectively lower res targets, or clever use of variable rate shading to avoid interfering with data layout?

Yeah I don't know what the performance metrics are. Maybe the framerate was just "125%" of normal?
 
This patent seems to be related.

https://patentswarm.com/patents/US10430915B2

Multi-GPU frame rendering

In one embodiment, two or more GPUs are configured to operate as peers, with one peer able to access data (e.g., surfaces) in local memory of another peer through a high-speed data link (e.g., NVLINK, high-speed data link 150 of FIG. 1E). For example, a first GPU of the two or more GPUs may perform texture mapping operations using surface data residing remotely within a memory of a second GPU of the two or more GPUs.

In certain embodiments, a given frame to be rendered is partitioned into regions (e.g., rectangular regions) forming a checkerboard pattern, with non-overlapping adjacent regions sharing a common edge in the checkerboard pattern generally assigned to different GPUs.
 
Playing around with CFR , the 2 main visual artifacts i've seen are either objects,terrain,texture flickering(Farcry5_New Dawn,RDR2) and or visible tiles in volumetric effects(RDR2,quantum break,Alien isolation). Some of those can be fixed by using a custom resolution. For instance using a custom resolution of 3840x2159(instead of nativ 3840x2160) fixes the tile artifacts in quantum break and RDR2.
 
The primary innovation in the patent is a reduction of latency between GPUs by initiating the copy of data from the remote GPU before the data is actually needed. In any given frame the data to be copied is based on an educated guess driven by heuristics on what data was copied in the prior frame(s).

It makes sense at a high level but I don’t see how this doesn’t result in a stuttery mess as it could never predict 100% of data dependencies. Some requests will stall and wait for the long trip over NVLink.

Theoretically this is no different to requests to off-chip memory so maybe much of the NVLink latency can be hidden behind other processing.

The patent doesn’t explain how this would work for surface data generated by compute shaders. Those surfaces may not align themselves to a neat checkerboard pattern so how would the driver know which data was produced by which GPU?
 
Back
Top