Will NVLINK finally allow better parallelization?

Discussion in 'Architecture and Products' started by MfA, Mar 26, 2014.

  1. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,770
    Likes Received:
    470
    Can we finally say goodbye to AFR with NVLINK or will the misery continue?

    I'm sceptical 5xPCIe3 is going to be enough for sort middle parallelization but lets hope I'm wrong.
     
  2. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    14,858
    Likes Received:
    2,275
    What do you expect afr to be replaced with ?
     
  3. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    MfA mentioned one possibility already: sort middle rendering. See http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/readings/molnar94_sorting.pdf for the details of what that is. But basically, your GPUs collaborate on a single frame at a time, reducing rendering latency. There is a work redistribution step between geometry processing and pixel processing, hence the name.
     
  4. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,853
    Likes Received:
    4,463
    Ever since (nVidia's) SLI first came out, we've all been hoping for unified memory pool between multi-GPUs and something like per-object rendering like Lucid's now defunct Hydra.
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    Basically the link only need to be fast enough for the two GPUs to share rendered results. Textures can be duplicated, and memory is cheap. So, basically the link only needs to be fast enough for sending the final rendered result (as current AFR implementation) and additionally the off-screen rendering results.

    So, to understand how much bandwidth you need, basically you need to estimate how many off-screen renderings are needed, and that can be described as a factor: e.g. you need around 5 times of the on-screen resolution for off-screen rendering, then the factor is 5.

    5x PCIe3 is ~ 5GB/s bi-directional. You need ~ 500MB/s for 1920x1080 @ 60 fps, so with that you can do around the factor of 9. However, for modern game engines it's probably a bit too tight. For example, if you do deferred rendering you'll need to render a depth buffer first (that's also has to be shared), which may have to be a higher resolution if you use some sort of multi-sampling AA, and this alone takes out a factor of 4 if you use 4X AA. However, if you only use morphological AA then maybe it's fine, but on the other hand 1920x1080 @ 60 fps is a little too low for a multi-GPU setup.
     
  6. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,770
    Likes Received:
    470
    That's only true if there is geometry level tiling and the GPU can't do that on it's own ... the only practical way is to assign tiles to GPUs but just divide geometry/vertex shading evenly and sort in the middle, the geometry takes bandwidth in addition to the replication of all rendertargets (which includes can include intermediary stuff like shadow maps).
     
  7. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    5x PCIe 3, if that means five times the bandwith of PCIe 16x 3.0, would be about 80GB/s bi-directionnal.
     
  8. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    Yeah, I forgot about geometry shaders. However, to my understanding geometry shaders tend to be not as computation/bandwidth intensive as pixel rendering. So in order to avoid possible pipeline bubbles it might be faster to just replicate the works on both GPU.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    Yeah, for HPC applications it's probably more likely to be the case.
    If we can have this bandwidth on normal PC based multi-GPU set up then it could be a game changer.
     
  10. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    When I read about it I first though of the implications for the CPU side of things. Nvidia has stated that they wanted to have denver cores in every GPU mid term, they need a bus to connect coherently multiple SoCs.
    It may help with multi GPU in a gaming rig, I don't know but I guess the primary purpose is more on the compute side of things, like selling "all Nvidia" CUDA stations without Intel or AMD CPUs being required.
     
  11. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,770
    Likes Received:
    470
    Actually as I said a decade ago, it might make more sense to copy on demand chunks of the dynamic textures since a given tile in screen space will generally only require a small part of an environment/shadow/etc map.
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    It *might* be enough for sort middle tiled parallelization.

    AMD/NV will never implement sort middle GPUs. Their current architectures are too optimized for sort last.

    A more hopeful scenario is that we see multiple graphics queues on a chip, doing sort first parallelization for opaque geometry/g buffer etc.
     
  13. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,770
    Likes Received:
    470
    Internally they already are in some ways because of the ROPs.
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    How so? ROPs make them sort last, not sort middle.

    You could make the argument that the triangle setup/raster parallelization introduced with Fermi makes them sort middle to a degree, but that is miles away from sort middle tiled. At realistic triangle rates in desktop, I am doubtful any realistic off chip multi-gpu interconnect wouldn't prove to be a latency and bandwidth bottleneck.

    Multi-socket interconnects like QPI etc. *might* work, but they don't seem to be anywhere on the rodmap for multi-gpu systems.
     
  15. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    As long as there's longer fps bars in benchmark to be had with any method (in this case AFR), this is gonna stay a wet dream. Remember what/how Multi-GPU launched with (AFR, SFR/Scissoring, Tiling) and to what it evolved today.

    Except, that on latest generation multi-GPU setups, people will rather likely be using higher resolutions - e.g. 3840 x 2160 or multi-mon configurations.
     
  16. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I don't see how sort middle has any hope of becoming reality now that we're way past the single tri per clock era. It could have been done before, but now there's too much data at that stage.

    I think object level sorting is the only route to getting rid of AFR, and if XBox 360 wasn't enough impetus to encourage that kind of coarse tiling, then it's a lost cause. The number of gamers who have an SLI system and would buy one game over another because it used an alternative to AFR (likely at lower frame rates) is miniscule.
     
  17. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,910
    Likes Received:
    1,607
    [​IMG]

    http://www.eetimes.com/author.asp?section_id=36&doc_id=1321693&page_number=1
     
    #17 pharma, Mar 29, 2014
    Last edited by a moderator: Mar 29, 2014
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...