AMD: Speculation, Rumors, and Discussion (Archive)

Discussion in 'Architecture and Products' started by iMacmatician, Mar 30, 2015.

Thread Status:
Not open for further replies.
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    When my solution A is the way it works, they're absolutely right that it's better.

    And it's not even hard to implement: treat the SLI bridge as a virtual video interface way past the intricacies of resource sharing and whatnot and you're done.

    It seems much smarter to me to bet on that solution than to think that you're smarter than the combined intelligence of a bunch of engineers who think about it all day.
     
    pharma likes this.
  2. Infinisearch

    Veteran

    Joined:
    Jul 22, 2004
    Messages:
    779
    Likes Received:
    146
    Location:
    USA
    Are you guys talking about implict or explicit multi-GPU? Because AFAIK for explicit PCIe is used. But the way it is talked about it seems like its geared toward bulk transfers.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    You asked why I think 32 ROPs are enough and I think geometry discard could make ROPs more efficient. Is that a minor optimisation if that's how it works?

    Why has it taken so long for delta colour compression to get to where it is? Or do you think that bandwidth isn't the single most important constraint in GPU design since forever?
     
  4. Alessio1989

    Regular

    Joined:
    Jun 6, 2015
    Messages:
    614
    Likes Received:
    321
    We are talking about linked-adapter node (SLI/CFX). Both linked and unlinked modes are explicit under D3D12 and Vulkan.
     
  5. Infinisearch

    Veteran

    Joined:
    Jul 22, 2004
    Messages:
    779
    Likes Received:
    146
    Location:
    USA
    If you are talking about linked adapter mode then why are you talking about sharing resources (non framebuffer), aren't resources replicated across adapters in linked mode?
     
  6. Alessio1989

    Regular

    Joined:
    Jun 6, 2015
    Messages:
    614
    Likes Received:
    321
    Cross-node sharing/linked adapter mode (though some adapters may comes with multiple linked nodes, mostly 2 nodes), also known as SLI and CFX in the commercial names, allows explicit sharing of resources under Direct3D 12 (and I guess Vulkan too though I did not study in deep this one). This is different from Direct3D 11 and older high-level APIs where quite every resource needs a copy on every VRAM pool (though later versions of NVLINK and AGS APIs allowed a little control over them). What can be shared and what cannot be shared is expressed by the cross node sharing tier. A higher cross-node sharing tier can be nullify if the cross-node configuration is limited by bandwidth: SLI bridges - even the "new double" of Pascal - have a lower bandwidth compared to how many PCI-E lines a typical multi-GPU configuration can gain.
    Cross-node sharing is historically tied with AFR, but with the new APIs many different and more efficient techniques or implementation are allowed.

    Instead, cross-adapter mode is when you have different adapters (even of different vendors or architectures) seen as different devices. On Cross-adapter mode you can share a resource heap too. In this scenario "share" literally means "copy", but that's not a main issue since in this mode you really want to the GPUs do different jobs and minimize the dependencies. Of course different hardware and different configuration may have different efficiency and cross-adapter resource sharing implementation (ie: two cards of the same vendor with similar architecture may optimize the sharing compared to two card of different vendor).
    Cross-adapter mode can be a lot efficient if the workload is well balanced between the GPU involved. AOS is at least one application using this kind of multi-GPU technique, and it demonstrated it can be a lot efficient, as it demonstrated the result may vary if the GPUs involved are switched in the jobs.
     
    #3266 Alessio1989, Jun 26, 2016
    Last edited: Jun 26, 2016
  7. xEx

    xEx
    Veteran

    Joined:
    Feb 2, 2012
    Messages:
    1,060
    Likes Received:
    543
    I think a while ago someone was blown away of Polaris discard abilities. I think he says "in some cases its faster than anything in the market" referring to the discard capabilities that Polaris had.

    Having BW is always important but BW alone is useless unless you can use it(Fury)

    In my opinion if Polaris can do the things AMD is saying it can do then I think Polaris resources are "good enough" for the work AMD want Polaris to do. My biggest concern is what will happen to the gap between the 230 dollars polaris vs 400 dollars Pascal. You cant just try to throw a bunch of "high end customs design" of a mid range GPU and populate the 170 dollars gap...specially is Polaris OC is as week as the last rumors are suggesting.
     
  8. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    More surprising is that we haven't seen NVLink being used over their existing SLI connector. Might be a pin issue, but that would be a rough substitute for a lack of PCIE bandwidth. You could have a second card that was all memory.

    In the case of compute shaders, especially asynchronous, I could see the sharing issues getting rather interesting. Some of the uses would only be practical on a single adapter.
     
  9. Otto Dafe

    Regular

    Joined:
    Aug 11, 2005
    Messages:
    400
    Likes Received:
    59
    Are you referring to the quote in this post?
     
  10. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    SLI was introduced for NVidia cards in 2004, back when AGP slots were the standard. PCIe 1.0 came out enabling installing multiple cards simultaneously.. so cool! But PCIe 1.0 bandwidth is only about 1/8 today's PCI 3.0 bandwidth, and implementations were not as polished. The SLI bridge gave NVIdia the ability to directly share rendered images (alternate frames, partial frames, interleaved frames..) over a private bus they could control and therefore not sensitive to exact PCI bus contention, motherboard quality, or northbridge chipset behaviors.. (ah, 2004, we had northbridges!)

    It's now over 10 years later and PCIE bandwidth is no longer a major issue for graphics cards, even for sharing 4K@60Hz frames. So why keep SLI?
    The answer is simple. It is guaranteed bandwidth, and more importantly it's guaranteed low latency communiation. You want to minimize frame stuttering when using multiple GPUs? Then a low, fixed, latency pipe makes it much easier. Using PCIE only is still possible of course, but if you want to get a guaranteed, low latency, transfer of video frames, the SLI backchannel gives it to you.

    Starting now with with Pascal (and perhaps Polaris??) new unified memory models will mean that GPUs share RAM with each other (and the CPU) more transparently and likely more frequently. So our PCIE busses are going to be busier than they have been in the past.. meaning the backchannel bus is more useful now than it was last year. NVidia's new SLI bridges are higher bandwidth likely to handle 4K or higher frames at higher refresh rates.

    In summary, SLI bridges are useful not because of extra bandwidth, but for the guaranteed low latency communication. The freed PCIE bandwidth is a pleasant side effect.
     
    Grall and nnunn like this.
  11. xEx

    xEx
    Veteran

    Joined:
    Feb 2, 2012
    Messages:
    1,060
    Likes Received:
    543
    actually not. it was on another forum a while ago(I think he got an ES)
     
  12. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    When talking specifically about making ROPs more efficient, I mean it from a narrow point of view: increasing efficient from, say, 90% of their theoretical peak to 95% of their theoretical, not about avoiding them altogether.

    Of course, improved discard would help to avoid the alleged 32 ROP bottleneck...

    Because the bandwidth is not improving as fast core shader performance and the gates required to implement compression are becoming relatively cheaper.

    That's exactly what I think. I've stated many times in the past, in the context of the introduction of HBM, that there's no need yet to go all out on bandwidth. That doesn't mean that it's not important, but it's definitely not the single most important constraint. If it were, Fury X would blow all other GPUs out of the water.
     
  13. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    I agree with the general content of your post, but PCIe is high bandwidth enough to guarantee the QoS levels needed to transfer real-time video streams by added enough buffering. Either by using of-chip RAM (which would eat up some extra MC bandwidth) or on-chip FIFOs to feed straight into the display output port.
    The latter transforms the SLI/Crossfire-over-PCIe implementation into a matter of cost: how much FIFO die size do you need to have a guaranteed lack of underflows when scanning out pixels.
     
  14. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    With NVLink-like signaling, signal integrity could be a real issue.
     
    pharma, CSI PC, Lightman and 2 others like this.
  15. AnomalousEntity

    Newcomer

    Joined:
    Jun 6, 2016
    Messages:
    38
    Likes Received:
    25
    Location:
    Silicon Valley
    Geometry discard is of 3 types:
    1. Backface culling : Solved
    2. View volume culling : Solved
    3. Hidden surface culling : Solved in TBDR

    My point, there's not much left in culling more triangles to ultimately reduce pixel pixel shading and ROP work. So not much to be gained here I think.

    You could scale the geometry engines i.e make more shader clusters with like 8 CUs instead of 16 on Fiji, this would also increase the rate of discard. This won't actually save the pixel shading work but just reduces the probability of it being a bottleneck. But then again I am pretty sure it's either the pixel shading or memory bandwidth the bottleneck instead of geometry in most modern games.
     
    Heinrich04 and silent_guy like this.
  16. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    True, but Nvidia still managed to get 8 working in a box. Granted that was using that mezzanine connector. With HBM likely removing some pin constraints, I'm still wondering if we'll see a dual die card where a link could be hardwired. What's to say they couldn't modify a motherboard for a socketed GPU either? The interconnects are likely all GMI now. A dual or quad socket board, while typically reserved for servers, could have GPUs stuck in some sockets.
     
    pharma, spworley and nnunn like this.
  17. itsmydamnation

    Veteran

    Joined:
    Apr 29, 2007
    Messages:
    1,349
    Likes Received:
    470
    Location:
    Australia
    I don't know if this was linked here before, for whatever its worth:

    http://semiaccurate.com/forums/showpost.php?p=266518&postcount=2022
     
  18. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    NVLink has a fairly exotic signaling protocol, which is one reason it gets good pJ/bit. But AFAIK, NVLink cables are not currently possible from a technical point of view because of signal integrity.
     
    Kaarlisk, Grall, pharma and 1 other person like this.
  19. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    This is incorrect. GPUDirect RDMA is about network transferring, but the GPUDirect brand also includes peer-to-peer transfers within the same PCIe root complex.
     
    BRiT, pharma and silent_guy like this.
  20. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Ignorant question of the day: on today's non-Polaris GPUs, how are pipeline stalls visible to the programmer and how does one typically have to deal with it? And what did Larrabee do different? Come to think of it: was Larrabee ever available as a GPU?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...