AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    692
    Likes Received:
    441
    Location:
    Slovenia
    In case of mining or rendering application is aware of multiple GPUs in the system and it's the application that's partitioning the load. It's also applications that ensure that data needed for that specific partition of the problem is available on that specific GPU (this part is changing with unified memory though).

    Games on the other hand are not using two explicit graphics devices in SLI/CF case. So there were always limits as to what driver has been able to figure out on its own. AFR is simple (render one frame on GPU 1, render next frame on GPU 2), but even this is beginning to break down this days when some intermediate steps may persist for multiple frames and as such require sync between the GPUs. Other approaches started dying earlier as render to texture became more widespread. Rendering say half of render target (say shadow map) on one gpu and another half on the other will need driver to sync both halves to both gpus and merge it on both gpus before using it later. Note that z-buffers (shadow maps) were also compressed before normal colour render targets were...
    Multi GPU has been a sort of dark magic that "just worked" unless you did this and that and that for almost its entire existence with the exception of original Voodoo, but those were simple times.
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Yeah. Automatic AFR worked as long as each frame was fully independent. But that meant that developer could not reuse any data generated by previous frames. As the complexity of the rendering has increased, reuse has become a crucial part of generating high quality images. And this trend will continue in the future. Two frames at 60 fps are very similar. There's so much work already done in the previous frame that developer wants to reuse.

    At first games adapted optimizations such as reducing the refresh rate far away shadow maps and shadow cascades. This made it possible to have more dynamic shadow casting light sources at once. Static lighting wasn't the only sane option anymore. Then people started caching blended multilayer materials and decals to offscreen texture arrays and atlases to reduce the cost of repeated blending operations of complex materials (especially in terrain rendering). Only a small part of a huge texture was (permanently) changed every frame. Other data was reused. The quality of materials and terrain rendering increased. Then some people started to think about moving culling on GPU side. To reduce the CPU->GPU traffic, some devs kept scene data structures on GPU side and partially upgraded them every frame. And nowadays most games do temporal reprojection for anti-aliasing and stochastic techniques and temporal upscaling (including checkerboard rendering) is gaining popularity. Automatic AFR has no future. Developers either need to manually split the work to multiple GPUs, or GPUs need to adapt a simplified CPU-style multi-socket coherency model that allows them to cooperate transparency on the parallel workload of every frame.
     
  3. giannhs

    Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    38
    Likes Received:
    40
    what he is trying to say is
    devs can barely create a game nowdays that runs well on all single cards (doom) 99% of games are full with bugs and the perfomance is lower than expected so therefore since devs dont even care to fix the damn game before the release(let alone later with patches) they most surely wont give a single %^& for mgpu setups...
    that forces nvidia and amd to take the matter into their hands and create a hardware+software solution to force mgpu regardless of what the dev is doing
     
  4. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    I'm not entirely sure that it is so desirable to have mutant cancerous ISAs that grow new instructions whenever given a chance and change or remove old ones on a whim. Adding without ever removing as a non-mutant alternative may be of interest, but people scream bloody murder about x86 all the time.
     
    Silent_Buddha and DavidGraham like this.
  5. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,164
    Likes Received:
    1,461
    Location:
    Beyond3D HQ
    That graph is misleading too. If it wanted to be more accurate it should just have time and some measure of processor complexity as the x-axis, not node changes. And even then it just simply doesn't marry up with reality.

    It's a whole discussion topic to cover why, don't want to derail this thread too much.
     
  6. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    Discrete CPUs worked fine in the past, so the reasonable expectation is that a similar glued solution will only work better.

    Discrete GPUs did NOT work well in the past, so it's unreasonable to expect that just gluing them together with a bus that similar to the past solutions will magically make it work.

    The solution that you're proposing, discrete dies that act like one monolithic GPU has not been done before. It's definitely not like AFR and not even like SFR.
     
  7. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    126
    Likes Received:
    154
    Of course, you're right. If you have the time and are in the mood someday to explain more about that in a different thread it would be great. :-D

    Just because it has not been done before doesn't mean that it wouldn't be possible and work good nowadays. Technology is evolving. The same could've been said about stacked chips and we get more and more of them.
    The problem before was the limited interconnect speed, but especially by using interposers this isn't such a problem nowadays.
     
  8. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    I seen the incompetence of game developers in making something efficient as an even stronger endorsement for using single die solutions as long as possible. ;-)

    I have to believe sebbbi when he says that it's possible. I'm sure that, given enough effort and HW resources, it will eventually be possible. But I also believe that the solution is much harder than just slapping a bus between two dies that has moderate performance compared to DRAM BW, and that the solution will be more expensive and less efficient than a single die.

    In other words, it will only be useful once single die solutions have hit the wall completely and there's nowhere else to go.
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Two GPU dies with single shared memory seems to be "doable" for existing graphics workloads (where "doable" is only considering data sharing, not implementation cost). However future graphics workloads might require more coherency between thread groups. Also HPC/scientific workloads could already use algorithms that need more coherency between thread groups. For example fast global prefix sum algorithms are already latency bound (read previous group sum -> write sum to next group -> ...). My experience is mostly about game engines. GPU designs nowadays have to be balanced between professional compute use and gaming. I have no idea how much slower operations requiring coherency between thread groups would become. It might be a showstopper for some non-gaming use cases. What I am trying to say is that graphics APIs (including compute shaders offered by those APIs) don't actually require as much GPU<->GPU memory coherency as most people believe they do. As rasterizers are becoming more and more memory local (self contained tiles), there's even more opportunity to split the workload to multiple processors without needing high frequency data synchronization. However a system like this would need shared memory. Split memory (each GPU has their own dedicated memory) wouldn't work obviously. You'd need multiple dies on the same PCB (preferably on the same interposer like EPYC to reduce the GPU<->GPU latency to minimum).
     
    Lightman and BacBeyond like this.
  10. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    939
    Likes Received:
    399
    Yes my question was what was or where were the difference that can it different(and transparent) between multi VGAs/GPUs and multi die GPUs. it was new software(driver) that can also have an impact in multi VGAs config or it was because of the short distance between dies that allows for less latency and higher bandwidth that can make this approach viable.
     
  11. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    From a technical standpoint it isn't hard to do. Doing it with effective performance at a reasonable price is another question. Ideally the interconnect isn't even required and the frame rendered as independent tiles.

    Something to consider. Having the engine split the render into two independent frames isn't too difficult. Problem being it would double CPU load with effectively twice as many draws. Twice the GPU with half the CPU is then problematic. In the case of DX12/Vulkan that overhead is much lower. GPU driven rendering even lower. So while the driver could accelerate mGPU in various ways, it's best done by a programmer as that's the only way to guarantee there are no coherency issues. The driver can't always assume that is the case and split the work appropriately.
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    We haven't yet had any systems with multiple GPUs + uniform graphics memory between the GPUs. All implementations so far have has split graphics memories. I don't know how doable this is. We already have integrated GPUs and CPUs accessing the same unified memory. However even that configuration has downsides. The first question should be: Would it be efficient to share fast graphics memory between two GPUs? And I am not talking about full coherency. Only a small subset of the accesses need to be coherent between the GPUs (as explained in my previous post). But if we don't have shared memory, then fine grained automated load balancing becomes a pretty hard problem to solve.
    Splitting viewport to two frustums (left & right) isn't that expensive. You simply add a single extra plane test (you already have 5) to your frustum culling code. Have two arrays, and put objects to the left/right array depending on the plane test result (intersecting objects to both). Of course objects crossing the center plane require two draw calls, but that's only a small subset of the visible objects. I don't think the draw calls and/or g-buffer rendering are an issue at all. Issues mostly occur in lighting and post processing steps, where you need to access neighbor pixels. This is problematic if the neighbor is on the other side of the screen. Examples: screen space ambient occlusion, screen space reflections, temporal AA, bloom, depth of field, motion blur, refraction effects... Two halves of the screen aren't independent of each other. Even if you solve these problems (for example by rendering wide overlap section between the halves), there's a bigger problem left: Shadow maps. You don't want to render each shadow map twice. Sun light cascades are huge and both left & right frustum would often sample the same location (imagine low sun angle and sun shining directly from the side). With GPU-driven rendering, you can do much better vs traditional CPU shadow culling, because you can go though the z-buffer to identify exactly which surface fragments are visible (= all possible receivers) for both sides of the screen. There's still some overlap, but nowhere near 2x.

    If two GPUs had (non-coherent) unified memory, this would be much easier problem to solve. You would simply use two render queues (one per GPU) and some compute queues and some fences to ensure cache flushes at correct locations. In the example case you would do a fence wait to ensure that both halves of the screen are finished. Flush caches and continue post processing on both sides separately (both can read each others' data). You wouldn't need any cache coherence at all. But with some form of limited cache coherency, it would be even easier to split workload between two GPUs. It could be mostly automated.
     
  13. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,035
    Likes Received:
    5,576
    Lightman, BRiT and Anarchist4000 like this.
  14. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Unified memory shouldn't be a problem with Vega. That would already occur with system memory, SSGs, and external storage involving multiple adapters. VRAM likely partitioned for use with HBCC with and without coherency and paging. I can't think of any good reason for a coherent or paged framebuffer in regards to performance. HBCC would likely page resources in a way that won't allow them to be efficiently mapped. That shouldn't be an issue though as you wouldn't want to share them.

    My concern would be reducing bandwidth on the interconnect as much as possible. The lighting and compute passes I'd imagine are difficult to partition efficiently for the driver. Beyond simple use cases. Compute especially as the frustum or screen space isn't necessarily apparent or evenly distributed. Tiled screen space and results will be interesting.

    Definitely faster memory as there is another Ripper further down at higher(4.2 vs 3.4) core clocks. Now for Epyc with twice as many links!
     
  15. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,035
    Likes Received:
    5,576
    But how are we sure these bandwidth results refer to inter-die and not inter-core within the same die? Is the benchmark measuring speeds between all cores and showing only the slowest result?

    I thought Epyc used the same amount of links between each core, and it doesn't work as a mesh.
     
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The one socket solution should have had links between all the dies. The two was where the mesh started to break down going to two hops and four links to the other socket.

    Can't be sure as I'm not familiar with the test, but more links should make more bandwidth available. Exception being if all links to one chip must use the same controller and speed is limited, but that seems a really bad design all things considered. Not being able to sustain the bandwidth of multiple independent links would be extremely low hanging fruit. The hard part is driving all the lanes, not internally routing the data at the equivalent of L1 cache speeds.
     
  17. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    I was not talking about GPU accessing unified DDR4 system memory. I was talking about unified graphics memory (GDDR5 or HBM2) between two GPUs. No paging obviously. Direct cache line granularity access by both GPUs to the same memory.

    This would be conceptually similar to two CPU sockets accessing one shared system memory. However the GPU programming model doesn't need full coherency (many resources are read only, and UAV writes by default are only seen by the same group, so no global coherency is needed in default case = much less coherency traffic between the GPUs).
     
    #117 sebbbi, Aug 8, 2017
    Last edited: Aug 8, 2017
    DavidGraham likes this.
  18. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,311
    Likes Received:
    411
    Location:
    Australia
    Maybe im being over simplistic, but if all worlds (rops,mtu's,ALU's) meet at the L2, could you just join your GPU's at the L2 slices ( ring , mesh /whatever) and maybe another cross connect for the front end. So the front end looks like one big front end and the L2 look like one big L2.

    Remember we are talking silicon not organic interposer here unlike naples so they should be able to be driven faster and for less power.
     
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    That's the current way. Compute units, ROPs, etc communicate with each other through the L2 cache. But L2 is a performance critical part, so it must be on the same die as these units. Theoretically you could move the L2 out of the GPU die (shared Crystalwell style cache), but the latency would be much longer to L2 cache, so all operations would suffer. The bandwidth between the compute units and L2 is huge, so the off-chip interconnect would need to be larger than anything we have seen before. This would also add one extra chip (connected with the GPUs under the same interposer). So it would add cost.
     
  20. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,311
    Likes Received:
    411
    Location:
    Australia
    I dont mean move it out, i mean extend the L2 fabric, Does every GCN "core" have the same bandwidth and latency to every part of the L2 right now ( i dont know but i doubt it) ? obvious its a question of power consumption and added latency, but those are two of the area's where a silcon interposer was supposed to be much better then an organic one.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...