AMD: Navi Speculation, Rumours and Discussion [2017-2018]

MDolenc · Aug 4, 2017

CarstenS said:
In fact, even multiple graphics cards used for mining or rendering (blender etc.) or the accelerators in supercomputers do work together very well already. The culprit is gaming: Vendors insisted on maximum length of benchmark bars for gaming and focused on AFR, which in turn introduces a whole load of troubles on it's own.

In the early days, screen partitioning in one way or the other was the method of choice and it worked rather well - at least compared to AFR-style mGPU. Problem is/was: How to market all the hassle with two or more GPUs [cost (2 cards, mainboards with 2x PEG, PSU, electricity) and noise] when you won't get 2× performance - while your competitor might actually do that by accepting all that is bad in MGPU (aka AFR). That's what broke the neck for MGPU in gaming, IMHO.

In case of mining or rendering application is aware of multiple GPUs in the system and it's the application that's partitioning the load. It's also applications that ensure that data needed for that specific partition of the problem is available on that specific GPU (this part is changing with unified memory though).

Games on the other hand are not using two explicit graphics devices in SLI/CF case. So there were always limits as to what driver has been able to figure out on its own. AFR is simple (render one frame on GPU 1, render next frame on GPU 2), but even this is beginning to break down this days when some intermediate steps may persist for multiple frames and as such require sync between the GPUs. Other approaches started dying earlier as render to texture became more widespread. Rendering say half of render target (say shadow map) on one gpu and another half on the other will need driver to sync both halves to both gpus and merge it on both gpus before using it later. Note that z-buffers (shadow maps) were also compressed before normal colour render targets were...
Multi GPU has been a sort of dark magic that "just worked" unless you did this and that and that for almost its entire existence with the exception of original Voodoo, but those were simple times.

sebbbi · Aug 4, 2017

MDolenc said:
Games on the other hand are not using two explicit graphics devices in SLI/CF case. So there were always limits as to what driver has been able to figure out on its own. AFR is simple (render one frame on GPU 1, render next frame on GPU 2), but even this is beginning to break down this days when some intermediate steps may persist for multiple frames and as such require sync between the GPUs. Other approaches started dying earlier as render to texture became more widespread. Rendering say half of render target (say shadow map) on one gpu and another half on the other will need driver to sync both halves to both gpus and merge it on both gpus before using it later. Note that z-buffers (shadow maps) were also compressed before normal colour render targets were...

Yeah. Automatic AFR worked as long as each frame was fully independent. But that meant that developer could not reuse any data generated by previous frames. As the complexity of the rendering has increased, reuse has become a crucial part of generating high quality images. And this trend will continue in the future. Two frames at 60 fps are very similar. There's so much work already done in the previous frame that developer wants to reuse.

At first games adapted optimizations such as reducing the refresh rate far away shadow maps and shadow cascades. This made it possible to have more dynamic shadow casting light sources at once. Static lighting wasn't the only sane option anymore. Then people started caching blended multilayer materials and decals to offscreen texture arrays and atlases to reduce the cost of repeated blending operations of complex materials (especially in terrain rendering). Only a small part of a huge texture was (permanently) changed every frame. Other data was reused. The quality of materials and terrain rendering increased. Then some people started to think about moving culling on GPU side. To reduce the CPU->GPU traffic, some devs kept scene data structures on GPU side and partially upgraded them every frame. And nowadays most games do temporal reprojection for anti-aliasing and stochastic techniques and temporal upscaling (including checkerboard rendering) is gaining popularity. Automatic AFR has no future. Developers either need to manually split the work to multiple GPUs, or GPUs need to adapt a simplified CPU-style multi-socket coherency model that allows them to cooperate transparency on the parallel workload of every frame.

giannhs · Aug 4, 2017

silent_guy said:
I don't think that answers the original question. ;-)

Motherboards with multiple CPUs have existed for decades and worked quite well. Adding multiple CPU dies on the same substrate is almost the same thing. The interface between just has a higher BW and there's some cache coherency protocol (I think).

It's not at all clear to me that the same can be done for GPUs without a massive BW interface between GPUs, and what the cost of that would be.

what he is trying to say is
devs can barely create a game nowdays that runs well on all single cards (doom) 99% of games are full with bugs and the perfomance is lower than expected so therefore since devs dont even care to fix the damn game before the release(let alone later with patches) they most surely wont give a single %^& for mgpu setups...
that forces nvidia and amd to take the matter into their hands and create a hardware+software solution to force mgpu regardless of what the dev is doing

AlexV · Aug 4, 2017

sebbbi said:
There's only 40 new instructions in the GCN5 ISA (http://diit.cz/clanek/architektura-radeon-rx-vega). Vega isn't a huge ISA jump. Vega seems to focus more on graphics side (tiled rasterizer, ROP caches, geometry pipes, etc) and optimizing the power usage and rising the clocks. GCN3 was a much bigger ISA change.

I'm not entirely sure that it is so desirable to have mutant cancerous ISAs that grow new instructions whenever given a chance and change or remove old ones on a whim. Adding without ever removing as a non-mutant alternative may be of interest, but people scream bloody murder about x86 all the time.

Rys · Aug 4, 2017

Samwell said:
As i don't have insight i can only take the public data, so might be much too high.

100M might be really too high, but are you really sure that it does not shoot up per node? Your ex company also showed numbers in which verification cost skyrocket, while true that it's on a way lower level. But 28nm to 16nm here we have a doubling and looking at the trend from 65nm this happened every node jump. Looking at the 25 M in this Graph i would expect 50 M at 7nm. Also verification cost should be bigger in a bigger chip i would expect or am i wrong? So maybe in big chips with vega size you could even reach 100M. At least that would've been my laymans thought that a chip much bigger than socs would cost way more to verify. Correct me if i'm wrong
https://www.imgtec.com/blog/imagination-tsmc-collaborate-on-iot-subsystems/

That graph is misleading too. If it wanted to be more accurate it should just have time and some measure of processor complexity as the x-axis, not node changes. And even then it just simply doesn't marry up with reality.

It's a whole discussion topic to cover why, don't want to derail this thread too much.

silent_guy · Aug 4, 2017

BacBeyond said:
How does it not answer his question?

His question was "How is it different from what we have now for multiple gpu support". The answer is, instead of requiring developer / extra driver hacks, it will work as a single GPU and not multiple. Say it's 1024 cores per "gpu", the system would see one 2048 core gpu instead of two 1024 ones.

Discrete CPUs worked fine in the past, so the reasonable expectation is that a similar glued solution will only work better.

Discrete GPUs did NOT work well in the past, so it's unreasonable to expect that just gluing them together with a bus that similar to the past solutions will magically make it work.

The solution that you're proposing, discrete dies that act like one monolithic GPU has not been done before. It's definitely not like AFR and not even like SFR.

Samwell · Aug 4, 2017

Rys said:
That graph is misleading too. If it wanted to be more accurate it should just have time and some measure of processor complexity as the x-axis, not node changes. And even then it just simply doesn't marry up with reality.

It's a whole discussion topic to cover why, don't want to derail this thread too much.

Of course, you're right. If you have the time and are in the mood someday to explain more about that in a different thread it would be great.

silent_guy said:
Discrete CPUs worked fine in the past, so the reasonable expectation is that a similar glued solution will only work better.

Discrete GPUs did NOT work well in the past, so it's unreasonable to expect that just gluing them together with a bus that similar to the past solutions will magically make it work.

The solution that you're proposing, discrete dies that act like one monolithic GPU has not been done before. It's definitely not like AFR and not even like SFR.

Just because it has not been done before doesn't mean that it wouldn't be possible and work good nowadays. Technology is evolving. The same could've been said about stacked chips and we get more and more of them.
The problem before was the limited interconnect speed, but especially by using interposers this isn't such a problem nowadays.

silent_guy · Aug 4, 2017

giannhs said:
that forces nvidia and amd to take the matter into their hands and create a hardware+software solution to force mgpu regardless of what the dev is doing

I seen the incompetence of game developers in making something efficient as an even stronger endorsement for using single die solutions as long as possible. ;-)

I have to believe sebbbi when he says that it's possible. I'm sure that, given enough effort and HW resources, it will eventually be possible. But I also believe that the solution is much harder than just slapping a bus between two dies that has moderate performance compared to DRAM BW, and that the solution will be more expensive and less efficient than a single die.

In other words, it will only be useful once single die solutions have hit the wall completely and there's nowhere else to go.

sebbbi · Aug 4, 2017

silent_guy said:
I have to believe sebbbi when he says that it's possible. I'm sure that, given enough effort and HW resources, it will eventually be possible. But I also believe that the solution is much harder than just slapping a bus between two dies that has moderate performance compared to DRAM BW, and that the solution will be more expensive and less efficient than a single die.

Two GPU dies with single shared memory seems to be "doable" for existing graphics workloads (where "doable" is only considering data sharing, not implementation cost). However future graphics workloads might require more coherency between thread groups. Also HPC/scientific workloads could already use algorithms that need more coherency between thread groups. For example fast global prefix sum algorithms are already latency bound (read previous group sum -> write sum to next group -> ...). My experience is mostly about game engines. GPU designs nowadays have to be balanced between professional compute use and gaming. I have no idea how much slower operations requiring coherency between thread groups would become. It might be a showstopper for some non-gaming use cases. What I am trying to say is that graphics APIs (including compute shaders offered by those APIs) don't actually require as much GPU<->GPU memory coherency as most people believe they do. As rasterizers are becoming more and more memory local (self contained tiles), there's even more opportunity to split the workload to multiple processors without needing high frequency data synchronization. However a system like this would need shared memory. Split memory (each GPU has their own dedicated memory) wouldn't work obviously. You'd need multiple dies on the same PCB (preferably on the same interposer like EPYC to reduce the GPU<->GPU latency to minimum).

xEx · Aug 4, 2017

Yes my question was what was or where were the difference that can it different(and transparent) between multi VGAs/GPUs and multi die GPUs. it was new software(driver) that can also have an impact in multi VGAs config or it was because of the short distance between dies that allows for less latency and higher bandwidth that can make this approach viable.

Anarchist4000 · Aug 4, 2017

xEx said:
Yes my question was what was or where were the difference that can it different(and transparent) between multi VGAs/GPUs and multi die GPUs. it was new software(driver) that can also have an impact in multi VGAs config or it was because of the short distance between dies that allows for less latency and higher bandwidth that can make this approach viable.

From a technical standpoint it isn't hard to do. Doing it with effective performance at a reasonable price is another question. Ideally the interconnect isn't even required and the frame rendered as independent tiles.

Something to consider. Having the engine split the render into two independent frames isn't too difficult. Problem being it would double CPU load with effectively twice as many draws. Twice the GPU with half the CPU is then problematic. In the case of DX12/Vulkan that overhead is much lower. GPU driven rendering even lower. So while the driver could accelerate mGPU in various ways, it's best done by a programmer as that's the only way to guarantee there are no coherency issues. The driver can't always assume that is the case and split the work appropriately.

sebbbi · Aug 7, 2017

xEx said:
Yes my question was what was or where were the difference that can it different(and transparent) between multi VGAs/GPUs and multi die GPUs. it was new software(driver) that can also have an impact in multi VGAs config or it was because of the short distance between dies that allows for less latency and higher bandwidth that can make this approach viable.

We haven't yet had any systems with multiple GPUs + uniform graphics memory between the GPUs. All implementations so far have has split graphics memories. I don't know how doable this is. We already have integrated GPUs and CPUs accessing the same unified memory. However even that configuration has downsides. The first question should be: Would it be efficient to share fast graphics memory between two GPUs? And I am not talking about full coherency. Only a small subset of the accesses need to be coherent between the GPUs (as explained in my previous post). But if we don't have shared memory, then fine grained automated load balancing becomes a pretty hard problem to solve.

Anarchist4000 said:
Something to consider. Having the engine split the render into two independent frames isn't too difficult. Problem being it would double CPU load with effectively twice as many draws. Twice the GPU with half the CPU is then problematic. In the case of DX12/Vulkan that overhead is much lower. GPU driven rendering even lower. So while the driver could accelerate mGPU in various ways, it's best done by a programmer as that's the only way to guarantee there are no coherency issues. The driver can't always assume that is the case and split the work appropriately.

Splitting viewport to two frustums (left & right) isn't that expensive. You simply add a single extra plane test (you already have 5) to your frustum culling code. Have two arrays, and put objects to the left/right array depending on the plane test result (intersecting objects to both). Of course objects crossing the center plane require two draw calls, but that's only a small subset of the visible objects. I don't think the draw calls and/or g-buffer rendering are an issue at all. Issues mostly occur in lighting and post processing steps, where you need to access neighbor pixels. This is problematic if the neighbor is on the other side of the screen. Examples: screen space ambient occlusion, screen space reflections, temporal AA, bloom, depth of field, motion blur, refraction effects... Two halves of the screen aren't independent of each other. Even if you solve these problems (for example by rendering wide overlap section between the halves), there's a bigger problem left: Shadow maps. You don't want to render each shadow map twice. Sun light cascades are huge and both left & right frustum would often sample the same location (imagine low sun angle and sun shining directly from the side). With GPU-driven rendering, you can do much better vs traditional CPU shadow culling, because you can go though the z-buffer to identify exactly which surface fragments are visible (= all possible receivers) for both sides of the screen. There's still some overlap, but nowhere near 2x.

If two GPUs had (non-coherent) unified memory, this would be much easier problem to solve. You would simply use two render queues (one per GPU) and some compute queues and some fences to ensure cache flushes at correct locations. In the example case you would do a fence wait to ensure that both halves of the screen are finished. Flush caches and continue post processing on both sides separately (both can read each others' data). You wouldn't need any cache coherence at all. But with some form of limited cache coherency, it would be even easier to split workload between two GPUs. It could be mostly automated.

Deleted member 13524 · Aug 7, 2017

So it seems a Threadripper is now leading as the world's fastest in inter-core/inter-die bandwidth.

http://ranker.sisoftware.net/top_ru...f4c9f8deb68bbe98e0ddeccaafcaf7c7e192af9f&l=en

The CPU isn't overclocked, so I'm guessing this might be the result of overclocked memory and the effect it has on infinity fabric?

Anarchist4000 · Aug 7, 2017

sebbbi said:
If two GPUs had (non-coherent) unified memory, this would be much easier problem to solve. You would simply use two render queues (one per GPU) and some compute queues and some fences to ensure cache flushes at correct locations. In the example case you would do a fence wait to ensure that both halves of the screen are finished. Flush caches and continue post processing on both sides separately (both can read each others' data). You wouldn't need any cache coherence at all. But with some form of limited cache coherency, it would be even easier to split workload between two GPUs. It could be mostly automated.

Unified memory shouldn't be a problem with Vega. That would already occur with system memory, SSGs, and external storage involving multiple adapters. VRAM likely partitioned for use with HBCC with and without coherency and paging. I can't think of any good reason for a coherent or paged framebuffer in regards to performance. HBCC would likely page resources in a way that won't allow them to be efficiently mapped. That shouldn't be an issue though as you wouldn't want to share them.

My concern would be reducing bandwidth on the interconnect as much as possible. The lighting and compute passes I'd imagine are difficult to partition efficiently for the driver. Beyond simple use cases. Compute especially as the frustum or screen space isn't necessarily apparent or evenly distributed. Tiled screen space and results will be interesting.

ToTTenTranz said:
So it seems a Threadripper is now leading as the world's fastest in inter-core/inter-die bandwidth.

http://ranker.sisoftware.net/top_ru...f4c9f8deb68bbe98e0ddeccaafcaf7c7e192af9f&l=en

The CPU isn't overclocked, so I'm guessing this might be the result of overclocked memory and the effect it has on infinity fabric?

Definitely faster memory as there is another Ripper further down at higher(4.2 vs 3.4) core clocks. Now for Epyc with twice as many links!

Deleted member 13524 · Aug 7, 2017

Anarchist4000 said:
Definitely faster memory as there is another Ripper further down at higher(4.2 vs 3.4) core clocks.

But how are we sure these bandwidth results refer to inter-die and not inter-core within the same die? Is the benchmark measuring speeds between all cores and showing only the slowest result?

Anarchist4000 said:
Now for Epyc with twice as many links!

I thought Epyc used the same amount of links between each core, and it doesn't work as a mesh.

Anarchist4000 · Aug 7, 2017

ToTTenTranz said:
I thought Epyc used the same amount of links between each core, and it doesn't work as a mesh.

The one socket solution should have had links between all the dies. The two was where the mesh started to break down going to two hops and four links to the other socket.

ToTTenTranz said:
But how are we sure these bandwidth results refer to inter-die and not inter-core within the same die? Is the benchmark measuring speeds between all cores and showing only the slowest result?

Can't be sure as I'm not familiar with the test, but more links should make more bandwidth available. Exception being if all links to one chip must use the same controller and speed is limited, but that seems a really bad design all things considered. Not being able to sustain the bandwidth of multiple independent links would be extremely low hanging fruit. The hard part is driving all the lanes, not internally routing the data at the equivalent of L1 cache speeds.

sebbbi · Aug 8, 2017

Anarchist4000 said:
Unified memory shouldn't be a problem with Vega. That would already occur with system memory, SSGs, and external storage involving multiple adapters. VRAM likely partitioned for use with HBCC with and without coherency and paging. I can't think of any good reason for a coherent or paged framebuffer in regards to performance. HBCC would likely page resources in a way that won't allow them to be efficiently mapped. That shouldn't be an issue though as you wouldn't want to share them.

I was not talking about GPU accessing unified DDR4 system memory. I was talking about unified graphics memory (GDDR5 or HBM2) between two GPUs. No paging obviously. Direct cache line granularity access by both GPUs to the same memory.

This would be conceptually similar to two CPU sockets accessing one shared system memory. However the GPU programming model doesn't need full coherency (many resources are read only, and UAV writes by default are only seen by the same group, so no global coherency is needed in default case = much less coherency traffic between the GPUs).

itsmydamnation · Aug 8, 2017

Maybe im being over simplistic, but if all worlds (rops,mtu's,ALU's) meet at the L2, could you just join your GPU's at the L2 slices ( ring , mesh /whatever) and maybe another cross connect for the front end. So the front end looks like one big front end and the L2 look like one big L2.

Remember we are talking silicon not organic interposer here unlike naples so they should be able to be driven faster and for less power.

sebbbi · Aug 8, 2017

itsmydamnation said:
Maybe im being over simplistic, but if all worlds (rops,mtu's,ALU's) meet at the L2, could you just join your GPU's at the L2 slices ( ring , mesh /whatever) and maybe another cross connect for the front end. So the front end looks like one big front end and the L2 look like one big L2.

That's the current way. Compute units, ROPs, etc communicate with each other through the L2 cache. But L2 is a performance critical part, so it must be on the same die as these units. Theoretically you could move the L2 out of the GPU die (shared Crystalwell style cache), but the latency would be much longer to L2 cache, so all operations would suffer. The bandwidth between the compute units and L2 is huge, so the off-chip interconnect would need to be larger than anything we have seen before. This would also add one extra chip (connected with the GPUs under the same interposer). So it would add cost.

itsmydamnation · Aug 8, 2017

sebbbi said:
That's the current way. Compute units, ROPs, etc communicate with each other through the L2 cache. But L2 is a performance critical part, so it must be on the same die as these units. Theoretically you could move the L2 out of the GPU die (shared Crystalwell style cache), but the latency would be much longer to L2 cache, so all operations would suffer. The bandwidth between the compute units and L2 is huge, so the off-chip interconnect would need to be larger than anything we have seen before. This would also add one extra chip (connected with the GPUs under the same interposer). So it would add cost.

I dont mean move it out, i mean extend the L2 fabric, Does every GCN "core" have the same bandwidth and latency to every part of the L2 right now ( i dont know but i doubt it) ? obvious its a question of power consumption and added latency, but those are two of the area's where a silcon interposer was supposed to be much better then an organic one.

AMD: Navi Speculation, Rumours and Discussion [2017-2018]

MDolenc

sebbbi

giannhs

AlexV

Heteroscedasticitate

Rys

Graphics @ AMD

silent_guy

Samwell

silent_guy

sebbbi

xEx

Anarchist4000

sebbbi

Deleted member 13524

Guest

Anarchist4000

Deleted member 13524

Guest

Anarchist4000

sebbbi

itsmydamnation

sebbbi

itsmydamnation