AMD: Navi Speculation, Rumours and Discussion [2017-2018]

DmitryKo · Jun 30, 2018

DavidGraham said:
CUDA tasks are not the same as gaming workloads

There are only two types of workloads - computationally intensive and memory bandwidth intensive.

The 4 die design was also found to need a tremendous amount of inter-chip bandwidth, up to 3TB/s. This is NOT feasible right now.

No, it was using 768 GB/s links, which they assumed are practically possible today, with 16 Mbyte L1.5 cache per die and 'first touch' virtual page allocation policy.

CarstenS · Jun 30, 2018

DmitryKo said:
There are only two types of workloads - computationally intensive and memory bandwidth intensive.

So the kind of user interaction with the outcome of those workloads does not play a role as well? I tend to disagree.

DavidGraham · Jul 1, 2018

DmitryKo said:
No, it was using 768 Gbit/s links, which they assimed are practically possible today, with 16 Mbyte L1.5 cache per die and 'first touch' virtual page allocation policy.

This is directly from NVIDIA's paper:

they spent quite a bit of time discussing how much bandwidth is required within the package, and figures of 768 GB/s to 3TB/s were mentioned, so it’s possible that it’s just the same tricks as fetching from global memory. The paper touches on the topic several times, but I didn’t really see anything explicit about what they were doing.

https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Discusses-Multi-Die-GPUs

So yeah up to 3TB/s is postulated in a simulation, could be more in a real world workload.

ImSpartacus · Jul 1, 2018

DavidGraham said:
This is directly from NVIDIA's paper:

https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Discusses-Multi-Die-GPUs

So yeah up to 3TB/s is postulated in a simulation, could be more in a real world workload.

That's the interpretation of PCPerspective. Reading the actual paper, I'm getting a different interpretation.

From page 4, the authors did performance scaling when you move from 384 GB/s all the way to 6 TB/s inter-GPM bandwidth.
From page 10, the authors did performance scaling going from a lowly multi-GPU config all the way up to a hypothetical equivalent monolithic GPU, with MCM-style solutions in between using 768 GB/s and 6 TB/s links.

Maybe I'm misinterpreting the paper as I'm just a layman, but it feels like Nvidia investigated more than 768 GB/s to 3 TB/s. It's more like 384 GB/s to 6 TB/s.

DavidGraham · Jul 1, 2018

ImSpartacus said:
From page 4, the authors did performance scaling when you move from 384 GB/s all the way to 6 TB/s inter-GPM bandwidth.

From the figure, a 3TB/s solution provided 95~99% of the 6TB/s solution performance. So maybe that's why PCPer were content with mentioning only the 3TB/s.

DmitryKo · Jul 1, 2018

CarstenS said:
So the kind of user interaction with the outcome of those workloads does not play a role as well? I tend to disagree.

It does not create a different kind of GPU workload which you need to optimize for.

DavidGraham said:
up to 3TB/s is postulated in a simulation, could be more in a real world workload.

No. The principal point of the paper is cache controller and thread scheduler optimisations for a multi-die GPU which allow 'practical' 768 MB/s links to achieve around 90-95% performance of 'ideal' very-high bandwidth links (or an equivalent monolithic die).

They expressly state that 3 TB/s links are beyond the current state of technology, and that an equivalent monolithic GPU is not possible to impelement at all, so these are provided for comparison only.

DavidGraham said:
This is directly from NVIDIA's paper:
https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Discusses-Multi-Die-GPUs

This is not NVidia paper - the links to the actual research paper and my short summary of their findings are given in the post above.

From the figure, a 3TB/s solution provided 95~99% of the 6TB/s solution performance. So maybe that's why PCPer were content with mentioning only the 3TB/s.

Which would be a gross misinterpretation of the results of this research.

ImSpartacus said:
it feels like Nvidia investigated more than 768 GB/s to 3 TB/s. It's more like 384 GB/s to 6 TB/s.

Exactly - they research optimisations which allow 768 GByte/s link to perform on par with a multi-terabyte link.

CSI PC · Jul 2, 2018

DmitryKo said:
Nvidia's 2017 design simulation of a MCM-GPU is fully transparent to both the application programmer and the OS - it's indistinguishable from a monolithic integrated GPU, with the video driver handling the details. They use a global thread scheduler which works across all GPU modules and large L1.5 caches in the cross-bar memory controller to improve execution and data access locality in each GPU module and reduce remote memory accesses. The 4-die design was tested in the simulator and found to perform on par with a comparable monolithic GPU using real-life CUDA tasks.

The AMD Execution Thread [2018]#92

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability
http://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs

Just to say as this has popped up again the Nvidia design R&D specific for this type of solution goes back to 2014 and ties back into Volta, which one aspect is the NVSwitch (very loosely).
I think the approach between Nvidia and AMD is pretty different when it comes to integrating the MCM-GPU design and the signalling-data/coherency.

CarstenS · Jul 2, 2018

DmitryKo said:
It does not create a different kind of GPU workload which you need to optimize for.

That mindset probably led people to believe, SLI and Crossfire were a good idea before they were massively tuned down both in marketing visbility as well as support in the recent architecture generations for gamers. Already from a pipeline design perspective it makes a difference, whether you have to fill up your results file over seconds, minutes or hours (CUDA) or if you not only have to be ready with a host of differently bottlenecked calculations as often as, say 144 times a second, but also have to display the results in a proper manner.

DmitryKo · Jul 3, 2018

CSI PC said:
I think the approach between Nvidia and AMD is pretty different when it comes to integrating the MCM-GPU design and the signalling-data/coherency.

AMD's approach to MCM-GPU is not just 'different' - it's inferior. If Nvidia's MCM-GPUs will look like a single big CPU to the OS, and AMD can only expose MCM-GPUs in explicit multi-GPU configuration, AMD rightly said this won't be popular with application developers.

CarstenS said:
from a pipeline design perspective it makes a difference, whether you have to fill up your results file over seconds, minutes or hours (CUDA) or if you not only have to be ready with a host of differently bottlenecked calculations as often as, say 144 times a second, but also have to display the results in a proper manner.

These figures define performance targets, not the type of workload.

CarstenS · Jul 3, 2018

DmitryKo said:
These figures define performance targets, not the type of workload.

Yes and no and that's where the whole sub-debate started. In contrast to most CUDA tasks, your typical gaming workload does change it's characteristic many times even within a single frame, of which you want to have at least, better 60 and preferrably 120 or more per second. Subtasks change from Compute-bound to (graphics) memory-bound to I/O-bound many times a second. That's what made it so hard to deliver a satisfying experience to gamers over the years. If forced to rigidly apply a categorization like yours, I'd propose games as inherently I/O bound (off-card I/O that is), when we compare it with the amount of computation or bandwidth relative to I/O that's needed in many CUDA/OpenCL tasks.

Sure, it looks good in some demos, you can boost benchmark scores and for some people it works in some games. Generally though, as is evident with the diminishing effort put into marketing Crossfire and SLI to gamers or even to include support in certain types of graphics cards, AMD and Nvidia seem to have all but given up on that idea and focus on the professional market with MGPU.

MDolenc · Jul 3, 2018

CarstenS said:
Yes and no and that's where the whole sub-debate started. In contrast to most CUDA tasks, your typical gaming workload does change it's characteristic many times even within a single frame, of which you want to have at least, better 60 and preferrably 120 or more per second. Subtasks change from Compute-bound to (graphics) memory-bound to I/O-bound many times a second. That's what made it so hard to deliver a satisfying experience to gamers over the years.

That's not quite what DmitryKo is pointing to. Yes, there's a bunch of different bottlenecks when rendering a single frame. And without making every component of the GPU 2x wider you won't get 2x performance in all scenarios. You optimize for parts that are most often bottlenecked and for the longest duration of time.

But that's not the problem with SLI/CF. AFR is taking advantage of parallelism across frames. SFR basically says too hell with vertex processing (you have to do it twice) and then you have a problem how to split pixel load 50:50 between 2 cards (though checkerboard that AMD had solves this part quite nicely). Reason for diminishing effort for SLI and CF is that games nowadays tend to break parallelism across frames quite often by introducing dependencies between frames which are effectively sync points. SFR died way earlier when games started using whole bunch of different render targets which again had to be synced across GPUs killing benefits. That has nothing to do with specific bottlenecks a game would experience on a single GPU.

CarstenS · Jul 3, 2018

MDolenc said:
That's not quite what DmitryKo is pointing to.

Yet, he started this line of argument in direct quotation contradicting that CUDA workloads are not the same as gaming workloads. So, why does MGPU seem very valid for a lot of CUDA applications while not quite so for many games.

DmitryKo · Jul 4, 2018

MDolenc said:
without making every component of the GPU 2x wider you won't get 2x performance in all scenarios. You optimize for parts that are most often bottlenecked and for the longest duration of time.

Exactly. Every workload is "different" and you can record a thousand performance indicators, but basically there are only two independent factors - computational power and memory access bandwidth. Every other factor is dependent of these two variables which limit your maximum performance.

CarstenS said:
your typical gaming workload does change it's characteristic many times even within a single frame, of which you want to have at least, better 60 and preferrably 120 or more per second. Subtasks change from Compute-bound to (graphics) memory-bound to I/O-bound many times a second

CarstenS said:
Yet, he started this line of argument in direct quotation contradicting that CUDA workloads are not the same as gaming workloads

And what would be the practical implications of this variability for GPU design?

No matter which of these testing workloads are computationally bound or memory bandwidth bound, there is only so much can be done to alleviate the bottleneck of inter-die memory access (unless you engage explicit multi-GPU mode and try to avoid inter-die NUMA memory access entirely, in a kind of application-side SLI mode implementation).

BTW there is a follow-up research paper which discusses additional improvements to NUMA-aware multi-chip GPUs, such as
1) bi-directional inter-die links with dynamic reconfiguration between read and write lanes, and
2) improved cache policies with L2 cache coherency protocols and dynamic partitioning between local and remote data.

http://research.nvidia.com/publication/2017-10_Beyond-the-socket:

why does MGPU seem very valid for a lot of CUDA applications while not quite so for many games

It's a freakin' academic research paper - why would the learned gentlemen resort to the torture of making first-person shooter games run at 0.001 fps, instead of scheduling some well-known HPC benchmarks from a command prompt?

firstminion · Jul 8, 2018

DmitryKo said:
AMD's approach to MCM-GPU is not just 'different' - it's inferior. If Nvidia's MCM-GPUs will look like a single big CPU to the OS, and AMD can only expose MCM-GPUs in explicit multi-GPU configuration, AMD rightly said this won't be popular with application developers.

These figures define performance targets, not the type of workload.

That's a very big if. Right now this seems apples and oranges to me.

CSI PC · Jul 9, 2018

DmitryKo said:
Exactly. Every workload is "different" and you can record a thousand performance indicators, but basically there are only two independent factors - computational power and memory access bandwidth. Every other factor is dependent of these two variables which limit your maximum performance.

That is true, but you can extrapolate somewhat albeit not conclusively how both this and geometry are affected relative to the scaling of a design and at a high level what aspect is not equal to that.
Case in point TitanV while being BW/ROPs limited relative to V100 actually still seems to match core performance scaling of 42% as can be seen in heavy compute/BW application such as Amber, seems Nvidia were very accurate in understanding where they could cut aspects still to match the core scaling relative performance.
Point is sure it is being limited from an absolute perspective, but from a relative core performance scaling performance view the change is still equal not worst (specifically related to compute related and BW) and sometimes it needs to be looked at it from the relative scaling perspective; just saying as you are right but it is also valuable to have this other perspective as well.

The geometry side with Arun's tool is quite insightful and reflects what is being seen with game performance that is either marginal or only up to on average 18-25% faster than comparable Pascal with only one or two games closer to relative scaling performance (due to game compute related aspects), while certain rendering/benchmark operations also reflect lower than relative scaling .
For Titan V it seems that the computational power and BW is not the issue for hitting the relative scaling performance of 42%, but for now (may or may not be a solvable issue) something seems up with geometry and possibly comes back to this being the 1st time the geometry side has broken the 1:1 relationship with the architecture and has sharing/contention with SMs/TPC (even when allowing for 64 CUDA cores per SM rather than 128 design), comes back to Arun's tool and separately broad level of performance results when utilising said geometry although one outlier is Luxmark OpenCL with very high gains; some of this was discussed I think in the Volta thread.

I appreciate this is focused on Nvidia, but fundamentally it also fits here when looking at performance from both an absolute gain/limitations and relative scaling performance/limitations, with primary factors that go beyond computational power and BW in this context.

giannhs · Jul 18, 2018

DmitryKo said:
AMD's approach to MCM-GPU is not just 'different' - it's inferior. If Nvidia's MCM-GPUs will look like a single big CPU to the OS, and AMD can only expose MCM-GPUs in explicit multi-GPU configuration, AMD rightly said this won't be popular with application developers.

These figures define performance targets, not the type of workload.

so your whole idea of amd version of mcm being inferior is a speculation based on literally nothing more than a rumor?
how is nvidia gonna expose their interconnected cores as one? by magic? obviously not they are facing the very same problem as amd(not to mention that their version is literally what ryzen does) but amd already has a working model on ryzen and probably knows quite a lot more from a practical pov of how bad or not it can be for a gpu

DmitryKo · Jul 18, 2018

CSI PC said:
you can extrapolate somewhat albeit not conclusively how both this and geometry are affected relative to the scaling of a design
fundamentally it also fits here when looking at performance from both an absolute gain/limitations and relative scaling performance/limitations

My question still stands, how this variability would affect chip design? There are generational improvements to individual blocks, but it's still a far cry from a dynamically reconfigurable processor configuration that would allow to maximize perfomance for every individual workload. There are still hard limits such as bus width, clocks, cache size, wavefront depth etc. - although the recent Nvidia research paper proposes some real-time variability to data bus direction and cache size partitioning to account for multi-gpu access to far memory....
http://research.nvidia.com/publication/2017-10_Beyond-the-socket:

giannhs said:
so your whole idea of amd version of mcm being inferior is a speculation based on literally nothing more than a rumor?

It is based on their earlier comments about MCM-GPU being an explicit multi-GPU configuration which is not convenient for gaming.

giannhs · Jul 18, 2018

because there is no possible way to expose them as one to the game engines unless the engines undergo a major rehauling on their SC which wont happen any time soon...
so this leaves even the nvidia way in the same situation and since gameworks still dominates the market pretty sure mcm gpus to the massess is just a dream unless somehow amd creates a better simpler and open gameworks equivalent...

Rootax · Jul 18, 2018

Drivers can't expose them as one ? Real question. If the hardware is designed this way...

Magnum_Force · Jul 18, 2018

For multi gpu rendering, I wonder if it's possible to use some kind of multi frame super resolution, where each gpu renders the same scene, but from an ever so slightly different perspective - literally a pixel shift or two, then combine into the final upscaled image. I think some VR rendering techniques are somewhat similar?

AMD: Navi Speculation, Rumours and Discussion [2017-2018]

DmitryKo

CarstenS

Moderator

DavidGraham

ImSpartacus

DavidGraham

DmitryKo

CSI PC

CarstenS

Moderator

DmitryKo

CarstenS

Moderator

MDolenc

CarstenS

Moderator

DmitryKo

firstminion

CSI PC

giannhs

DmitryKo

giannhs

Rootax

Magnum_Force