AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    681
    Likes Received:
    544
    Location:
    55°38′33″ N, 37°28′37″ E
    There are only two types of workloads - computationally intensive and memory bandwidth intensive.

    No, it was using 768 GB/s links, which they assumed are practically possible today, with 16 Mbyte L1.5 cache per die and 'first touch' virtual page allocation policy.
     
    #521 DmitryKo, Jun 30, 2018
    Last edited: Jul 1, 2018
    Bondrewd and ImSpartacus like this.
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    So the kind of user interaction with the outcome of those workloads does not play a role as well? I tend to disagree.
     
    #522 CarstenS, Jun 30, 2018
    Last edited: Jun 30, 2018
  3. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    This is directly from NVIDIA's paper:

    https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Discusses-Multi-Die-GPUs

    So yeah up to 3TB/s is postulated in a simulation, could be more in a real world workload.
     
  4. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    That's the interpretation of PCPerspective. Reading the actual paper, I'm getting a different interpretation.
    • From page 4, the authors did performance scaling when you move from 384 GB/s all the way to 6 TB/s inter-GPM bandwidth.

      [​IMG]
    • From page 10, the authors did performance scaling going from a lowly multi-GPU config all the way up to a hypothetical equivalent monolithic GPU, with MCM-style solutions in between using 768 GB/s and 6 TB/s links.

      [​IMG]

    Maybe I'm misinterpreting the paper as I'm just a layman, but it feels like Nvidia investigated more than 768 GB/s to 3 TB/s. It's more like 384 GB/s to 6 TB/s.
     
    pharma, DmitryKo and DavidGraham like this.
  5. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    From the figure, a 3TB/s solution provided 95~99% of the 6TB/s solution performance. So maybe that's why PCPer were content with mentioning only the 3TB/s.
     
  6. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    681
    Likes Received:
    544
    Location:
    55°38′33″ N, 37°28′37″ E
    It does not create a different kind of GPU workload which you need to optimize for.

    No. The principal point of the paper is cache controller and thread scheduler optimisations for a multi-die GPU which allow 'practical' 768 MB/s links to achieve around 90-95% performance of 'ideal' very-high bandwidth links (or an equivalent monolithic die).

    They expressly state that 3 TB/s links are beyond the current state of technology, and that an equivalent monolithic GPU is not possible to impelement at all, so these are provided for comparison only.

    This is not NVidia paper - the links to the actual research paper and my short summary of their findings are given in the post above.

    Which would be a gross misinterpretation of the results of this research.

    Exactly - they research optimisations which allow 768 GByte/s link to perform on par with a multi-terabyte link.
     
    #526 DmitryKo, Jul 1, 2018
    Last edited: Jul 1, 2018
    CSI PC and pharma like this.
  7. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Just to say as this has popped up again the Nvidia design R&D specific for this type of solution goes back to 2014 and ties back into Volta, which one aspect is the NVSwitch (very loosely).
    I think the approach between Nvidia and AMD is pretty different when it comes to integrating the MCM-GPU design and the signalling-data/coherency.
     
  8. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    That mindset probably led people to believe, SLI and Crossfire were a good idea before they were massively tuned down both in marketing visbility as well as support in the recent architecture generations for gamers. Already from a pipeline design perspective it makes a difference, whether you have to fill up your results file over seconds, minutes or hours (CUDA) or if you not only have to be ready with a host of differently bottlenecked calculations as often as, say 144 times a second, but also have to display the results in a proper manner.
     
    DavidGraham and BRiT like this.
  9. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    681
    Likes Received:
    544
    Location:
    55°38′33″ N, 37°28′37″ E
    AMD's approach to MCM-GPU is not just 'different' - it's inferior. If Nvidia's MCM-GPUs will look like a single big CPU to the OS, and AMD can only expose MCM-GPUs in explicit multi-GPU configuration, AMD rightly said this won't be popular with application developers.

    These figures define performance targets, not the type of workload.
     
    DavidGraham likes this.
  10. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Yes and no and that's where the whole sub-debate started. In contrast to most CUDA tasks, your typical gaming workload does change it's characteristic many times even within a single frame, of which you want to have at least, better 60 and preferrably 120 or more per second. Subtasks change from Compute-bound to (graphics) memory-bound to I/O-bound many times a second. That's what made it so hard to deliver a satisfying experience to gamers over the years. If forced to rigidly apply a categorization like yours, I'd propose games as inherently I/O bound (off-card I/O that is), when we compare it with the amount of computation or bandwidth relative to I/O that's needed in many CUDA/OpenCL tasks.

    Sure, it looks good in some demos, you can boost benchmark scores and for some people it works in some games. Generally though, as is evident with the diminishing effort put into marketing Crossfire and SLI to gamers or even to include support in certain types of graphics cards, AMD and Nvidia seem to have all but given up on that idea and focus on the professional market with MGPU.
     
    DavidGraham and BRiT like this.
  11. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    That's not quite what DmitryKo is pointing to. Yes, there's a bunch of different bottlenecks when rendering a single frame. And without making every component of the GPU 2x wider you won't get 2x performance in all scenarios. You optimize for parts that are most often bottlenecked and for the longest duration of time.

    But that's not the problem with SLI/CF. AFR is taking advantage of parallelism across frames. SFR basically says too hell with vertex processing (you have to do it twice) and then you have a problem how to split pixel load 50:50 between 2 cards (though checkerboard that AMD had solves this part quite nicely). Reason for diminishing effort for SLI and CF is that games nowadays tend to break parallelism across frames quite often by introducing dependencies between frames which are effectively sync points. SFR died way earlier when games started using whole bunch of different render targets which again had to be synced across GPUs killing benefits. That has nothing to do with specific bottlenecks a game would experience on a single GPU.
     
  12. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Yet, he started this line of argument in direct quotation contradicting that CUDA workloads are not the same as gaming workloads. So, why does MGPU seem very valid for a lot of CUDA applications while not quite so for many games.
     
    BRiT likes this.
  13. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    681
    Likes Received:
    544
    Location:
    55°38′33″ N, 37°28′37″ E
    Exactly. Every workload is "different" and you can record a thousand performance indicators, but basically there are only two independent factors - computational power and memory access bandwidth. Every other factor is dependent of these two variables which limit your maximum performance.

    And what would be the practical implications of this variability for GPU design?

    No matter which of these testing workloads are computationally bound or memory bandwidth bound, there is only so much can be done to alleviate the bottleneck of inter-die memory access (unless you engage explicit multi-GPU mode and try to avoid inter-die NUMA memory access entirely, in a kind of application-side SLI mode implementation).


    BTW there is a follow-up research paper which discusses additional improvements to NUMA-aware multi-chip GPUs, such as
    1) bi-directional inter-die links with dynamic reconfiguration between read and write lanes, and
    2) improved cache policies with L2 cache coherency protocols and dynamic partitioning between local and remote data.

    http://research.nvidia.com/publication/2017-10_Beyond-the-socket:

    It's a freakin' academic research paper - why would the learned gentlemen resort to the torture of making first-person shooter games run at 0.001 fps, instead of scheduling some well-known HPC benchmarks from a command prompt?
     
    #533 DmitryKo, Jul 4, 2018
    Last edited: Jul 4, 2018
  14. firstminion

    Newcomer

    Joined:
    Aug 7, 2013
    Messages:
    217
    Likes Received:
    46
    That's a very big if. Right now this seems apples and oranges to me.
     
  15. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    That is true, but you can extrapolate somewhat albeit not conclusively how both this and geometry are affected relative to the scaling of a design and at a high level what aspect is not equal to that.
    Case in point TitanV while being BW/ROPs limited relative to V100 actually still seems to match core performance scaling of 42% as can be seen in heavy compute/BW application such as Amber, seems Nvidia were very accurate in understanding where they could cut aspects still to match the core scaling relative performance.
    Point is sure it is being limited from an absolute perspective, but from a relative core performance scaling performance view the change is still equal not worst (specifically related to compute related and BW) and sometimes it needs to be looked at it from the relative scaling perspective; just saying as you are right but it is also valuable to have this other perspective as well.

    The geometry side with Arun's tool is quite insightful and reflects what is being seen with game performance that is either marginal or only up to on average 18-25% faster than comparable Pascal with only one or two games closer to relative scaling performance (due to game compute related aspects), while certain rendering/benchmark operations also reflect lower than relative scaling .
    For Titan V it seems that the computational power and BW is not the issue for hitting the relative scaling performance of 42%, but for now (may or may not be a solvable issue) something seems up with geometry and possibly comes back to this being the 1st time the geometry side has broken the 1:1 relationship with the architecture and has sharing/contention with SMs/TPC (even when allowing for 64 CUDA cores per SM rather than 128 design), comes back to Arun's tool and separately broad level of performance results when utilising said geometry although one outlier is Luxmark OpenCL with very high gains; some of this was discussed I think in the Volta thread.

    I appreciate this is focused on Nvidia, but fundamentally it also fits here when looking at performance from both an absolute gain/limitations and relative scaling performance/limitations, with primary factors that go beyond computational power and BW in this context.
     
    #535 CSI PC, Jul 9, 2018
    Last edited: Jul 9, 2018
  16. giannhs

    Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    37
    Likes Received:
    40
    so your whole idea of amd version of mcm being inferior is a speculation based on literally nothing more than a rumor?
    how is nvidia gonna expose their interconnected cores as one? by magic? obviously not they are facing the very same problem as amd(not to mention that their version is literally what ryzen does) but amd already has a working model on ryzen and probably knows quite a lot more from a practical pov of how bad or not it can be for a gpu
     
    Lightman and no-X like this.
  17. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    681
    Likes Received:
    544
    Location:
    55°38′33″ N, 37°28′37″ E
    My question still stands, how this variability would affect chip design? There are generational improvements to individual blocks, but it's still a far cry from a dynamically reconfigurable processor configuration that would allow to maximize perfomance for every individual workload. There are still hard limits such as bus width, clocks, cache size, wavefront depth etc. - although the recent Nvidia research paper proposes some real-time variability to data bus direction and cache size partitioning to account for multi-gpu access to far memory....
    http://research.nvidia.com/publication/2017-10_Beyond-the-socket:


    It is based on their earlier comments about MCM-GPU being an explicit multi-GPU configuration which is not convenient for gaming.
     
  18. giannhs

    Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    37
    Likes Received:
    40
    because there is no possible way to expose them as one to the game engines unless the engines undergo a major rehauling on their SC which wont happen any time soon...
    so this leaves even the nvidia way in the same situation and since gameworks still dominates the market pretty sure mcm gpus to the massess is just a dream unless somehow amd creates a better simpler and open gameworks equivalent...
     
  19. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,170
    Likes Received:
    576
    Location:
    France
    Drivers can't expose them as one ? Real question. If the hardware is designed this way...
     
  20. Magnum_Force

    Newcomer

    Joined:
    Mar 12, 2008
    Messages:
    102
    Likes Received:
    70
    For multi gpu rendering, I wonder if it's possible to use some kind of multi frame super resolution, where each gpu renders the same scene, but from an ever so slightly different perspective - literally a pixel shift or two, then combine into the final upscaled image. I think some VR rendering techniques are somewhat similar?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...