AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    So which cache is the crossbar that distributes to the raster units from the other linked slides?


    I think it's saying those are just placeholders for tests, not that the placeholders are wrong.
     
  2. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    From the article: "But this patch isn't even for the AMD Linux driver itself... It's UMR: AMD's open-source GPU debugger they started work on about one year ago. "
    It seems to be for the GPU debugger not the driver, and when I heard the original news it was stated as driver... hence me saying 'wrong'.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The earlier post was written as if the patch was wrong, rather than the news sites reporting it.
     
  4. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    My working theory has been for a while: Syncing via the L2 cache is the reason why in certain synthetical tests like the B3D suite geometry rates start to be limited by the individual L2's R/W rate, hence no scaling beyond 4 GPCs in those synthies (will be a while before we reach that limit in real world scenarios). My best guess is, after syncing, it is determined if the geometry processed can stay inside the GPC or has to move out.
     
  5. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    Hidden in the 17.12.1 drivers:

    [​IMG]
     
    Nemo likes this.
  6. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Was the Nvidia Distributed Tiled Cache patent ever posted?
    http://www.freepatentsonline.com/y2014/0118361.html
    Figure 2,3,3A are of interest to the below points and worth checking out while reading.
     
    _cat and pharma like this.
  7. Nemo

    Newcomer

    Joined:
    Sep 15, 2012
    Messages:
    125
    Likes Received:
    23
    There are two new Navi chips -- 1000 and 1001 GPUs
     
  8. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    This is really no problem as the hardware access is behind so many layers that you can make it transparent on the API level. DX11 co-exists with DX12 for that reason. DX12 already has all the API needed to cover chiplets, but well, if you don't want to even see that there are chiplets, then drop in an alternative kernel implementation which has exactly one graphics queue, and one compute queue, for compatibility sake. Or lift it up to DX13.

    From experience I can tell you, bringing a engine to support multi-core CPUs was a way way more prolonged and painful path then bringing one to support multi-GPU. The number of bugs occuring for the former is still very high, by the nature of it, the number of bugs occuring for the latter is ... fairly low, because it's all neatly protectable [in DX12] and bakable into management code.

    I wish I could address individual CUs already, so I could have my proto-chiplet algorithms stabalize before hitting the main target.

    In any case, for GPUs, I don't believe you have to drop all this behaviour into the circuits, you have the luxury of thick software layers which you can bend. A super-cluster has no hardware to make the cluster itself appear coherent, but it has software to do so to some degree. The scheduling is below a thick layer of software, it has to be, it's way too much sub-systems involved to do this yourself.

    I wonder about the impact of the x86 memory model on people's mind-set. I had to target ARM multi-core lately, and uggg, no coherence sucks if all your sync and atomic code depends on it. But then the C+11 threaded API was a great relief, because I only need to state what I need and then it will be done one way or another. I then started believing that incorporating these specifics into your code-base without a wrapper in the first place is counter-productive. Lesson: don't hack away, anticipate variance, do good software design, be verbose, really really verbose, very semantic (will vanish in compiler) - then you won't have that much problems when your target changes "radically".
    The language/API/ISA is only half the solution regardless, because you have batch cache-invalidations for example, in general you have to design the data-sync points more carefully, more semantic. :)

    But hey, all this is comfort-land in comparison with PS3-to-PS4 transitions (for all the mentioned reasons).
     
    DeeJayBump, fuboi and Lightman like this.
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Is the alternative kernel with one queue for each type expected to scale with an MCM GPU, or is it expected to perform at 1/N the headline performance on the box? Is it expected to run with 1/N the amount of memory?

    One interpretation of the statements of the then-head of RTG was that AMD's plans were have developers manage the N-chip case explicitly, not extend the abstraction.
    I think this would reduce the desirability of such a product for a majority of the gaming market, absent some changes and a period of time where AMD proves that this time it's different. There's been evidence in the past that SLI on a stick cards lost ground against the largest single-GPU solutions, and multi-GPU support from the IHVs has in recent products regressed.

    If there's certainty that competing products will make the same jump to multi-GPU and with the AMD's claimed level of low-level exposure, then it might be able to make the case that there is no alternative. Even then, legacy code (e.g. everything at hardware launch time) might hinder adoption versus the known-working prior generations.
    If the possibility exists that a larger and more consistent single-GPU competitor might launch, or gains might not be consistent for existing code, that's a riskier bet for AMD to take.

    Further, if the handling of the multi-adapter case is equally applicable to an MCM GPU as it is to two separate cards, how does MCM product distinguish itself?

    At this point, however, I would cite that nobody can buy a one-core CPU and software can be expected to use at least 2-4 cores at this point. There's been limited adoption and potentially a negative trend for SLI and Crossfire. The negative effect that those implementations have had on user experience for years isn't easily forgotten, and we have many games that are not implemented to support multi-GPU while the vendors are lowering the max device count they'll support.

    Fair or not, the less-forgiving nature of CPU concurrency has been priced in as an unavoidable reality, and the vendors have made the quality of that infrastructure paramount despite how difficult it is to use.

    This gives me the impression of wanting things both ways. The CUs are currently held beneath one or two layers of hardware abstraction and management within the ASIC itself, and those would be below the layers of software abstraction touted earlier. There are specific knobs that console and driver devs might have and high-level constructs that give some hints for the low level systems, but there are architectural elements that would run counter to exposing the internals further.

    Not knowing the specifics of the implementation, there's potentially the game, engine, API, driver(userspace/kernel), front-end hardware, and back-end hardware levels of abstraction, with some probable omissions/variations.
    The lack of confidence in many of those levels is where the desire for a transparent MCM solution comes from.

    I think there are a number of implementation details that can change the math on this, and if a cluster uses message passing it could skip the illusion of coherence in general. The established protocols are heavily optimized for throughout the stack, however. I'm not sure how comparable the numbers are for some of the cluster latency figures versus some of those given for GPU synchronization.

    Relying on the C++ 11 standard doesn't remove the dependence on the hardware's memory model. It maps the higher-level behaviors desired to the fences or barriers provided for a given architecture. For the more weakly-ordered accesses that aren't considered synchronized, the regular accesses for x86 are considered too strong, but for synchronization points x86 is considered only somewhat more strong than is strictly necessary. Its non-temporal instructions are considered too weak.

    More weakly-ordered architectures have more explicitly synchronized accesses or heavyweight barriers, and the architectural trend for CPUs from ARM to Power has been to implement a more strongly-ordered subset closer to x86 for the load-acquire/store release semantics.
    The standard's unresolved issues frequently cover when parts of its model are violated, often when it comes to weaker hardware models turning out not being able to meet the standard's assumptions or potentially not being able to without impractically heavy barriers.

    That aside, the question isn't so much whether an architecture is as strongly-ordered as x86, but whether the architecture has defined and validated its semantics and implementations, or has the necessary elements to do so. The software standard that sits at the higher level of abstraction assumes the foundation they supply is sound, and some of its thornier problems arise when it is not. The shader-level hierarchy would be nominally compliant with the standard, but the GPU elements not exposed to the shader ISA are not held to it and have paths that go around it.
     
  10. mrcorbo

    mrcorbo Foo Fighter
    Veteran

    Joined:
    Dec 8, 2004
    Messages:
    3,578
    Likes Received:
    1,986
    Wouldn't there be a much faster interconnect between the individual chips and their directly attached resources and wouldn't that yield a substantial performance benefit over a standard multi-adapter setup?
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    It's a good item to leverage if the bandwidth is there.
    In a more explicitly developer-managed system, it seems like this would be a different class of multi-adapter where certain combinations of independent devices have non-uniform performance effects.
    Copy operations would be faster if they were finding themselves limited by bandwidth before.

    If the game's multi-adapter path was coded with traditional multi-slot bandwidth in mind, the MCM GPU could find itself in a similar situation to how Vega's HBCC shows more arguable benefits when most games remain coded for the constraints of regular cards.
    HBCC's handling is already more hardware-managed than the scenario I was addressing.

    Application-level queues or device commands used to initiate transfers or sync the devices wouldn't necessarily see this interconnect, since that leaves each chip engaged in its own back and forth with the host side. That direction is less about bandwidth than it is the scheduling and device latencies, although creating fast paths in the architecture that let the chips collaborate on their own could leverage the interconnect.
    If this does not replace traditional multi-card setups, it's an additional niche to target.
     
    mrcorbo likes this.
  12. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Well I was thinking about this some more and have some thoughts. I think the key to AMD's strategy with a multi-chip Navi center around DSBR and its deferred work feature. If we consider opaque triangles with no writing to UAV's in the pixel shader (and no memory accesses other than initial geometry in the geometry stages) you can guess at a interchip bandwidth/performance opimization strategy. If the chips are setup similarly to a NUMA setup with each chip having one or more memory channels and assume they are logically stripped on a per chiplet granularity (in case I'm not being clear I mean all memory channels hooked up to chiplet 1 form the first stripe, chiplet 2 second stripe... and so on), my guess is that they localize all work up to and including rasterization to that chiplet. They perform binning and defer the work until visibility is worked out. Then the temporary tiles created in each chiplet are then compared against each other in the chiplet whose local memory contains the backing store of the frame buffer tile in question and final visibility is determined and pixel shading proceeds in that chiplet. This doesn't solve the interchip bandwidth problem for texture access's, but it does solve it for geometry and framebuffer. Since the the DSBR with deferred work reduces the number of pixels to be shaded it does reduce texture access bandwidth somewhat as well. This may require a different memory layout for the tiles of the frame buffer, but that shouldn't be a problem...

    I think some more about other types of work later, thanks for listening.
     
  13. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    What kind of additional latency do you guys expect from inter-chip(let) connections anyway? It's not like with 3D Rendering in Cinema 4D or Blender, where you have a nice sorting up front and then much much rendering happening in tiny tiles.
     
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Isn't that made even more complex due to such design looking to hide/reduce latency but then impacted and not in a linear fashion by certain workloads while others are insensitive (low parallelism), and other variables such as cache and also writebacks.
    Does not seem there is an easy answer to your question unless limiting scope of real world use.
     
  15. CaptainGinger

    Newcomer

    Joined:
    Feb 28, 2004
    Messages:
    92
    Likes Received:
    47
    Why is everyone so convinced Navi is a multi chiplet solution?
     
    _cat, xpea, DavidGraham and 1 other person like this.
  16. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,172
    Location:
    La-la land
    @CaptainGinger
    ...Because fantasy is more interesting than reality? :D
     
    CSI PC, Kej, Lightman and 2 others like this.
  17. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,029
    Likes Received:
    3,101
    Location:
    Pennsylvania
    In the AMD roadmap, the keyword describing Navi is "Scalability". I believe that is where the concept was born?
     
  18. CaptainGinger

    Newcomer

    Joined:
    Feb 28, 2004
    Messages:
    92
    Likes Received:
    47
    I agree this is probably where it started but it makes no sense to me. We know that for whatever reason AMD has not "scaled" past 64 compute units and we also know they can't sit at 64 compute units forever. In my mind the most likely interpretation of "Scalability" for Navi is just that it will be the first AMD GPU core to go past this number.
     
    Newguy, xpea, Grall and 6 others like this.
  19. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I think it all started with Ryzen’s external interconnect and then everybody just assumed it could be expanded just like that for GPUs.

    I’m in the “it’s not going to happen camp this generation” camp.
     
    _cat, CSI PC, xpea and 3 others like this.
  20. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,170
    Likes Received:
    576
    Location:
    France
    I think one other argument is "it's their only chance to compete", like, they don't have the ressources to make a big fat monolithic die that can compete with nvidia (Fiji, Vega,...) anymore. I hope I'm wrong, but the difference in R&D budget is so huge... I'm not sur they can pull an R300 or, in another scenario, an RV770 again...
     
    Grall and ToTTenTranz like this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...