AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. CaptainGinger

    Newcomer

    Joined:
    Feb 28, 2004
    Messages:
    92
    Likes Received:
    47
    Vega is not an IMC.
     
  2. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Everything else AMD is releasing is and research from AMD and Nvidia support the idea. GPUs scale far more easily than CPUs.

    Just treat each stack/PIM as an independent cache and duplicate vertex data with a paging mechanism from system memory. Same idea as HBCC where only ~3% of the frame changes each iteration. Any modifications can be brute forced from there with heavy frustum culling. Something which primitive shaders have been suggested to be good at.

    Vega should have all the pieces required for MCM already, assuming the new features were functioning. I'm still of the mindset Navi is an ~200mm2 Vega as opposed to a big one. The current small Vegas are all integrated, excluding the Intel thing.
     
  3. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    That's the point... if we assume the application programmer optimized the order of vertex's/index's for the post and pre transform caches the majority of triangles can be formed with data local to one chiplet. This will most likely require more complicated vertex buffering but the point is most vertex data and work associated with them and the triangle data can be kept local.
    Its not really about the vertex data though, although it is a bonus along with providing work for all chiplets. The real point would be doing things this way minimizes transfers that have to do with the output of the rasterizer. If you don't batch visibility all at once overdraw bandwidth would consume 'extra' bandwidth.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    EPYC shows something like ~50ns extra latency over a local access if going to another die in the same package. Over xGMI, it seems to take ~120ns.
    https://www.servethehome.com/amd-epyc-infinity-fabric-latency-ddr4-2400-v-2666-a-snapshot/

    I haven't heard whether HBM low-cost is confirmed, since the last I saw Samsung was still shopping the idea around. HBM3 is apparently late 2019/2020, which I would need to reconcile with AMD's roadmap having Navi apparently earlier and Next Gen in that slot.
    Vega 20 supposedly has xGMI, which would be an off-package bus running over a PCIe physical interface. There can be use cases for that in compute nodes or servers, although if used in the client space it's potentially more able to accelerate resource moves prompted by the copy queue or AMD's existing transfer over PCIe capabilities.

    Assuming the Polaris and Carrizo refreshes as insignificant changes, and that the shrinks for the Xbox One S and PS4 Slim aren't big enough changes despite being different chips. There's Xbox One X, PS4 Pro, Vega 10, Raven Ridge, and the Intel custom chip.
    I would say the console shrinks and subsequent distinct SOCs are evidence that AMD can find the means to roll out more than 4, if it wants to.
    Given the quality of Vega's rollout, I'll grant that it's apparently not able to roll them out very well.

    It technically could, but the sort of overhead AMD documented for EPYC would lead to more die area and power efficiency lost to the attempt than if they hadn't bothered.

    Then there's the part when the head of RTG was asked about transparently integrated multiple GPUs, and he said he didn't want that.
    I can admit to some skepticism for AMD's chance for implementing this, because I think they've been saying they don't want to do that.
    The things they do want to do, however, are actually hard and likely not realizable until after 2020.

    I think the more useful interpretation is that Nvidia gave a reasonable bare minimum for what has to be done for any such solution to be adequate (I think even that is optimistic for what people expect), and even then it only discussed things in terms of compute.
    That includes a significantly better interconnect than what's available presently, hardware specialization beyond EPYC's MCM method, and a significant change in the internal memory hierarchy.

    What AMD has offered is?

    The filing date is June 2016. There's usually a delay between filing and when a feature shows up in a product, if it does. For example the hybrid rasterizer for Vega had an initial filing in March of 2013.
    Vega's development pipeline may have had some unusual stalls in it, so we may need to come back to this to see when Navi or its successor is finalized and whether this method appears in it.

    The items that AMD discloses for GPU chiplets talk about them being paired with memory standards 2 generations beyond HBM2. Should I only take every other sentence AMD says as evidence and ignore the ones that contradict my desired outcome?
     
    pharma and DavidGraham like this.
  5. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,172
    Location:
    La-la land
    One would expect that AMD prioritized its console contracts (IE: PS4 Pro, Bone X) ahead of its own graphics line-up. Anecdotal evidence seems to back that assumption up... :p

    *edit: grammer...
     
    #365 Grall, Jan 1, 2018
    Last edited: Jan 1, 2018
    Lightman likes this.
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    My contention is with your original sentence where you say "all work" ([...] they localize all work up to and including rasterization to that chiplet). Now you're saying "not all work". So, uhuh, we agree.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    If we could have kernels with no pre-defined limit on the register allocation, oh boy that would be so sweet.
     
  8. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    225
    Likes Received:
    97
    How will a Muilty Chip design looks like? They build 2-3 Complete Engines one a Die and combine 2-4 of thes Dies?

    Or maybe they build a Frontend Chip than a Shader Chip and at least a Backendchip?
     
    Grall likes this.
  9. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    I guess I wasn't clear in my first post. I guess I should have said "they localize all work that can be generated from the local data up to and including rasterization". But I thought my pointing out the numa thing and the striping methodology right before that statement alluded to that meaning. Sorry for the confusion.
     
  10. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Lower cost, assuming Intel's EMIB is analogous to not using an interposer.

    Yes, because not all concepts in a research paper will make the final cut and technology changes. To conserve energy source and destination need to be close. Even for scaling it makes more sense to tightly couple them, then add the ability to share data on top of that. Limit coherence to only the data where it really matters, which isn't most textures and untransformed geometry.

    Split out by shader engine makes the most sense with 1, w, and 4 SE parts. Two binning passes and a leap towards 16 SEs across an Epyc/Ripper backplane may be doable.
     
  11. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,981
    Likes Received:
    4,569
    The slide is very clear: Navi in 2018 with "Nextgen Memory", after Vega with HBM2:

    [​IMG]

    Perhaps it's HBM2.5 or HBM Low-cost, perhaps it's HBM2 using Intel's EMIB and given a different name, perhaps it's GDDR6 or perhaps it's HBM3 by SK-Hynix coming before Samsung's.
    It's not like this industry is entirely predictible with a span of 2/3 years. How long did it take between GDDR5X being announced and launched in a final product? 3 months or so?
    Regardless, after this slide I doubt Navi is coming with HBM2 like Vega.


    I don't think console chips count for 2 reasons:
    1 - They're not developed solely by AMD, they're a joint venture between teams belonging to AMD and Sony/Microsoft.
    2 - The teams who worked on PS4, PS4 Pro, XBone and Xbone X are probably working on the next gen already.

    Like you said, Polaris isn't a long way from a 14nm shrink of GFX8 architectures Tonga/Fiji. Carrizo is actually from 2015 but you probably meant Bristol Ridge which is practically Carrizo with Excavator v2 and the GPU was untouched.
    So in practice, what we got was Polaris 10/11 in 2016 and Vega 10/11 + Polaris 12 in 2017. I remember Raja saying 2 distinct GPUs per year was just about what RTG could do..

    And that's just not enough to compete with nvidia who launched GP100 + GP102 + GP104 + GP106 + GP107 + GP108 + GV100 within the same period of time.



    We've been through that before. Raja was specifically following up on a conversation about ending "Crossfire" (i.e. driver-ridden AFR that needs work per-game on AMD's side) and leaving multi-GPU in DX12 to game developers. Which is what they're progressively doing already.
    What exactly did he state in that interview that makes one think he was talking about multi-chip GPUs?


    Like I said. It's hard. And AMD can't do hard things.


    More than some ideas on a paper.


    You mean this is not what you're doing? Trying to invalidate all facts that point to "Yes" in order to prove your opinion of a "No"?
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    It adds a hardware management engine to distinguish between the less-active register use cases from a smaller set of hot registers, which allows a smaller local register file, a larger less energy-efficient register file, then spill-fill in the memory hierarchy. The theory is that it would dynamically relax the per-wavefront register occupancy constraints, although the pre-defined 256-register ISA limit would remain.

    That would be different than what Samsung's variant is attempting. EMIB is structurally a small silicon bridge capable of having the same interconnect density as a silicon interposer. Samsung's reduced-cost memory drops the bus width so that it can avoid using silicon, and a memory standard captive to something only Intel seems to have would be questionable.
    The reduced-cost HBM was something Samsung was still making inquiries about customer interest.

    GDDR6 would be a next-generation memory, at least compared to GDDR5. If compared to HBM2, the possibilities seem limited in terms an upgrade, like a form of HBM2 that lives up to its original specifications. The low-cost variant is a potential cost reduction, but would be somewhat inferior
    I haven't ruled out some distinction in terms of memory revision, caches, or HBCC version that lets AMD say something counts as Next Gen memory, although I do not recall if that bullet point has persisted into more recent slide decks.

    That's a balance that exists by AMD's choice in priorities, and doesn't take into account how much IP cross-pollination is going on.
    The description of the process by the designers for the PS4 Pro and Xbox One X show a lot of high-level interchange and pulling from a menu of options AMD provides. Adjusting for custom external IP, AMD's internal investment in terms of engineering seems on the order of at least a modestly different GPU variant. The implementation and design for manufacturing on chips the same size or larger than Polaris also put the onus on AMD, since AMD eats the manufacturing costs if they cannot get it to yield sufficiently.

    Polaris was "refreshed" in the last 18 months, which is what I wasn't counting. I considered the initial Polaris launch something of a borderline case, and had forgotten to add that to the count.
    That would be a refresh that I did not count, I'm not sure if the steppings changed from the end of one line to the start of the next.

    Actually, I had blanked on the other Polaris chips as well, so add two more.

    Koduri's words concerning moving away from Crossfire was its abstracting of multiple GPUs as if they were a single GPU. Going forward with the new APIs, the intention was to involve and invest developers into the explicit management of the individual GPUs .
    (Source: https://www.pcper.com/news/Graphics...past-CrossFire-smaller-GPU-dies-HBM2-and-more, after 1:50)

    The technologies at AMD and its chain of manufacturing partners do not show a reasonable path to implementing 3D and 2.5D active interposers and chiplets that scale up and down the stack in this decade, no. Nor does it seem like its competitors are realistically positioned to do any better, although some have on occasion expressed skepticism on steps even earlier in AMD's chain of improvements.
    I do not think AMD has promised a such a transition as early as within the next 12-18 months, which seems consistent with none of the OSATs or foundries AMD relies on gearing up for this, or talking about it after the next set of fan-out packaging methods that come out in 1-2 years and not close to that goal.

    Is the "more" in this case slides for a CPU division product? I do not recall what the "more" is for AMD's GPUs, which I recall is rather vague and long-term.

    To quote someone's list: "2 - No official news or leaks about Navi have ever appeared that suggest it's a multi-chip solution."

    I think it's a stretch to apply EPYC's MCM method, and a mistake to use AMD's post 2020 HPC plans as a guidelines for 2018/2019.
    AMD's stated position was that it intended to have explicit multi-adapter allow developers to manage their GPUs up and down the stack.

    Once it's exposed and delegated to the developers, there's less pressure for interconnect scaling like what was done for EPYC's more variable allocation needs. The granularity for a lot of the transfers at an API level can be done at a coarser inter-frame granularity or after some heavier synchronization points. There's less back and forth and more coalescing into pages or finalized buffers, so something like PCIe 3.0 or the PCIe 4.0/xGMI in the Vega 20 slides could provide sufficient grunt.
    To me that seems like a reasonable path from what we see today to Navi's supposed time window, which isn't that far away anymore.

    xGMI or PCIe 4.0 can also go over a PCB rather than require the GPUs stay on the same substrate, which seems like it would make it easier to scale the product up an down without creating a series of modules of increasing size.
     
    AlBran likes this.
  13. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Different, but reduce the number of pins from HBM2 and you're looking at GDDR. The only difference may be placing the memory close enough to avoid large drivers in the ICs. It avoids the large silicon interposer still, but has the small bridge to retail a high level of IO. Sort of a middle ground if you will. Your thinking was my original thinking as well, but in hindsight we may have been wrong. At the very least they would seem to be competing technologies unless HBM3 is far lower bandwidth or involves interesting signaling.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I'm not thinking this as much as reading what Samsung stated.
    It's roughly half as wide as HBM2, and running at 50% higher bit rate. It strips out ECC and the base die, and aims for an organic rather than silicon material for its interposer.
    https://www.anandtech.com/show/1058...as-for-future-memory-tech-ddr5-cheap-hbm-more
     
  15. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,638
    Likes Received:
    148
    Okay, what would be your guess and expectations for the next Xbox, if Microsoft is targeting a late 2021 release (4 years after X1X, 8 years after XB1) in terms of AMD GPU architecture, number of CUs and memory bandwidth, and would HBM3 be feasible by then?
     
  16. fuboi

    Newcomer

    Joined:
    Aug 6, 2011
    Messages:
    90
    Likes Received:
    46
    Oh boy, imagine having up to 2^60 registers. Wait, isn't that a typical CPU with LD/ST instructions? Quick, let's also add a 5th cache hierarchy for 'registers' (after tex/const/D$/I$) including a tiny L0-cache.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    If the pattern from the current generation holds, whatever architecture adopted by the console would be much closer to a new card launched in a similar time frame, possibly with a slight delay like with Bonaire. That's potentially something next gen to AMD's Next Gen. Navi hopefully shouldn't be the design under consideration by that point.

    However, if the current gen's pattern of 6x to 8x over the prior generation holds, I think it's possible for a 64 CU Navi to get rather close to that if the basis is Durango or Orbis, and that doesn't necessarily go beyond the high-level chip organization we have now. Versus Sea Islands variants of 12 CUs at .853 GHz and 18 at .8 GHz, a chip with 64 CUs running at a notch below Vega's overly high 14nm clock speeds could actually improve GPU shader performance by almost an order of magnitude over the Xbox One, and that's without 7nm's power scaling or any architectural improvements since Sea Islands. Durango's something of a lower bar to clear, however.
    Getting that level of hardware improvement over the launch PS4 would likely require some of those other improvements being taken into account. This is roughly mapping the foundry marketing of 4 node transitions where 28->~20->~16->~12->~7 into something close to the improvement from the 90->65->40->28 that denoted the prior generation's transition, although I'd be more confident with somewhere between 7nm and 5nm to get enough padding to compensate for marketing's inflating the progress being made these days.

    If the baseline for 2021's jump is the 16nm PS4 Pro or Xbox Scorpio, then the there's probably going to be problems getting the same spec ratio, and there's a question of diminishing returns where getting the same perceived improvement may need more performance and physical integration tricks. However, the cost and risk picture at that point for AMD is a question mark.

    HBM3 is potentially a case where trends are not going in AMD's favor. It doesn't sound like Samsung wants it to be a value play, and it's not certain if the DRAM market's long term price picture matches what AMD expected when it committed to HBM so long ago. Consolidation and process node difficulties in the DRAM market coupled with enterprise and mobile competition for memory production may put a price premium per stack that scales with chip count if doing something like an MCM or PIM.
    HBCC and AMD's HPC proposals do show recognition of this, but it is something that is more awkward to handle if there's an architecturally higher minimum in memory cost.
    Four years is a long time, however.

    I'd imagine there's some potential course adjustments built in based on whether some of these pan out.
    However, and this is more of a grenade to throw in the console forum, it's a loaded question to ask about what's feasible for a console more than 3 years out (roughly the design time of the current gen) just in terms of AMD.
    2021 is pretty far, and conditions aren't the same in terms of leadership, some of the new products, or the positioning of competitors. Some of the console makers seem to have demonstrated a willingness to Switch, or have shown infrastructure in their backwards compatibility measures to handle a switch.
     
  18. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    _cat likes this.
  19. Nemo

    Newcomer

    Joined:
    Sep 15, 2012
    Messages:
    125
    Likes Received:
    23
  20. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    520
    Likes Received:
    239
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...