AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,446
    Likes Received:
    302
    I don't know what that means. I suppose it's someone's username so no.
     
    ethernity likes this.
  2. SimBy

    Regular Newcomer

    Joined:
    Jun 21, 2008
    Messages:
    641
    Likes Received:
    313
    Not really. What AMD should do is follow their own release schedule and completely ignore all the noise. There's no need for subtle marketing attempts with thinking faces and alike. Let the product do the talking. If it's great it makes the doubters look like complete idiots. If it's not great you save your face by not hyping it up with poor marketing attempts.
     
  3. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    572
    Likes Received:
    266
    And if it doesn't have a latency advantage I don't see what it could be for. You're still going to get slowed down by a narrow bus anytime you read from gddr with this scheme. Yes, IBM claims their ed-ram cache hits 3tbps, very fast. But without a 512mb cache or so you'd still need to write and read with gddr a lot for the current frame's buffers, and the cache would take up die space enough to just widen the bus instead. Unless it is some chiplet scheme, but I'd figure the first use of such would be to separate I/O from logic like Intel did with Lakefield.

    Overall giant ed-ram cache isn't a new idea but, afaik, IBM is the only one continuing to use it to much extent.
     
    sonen likes this.
  4. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    11,747
    Likes Received:
    2,727
    the absence of official information is a problem. I don't know why AMD waits so long to release all their products. Oct 8th announcement barely puts them in time for holiday sales and unless thye have extremely good avalibilty it will make a lot of people switch companies
     
  5. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,350
    Likes Received:
    3,340
    Location:
    Finland
    They're not "waiting so long to release their products", they're releasing them as soon as they're ready to be released. Can you imagine the shitstorm if they did proper paper release at this point?
     
    Per Lindstrom, no-X, CeeGee and 2 others like this.
  6. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    11,747
    Likes Received:
    2,727
    wait so you expect it not to be a paper release in Oct ? I have a feeling Oct 8th is announcement with avalibilty late Oct or even earl Nov
     
  7. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,818
    Likes Received:
    749
    Location:
    msk.ru/spb.ru
    RDNA2 is Oct 28th
     
  8. GeniusMonkey

    Joined:
    Sep 12, 2020
    Messages:
    4
    Likes Received:
    14
    128MB of L3 cache with "only" a 256 bit GDDR6 bus is plausible. It is hard to say how such an architecture would perform without doing a bunch of simulations of different workloads. The cache replacement policy would be critical to performance since you don't want to cache a bunch of data that is not going to be reused. It would to a large degree be an engineering trade-off between die area and bus width (and the costs associated with each).
     
  9. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    11,747
    Likes Received:
    2,727
    thanks for correcting me. So yea We may not even have it before the end of Nov at this point for shipping cards
     
    Cuthalu likes this.
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,112
    Likes Received:
    1,202
    Location:
    London
    As I noted more recently, the labelled XSX die shot has nothing like a massive quantity of RAM. So my comments about these "color/depth" blocks are irrelevant.

    Xenos benefitted hugely from the daughter die's performance.

    Xenos's daughter die didn't have the capacity for all render target formats that could be bound by the GPU, so there were performance uncertainties and cliffs (full speed 4xMSAA was limited to 720p, I think - fuzzy memory alert).

    "128MB" is gargantuan in comparison, 16 bytes per pixel at 4K, but I guess there will be cases where delta colour compression or MSAA sample compression fail to satisfy the 16 bytes limit.

    NVidia's tiled rendering (with tile sizes that vary depending upon the count of vertex parameters, pixel format, etc.) is some kind of cache scheme that only seems to "touch" relatively few tiles at any given time.

    By comparison the rumoured "128MB Infinity Cache" seems like dumb brute force if it were dedicated solely to ROPs.

    I'll be honest I think "128MB Infinity Cache" is a hoax or at the very least a grave misunderstanding.

    I like this. I have a fuzzy memory of a previous discussion about an active interposer used this way.

    I'm still haunted by "Nexgen Memory":

    [​IMG]

    Frankly I don't believe there is such a large "cache". The frame rate targets for XSX/XSS without an obvious hunk of memory in the die shot (frame rates that we should expect from 6800XT/6700XT, it seems) tells me that "Infinity Cache" in a giant amount is not part of Navi 2x architecture.

    I can believe "Infinity Cache" is a property of the architecture, but this magic number of 128MB has been conjured out of thin air by the leakerverse. I can believe that every type of memory on Navi 21 (registers, caches, buffers) adds up to a total of 128MB, but not that there is a cache of that size.
     
    TheAlSpark and Lightman like this.
  11. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,818
    Likes Received:
    749
    Location:
    msk.ru/spb.ru
    That's GDDR6.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,112
    Likes Received:
    1,202
    Location:
    London
    6 and 6X certainly have features/performance (channels and signalling) that are beyond 5.

    I've always interpreted it as features/performance beyond HBM2, but whatever.
     
    Lightman likes this.
  13. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    341
    Likes Received:
    282
    IMO immediate mode GPUs these days are too programmable to pull off the same feast again. Schemes like “tile based rendering” and “DSBR” are basically trying to improve spatial locality of caches within the current programming model. They don’t change the intrinsic nature of immediate mode not capable of guaranteeing that the GPU (or a specific screen-space partition of it) will only touch one rasteriser bin/tile at a time. So the worst case scenario remains chaotic/pathological API submissions that lead to little to no binning in practice, where the partition always have to readily deal with any amount of tiles simultaneously. In other words, we have no bound/fixed memory use, should on-chip tile memory be introduced to an immediate mode GPU. That’s unlike TBDR sorting & binning all primitives before rasterisation and fragment shading.

    Larger caches (not on-chip tile memory) could in theory help, but it would never be as effective as TBDR intrinsically. All immediate mode GPU vendors seemingly have collectively decided so far that cost outweighed then benefit, or otherwise they have always been in a position to stack up the caches in this age of dark silicon (see GA100’s 40MB L2).
     
    #3093 pTmdfx, Sep 13, 2020
    Last edited: Sep 13, 2020
    OlegSH likes this.
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,472
    Likes Received:
    4,410
    Location:
    Well within 3d
    The cited benefit was compensating for the lack of a major update to the external memory bus, so bandwidth amplification seems to be the primary motivation.
    However, leaving things as-is and introducing another cache layer leaves a question of how much bandwidth this is supposed to amplify, and if it becomes large enough to reduce the relative effectiveness of the L2's amplification. The L2's parallelism and bandwidth is something that has had limited scaling from GCN generations through Navi. If the aggregate cache and memory bandwidth supplied into the rises, a cache pipeline that isn't rebalanced could find an L2 with marginally more bandwidth than prior generations and potentially no additional capability to avoid bank conflicts becoming a bottleneck.

    Latency could become a problem, unless workloads that are known to be sensitive can bypass multiple layers of cache. GCN and RDNA have the ability to control this, although it's not clear that there are latency benefits since it seems like it's more about cache invalidation and miss control rather than avoiding long-latency paths in the pipeline. Additionally, having yet another cache leaves the question if it's being used differently or transparently to the GPU, because yet another layer of cache control seems like it's pushing things. The current level of exposure of architectural details and the hand-holding the ISA has for the cache hierarchy has already increased with RDNA1.


    eDRAM has specific process needs. IBM's use extends to Power9, which is fabricated on the 14nm SOI process IBM sold to GF. IBM's Power10 is to be on Samsung's 7nm node and reverts to SRAM.
    Per the following, it seems like space savings weren't particularly good or potentially negative at the speed and bandwidth level required for an L3 cache, but eDRAM did save on static leakage versus SRAM.
    https://www.itjungle.com/2020/08/24/drilling-down-into-the-power10-chip-architecture/
    This actually brings up a possible pain point for an AMD large-cache GPU if it's using SRAM in such quantity. Perhaps it's compensated for by reducing the need for a high-speed bus, but if this is supposed to scale to mobile levels 128MB of powered SRAM may need some attention paid to standby power.

    There may be a limit of sorts to the pathological case, in that we don't know how many batches the DSBR can close and have in-flight for any given screen tile. The capacity for tracking API order for primitives in batches exists, or at least we know the opposite where the GPU can be told to ignore API order for cases like depth/coverage passes exists.
    If it's known the hardware cannot generate an unbounded number of batches in-flight, the depths of its queues or backlog could provide decision data for the lifetime of a given screen tile in local storage.
     
    Jawed, PSman1700, TheAlSpark and 3 others like this.
  15. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    572
    Likes Received:
    266
    Afaik 2:1 compression is still there most common case for DCC, thus 16bytes per pixel only gives you a big g-buffer at 4k, and not anything else, or a non existent one and little else. You can't evict it back to gddr since you need it throughout much of the frame, and then you don't have room for all the rest of the frames data.

    My feeling is someone saw that "leaked" pic, that had an asic label for who knows what reason, assumed it was true and came it up with a reason it could be the big card.

    Besides, bus die size isn't what seems to be holding gpus back today, at least no more than nominally. Ampere shows how easily you can run out your thermal and power budget. Targeting that would seem a far more obvious target than wanting to maybe save a bit some die by going for a smaller bus.
     
  16. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    837
    Likes Received:
    139
    Location:
    'Zona
    Power efficiency is always the main priority but it still needs to be balanced with other parts and the bandwidth requirement is near the top of that list to reach a high performance level. That's why Nvidia partnered with Micron to develop GDDR6x with a new signaling spec, higher speed and more power efficiency.
    That's also why we have seen engineers preaching about locality because when you do need to go off-chip it is going to use more power. See video below.
    It is also why we have seen more innovation at the marchitecture level along with other ways to feed/maintain performance levels with other means of I/O, like DirectStorage, Infinity Fabric, and NVLink.

    I remember an interview with an AMD engineer around Fiji launch, talking about how the bandwidth requirements in the future was going to scale across the entire system, not just on-chip and/or to the memory. It seems like we are reaching that point.
    I can picture a snapshot of the interview but I couldn't find it trolling through AMD Fiji youtube videos.

     
    #3096 LordEC911, Sep 14, 2020
    Last edited: Sep 14, 2020
  17. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,717
    Likes Received:
    1,080
    Location:
    France
    RGT is doubling down on 256bit bus (with ddr6). If true it will be interesting to see the bandwitdh efficiency of rdna2.
     
  18. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,347
    Likes Received:
    1,284
    Location:
    Earth
    Power usage for ampere is probably higher because it can run things more efficiently at the same time. The diagrams show tensor core, rt cores and regular shading/compute running at the same time. The memory bandwidth is too limited though as from those same diagrams it looks like there is barely any left for tensor cores to use. If you don't have the memory bandwidth then hw will idle more and power consumption would go down, though side effect would be slower end result
    upload_2020-9-14_11-17-55.png

    upload_2020-9-14_11-13-30.png
     
    #3098 manux, Sep 14, 2020
    Last edited: Sep 14, 2020
  19. SimBy

    Regular Newcomer

    Joined:
    Jun 21, 2008
    Messages:
    641
    Likes Received:
    313
    For N21? To feed 80CUs? I'm sure consoles would benefit from that alien tech.
     
  20. xEx

    xEx
    Regular Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    964
    Likes Received:
    428
    [​IMG]

    This is the new cooler AMD will use. I like it.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...