AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Are you sure about that part? IIRC the deal about the SXM socket series was that the GPUs would be allowed to draw as much power as they need to never be throttled by power limit. Also flat side-by-side mounting, rather than "stacked", so less thermal issues, too, so neither any thermal throttling.

    Not to mention that the other models have NVLink too, just not as system interface.
     
    pharma and DavidGraham like this.
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,278
    Likes Received:
    3,523
    A100 is not your typical run of the mill GPU, it'a Tensor Core GPU, NVIDIA even named it as such, it lacks RT cores, video encoders, and display connectors .. so you don't judge it based on traditional FP32/FP64 throughput. You judge it based on it's AI acceleration performance, for which it has plenty of performance to offer.

    You might be onto something, SXM2 power limit was 300W, SXM3 power limit was raised to 350W, and later to even 450W. A100 is SXM4, which explains the 400W TDP limit.
     
    #2162 DavidGraham, May 26, 2020
    Last edited: May 26, 2020
  3. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,079
    Likes Received:
    2,949
    Location:
    Finland
    Yeah, but since when have GPUs been compared at roughly ISO power anywhere? In every other usual metric it should be compared to Vega 64 rather than 56.
     
  4. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,079
    Likes Received:
    2,949
    Location:
    Finland
    Sorry, forgot those. 4GB since that's the one they have as "reference model" in their charts, at 1080p to make it as fair for 5500 XT as possible
    According to their tests 8GB models can be up to ~10% faster, but they're all running at higher clocks too I think
     
  5. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,031
    Likes Received:
    5,573
    1 - I obviously meant something connected to the graphic I put right after that sentence, which showed relative gaming performance numbers, not TFLOPs. I wouldn't show a graphic about gaming performance, then talk about the graphic but somehow implicitly mean I was talking about TFLOPs instead. That makes no sense.

    2 - CU count alone doesn't tell TFLOP throughput. At what imaginary clocks are your CUs working at? The exact same as a 5700XT? Knowing the PS5's GPU boosts at up to 2.23GHz and according to Sony officials it "stays there most of the time", I find it hard to believe Big Navi will run at 1850Mhz average.

    3 - "Navi 10 x2" wouldn't result in 30% more TFLOPs than the 2080 Ti, so this wouldn't hold either way.

    4 - Comparing AMD vs. nvidia in TFLOPs and gaming-performance-per-TFLOP stopped being relevant moment Turing brought dedicated INT32 ALUs that work in parallel with the FP32 ALUs. Not that it's ever been an especially good indicator of relative gaming performance between different architectures from different IHVs.


    Which sounded really sarcastic TBH, and was further emphasized by the doubting and questioning to the data points I provided.
    If this was not the case, then I apologize.


    It could be that the main culprit for the relatively low cocks on GA100 is related to nvidia using a larger proportion of area-optimized transistors because they're close to the reticle limit.



    Then why wouldn't the chip increase the core clocks when not using the tensor cores, effectively making the GPU more competitive for non-ML HPC tasks? It's not like clock boosting is a new thing to nvidia.
    Occam's Razor says these 1410MHz are simply how high the chip can clock considering its 400W TDP, regardless of it using the tensor cores or the regular FP32/FP64 ALUs.


    I still see the fact that a process node being implemented and the core clocks being lower as something odd, considering how nvidia has historically increased their chips' clocks with each new node that allows it. That and the FP32 and FP64 throughput increase not even keeping up with the TDP increase, which is also odd.
    Regardless, this isn't the Ampere thread.
     
    w0lfram likes this.
  6. szatkus

    Newcomer

    Joined:
    Mar 17, 2020
    Messages:
    30
    Likes Received:
    23
    TDP is roughly correlated with a market segment. In theory they could release Oland GPU at 300W, but their target isn't interested in so power hungry cards. Navi was released as a higher mid-end GPU, Vega was in high-end territory.
     
    Cuthalu likes this.
  7. neckthrough

    Newcomer

    Joined:
    Mar 28, 2019
    Messages:
    14
    Likes Received:
    31
    Actually it does that too. From the conclusions page, when talking about raw performance:

    "This makes the card a whopping 20% faster than Vega 64 and puts it just 8% behind Radeon VII, which is really impressive."

    Later on that same page, the article makes an observation about the improvement in power efficiency:

    "One important cornerstone is the long overdue reduction in power draw to make up lost ground against NVIDIA. In our testing, the RX 5700 XT is much more power efficient than anything we've ever seen from AMD, improving efficiency by 30%–40% over the Vega generation of GPUs. With 220 W in gaming, the card uses slightly less power than RX Vega 56 while delivering 30% higher performance."

    Though it seems the wording is slightly incorrect - Vega 56 is slightly less power than the 5700XT, not the other way around (at least based on ref specs)?

    Anyway we're debating over the quality of TPU's article, which is a little silly and not the point of this thread. Apologies for stretching this tangent.
     
  8. BRiT

    BRiT Verified (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    15,531
    Likes Received:
    14,070
    Location:
    Cleveland
    This is the AMD Navi thread. If you want to talk about stuff that isn't Navi, do it somewhere else.
     
  9. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,934
    Likes Received:
    2,264
    Location:
    Germany
    Accepted. But honestly, it's nothing less that can be expected after the two datapoints I provided (timespan, mfg. tech). If it's anything less than that on a „Big Navi“ (i.e. a meaningfully more sizeable chip than Navi 10, like 450-550 mm², not a 320 mm² half-bred).
     
  10. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,031
    Likes Received:
    5,573
    It was.. up until 2016 where a GTX1080 was consuming little more power than a RX480, effectively putting two cards from very different market segments with a similar TDP, and then cards within the same market segment with very different TDPs (e.g. Vega 64 vs. GTX 1080).
    And that was the case for 3 years, until Navi 10 released last year.



    I think @Kaotik 's point is that Vega 64 was the Vega 10 high-end card that sacrifices power efficiency to maximize performance whereas Vega 56 is the cut-down version that clocks closer to the ideal power efficiency curves.
    The 5700XT is the Navi 10 high-end card that sacrifices power efficiency to maximize performance whereas 5700 is the cut-down version that clocks closer to the ideal power efficiency curves.
    This is further proved by their respective price ranges by the time Navi 10 released.



    I think it depends on how high they can clock the new chips (which should clock pretty high, given the PS5's example) without hitting a power wall.
    I'm guessing a 30 WGP / 60CU part with a 320bit bus using 16Gbps GDDR6 (640GB/s?) with clocks averaging at 2.3GHz would already put pressure on the 2080 Ti's performance bracket, and it wouldn't require a 450mm^2 chip. It would actually be smaller than 350mm^2, considering how big the SeriesX's APU is.
     
    Kaotik likes this.
  11. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    39
    Likes Received:
    73
    Some new patent applications from AMD for motion based VRS and Texture decompression

    This is a logical extension of current Adrenalin Radeon Boost feature.
    20200169734 VARIABLE RATE RENDERING BASED ON MOTION ESTIMATION

    20200118299 REAL TIME ON-CHIP TEXTURE DECOMPRESSION USING SHADER PROCESSORS

     
    #2171 ethernity, May 28, 2020
    Last edited: May 28, 2020
    w0lfram, DmitryKo, PSman1700 and 3 others like this.
  12. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    39
    Likes Received:
    73
    Lol, this is actually Radeon Boost :)
     
  13. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,079
    Likes Received:
    2,949
    Location:
    Finland
    Nah, Radeon Boost is simpler than what's described there, it just lowers the whole rendering resolution.
     
    ethernity likes this.
  14. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,804
    Likes Received:
    1,091
    Location:
    Guess...
    Interesting. I wonder if that could be used for what I speculate about here, but using shaders rather than a hardware block. It would certainly have the advantage of not being fixed to a specific standard. But I wonder what the GPU hit would be...
     
    ethernity and PSman1700 like this.
  15. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,278
    Likes Received:
    3,523
    In the past, some games used NVIDIA's CUDA cores to do texture decompression, like Rage 2 and Wolfenstein Old Blood, with no measurable hit to performance, advantages were limited too.
     
  16. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,464
    Likes Received:
    831
    Location:
    France
    I don't get what's the difference is with S3TC, which doesn't need shaders to manage compressed textures... ?
     
  17. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,804
    Likes Received:
    1,091
    Location:
    Guess...
    Interesting to know there is precedent here. I think the dynamics of IO speed and CPU power will be significantly different this generation though (at least for a couple of years) which may make GPU decompression a much bigger win.

    It's a second level of compression over the GPU native formats like S3TC. Those formats can be handled natively by the GPU's without first being decompressed. But they don't have great compression ratios. If you compress again with something like Kraken or BCPACK, you can make them even smaller, but then you need to decompress before they are processed by the GPU.
     
    Lightman, PSman1700 and Rootax like this.
  18. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,482
    Likes Received:
    15,934
    Location:
    Under my bridge
    S3TC can't get as high compression rates as these other compression schemes, which can be used alongside S3TC (and other) texture formats to decrease asset size and increase load speeds. Realtime texture compression formats are specifically designed to be fast and useable in the texture units, so they have to sacrifice some opportunities for greater packing.
     
    Lightman, pharma, ethernity and 3 others like this.
  19. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,464
    Likes Received:
    831
    Location:
    France
    Thx for the answers.
     
  20. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    And one of the obvious, necessary requirements is usually a constant compression factor, so that addressing in a compressed image works the same as in an um-compressed image. Which results in only being able to use lossy compression codecs, which on top may not properly compress well compressible features either.

    E.g. take a 3 channel image of something which resembles a typical company logo, just two colors and few features, in a 1k x 1k resolution. 4MB raw, still 1MB with S3TC family. While a decent compression algorithm can usually bring this down to a few kB.

    Of course it doesn't get as good as e.g. PNG format. That format simply isn't made for random access at all.

    But imagine e.g. a layer on top of good old BC1, except you declare that one BC1 block may be up-scaled to be representative of an entire macro block of 16x16 or 128x128, rather than just the usual 4x4 block. Still good enough for flat color, and maybe even most gradients. And the same lower level block may also be re-used for several upper level macro blocks. Now, the lookup table is still somewhat easy to construct yourself. Just reduce original picture resolution by 4x4, and store 32bit index to representative block. If you feel like it, spend 1 or 2 bit on flagging scaled use of macro blocks.

    So far, this is something you can implement yourself, in pure software. Unconditionally trade a texture lookup in a 16x size reduced LUT, for the corresponding savings in memory footprint. 32x reduced if you can live with 16bit index. And then still benefit from HW accelerated S3TC decompression in the second stage, while likely getting a cache hit on both first and second stage. Up to this point, you have already a decent variable compression rate texture compression scheme. Applied virtual texturing, without the deferred feedback path.

    You can just use existing tiled resources feature, with a pre-compiled (feature-density-aware) heap, to get this type of compression. Aggressively use deduplication, and whenever possible, just omit the high LOD outside of regions of interest.

    ... of course this isn't what the patent describes.

    The patent goes one step further, and declares that there are still huge savings to be made from caching the decoding output from the second stage. E.g. you could actually go for much bigger macro blocks (e.g. 16x16), use a high compressing algorithm (alike to JPEG). But he hardware doesn't provide hard-wired decompression logic any more, instead it supports invoking a custom decompression shader on-demand, based on a cache-miss in a dedicated cache for decompressed macro blocks. Thereby enabling the use of compression schemes which are significantly more costly than S3TC family.

    So I suppose this means there will be support for conditional decompression-callbacks made from within texture lookup calls in the near future. From programing interface it's going to look like a dedicated decompression shader bound to the pipeline. Note that this can then also be abused to simply run a generative shader instead of "decompression", essentially providing forward texture space shading. Assuming that AMD had enough foresight to provide a data channel into decompression shader, and to explicitly provide sufficient memory space to hold an entire frame worth of decompressed blocks.

    In terms of existing API usage, rather than polling CheckAccessFullyMapped() afterwards, you get a shader invoked at the time where the access would have failed.

    What's not covered by the patent, if this is potentially also applicable for memory accesses outside of texturing. E.g. on-demand block-decompression not only for texture, but also for arbitrary buffer access.
     
    #2180 Ext3h, May 29, 2020
    Last edited: May 29, 2020
    w0lfram, Lightman, Ethatron and 10 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...