AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
The V100S PCI-E shaves 50W of power despite running at faster clocks for the core and memory just because it dumps the NVLink for PCIe.
Are you sure about that part? IIRC the deal about the SXM socket series was that the GPUs would be allowed to draw as much power as they need to never be throttled by power limit. Also flat side-by-side mounting, rather than "stacked", so less thermal issues, too, so neither any thermal throttling.

Not to mention that the other models have NVLink too, just not as system interface.
 
A lower clock on a similar sized chip with higher TDP, despite the jump to a (supposedly) significantly improved process node, together with a modest increase in FP32 and FP64 throughput.
A100 is not your typical run of the mill GPU, it'a Tensor Core GPU, NVIDIA even named it as such, it lacks RT cores, video encoders, and display connectors .. so you don't judge it based on traditional FP32/FP64 throughput. You judge it based on it's AI acceleration performance, for which it has plenty of performance to offer.

Are you sure about that part? IIRC the deal about the SXM socket series was that the GPUs would be allowed to draw as much power as they need to never be throttled by power limit
You might be onto something, SXM2 power limit was 300W, SXM3 power limit was raised to 350W, and later to even 450W. A100 is SXM4, which explains the 400W TDP limit.
 
Last edited:
Seems reasonable if you're trying to do a perf comparison at roughly iso-power.

5700XT = 225W
Vega56 = 210W
Vega64 = 295W

5700XT vs. Vega56 is a 7% difference. Vega64 vs. 5700XT is a 31% difference.
Yeah, but since when have GPUs been compared at roughly ISO power anywhere? In every other usual metric it should be compared to Vega 64 rather than 56.
 
Which 5500XT? 4GB? 8GB? And at which resolution?
Sorry, forgot those. 4GB since that's the one they have as "reference model" in their charts, at 1080p to make it as fair for 5500 XT as possible
According to their tests 8GB models can be up to ~10% faster, but they're all running at higher clocks too I think
 
Yeah, my bad. By "80 CU Navi 2x will be 20-30% over the 2080 Ti. Just do Navi 10 x2 that's where you stand." you obviously meant something totally not connected to the number of CUs and frequency (which would be TFlops for instance), which is why you recommended "just to do Navi 10 x2".

1 - I obviously meant something connected to the graphic I put right after that sentence, which showed relative gaming performance numbers, not TFLOPs. I wouldn't show a graphic about gaming performance, then talk about the graphic but somehow implicitly mean I was talking about TFLOPs instead. That makes no sense.

2 - CU count alone doesn't tell TFLOP throughput. At what imaginary clocks are your CUs working at? The exact same as a 5700XT? Knowing the PS5's GPU boosts at up to 2.23GHz and according to Sony officials it "stays there most of the time", I find it hard to believe Big Navi will run at 1850Mhz average.

3 - "Navi 10 x2" wouldn't result in 30% more TFLOPs than the 2080 Ti, so this wouldn't hold either way.

4 - Comparing AMD vs. nvidia in TFLOPs and gaming-performance-per-TFLOP stopped being relevant moment Turing brought dedicated INT32 ALUs that work in parallel with the FP32 ALUs. Not that it's ever been an especially good indicator of relative gaming performance between different architectures from different IHVs.


You seem to mistake me for someone who is contending the possibility of this becoming reality. Instead I wrote "After 2+ years and on a full node advantage, I'm really looking forward to your math coming true."
Which sounded really sarcastic TBH, and was further emphasized by the doubting and questioning to the data points I provided.
If this was not the case, then I apologize.


1410 MHz, FWIW. And what GA100 seemingly has done is investing a large portion of "7nm goodness" into more transistors. They won't switch free of charge, and they won't come free in terms of clock speeds either, considering how tightly they are packed.
It could be that the main culprit for the relatively low cocks on GA100 is related to nvidia using a larger proportion of area-optimized transistors because they're close to the reticle limit.



Personally, I've yet to see that TDP linked to any specific task, but it's immediately obvious to me that it probably refers to the TDP required for the 320 TFlops in tensor cores, which is a 2.5x increase over Volta.

Then why wouldn't the chip increase the core clocks when not using the tensor cores, effectively making the GPU more competitive for non-ML HPC tasks? It's not like clock boosting is a new thing to nvidia.
Occam's Razor says these 1410MHz are simply how high the chip can clock considering its 400W TDP, regardless of it using the tensor cores or the regular FP32/FP64 ALUs.


I don't think there's anything suspect or out of place at all with GA100's "normal FP32 cores", so why would I discuss about "something" being "different" in consumer Ampere.
I still see the fact that a process node being implemented and the core clocks being lower as something odd, considering how nvidia has historically increased their chips' clocks with each new node that allows it. That and the FP32 and FP64 throughput increase not even keeping up with the TDP increase, which is also odd.
Regardless, this isn't the Ampere thread.
 
Yeah, but since when have GPUs been compared at roughly ISO power anywhere? In every other usual metric it should be compared to Vega 64 rather than 56.
TDP is roughly correlated with a market segment. In theory they could release Oland GPU at 300W, but their target isn't interested in so power hungry cards. Navi was released as a higher mid-end GPU, Vega was in high-end territory.
 
Yeah, but since when have GPUs been compared at roughly ISO power anywhere? In every other usual metric it should be compared to Vega 64 rather than 56.
Actually it does that too. From the conclusions page, when talking about raw performance:

"This makes the card a whopping 20% faster than Vega 64 and puts it just 8% behind Radeon VII, which is really impressive."

Later on that same page, the article makes an observation about the improvement in power efficiency:

"One important cornerstone is the long overdue reduction in power draw to make up lost ground against NVIDIA. In our testing, the RX 5700 XT is much more power efficient than anything we've ever seen from AMD, improving efficiency by 30%–40% over the Vega generation of GPUs. With 220 W in gaming, the card uses slightly less power than RX Vega 56 while delivering 30% higher performance."

Though it seems the wording is slightly incorrect - Vega 56 is slightly less power than the 5700XT, not the other way around (at least based on ref specs)?

Anyway we're debating over the quality of TPU's article, which is a little silly and not the point of this thread. Apologies for stretching this tangent.
 
Which sounded really sarcastic TBH, and was further emphasized by the doubting and questioning to the data points I provided.
If this was not the case, then I apologize.
Accepted. But honestly, it's nothing less that can be expected after the two datapoints I provided (timespan, mfg. tech). If it's anything less than that on a „Big Navi“ (i.e. a meaningfully more sizeable chip than Navi 10, like 450-550 mm², not a 320 mm² half-bred).
 
TDP is roughly correlated with a market segment.
It was.. up until 2016 where a GTX1080 was consuming little more power than a RX480, effectively putting two cards from very different market segments with a similar TDP, and then cards within the same market segment with very different TDPs (e.g. Vega 64 vs. GTX 1080).
And that was the case for 3 years, until Navi 10 released last year.



Actually it does that too. From the conclusions page, when talking about raw performance:

"This makes the card a whopping 20% faster than Vega 64 and puts it just 8% behind Radeon VII, which is really impressive."

Later on that same page, the article makes an observation about the improvement in power efficiency:

"One important cornerstone is the long overdue reduction in power draw to make up lost ground against NVIDIA. In our testing, the RX 5700 XT is much more power efficient than anything we've ever seen from AMD, improving efficiency by 30%–40% over the Vega generation of GPUs. With 220 W in gaming, the card uses slightly less power than RX Vega 56 while delivering 30% higher performance."

Though it seems the wording is slightly incorrect - Vega 56 is slightly less power than the 5700XT, not the other way around (at least based on ref specs)?

I think @Kaotik 's point is that Vega 64 was the Vega 10 high-end card that sacrifices power efficiency to maximize performance whereas Vega 56 is the cut-down version that clocks closer to the ideal power efficiency curves.
The 5700XT is the Navi 10 high-end card that sacrifices power efficiency to maximize performance whereas 5700 is the cut-down version that clocks closer to the ideal power efficiency curves.
This is further proved by their respective price ranges by the time Navi 10 released.



hing less that can be expected after the two datapoints I provided (timespan, mfg. tech). If it's anything less than that on a „Big Navi“ (i.e. a meaningfully more sizeable chip than Navi 10, like 450-550 mm², not a 320 mm² half-bred).
I think it depends on how high they can clock the new chips (which should clock pretty high, given the PS5's example) without hitting a power wall.
I'm guessing a 30 WGP / 60CU part with a 320bit bus using 16Gbps GDDR6 (640GB/s?) with clocks averaging at 2.3GHz would already put pressure on the 2080 Ti's performance bracket, and it wouldn't require a 450mm^2 chip. It would actually be smaller than 350mm^2, considering how big the SeriesX's APU is.
 
Some new patent applications from AMD for motion based VRS and Texture decompression

This is a logical extension of current Adrenalin Radeon Boost feature.
20200169734 VARIABLE RATE RENDERING BASED ON MOTION ESTIMATION

Abstract

A rendering processor assigns varying logical pixel dimensions to regions of an image frame and rendering pixels of the image frame based on the logical pixel dimensions. The rendering processor renders in highest resolution (i.e., with smaller logical pixel dimensions) those areas of the image that are more important (on which the viewer is expected to focus (the "foveal region"), or regions with little-to-no motion), and renders in lower resolution (i.e., with larger logical pixel dimensions) those areas of the image outside the region of interest, or regions that are speedily moving, so that loss of detail in those regions will be less noticeable to the viewer. For regions with less detail or greater magnitude of motion, larger logical pixel dimensions reduce the computational workload without affecting the quality of the displayed graphics as perceived by a user.

20200118299 REAL TIME ON-CHIP TEXTURE DECOMPRESSION USING SHADER PROCESSORS

Abstract

A processing unit, method, and medium for decompressing or generating textures within a graphics processing unit (GPU). The textures are compressed with a variable-rate compression scheme such as JPEG. The compressed textures are retrieved from system memory and transferred to local cache memory on the GPU without first being decompressed. A table is utilized by the cache to locate individual blocks within the compressed texture. A decompressing shader processor receives compressed blocks and then performs on-the-fly decompression of the blocks. The decompressed blocks are then processed as usual by a texture consuming shader processor of the GPU.
 
Last edited by a moderator:
Some new patent applications from AMD for motion based VRS and Texture decompression

This is a logical extension of current Adrenalin Radeon Boost feature.
20200169734 VARIABLE RATE RENDERING BASED ON MOTION ESTIMATION
Lol, this is actually Radeon Boost :)
 
Interesting. I wonder if that could be used for what I speculate about here, but using shaders rather than a hardware block. It would certainly have the advantage of not being fixed to a specific standard. But I wonder what the GPU hit would be...
In the past, some games used NVIDIA's CUDA cores to do texture decompression, like Rage 2 and Wolfenstein Old Blood, with no measurable hit to performance, advantages were limited too.
 
In the past, some games used NVIDIA's CUDA cores to do texture decompression, like Rage 2 and Wolfenstein Old Blood, with no measurable hit to performance, advantages were limited too.

Interesting to know there is precedent here. I think the dynamics of IO speed and CPU power will be significantly different this generation though (at least for a couple of years) which may make GPU decompression a much bigger win.

I don't get what's the difference is with S3TC, which doesn't need shaders to manage compressed textures... ?

It's a second level of compression over the GPU native formats like S3TC. Those formats can be handled natively by the GPU's without first being decompressed. But they don't have great compression ratios. If you compress again with something like Kraken or BCPACK, you can make them even smaller, but then you need to decompress before they are processed by the GPU.
 
S3TC can't get as high compression rates as these other compression schemes, which can be used alongside S3TC (and other) texture formats to decrease asset size and increase load speeds. Realtime texture compression formats are specifically designed to be fast and useable in the texture units, so they have to sacrifice some opportunities for greater packing.
 
Realtime texture compression formats are specifically designed to be fast and useable in the texture units, so they have to sacrifice some opportunities for greater packing.
And one of the obvious, necessary requirements is usually a constant compression factor, so that addressing in a compressed image works the same as in an um-compressed image. Which results in only being able to use lossy compression codecs, which on top may not properly compress well compressible features either.

E.g. take a 3 channel image of something which resembles a typical company logo, just two colors and few features, in a 1k x 1k resolution. 4MB raw, still 1MB with S3TC family. While a decent compression algorithm can usually bring this down to a few kB.

Of course it doesn't get as good as e.g. PNG format. That format simply isn't made for random access at all.

But imagine e.g. a layer on top of good old BC1, except you declare that one BC1 block may be up-scaled to be representative of an entire macro block of 16x16 or 128x128, rather than just the usual 4x4 block. Still good enough for flat color, and maybe even most gradients. And the same lower level block may also be re-used for several upper level macro blocks. Now, the lookup table is still somewhat easy to construct yourself. Just reduce original picture resolution by 4x4, and store 32bit index to representative block. If you feel like it, spend 1 or 2 bit on flagging scaled use of macro blocks.

So far, this is something you can implement yourself, in pure software. Unconditionally trade a texture lookup in a 16x size reduced LUT, for the corresponding savings in memory footprint. 32x reduced if you can live with 16bit index. And then still benefit from HW accelerated S3TC decompression in the second stage, while likely getting a cache hit on both first and second stage. Up to this point, you have already a decent variable compression rate texture compression scheme. Applied virtual texturing, without the deferred feedback path.

You can just use existing tiled resources feature, with a pre-compiled (feature-density-aware) heap, to get this type of compression. Aggressively use deduplication, and whenever possible, just omit the high LOD outside of regions of interest.

... of course this isn't what the patent describes.

The patent goes one step further, and declares that there are still huge savings to be made from caching the decoding output from the second stage. E.g. you could actually go for much bigger macro blocks (e.g. 16x16), use a high compressing algorithm (alike to JPEG). But he hardware doesn't provide hard-wired decompression logic any more, instead it supports invoking a custom decompression shader on-demand, based on a cache-miss in a dedicated cache for decompressed macro blocks. Thereby enabling the use of compression schemes which are significantly more costly than S3TC family.

So I suppose this means there will be support for conditional decompression-callbacks made from within texture lookup calls in the near future. From programing interface it's going to look like a dedicated decompression shader bound to the pipeline. Note that this can then also be abused to simply run a generative shader instead of "decompression", essentially providing forward texture space shading. Assuming that AMD had enough foresight to provide a data channel into decompression shader, and to explicitly provide sufficient memory space to hold an entire frame worth of decompressed blocks.

In terms of existing API usage, rather than polling CheckAccessFullyMapped() afterwards, you get a shader invoked at the time where the access would have failed.

What's not covered by the patent, if this is potentially also applicable for memory accesses outside of texturing. E.g. on-demand block-decompression not only for texture, but also for arbitrary buffer access.
 
Last edited:
Status
Not open for further replies.
Back
Top