AMD: Navi Speculation, Rumours and Discussion [2019-2020]

DuckThor Evil · Apr 28, 2019

Entropy said:
Are you saying that Vega 20 clocks 30% higher than Vega 10 at ISO power for the products indicated? That’s just not the case and you know it.

The slide paints an area of "high efficiency operating range", an area where performance increases relatively linearly with power consumption. At the top end of that area it states 30% higher frequency for the 7nm chip. Both of these actual products have been pushed past that efficiency range though and the slide doesn't state what the clocks would be at those points.

I don't think that slide is too far off with the numbers. Techpowerup has Radeon 7 offering 40% higher performance per Watt at 4K (most GPU limited resolution) than Vega 64, so the combination of higher frequency and slightly lower power consumption does add up to something.

Deleted member 13524 · Apr 28, 2019

Entropy said:
Are you saying that Vega 20 clocks 30% higher than Vega 10 at ISO power for the products indicated? That’s just not the case and you know it.
That doesn’t necessarily mean all that much for Navi though, that has the benefit of over two years of development since Vega, and almost a year at the node. That should count for something.

IIRC that slide appeared in mid 2018. Then in late 2018 we got reports saying that TSMC's 7nm process was not going to be fully utilized because performance was below expectations.

Regardless, I'd like to see how much power a Radeon VII consumes at 1.4/1.5GHz. Maybe it does go down to 150W or lower.

Bondrewd · Apr 28, 2019

ToTTenTranz said:
TSMC's 7nm process was not going to be fully utilized because performance was below expectations.

what.
It's underutilized because slow phone cycles, particularly Apple.

Deleted member 91098 · Apr 28, 2019

ToTTenTranz said:
[...]

Regardless, I'd like to see how much power a Radeon VII consumes at 1.4/1.5GHz. Maybe it does go down to 150W or lower.

Yes, would certainly be interesting considering how much undervolting lowers the power draw already at 1800 MHz. I would really like to see a graph with power consumption in relation to clock speeds like there were for some of AMDs previous GPUs and for their Zen CPUs.

Image is taken from tomshw.de: https://translate.google.com/translate?sl=auto&tl=en&u=https://www.tomshw.de/2019/02/27/rtg-radeon-tweaker-group-amd-radeon-vii-mods-tweaks-fuer-neueinsteiger-und-profis-igorslab/

Entropy · Apr 28, 2019

Dr Evil said:
The slide paints an area of "high efficiency operating range", an area where performance increases relatively linearly with power consumption. At the top end of that area it states 30% higher frequency for the 7nm chip. Both of these actual products have been pushed past that efficiency range though and the slide doesn't state what the clocks would be at those points.

I don't think that slide is too far off with the numbers. Techpowerup has Radeon 7 offering 40% higher performance per Watt at 4K (most GPU limited resolution) than Vega 64, so the combination of higher frequency and slightly lower power consumption does add up to something.

That's a function of memory bandwidth, not the GPU itself. To evaluate the GPU, you have to look at lower resolutions where it shows less of an improvement, 15-30% depending on which review site you're looking at.
Be that as it may, again, I'm not too sure how Vega 20 results relate to Navi. Even without any architectural tweaks and improvements what would the positive effects on density and power draw be of ditching half-rate FP64, and anything else that is associated with Vega that either targets HPC or never really worked out in practice? Are there savings in not supporting a 4096-bit wide memory interface? But of course, there should be some improvements with two years of additional design work? And to what extent can they leverage a more solid lithographic foundation?

To my eyes, Navi looks a lot like Polaris. Same memory bus only supporting GDDR6, similar (by the rumours) die area. Over time, production costs should be in the same rough ballpark. In a way, I hope they are a bit boring and tune the process a bit more for density and less for maximum frequency, and use that to provide roughly Vega 64 computational resources in a much smaller and economic package, and with some architectural tweaks. It would be an excellent volume product, and could scale downwards in price over time similar to Polaris. Even has an easy tweak node now in TSMC 6nm.

Bondrewd · Apr 28, 2019

Entropy said:
Even has an easy tweak node now in TSMC 6nm.

6nm is barely entering risk production, by the time it goes volume, Navi will be no more.

3dilettante · Apr 28, 2019

One bit I missed from the gfx1010 link I made earlier was the line about a destination register cache (HWRC) in the scheduling section.
This may be along the lines of the earlier patents that introduced a small amount of storage at the output of the ALUs.

Revisiting that section, and the gfx10 speed model introduced, I'm curious if the lack of a 1/4 conversion from raw clock to the SIMD cadence have some additional purpose.
The most simple and lowest-import interpretation is that this is a typo or incompatible documentation style injected into an initial version of the targeting code.

If not an error or style conflict, an incremental change could be related to the vector path having some potentially exposed latency for dependent operations.
The model seems consistent with a pipeline that has a rough 4 cycles of cadence for many things like scalar operations, export, LDS, and memory. Those would presumably not have the more complicated register cache and file arrangement, but may have their raw cycle counts given to be consistent with the units that were affected.
The vector numbers themselves appear to be mostly 4x that of the prior GCN numbers, outside of a +1 for operations writing to vector ALU registers. The +1 may be related to the register cache and/or the more complex operand delivery network, but one interpretation of the higher cycle count is that something has changed with regards to the cadence presented to software.

An interpretation that would represent a somewhat larger change is that the architecture is exposing some of the lower-level details of the register file's implementation.
The various operations seem to have similar latency to the 4-cycle cadence, but maybe somehow it's less hidden? Some of the register file patents used different language to describe the wavefront execution model and register addressing versus the SIMD or super-SIMD patents. The register patents more clearly treated each row of a wavefront as its own operation, and unlike the ALU-focused patent they took the 4-bank register file and labelled each bank's row as its own register--rather than saying all registers at the same row in every bank count as one register from the point of view of a 4-cycle wavefront instruction.
That could be related to the feature flag indicating gfx10 has a banked register file. Having banks of registers isn't new, but it could be more explicit in the execution model presented to software if certain situations can expose it.
I am not sure what situations would be unique to gfx10, since there are lane-crossing operations like the DPP instructions that already partially expose the lower-level detail. The patent language seemed to imply there was still internal logic that was mapping operand IDs to register accesses without the program being aware of it.
Some possibilities are that there are more complex inter-lane operations, the register cache has some limitations in how it can be used, or a more significant change like some of the register mapping not being done transparently anymore.

A more significantly impactful interpretation is that code may be able to see latencies formerly hidden by the cadence. For example, a wavefront may see scalar ops as having 1 cycle of latency despite taking 4 actual cycles because the wavefront will not be able to issue an instruction for another 4 cycles. Having some level of dual-issue within the SIMD path, or a faster than 4-cycle issue latency in some other manner could do this.

anexanhume · Apr 29, 2019

3dilettante said:
One bit I missed from the gfx1010 link I made earlier was the line about a destination register cache (HWRC) in the scheduling section.
This may be along the lines of the earlier patents that introduced a small amount of storage at the output of the ALUs.

Revisiting that section, and the gfx10 speed model introduced, I'm curious if the lack of a 1/4 conversion from raw clock to the SIMD cadence have some additional purpose.
The most simple and lowest-import interpretation is that this is a typo or incompatible documentation style injected into an initial version of the targeting code.

If not an error or style conflict, an incremental change could be related to the vector path having some potentially exposed latency for dependent operations.
The model seems consistent with a pipeline that has a rough 4 cycles of cadence for many things like scalar operations, export, LDS, and memory. Those would presumably not have the more complicated register cache and file arrangement, but may have their raw cycle counts given to be consistent with the units that were affected.
The vector numbers themselves appear to be mostly 4x that of the prior GCN numbers, outside of a +1 for operations writing to vector ALU registers. The +1 may be related to the register cache and/or the more complex operand delivery network, but one interpretation of the higher cycle count is that something has changed with regards to the cadence presented to software.

An interpretation that would represent a somewhat larger change is that the architecture is exposing some of the lower-level details of the register file's implementation.
The various operations seem to have similar latency to the 4-cycle cadence, but maybe somehow it's less hidden? Some of the register file patents used different language to describe the wavefront execution model and register addressing versus the SIMD or super-SIMD patents. The register patents more clearly treated each row of a wavefront as its own operation, and unlike the ALU-focused patent they took the 4-bank register file and labelled each bank's row as its own register--rather than saying all registers at the same row in every bank count as one register from the point of view of a 4-cycle wavefront instruction.
That could be related to the feature flag indicating gfx10 has a banked register file. Having banks of registers isn't new, but it could be more explicit in the execution model presented to software if certain situations can expose it.
I am not sure what situations would be unique to gfx10, since there are lane-crossing operations like the DPP instructions that already partially expose the lower-level detail. The patent language seemed to imply there was still internal logic that was mapping operand IDs to register accesses without the program being aware of it.
Some possibilities are that there are more complex inter-lane operations, the register cache has some limitations in how it can be used, or a more significant change like some of the register mapping not being done transparently anymore.

A more significantly impactful interpretation is that code may be able to see latencies formerly hidden by the cadence. For example, a wavefront may see scalar ops as having 1 cycle of latency despite taking 4 actual cycles because the wavefront will not be able to issue an instruction for another 4 cycles. Having some level of dual-issue within the SIMD path, or a faster than 4-cycle issue latency in some other manner could do this.

Are you referencing this patent? Specifically, this part:

Also, the parallel processing unit is configured to leverage the RAM's output flops as a last level cache to reduce duplicate operand requests between multiple instructions. The parallel processing unit includes a vector destination cache to provide additional R/W bandwidth for the vector register file.

3dilettante · Apr 29, 2019

The vector destination cache mentioned in the second sentence is potentially related to the destination register cache in the LLVM changes. A similar destination cache at the output of the ALUs is mentioned in the so-called super-SIMD concept (http://www.freepatentsonline.com/y2018/0121386.html).

What's curious to me is that both have diagrams of a representative SIMD based on the "old" architecture, which has multiple register banks. So why GFX10 would be labelled as having a banked register file even if it implements something like those patents is unclear, unless something else can somehow alter the throughput of those banks in a way that a shader can detect. (Also unclear, how different the "old" operand network is from the new one.)
The super-SIMD patent labels each bank as belonging to one of the rows of a wavefront--which correspond to different cycles in the cadence. That patent lists the register file as being multiple banks, with registers 0 through N in each bank.
The register file patent, on the other hand, numbers the registers more as a global count (V0,V4,V8 in bank 0, V1,V5,V9 in bank 1, and so on).
That can come about by a designer specializing in specific subsections of the architecture seeing things in terms of their chosen specialty, so the patents could be using different language for the same thing. It's also possible that the different emphases can lose correctness for the parts outside the scope the individual design element, or they may not be describing the same exact embodiments. The way the register cache is banked, and how it is connected to the operand network is not rendered to the same depth in both.

Even so, I'm not sure what in GFX10 would make this worthy of an external target flag unless there's some specific combination of claims or omissions from one or both that makes GFX10 act differently in practice.

late edit:
Also, the register file one does indicate the destination cache is banked, but going by my understanding it is banked in a way that should line up with the established cadence.
There is a new source of stalls possible with the cache, if the ALU cannot allocate an entry in the cycle output is to be written out. That is new, and might be part of the +1 latency mentioned in the GFX10 changes, but what it takes to get that kind of stall and whether this is affected by banking is not clear.

anexanhume · Apr 29, 2019

Was this incorporated and/or enabled in Vega?

http://www.freepatentsonline.com/y2019/0122417.html

Original filing is 2013. Koduri talking about it here:

3dilettante · Apr 29, 2019

anexanhume said:
Was this incorporated and/or enabled in Vega?

http://www.freepatentsonline.com/y2019/0122417.html

Original filing is 2013. Koduri talking about it here:

The public description of Vega's DSBR appears to match many of the claims in the patent. Whether any of the patent's specific embodiments or a combination of claims are identical to what Vega has would require deeper knowledge of the chip than has been disclosed.

The DSBR has been announced as being enabled in most instances, and there are various open-source driver changes related to enabling it or managing it. There are binning, deferring, and culling components to the functionality that may be selectively enabled or disabled based on the primitives submitted, but we don't have statistics on how often the DSBR is partially unused or reverts to standard rendering.
There are some specific professional applications that do benefit a fair amount from the DSBR, and some cited games with a certain amount of saved bandwidth and performance. Some Raven Ridge marketing gave some modest improvements to having it on, and a few driver submissions indicated minor improvement or at least no significant regression.

The pipeline's scoreboarding and context management have limited capacity, and it's possible the DSBR might have been more effective if more culling was done before consuming limited bin context storage. The full impact of that might depend on internal details on how flexible the DSBR is about pruning its bin context of culled data, or if the larger volume of unculled primitives may hit some other batch-closing thresholds.

anexanhume · Apr 29, 2019

3dilettante said:
The public description of Vega's DSBR appears to match many of the claims in the patent. Whether any of the patent's specific embodiments or a combination of claims are identical to what Vega has would require deeper knowledge of the chip than has been disclosed.

The DSBR has been announced as being enabled in most instances, and there are various open-source driver changes related to enabling it or managing it. There are binning, deferring, and culling components to the functionality that may be selectively enabled or disabled based on the primitives submitted, but we don't have statistics on how often the DSBR is partially unused or reverts to standard rendering.
There are some specific professional applications that do benefit a fair amount from the DSBR, and some cited games with a certain amount of saved bandwidth and performance. Some Raven Ridge marketing gave some modest improvements to having it on, and a few driver submissions indicated minor improvement or at least no significant regression.

The pipeline's scoreboarding and context management have limited capacity, and it's possible the DSBR might have been more effective if more culling was done before consuming limited bin context storage. The full impact of that might depend on internal details on how flexible the DSBR is about pruning its bin context of culled data, or if the larger volume of unculled primitives may hit some other batch-closing thresholds.

Thanks for the detailed explanation. It would seem that whatever was done is not enough, given how Vega VII opens its performance gap over Vega 64 at 4K resolution, presumably benefitting from the higher memory bandwidth as opposed to 1440p resolutions.

iamw · Apr 30, 2019

3dilettante said:
The vector destination cache mentioned in the second sentence is potentially related to the destination register cache in the LLVM changes. A similar destination cache at the output of the ALUs is mentioned in the so-called super-SIMD concept (http://www.freepatentsonline.com/y2018/0121386.html).

What's curious to me is that both have diagrams of a representative SIMD based on the "old" architecture, which has multiple register banks. So why GFX10 would be labelled as having a banked register file even if it implements something like those patents is unclear, unless something else can somehow alter the throughput of those banks in a way that a shader can detect. (Also unclear, how different the "old" operand network is from the new one.)
The super-SIMD patent labels each bank as belonging to one of the rows of a wavefront--which correspond to different cycles in the cadence. That patent lists the register file as being multiple banks, with registers 0 through N in each bank.
The register file patent, on the other hand, numbers the registers more as a global count (V0,V4,V8 in bank 0, V1,V5,V9 in bank 1, and so on).
That can come about by a designer specializing in specific subsections of the architecture seeing things in terms of their chosen specialty, so the patents could be using different language for the same thing. It's also possible that the different emphases can lose correctness for the parts outside the scope the individual design element, or they may not be describing the same exact embodiments. The way the register cache is banked, and how it is connected to the operand network is not rendered to the same depth in both.

Even so, I'm not sure what in GFX10 would make this worthy of an external target flag unless there's some specific combination of claims or omissions from one or both that makes GFX10 act differently in practice.

late edit:
Also, the register file one does indicate the destination cache is banked, but going by my understanding it is banked in a way that should line up with the established cadence.
There is a new source of stalls possible with the cache, if the ALU cannot allocate an entry in the cycle output is to be written out. That is new, and might be part of the +1 latency mentioned in the GFX10 changes, but what it takes to get that kind of stall and whether this is affected by banking is not clear.

Old GCN architecture is a single issue architecture without bank conflict.But if VLIW2 or other multiple issue method are introduced,bank conflict could be a problem.I think that's why llvm mentions banked register .

Rootax · Apr 30, 2019

anexanhume said:
Thanks for the detailed explanation. It would seem that whatever was done is not enough, given how Vega VII opens its performance gap over Vega 64 at 4K resolution, presumably benefitting from the higher memory bandwidth as opposed to 1440p resolutions.

Some DSBR "testers" didn't show it was enabled for these tests.... and seeing how Fiji was doing against Vega clock 4 clock, I believe it was never used in game despite what AMD is claiming. Or it was screwed up / bugged in some way that it was useless, I won't be surprised. After all primitive shaders didn't work either in the end.
And I doubt Raja left/was fired because Navi was heading in the right direction...

no-X · Apr 30, 2019

How do you explain, that Vega is 25 % faster (clock for clock) than Fiji in Battlefield 1, Titanfall 2 etc.?

Rootax · Apr 30, 2019

no-X said:
How do you explain, that Vega is 25 % faster (clock for clock) than Fiji in Battlefield 1, Titanfall 2 etc.?

It's not what the article I had in mind is showing : https://www.hardocp.com/article/2017/09/12/radeon_rx_vega_64_vs_r9_fury_x_clock_for/10

Also, if 4gb is limiting, then of course Vega will show less slowdown with 8gb of vram instead of 4 for fury.

no-X · Apr 30, 2019

https://www.computerbase.de/2017-08...battlefield-1-rx-vega-56-vs-vega-64-vs-fury-x

Performance of Fury X in this benchmark isn't limited by 4GB VRAM, it's almost the same as with GeForce GTX 980 Ti (6GB) and higher than many 8GB cards.

Deleted member 13524 · Apr 30, 2019

In seems that testing the influence of VRAM capacity was made a whole lot harder than before. Gamersnexus just tested the GTX960 2GB vs. 4GB to see how those cards aged depending on VRAM, and they noticed that at least one game is dynamically changing texture quality to fit it into the framebuffer.

yuri · Apr 30, 2019

no-X said:
How do you explain, that Vega is 25 % faster (clock for clock) than Fiji in Battlefield 1, Titanfall 2 etc.?

Maybe somebody finally made use of that doubled L2 or other pieces of that quoted 45MB SRAM?

3dilettante · Apr 30, 2019

iamw said:
Old GCN architecture is a single issue architecture without bank conflict.But if VLIW2 or other multiple issue method are introduced,bank conflict could be a problem.I think that's why llvm mentions banked register .

Going by the super-SIMD patent, operands are gathered from a bank over several clock cycles and stored in buffers ahead of the ALUs. A single row would collect each source register once per cycle, and then move down the ALU pipeline. That prevents a bank conflict occurring within a single FMA instruction, and a significant point of the patent was to utilize wasted register access cycles for instructions that didn't consume as many operands as an 3-operand FMA by allowing a simpler operation to borrow the unused cycles. At least within the scope of that method, the existing access method could permit conflict-free access.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

DuckThor Evil

Deleted member 13524

Guest

Bondrewd

Deleted member 91098

Guest

Entropy

Bondrewd

3dilettante

anexanhume

3dilettante

anexanhume

3dilettante

anexanhume

iamw

Rootax

no-X

Rootax

no-X

Deleted member 13524

Guest

yuri

3dilettante