AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
That's Navi.

Why are you so sure that Navi will be a big jump while it's still GCN ?

(I'm not saying you're wrong, but, we saw fiji to vega was pretty much a bunch of nothing. PS isn't available, it's very hot, all the changes they did on paper doesn't seems to affect real performances, only clock speeds etc...Like GCN is out of gas, they can't tweak it much anymore imo)
 
Why are you so sure that Navi will be a big jump while it's still GCN ?

(I'm not saying you're wrong, but, we saw fiji to vega was pretty much a bunch of nothing. PS isn't available, it's very hot, all the changes they did on paper doesn't seems to affect real performances, only clock speeds etc...Like GCN is out of gas, they can't tweak it much anymore imo)
Occums razor says it's far more likely lack of R&D money especially with all the patients that are gcn based.
 
That is the instruction that I mentioned earlier was explicitly flagged as causing shaders to hang in the GFX1010 changes.
What seems odd to me about the GFX10 I$ is that it seems more appropriate for some kind of wavefront buffer, rather than an instruction cache. GCN's I$ is referenced as being 32KB (shared between several CUs), so GFX10 having 256 bytes only seems very impractical unless this comment is addressing a closer tier of storage that's not being shared. To some extent, there's always been some storage in this range of capacity for the per-wavefront buffers for 40 wavefronts in a CU.
Another interpretation of the 4x64 comment is that the existing GCN I$ is 4-way associative and has 64B lines, but it doesn't make sense in the context of the other changes including using an instruction to keep three consecutive line loads from thrashing the cache.

I saw a whole bunch of HW bugs in the LLVM commit.
https://github.com/llvm-mirror/llvm...939#diff-983f40a891aaf5604e5f0b955e4051d2R733

Someone have some idea how severe they are?
If it says hazard, I'm not sure how different the situation is than before in terms of importance, or if they're "bugs" (even though they're listed in the bug list). Hazards in the ISA docs usually involve referring to a list of required stall cycles where there could be invalid or unpredictable behavior. All architectures have them, and GCN as a somewhat loosely integrated set of sequencers and pipelines has a long history of them listed out in the various ISA docs. GFX10 actually has a flag indicating it may have removed a lot of those longstanding hazards.

The ones that have bug in their name aren't well-enough disclosed to be clear on how problematic they are. There are some bug flags for other GCN variants, but this list does seem like it could be longer than the others for this part of the code changes.
FeatureInstFwdPrefetchBug - the previously mentioned prefetch instruction. I'm not sure what specific scenarios this may fail, or how bad it can be if other code changes go into more detail about using it the problem was that common.
FeatureNSAtoVMEMBug - an apparent failure if an NSA instruction is followed by a vector memory instruction, if the upper 32 bits of the EXE mask are 0 or the lower 32 bits of the EXE mask are 0. Architecturally, the EXE mask is mapped to two 32-bit scalar registers. What happens with a failure is unclear, but it may be a situation where the instruction mix needs to be adjusted so that one instruction type isn't immediately followed by the other.
FeatureFlatSegmentOffsetBug - this one seems rather far-ranging for a bug. GCN has multiple increasingly complex addressing modes, and in this case many instructions that use flat addressing wind up ignoring an architected offset value in that region of memory, which seems like a significant thing to miss. The mitigation is to have the shader code do the math that was implicit to the instruction encoding, so it leaves the functionality usable, although less compact or efficient. There are other code commits that seem to outline workarounds for this across a list of instructions.
FeatureLdsMisalignedBug - limited detail on this one, although it seems limited to WGP mode. There are two scheduling modes for assigning wavefronts, CU mode and Workgroup Processor Mode (if I remember right). CU mode is what I think is the "traditional" mode, where a work group's wavefronts are all assigned to the same CU. WGP mode appears to allow a single shader's wavefronts to span two, which may have implications in other ways for the architecture. What specifically is bugged about his corner case isn't clear at this point.
https://github.com/llvm-mirror/llvm...37b3f7f#diff-ad4812397731e1d4ff6992207b4d38fa

"// In WGP mode the waves of a work-group can be executing on either CU of
// the WGP. Therefore need to bypass the L0 which is per CU. Otherwise in
// CU mode and all waves of a work-group are on the same CU, and so the
// L0 does not need to be bypassed."

Also, not sure what the implications are for the L0/L1 nomenclature.


General consensus say GCN and Fermi (with some arguing over is Fermi really that big of a change or not)
I suppose this could go either way.
Southern Islands GCN was preceded by VLIW4, and depending on which parts of the overall GPU or architecture we focus on, there were elements outside of the CUs themselves more like the prior generation. GCN as an architectural model presented to software changed many things, although I think elements like the scalar path were there to a significant extent before, just not for general consumption.

Depending on what level we look at things, Nvidia has had some strong architectural shifts after Fermi. Fermi's more CISC-like load-op architecture and scoreboarded dependency checking was removed at a hardware and ISA level in later gens, and Nvidia's added /removed microarchitectural elements and most recently more fully enabled an SIMT model where the "threads" were much closer behaving like threads than ever before.

Both vendors have thrown big swaths of their ISA encodings around between generations, and Nvidia's done some rather broad encoding changes between generations, even if the architectures didn't seem that deeply revamped.

Isn't GCN an ISA? I doubt AMD would switch off it completely in the foreseeable future. Every iteration of GCN changes some functions but the core of it still gets called GCN. Switching off GCN is kind of like switching off X86 for CPU. Even after x64 came out, people still call it x86.
GCN's something of a mix of high and low-level architectural details, beyond just the ISA. At the very least the instructions themselves have at times been subject to encoding whiplash that would have been completely unacceptable for x86.
Large numbers of opcodes have shifted between generations, instructions dropped, instructions with slightly different behaviors took the names of prior operations, etc. GFX10 reverses a range of GCN3 encoding changes, as mentioned earlier. At one point, Vega's code commits had comments about the Vega's handling of taking over FMA instructions being a mess, as described by an AMD's own programmers.

As architectures x86 and GCN are heavily underpinned by their instructions, but also a broad range of other specifications and behaviors not necessarily reducible to the instruction encodings. GCN as an architecture is comparatively much more sloppy and less disciplined about what low-level and high-level details have to bleed into the architecture, or consistency across gens.


Now... with a wavefront of 32 that must be diferent! Although I doubt final values are diferent.
So if I reverse calculations, for a 64 CU Navi I would get the same 40 per CU, Diference would be that the SIMD would get 20 wavefronts.

Not shure this is correct, or even possible, but I doubt total capacity has been reduced.
A GFX10 wavefront is listed as being 64 in the recent code commits, although changing the hardware underneath that would have implications.

:p

Anyhoo, previously on GCN...

So Polaris 10 moved to 3 CUs sharing the K$ & I$ for wiring purposes. Wonder if they're taking another step with just paired CUs while also moving to the (apparent) super-SIMD :?:
That might not have been a point of architectural focus if it was 3-per, if it just happened that the CU count worked out. I thought Vega was the one had this limit stated.


Some things I've been seeing while skimming recent commits:
It seems like GFX10 is commented as not supporting cbranch fork and join instructions, and the setvskip instruction appears to be unsupported as well.

The NSA instructions are Non-Sequential Addressing forms of image instructions. I was skimming, but this may mean that unlike texturing instructions in current GCN, these don't require that an instruction's source data be put into a sequential set of vector registers, possibly allowing for something more flexible in accessing memory without duplicating or juggling registers to keep the pattern. I think this may come at a cost in terms of code density and other features/modifiers that might not be available to the new encoding.

There's a set of primitive export instructions.

There's another bit setting for memory coherence besides the SLC and GLC bits. There's a DLC bit that might be set.
*I'm trying to track down which commit described register banking in more detail. I think it was something like 4 banks for VREGs, and the scalar register file had 8 banks (two registers assigned to a bank before moving on). I'm not sure how much of this is strictly "new", as some of this also makes sense for the existing GCN, even if it weren't exposed. Perhaps some of the recent rumors about shifting the number of lanes per wave cycle could affect things, or perhaps the register files were banked previously in this way, and these new descriptions indicate even more banking?
 
Since Navi is in the pipeline for years now, I doubt they benefited of noticeable increased R&D money at the time...
That has nothing to do with what i said, your moving the goal posts, you said:

(I'm not saying you're wrong, but, we saw fiji to vega was pretty much a bunch of nothing. PS isn't available, it's very hot, all the changes they did on paper doesn't seems to affect real performances, only clock speeds etc...Like GCN is out of gas, they can't tweak it much anymore imo)

Which all the patents would disagree with you, the problem is ideas are cheap execution is expensive.

I would expect the more R&D money since Zen release, especially form semi custom to help Navi but obviously the project would have started before that was available so changes might not be as aggressive or wide ranging as they could be if money wasn't a problem at the start.
 
Who cares about patents if they don't deliver something good or not implemented ? Look at Vega. We'll see with Navi soon.

And what I said had something to do with your remark about money...
 
Who cares about patents if they don't deliver something good or not implemented ? Look at Vega. We'll see with Navi soon.

And what I said had something to do with your remark about money...

Again you can't stick the point that you made. You made a point, it doesn't match what we can see, we know the GFX teams got gutted under Rory, we know AMD's R&D hit really low levels and Zen was a priority above all else . You said "GCN is maxed", not one bit of evidence to support that from what we can see it is far better explained by lack of resources to execute. They have still been doing the idea's work now they have the money to do the execution work.

Money is always relevant because when you have none of it you cant do anything. Lack of money is a far better explanation then GCN being maxed, look at AMD/ATI's past VLIW4, VLIW5 and then GCN over a 5 year period and not much since, but have a look at R&D spending in 2007-2012 vs 2014-2016 eoy only in mid 2017 did they reach 2014 R&D levels they are still 100m a Q behind 2008 level R&D spending.
 
Last edited:
It doesn't match what we can see ? Do you look at real products instead of patents ?
They have hard times making gains with gcn, it's a fact. So you say, because money. All right.
If you don't have the money to support your uarch, then it's maxed out in the end. You can't make it evolve more. Do simpler things. Give them fuck you money I'm sure they could have still tweaked Terascale ?

Btw, being less condescending wouldn't hurt.
 
AMD makes their own set of compromises. While they have been criticised for less than stellar performance/W or performance/FLOP the last couple of generations, what has been ignored is that they actually are pretty good at performance/mm2 and particularly shader FLOPS/mm2. (I shy away from saying performance/$ since that is too dependent on market forces)
What that implies is that if they had so decided they could have supported their shader cores with larger caches/buffers/queues or multi porting or more registers or more front end/back end processing. But that may not have made sense to them overall, since this would have come at a cost in die size that may not have paid for itself in overall performance. Nvidia has enjoyed better margins on their products, and are able to ship larger dies (with each FLOP better supported). Now, the balance that AMD has struck in the past may or may not change on 7nm lithography. But even for these higher power chips, the improvement in density at 7nm is significant, maybe enough to pay a price in die area. Or not, and they might choose to produce smaller chips (improving both dies/wafer and percentage of good dies) so they have greater pricing flexibility.

I have nowhere near the detailed GPU architectural knowledge needed to know what is and is not possible to do within the GCN ISA, but I find it dubious to assume that it would be "maxed out" from a technical standpoint. I'm more versed in CPU architecture, and similar arguments have been made about "ARM" (nowish) and x86 (30 years ago) performance potential that has been proven emphatically wrong. You can always throw more hardware at a problem. The question is if it makes overall sense to do so, and that's decided by the market.
 
AMD makes their own set of compromises. While they have been criticised for less than stellar performance/W or performance/FLOP the last couple of generations, what has been ignored is that they actually are pretty good at performance/mm2 and particularly shader FLOPS/mm2. (I shy away from saying performance/$ since that is too dependent on market forces)
What that implies is that if they had so decided they could have supported their shader cores with larger caches/buffers/queues or multi porting or more registers or more front end/back end processing. But that may not have made sense to them overall, since this would have come at a cost in die size that may not have paid for itself in overall performance. Nvidia has enjoyed better margins on their products, and are able to ship larger dies (with each FLOP better supported). Now, the balance that AMD has struck in the past may or may not change on 7nm lithography. But even for these higher power chips, the improvement in density at 7nm is significant, maybe enough to pay a price in die area. Or not, and they might choose to produce smaller chips (improving both dies/wafer and percentage of good dies) so they have greater pricing flexibility.

I have nowhere near the detailed GPU architectural knowledge needed to know what is and is not possible to do within the GCN ISA, but I find it dubious to assume that it would be "maxed out" from a technical standpoint. I'm more versed in CPU architecture, and similar arguments have been made about "ARM" (nowish) and x86 (30 years ago) performance potential that has been proven emphatically wrong. You can always throw more hardware at a problem. The question is if it makes overall sense to do so, and that's decided by the market.

point taken, but coudn´t be the same said about Bulldozer ? Instead of throwing more money into it, they rather ditched it in favour of new arch. And it was right decision.

Let´s be frank. GCN is more than 8 years (!) old gpu architecture, with just modest updates through the years. It has it´s limits and weakness. It´s development started somewhere between 2007/2008. Sure, they can "tweak it" hard in the end of its live and spend tons of money to make it better, or they rather focus on new GPU architecture which has higher potential. One scenario here is more probable than the other I guess....
 
point taken, but coudn´t be the same said about Bulldozer ? Instead of throwing more money into it, they rather ditched it in favour of new arch. And it was right decision.

Let´s be frank. GCN is more than 8 years (!) old gpu architecture, with just modest updates through the years. It has it´s limits and weakness. It´s development started somewhere between 2007/2008. Sure, they can "tweak it" hard in the end of its live and spend tons of money to make it better, or they rather focus on new GPU architecture which has higher potential. One scenario here is more probable than the other I guess....
True.
When you develop a new architectural base, you have to make assumptions not only about which markets you want to adress (gaming? HPC? Mining? xxxxx?) but also how the needs within those markets will develop over time. It may well be that a decade down the line, you would not only see where you mispredicted, but also see new market either develop or disappear. To some extent you can rebalance within the ISA (for instance reduce 64-bit FP performance if your target is gaming, reduce ROPs if HPC and so on) or tweak the underlying hardware implementation as mentioned before. But it may also be that it would pay off nicely from an efficiency standpoint to reassess completely what markets you want to attack, and what trends fell by the wayside, and which seems likely to develop.
My point is that it is really difficult to judge just how much you would gain by completely rearchitecting vs. tweaking and rebalancing the existing. (Not to mention that no modern CPU/GPU architecture is going to be completely new anyway. That would incredibly wasteful as there are a lot of wheels that really don’t need reinventing.)

The Hot Chips presentation will be interesting. I hope.
 
point taken, but coudn´t be the same said about Bulldozer ? Instead of throwing more money into it, they rather ditched it in favour of new arch. And it was right decision.
Zen carries a lot of BD (and overall AMD-cores-since-K7) legacy, down to it inheriting the entire BPU.
 
Zen carries a lot of BD (and overall AMD-cores-since-K7) legacy, down to it inheriting the entire BPU.

Of course, like some said before, you don't redo everything. Still, Zen is considered by AMD like a new x86 architecture.

With all the changes in Zen (even if yes, some things didn't change), don't tell me it's just a tweaked Excavator...
 
It doesn't match what we can see ? Do you look at real products instead of patents ?
They have hard times making gains with gcn, it's a fact. So you say, because money. All right.

There's a direct correlation between AMD's competitiveness on consumer GPUs and their R&D expenditure.
The typical time period for designing a graphics chip is betwen 2 and 4 years. The launch of the HD4000 series in mid 2008 was preceded by AMD's historical maximum R&D budget of $450-500M/quarter the year before.
GCN was preceded by 2 years of ~$350M/quarter in 2012, after which their R&D expenditure started a steep decline, plus they started working on the new Zen architecture.
Both Polaris and Vega were preceded by 3 years of $230-250M/quarter (most of which probably went to Zen development), their historical low of the past 15 years without even adjusting for inflation.



So yes. Ideas are cheap. Execution is not.



Navi will be preceded by 2 years of $320-370M/quarter. It's a lot more, but it's still anemic compared to nvidia's >$500M/quarter during the last year (which they don't share as much with CPU development).
 
Navi will be preceded by 2 years of $320-370M/quarter. It's a lot more, but it's still anemic compared to nvidia's >$500M/quarter during the last year (which they don't share as much with CPU development).
And both are meager versus Intel's ~$13B a year, yet their execution is as close to pathetic as the company of their size can get.
Money can't buy execution either.
 
And both are meager versus Intel's ~$13B a year, yet their execution is as close to pathetic as the company of their size can get.
Money can't buy execution either.

One of the most basic logical fallacies.

A being incomplete without B doesn't mean all A with B is complete.
You can't do execution without money. It doesn't mean that all execution with money is successful.

Regardless, any sane person would much rather manage a company in Intel's current condition than AMD's.
Calling Inte'ls execution "pathetic" can only come from a place of ignorance IMO. They managed to steadily increase their profits year-after-year by huge percentages by progressively shrinking their quad-core CPUs while selling them at the same price. And they did so as much as their competition allowed them to.


Besides, Intel also has around 20 fabrication plants so they need to spend R&D to update them and besides CPUs, GPUs and chipsets they also spend R&D on NAND memory, NICs, FPGAs, Computer Vision, software security, IOT, WiFi modems, 4G modems until last month and others.
It's not like Intel has 10x more money to spend on GPU development than AMD or 5x than nvidia.
 
point taken, but coudn´t be the same said about Bulldozer ? Instead of throwing more money into it, they rather ditched it in favour of new arch. And it was right decision.

Let´s be frank. GCN is more than 8 years (!) old gpu architecture, with just modest updates through the years. It has it´s limits and weakness. It´s development started somewhere between 2007/2008. Sure, they can "tweak it" hard in the end of its live and spend tons of money to make it better, or they rather focus on new GPU architecture which has higher potential. One scenario here is more probable than the other I guess....

you can easily see bulldozers weaknesses,

L1i cache alaising issues
write through L1d
terrible L2 latency/shared L2
extremely terrible inter module latency
very narrow int execution
high FPU latencies from FMA unit
shared fetch/predict/decode

But what can you say about GCN, nothing, games that are compute heavy generally do well on it, games that hit other parts of the GPU don't. So your answer is to throw it away? The other interesting thing to note is , if you address all the above issues will BD you can very easily come out looking like Zen :).

edit: Also the two thing that got worked on the most in steamroller/excavator the SMU 100% went into Zen, that is turbo boost , xrf etc.
 
Last edited:
Status
Not open for further replies.
Back
Top