AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Sure but the cost of more CUs isn't just the FMA unit itself; for a given level of performance, it may or may not be cheaper to make trascendentals cheaper, rather than adding more CUs. Also it may be more power efficient; i.e. AMD's approach may (or may not) be more area efficient but less power efficient (see: dark silicon). Many GPU architectures have fully co-issued special function units including Pascal/Volta/Turing (even past architectures from AMD; e.g. Xenos isn't really Vec5, it's Vec4 FMA + scalar special function).

Everything's a trade-off and clearly AMD went strongly in the direction of doing more on the general-function FMA units compared to their VLIW4/VLIW5 architectures and compared to NVIDIA. It's not obvious to me whether that has actually paid off for them...

Also I'm still not 100% sure whether GCN special function ops really stall the pipeline for 4 cycles or only 1 cycle.

Interestingly, Lisa Su said at CES that heterogeneous computing was the answer to the slowing down of Moore's Law. I wonder whether that vision goes down to the sub-CU level.
 
Sure but the cost of more CUs isn't just the FMA unit itself; for a given level of performance, it may or may not be cheaper to make trascendentals cheaper, rather than adding more CUs. Also it may be more power efficient; i.e. AMD's approach may (or may not) be more area efficient but less power efficient (see: dark silicon).

I agree. It's not a simple problem evaluating pros and cons; You might save power evaluating transcendentals using special purpose units, but FMA workloads at iso performance would consume less power (power ≃ clock³).

Cheers
 
Interestingly, Lisa Su said at CES that heterogeneous computing was the answer to the slowing down of Moore's Law. I wonder whether that vision goes down to the sub-CU level.
Is HSA still alive? I haven't actually heard them talk much about it for a long time.
 
Perhaps because apps who benefit from GPU compute have already implemented OpenCL and CUDA half a decade ago, and nowadays with 4+ cores being the norm, the biggest bottleneck in 99% of daily use is that pesky javascript code in the browser or other single-threaded stuff.

Facebook video chat I'd guess uses the fixed function video codec units, so I don't know where that came from.
 
Redgamingtech, the website who leaked Radeon VII 2 weeks prior to launch has a new rumor/report on Navi.
It doesn't say if the source is the same as the one from December, though it does say it's a source that was proven true in the past.

http://www.redgamingtech.com/navi-a...-july-more-powerful-navi-launching-next-year/

Well, according to a source Navi will not be announced until around E3 2019 (which takes place in June). The card(s) will then launch about a month later (I don’t have an exact date). According to the source, AMD said that the GPU is looking good (at least the general feeling of the company is confident).

For the performance targets – I can’t give specifics, as I don’t have exact numbers (I wish I did). But I can tell you that according to the source, the company is targeting the “low to mid-range” which, as you can imagine is a pretty wide net.
(...)
I can also reveal that the lower end Navi’s of July will then be followed by a higher end Navi part that is supposedly arriving in 2020 (though no time window was given).


So to summarize:

- Low to Midrange Navi chips replace Polaris 11 up to Vega 10, announced in June to launch in July
- High-end Navi that replaces (and hopefully upgrades upon) Vega VII to launch in 2020


Then we have Arcturus also announced for 2020, though that could have slipped by now.


It could be that AMD is targeting a "tick-tock" of sorts with GPU architectures within 6 month periods:
Q1'19: 7nm high-end Vega (tock)
Q3'19: 7nm low/mid-end Navi (tick)
Q1'20: 7nm+ high-end Navi (tock)
Q3'20: 7nm+ low/mid-end Arcturus (tick) - pure speculation
Q1'21: 5nm high-end Arcturus (tock) - pure speculation


That would require perfect execution, obviously, which we shouldn't expect from AMD.
7nm+ being TSMC's 7nm EUV, where yields should be significantly greater so chips wouldn't necessarily clock higher nor be smaller but could be significantly larger.
 
High-end Navi that replaces (and hopefully upgrades upon) Vega VII to launch in 2020

I don´t think AMD will release such chip especially when they announce next gen GPU in 2020. But that doesn´t prevent some big Navi rumours re-emerges over and over again....
 
I don´t think AMD will release such chip especially when they announce next gen GPU in 2020. But that doesn´t prevent some big Navi rumours re-emerges over and over again....
Large Navi would be old architecture to act as pipe cleaner for 7nm EUV, like Vega 20 is for 7nm DUV.
Arcturus would be new architecture on then-known 7nm EUV.

As I suggested, a tick or tock every 6 months would be extremely optimistic for AMD, let alone RTG.
 
Large Navi would be old architecture to act as pipe cleaner for 7nm EUV, like Vega 20 is for 7nm DUV.

yes, the same people said about Polaris as "pipe cleaner" for Vega10 althought this tick-tock never materialized , then the same says about Vega20 being "piper cleaner" for Navi and Navi for next gen, etc. GCN architecture is almost 9 years old, why should AMD bet it´s future on it ? AMD desperately needs something fresh. Old GCN doesn´t seem like right way to achieve it. Hope that small NAVI is the last GCN gpu from AMD...
 
Last edited:
Rumours are Navi needs a re-spin. Maybe it had some aggressive changes to improve clock scaling.

Why would it need a respin when its being designed by Sony? Is Sony not as great as some claim?


:runaway:


Sorry, I couldnt resist. Feel free to delete post after the lulz.
 
Do we believe the "Super SIMD" / "VLIW2" patent is applicable to Navi? It doesn't feel like a huge departure from GCN to me, so it feels very plausible to me that it'd be considered for Navi (whether it makes it into the final design is another question; patents don't always result in end-products).
It would take an encoding change for the architecturally single-issue ISA, since the compiler needs to determine whether instructions can go the primary or core slots instead of the hardware checking for dependences.
Perhaps that's a matter of yet another instruction format, which has precedent for GCN. Whether that can be fit in the current instruction lengths, or if this threatens to require a new length is unclear. (It could be company for Volta, which has gone to 128-bit instructions per the "Dissecting the NVIDIA Volta GPU Architecturevia Microbenchmarking" paper. https://arxiv.org/pdf/1804.06826.pdf)

After that, it would seem as if the CU would treat the two lanes as separate instruction queues that are drained in the same fashion as before.

The data output cache on the other side might be a somewhat larger departure. It seems like the hardware does a bit more checking in order to make a hit in the cache, or perhaps the decode and queuing process in the front end has a small table of cache slots and last-used architectural registers to override the source operands of subsequent instructions.

One thing I'm still confused about with GCN and my Google-fu is failing me (asking console devs on Twitter might be the easiest route but hopefully someone here knows as well): transcendental/special-function is 1/4 rate on GCN, but do they stall the entire pipeline for 4 cycles, or can FMAs be issued in parallel for some of these cycles?

Everything I've found implies that they stall the pipeline for 4 cycles, which is pretty bad (speaking from experience for mobile workloads *sigh* maybe not as bad on PC-level workloads) and compares pretty badly with NVIDIA which on Volta/Turing is able to co-issue SFU instructions 100% for free and they don't stall the pipeline unless they're the overall bottleneck (as they've got spare decoder and spare register bandwidth, and they deschedule the warp until the result is ready; obviously they can't co-issue FP+INT+SFU, but FP+SFU and INT+SFU are fine).
From the ISA docs, I do not see any reference to wait states for transcendental operations. If such an instruction actually did require 4 vector cycles to fully output the results for all waves, presumably co-issue would inject the risk of a subsequent fast instruction being able to source from the slow instruction's output register several cycles ahead of the writeback.
The architecture has various other places where it does not interlock, and the wait counts do not control for within-VALU dependences. Rather, the vestigial references to a VALUCNT in old docs may point to a time where the possibility was brought up but discarded. The more straightforward method that seems consistent with the ISA is that the architecture won't issue until the prior instruction has completed for these longer-duration instructions.

It feels to me like at this point, 1 NVIDIA "CUDA core" is actually quite a bit more "effective flops" than an AMD ALU. It's not just the SFU but also interpolation, cubemap instructions, etc... We can examine other parts of the architecture in a lot of detail as much as we want, but I suspect the lower effective ALU throughput is probably a significant part of the performance difference at this point... unlike the Kepler days when NVIDIA was a lot less efficient per claimed flop than they are today.
It's been some time since Kepler, but my recollection is that the impression of AMD's architectures consuming more general purpose FLOPs in mixed-use scenarios goes at least as far back as Tahiti, and possibly Cayman. (edit: VLIW5 had an AMD FLOP vs Nvidia FLOP debate as well.) The question was whether AMD's chip would have enough spare FLOPs to overcome the impact of the higher-cost special function instructions.
Other than perhaps Fermi's hobbled start, the impression with the VLIW GPUs was that AMD FLOPs weren't as meaningful for graphics as Nvidia FLOPs, and that's mostly held true for GCN.

EDIT: Also this would allow a "64 CU" chip to have 2x as many flops/clock as today's Vega 10 without having to scale the rest of the architecture (for better or worse). It feels like 8192 ALUs with 256-bit GDDR6 and better memory compression could be a very impressive mainstream GPU.
The exemplar image in the patent at least doesn't draw sufficient paths in the operand delivery from the register file, with just 4 reads overall for the ALUs and vector IO. If this is combined with the bandwidth from the destination operand cache, the ALU section sees a possible peak of 6 operands sufficient for 2 FMAs. It doesn't seem unreasonable to consider this close enough to 2x peak, given that many CPUs have needed the bypass network to compensate for a register file with too few ports for all the ALUs, and Nvidia's operand reuse cache does compensate for cases where its vector register bandwidth cannot be fully used.

One wrinkle to this going from the patent is that the operand cache interjects itself in the way of the forwarding network to the vector IO bus, potentially requiring some extra tracking or wait states since the cache does not feed into the bus used by the ALU and IO sections. It's a local ALU bus or write to the register file, so an export or memory read dependent on an operand in the cache may force an immediate writeback or require some additional checks of the mapped register list. There are some existing short wait states for some register hazards like this already, though this cache may make for longer explicit delays without pipeline interlocking.

Everything's a trade-off and clearly AMD went strongly in the direction of doing more on the general-function FMA units compared to their VLIW4/VLIW5 architectures and compared to NVIDIA. It's not obvious to me whether that has actually paid off for them...
If I recall correctly, VLIW4 is where the T-unit was broken up and the special-function elements distributed among the remaining four ALUs. An operation would cascade from one lane to the next over four cycles, with successive approximations or lookups occurring each time. GCN's lane orientation flipped things by 90 degrees, but it's possible that what it's doing for special instructions is from that lineage. The quad-based arrangement and 4-way crossbar available between the 4 ALUs in a quad for some instructions may fit with GCN acting like VLIW4. Otherwise, every lane would need the full complement of lookup tables and miscellaneous hardware, incurring an area cost while not realizing potentially significantly higher throughput if a full transcendental unit were in every lane.
Like the patent's side ALUs, the VLIW5 T-unit didn't have a corresponding set of hookups into the operand network, requiring unused operand cycles or shared operands with neighboring ALU slots. Unlike the T-unit, the side ALUs lack a multiplier and so cannot on their own perform complex operations. They're less generalist than the units that preceded them. Instead, it seems like the patent has two core ALUs with FMA capability, and then in some more complex scenario the side ALU can pair with one of them to perform instructions that require a full ALU.
 
Last edited:
- Low to Midrange Navi chips replace Polaris 11 up to Vega 10, announced in June to launch in July
- High-end Navi that replaces (and hopefully upgrades upon) Vega VII to launch in 2020
I would expect Navi to replace Polaris, but Radeon VII as it stands is a higher Enthusiast-level - so if high-end Navi happens, it would rather stand at 'Vega 40' (which never materialized) or would replace Vega 56.

'Replace' meaning 'to offer a similar or higher performance at the same or lower price'.

It could be that AMD is targeting a "tick-tock" of sorts with GPU architectures within 6 month periods:
Q1'19: 7nm high-end Vega (tock)
Q3'19: 7nm low/mid-end Navi (tick)
Q1'20: 7nm+ high-end Navi (tock)
Q3'20: 7nm+ low/mid-end Arcturus (tick) - pure speculation
Q1'21: 5nm high-end Arcturus (tock) - pure speculation
Six-month interval would be too early to introduce a completely new graphics architecture. The cycle of updates has been on a much slower pace in the last 3 years, not just for AMD but also for Nvidia.

I'd think Navi would be both a new/updated architecture and a discrete mid-range 7 nm chip implementing this architecture.
Thus Arcturus would be an implementation of Navi (or post-Navi) architecture for the high-end and enthusiast levels, using big caches and HBM3 memory, and won't be available in low or mid-end parts.

Polaris => Navi
Vega => Arcturus

tick or tock every 6 months would be extremely optimistic for AMD, let alone RTG
Arcturus coming 12 months after Navi? On a 5 nm EUV node? Nah, you've gotta be kidding...:cool:
 
Last edited:
Arcturus coming 12 months after Navi?
Arcturus actually was planned to release in late 2019 back in roadmap slides from mid-2017 (back then called "Next Gen"). It's Navi that's awfully late.
If Navi and Arcturus are being developed by 2 distinct teams in parallel, it's not impossible that the second isn't as late as the first. Besides, Arcturus in mid 2020 is already a 6 month delay from the initially planned.

As for 5nm, both Samsung and TSMC are planning risk production in mid 2019 with high volume production early 2020.
It all depends on how much volume the smartphone companies will require, considering the market is slowing down significantly.

Regardless, there's no indication of Arcturus being 5nm. That was pure and super optimistic speculation on my part, as I mentioned.
 
Right now we have yet another 9-month delay for Navi though.
9 month from when?
Navi was supposed to have an early 2018 release.
I. e. it was initially planned as a 10nm chip at best, 14nm at worst.
 
Redgamingtech, the website who leaked Radeon VII 2 weeks prior to launch has a new rumor/report on Navi.
It doesn't say if the source is the same as the one from December, though it does say it's a source that was proven true in the past.

http://www.redgamingtech.com/navi-a...-july-more-powerful-navi-launching-next-year/




So to summarize:

- Low to Midrange Navi chips replace Polaris 11 up to Vega 10, announced in June to launch in July
- High-end Navi that replaces (and hopefully upgrades upon) Vega VII to launch in 2020


Then we have Arcturus also announced for 2020, though that could have slipped by now.


It could be that AMD is targeting a "tick-tock" of sorts with GPU architectures within 6 month periods:
Q1'19: 7nm high-end Vega (tock)
Q3'19: 7nm low/mid-end Navi (tick)
Q1'20: 7nm+ high-end Navi (tock)
Q3'20: 7nm+ low/mid-end Arcturus (tick) - pure speculation
Q1'21: 5nm high-end Arcturus (tock) - pure speculation


That would require perfect execution, obviously, which we shouldn't expect from AMD.
7nm+ being TSMC's 7nm EUV, where yields should be significantly greater so chips wouldn't necessarily clock higher nor be smaller but could be significantly larger.

Sounds legit at least. TSMC's 7nm+ is already under production, I think AMD are waiting for the generation after that of 7nm (7nm C I think it's called?) that has even more EUV layers (almost as many as 5nm), allowing faster and cheaper production and cheaper tapeout without having to learn a new process. IE is better for large chips.

And of course Computex reveals are nothing new, that seems far more likely than E3. Are we getting 2 GPUs or 1? The scalability for 1 GPU to go between 15 watt laptop up to a 2070 competitor seems suspect. Shouldn't it be 2 GPUs, one at 20CUs and the other at 40 or something like that?

As for Arcturus, I'd not expect that till AMD said it, 2021. Next year will, maybe, see 5nm for smartphone SOCs and such, and perhaps a super expensive corporate card like the Vega ML cards. More relevantly for Arcturus I'd expect whatever the PS5 chip is to be some sort of early release half Navi half Arcturus like architecture, just like the PS4 Pro had FP16 support before Vega came out.

IE Whatever Arcturus is, we'll probably get a preview next years E3 or whenever the PS5 preview will be.
 
Last edited:
Status
Not open for further replies.
Back
Top