AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Could lower voltage logic maintain signal integrity?
Future PCIe implementations are exploring multi-level signaling. PCIe5 in theory hitting this year and market next. As IF follows PCIe signaling, lower voltages may not be the solution. Higher voltage with more bandwidth is a possibility.

What if we scale down the package to 2 HBM and 2 GPU dies and a 7nm respin of the IO die?
For a GPU I'm mixed on moving the GPUs to 7nm/off IO die for graphics. While 7nm has straightforward benefits for the cores, they aren't the bottleneck. Assume 7nm as relative process iterations. In a parallel architecture, generally using low powered processes makes sense for scaling. The solution may be to move the command processor and front end off die to a frequency optimized process. So IO/cores, HBM stacks, and high performance cores as a MCM. Perhaps replace one HBM stacks with a chiplet. That high performance die should be relatively light on bandwidth needs and latency to most cores less of a concern. Scaled up, multiple processing chiplets optimized for their specialty. Different variations of 7nm for chiplets should be possible to optimize performance. Say 3+1 chiplets with IO and HBM for a high end solution. Not too far from Milan and no reason all parts couldn't be within a single chip for cost sensitive markets.
 
move the command processor and front end off die to a frequency optimized process
So, a large 14nm IO die with the command processor.

What about global thread scheduling - wouldn't it require significant inter-die traffic?

Different variations of 7nm for chiplets should be possible to optimize performance. Say 3+1 chiplets with IO and HBM for a high end solution.
So, again, a 14 nm IO die, one HBM stack, and 3 GPU chiplets?

What kind of specialization the different chiplets could have - something like full and simple ALUs from the recent patent application?
 
The point is, larger caches and/or faster links with improved protocols may allow further unification - like it happened in Vega10 for pixel and geometry engines which were connected to the L2 cache.
They were connected to the L2, but the descriptions seem more consistent with them using it as a means to either spill or stream out in a unidirectional manner. There's no participation in a protocol, and there are instances where there are heavy synchronization and flush events that would not be needed if they were full clients.
As far as the L2 is concerned, there is currently no protocol as far as they need, because there is only one destination and minimal interference from other clients. Examples where interference can happen tend to be ones with flushes and command processor stalls.

The experts are on the x86 side. GCN's architecture reflects little to none of that expertise. It has a memory-side L2 cache where data can only exist in one L2 slice, and the L1s are weakly coherent by writing back any changes within a handful of cycles. Coherence between the GPU and CPU space is one-sided. The CPU side can invalidate and respond with copies in the CPU caches--which are designed by the experts. The GPU side with the GCN-level expertise cannot be snooped and cannot use its cache hierarchy to interact with the CPUs. At most, the GPU code can be directed to write back to DRAM and avoid the caches, after which it may be safe to read that data after some kind of synchronization command packet is satisfied.

But how this would be different from the current architecture?
Coherence in GCN's L2 is handled by there only being one physical place that a given cache copy can be in, the L2 slice assigned to a subset of the memory channels.
This is a trivial sort of coherence. A copy cannot be incoherent with another copy because it cannot be in more than one place.

Vega whitepaper and Vega ISA document imply that L2 is split between separate memory controllers/channels and interconnected through IF/crossbar.
The L2 has always been split between memory controllers. Vega's mesh could in theory permit flexibility in how the slices map physically, but the current products seem to have the same L2 to memory channel relationship for the same chip. The fabric would not make the L2 function correctly if it somehow allowed two slices to share the same address ranges.
What happens if another L2 slice has a copy is undefined. What is "coherent" as far as GCN goes is that a client be write-through to the L2, which only the CUs do.
For global coherence, GCN works by skipping the L1 and L2 caches entirely and writing back to DRAM, since none of them can participate in system memory coherently.

What if we scale down the package to 2 HBM and 2 GPU dies and a 7nm respin of the IO die?
Respinning the IO die would have limited benefit, which is why AMD is able to use 14nm IO dies. The size of the PHY and allocating perimeter length to IO is where the process node has limited benefit. Shrinking the IO die might add some challenges based on how extreme its length to width ratio gets to maintain sufficient perimeter.

Cutting the number of HBM stacks in half means evaluating a system with 2.5-5x the link bandwidth of Rome.
It's not clear how much of the overall DRAM bandwidth Rome can supply to a single chiplet. If between 0.5 to 1.0 of the overall bandwidth, something like 5x the link capability may stretch the capabilities of the IO die and the perimeter of a Rome-style chiplet.
Since we do not have a clear die shot of the involved silicon in Rome, it's hard to say how much area on the IO die 2.5-5x the links would take up, or how much perimeter it would need. Supplying one GPU chiplet with the bandwidth of 2 HBM stacks would be all or most of one side of the IO die, before considering the other GPU chiplet.(edit: at least if going by the drawn block diagram given by AMD)

At least for current products and this incoming generation of MCMs, it doesn't seem like it the technology has advanced enough to make as practical as it might be for CPUs like Rome.

Could lower voltage logic maintain signal integrity?
The on-package links already do operate at lower voltages, or at least as low as the engineers at a given vendor have managed while still being able to get usable signals from them. The reduced wire lengths and reduced error handling requirements bring the power cost per bit much lower than the inter-socket connections.
There may be future technologies that bring these downsides down, although AMD's projections for GPU MCMs have included interposers rather than assuming sufficient scaling for package interconnects.
 
Last edited:
What about global thread scheduling - wouldn't it require significant inter-die traffic?
Insignificant compared to the aggregate bandwidth of all the cores' data. Should be within IF's capability. Latency possibly a concern, but GCN overlaps waves within a CU which should mask it.

So, again, a 14 nm IO die, one HBM stack, and 3 GPU chiplets?

What kind of specialization the different chiplets could have - something like full and simple ALUs from the recent patent application?
As a high end product possibly. Goal would be chiplets with radically different clock domains. That 3+1 config could work, but more bandwidth would seem ideal from a 2nd HBM stack.

I haven't seen that patent yet, but it looks similar to something I theorized over a year ago with scalar processors and cascaded SIMDs. Raja did mention someone was really close at the time. That could have been it. I theorized one or more scalars feeding quads in a SIMD with each quads allowing full indirection to avoid writing out results. Goes along with that Nvidia study where most data was reused immediately or within 3-4 cycles. I was only assuming the chiplets were specialized for max clocks or low power. That ALU change could just as easily apply. Deunify pixel and vertex shaders for example. Could apply for any stage of the pipeline that didn't explode the volume of data.

Not unlike what Nvidia is doing with tensors and DLSS, but with chiplets. The simpler example I think would just be a single large GPU with one fast chiplet/coprocessor hanging off to the side. Same as current design, but with added chiplet and some specialized domains. Pixels, with texturing and ROPs, on the primary chip should account for most of the bandwidth needs of the GPU.

I believe IF was already usable as a bridge between processing clusters in Vegas. This should just be a step further which may make sense for Navi. Integrating IF has to do something beyond connecting the GPU to memory controller within the chip. Or had a future design in use. Vega in theory could have a few CUs dedicated to one stage of the pipeline.
 
What kind of specialization the different chiplets could have - something like full and simple ALUs from the recent patent application?
The primary goal of that patent is to better utilize a vector register file capable of 4 reads and 4 writes in a 4-cycle cadence. The physical structures and power consumption of the register file are sized for the peak requirements of the vector units, but in general only half of that bandwidth is used with the single-issue vector path. The area and power impact of the register file is proportional to its over-engineered size, and that is a scaling barrier for the CU and GPU overall.
The patent outlines a VLIW2 instruction stream that has one standard ALU and a core ALU that performs instructions with fewer operands. There is hardware for selecting either a VLIW2 from one thread or selecting one VLIW element from two, and an output cache that tries to avoid writes back to the vector register file in order to reduce the power cost and reduce the burden on the register file due to there being twice as many outputs per cycle.

That seems to be addressing a design barrier for the GPU architecture in general, where a chiplet specializing in not having it doesn't seem to be specializing in a positive way.
 
As far as the L2 is concerned, there is currently no protocol as far as they need, because there is only one destination and minimal interference from other clients.
Coherence in GCN's L2 is handled by there only being one physical place that a given cache copy can be in, the L2 slice assigned to a subset of the memory channels.

Coherence between the GPU and CPU space is one-sided. The CPU side can invalidate and respond with copies in the CPU caches--which are designed by the experts. The GPU side with the GCN-level expertise cannot be snooped and cannot use its cache hierarchy to interact with the CPUs.

What is "coherent" as far as GCN goes is that a client be write-through to the L2, which only the CUs do.
For global coherence, GCN works by skipping the L1 and L2 caches entirely and writing back to DRAM, since none of them can participate in system memory coherently.
I get your points, but we're looping back to original statement that the cache subsystem would have to be redesigned to allow ccNUMA, besides the need to provide enough inter-die link bandwidth.

I'd just assume RTG will fully use the expertise that exists within their own organization, should they really decide to go this route.

At least for current products and this incoming generation of MCMs, it doesn't seem like it the technology has advanced enough to make as practical as it might be for CPUs like Rome.
There may be future technologies that bring these downsides down, although AMD's projections for GPU MCMs have included interposers rather than assuming sufficient scaling for package interconnects.​

Well there are new endeavors at TSMC, like wafer-level packaging with vertically stacked controller/memory packages, though they are still a few years away.
Goal would be chiplets with radically different clock domains
Why not just de-clock or power-off some parts of the GPU?

I believe IF was already usable as a bridge between processing clusters in Vegas. This should just be a step further which may make sense for Navi. Integrating IF has to do something beyond connecting the GPU to memory controller within the chip.

Vega 20 does include external IF links.

Considering how Lisa Su responsed to the direct question on GPU chiplets at the CES press party, they may be doing some research to that end - but it's unlikely to be on the listt of design requirements for Navi...​
 
Why not just de-clock or power-off some parts of the GPU?
Helps, but not to the extent of a power optimized process. If already pursuing chiplets, avoiding a performance process may have a benefit as power is generally the limiting factor.

Vega 20 does include external IF links.
I meant internally and routing more traffic over IF. Allowing the current external links to share or possibly offload some functionality. Not currently usable in any meaningful way, but a stepping stone towards some future capability.
 
The primary goal of that patent is to better utilize a vector register file capable of 4 reads and 4 writes in a 4-cycle cadence. The physical structures and power consumption of the register file are sized for the peak requirements of the vector units, but in general only half of that bandwidth is used with the single-issue vector path. The area and power impact of the register file is proportional to its over-engineered size, and that is a scaling barrier for the CU and GPU overall.
As a side note - I'm still slightly confused why AMD uses 1R/1W SRAMs rather than shared port 1RW SRAMs if they admit the former is over-specced. The only reason I can think of both NV and AMD using 1R/1W for practically everything is that maybe it scales up to higher clocks than 1R/1W? I've never worked on GPU designs that clocked up to 2GHz so I'm not quite sure...

I wouldn't call it a "scaling barrier" by the way - I'd call it an efficiency barrier. It doesn't prevent scaling the GPU beyond 64 CUs; it just reduces the efficiency, whereas the current top-level GCN architecture does limit scaling beyond 64 CUs and is a scaling barrier. Sorry if that feels a bit pedantic, I trust you know what you're saying, I just want to avoid misunderstanding by others :)

Regarding the Super-SIMD/VLIW2 patent, I've got a few thoughts about it I'll post in the Navi 2019 thread...
 
I get your points, but we're looping back to original statement that the cache subsystem would have to be redesigned to allow ccNUMA, besides the need to provide enough inter-die link bandwidth.
I believe the point of requiring a cache system redesign was in response to an earlier statement saying all technical issues had been resolved with IF.

I'd just assume RTG will fully use the expertise that exists within their own organization, should they really decide to go this route.
If AMD considers it worthwhile. If RTG wants x86 system designers, it would happen when AMD decides its x86 side has enough to spare and that investing in that specific direction for graphics is worthwhile. At least for now, the pattern of investment into RTG has a slowed cadence and oblique references from former employees concerning resources to make for some doubt. For the architecture, if I recall from past papers on HPC APUs and TOP-PIM, the graphics chiplet's architecture seemed at a high level somewhat like Tahiti, and significant software work was promised for handling an architecture that did not appear to be approaching x86 all that quickly.
So I guess we wait for some kind of disclosure about what direction RTG takes, and if AMD's pattern of investment shows changes at the far end of the pipeline.


As a side note - I'm still slightly confused why AMD uses 1R/1W SRAMs rather than shared port 1RW SRAMs if they admit the former is over-specced.
The read ports are somewhat overspecced relative to average port usage, but would be sized to allow AMD's traditional emphasis on high peak utilization of vector units with 3-operand FMA operations. The patent shows an additional path out of the register and bypass networks for vector IO, presumably for functions like concurrent memory accesses reading from the register file. Updates into the file from outside units can proceed more readily without blocking the VALUs, but once the decision is made to give enough read bandwidth by having separate ports the write bandwidth comes by design.

Possibly, there are benefits to keeping the SRAM and access hardware in line with common architectures that have at least 1R and 1W. There may be benefits from the standpoint of simplifying the process for avoiding bank contention, and the read and write paths can potentially run parts of their stages concurrently rather than consistently blocking each other. Tuning circuits and transistor choices may also be something more readily specialized if the paths are separate.

I recall that Nvidia had a patent for a double-pumped shared read/write port SRAM, which was mentioned in the console threads back at the Xbox One launch. I wouldn't know if it were used in designs at the time or now. I'd imagine there would be some challenges at the speeds the units operate now, and SRAM physical design is getting more complicated without trying to combine frequently separate ports.

I wouldn't call it a "scaling barrier" by the way - I'd call it an efficiency barrier. It doesn't prevent scaling the GPU beyond 64 CUs; it just reduces the efficiency, whereas the current top-level GCN architecture does limit scaling beyond 64 CUs and is a scaling barrier. Sorry if that feels a bit pedantic, I trust you know what you're saying, I just want to avoid misunderstanding by others :)
I guess I don't see power efficiency as a subordinate consideration in a power-limited scenario. Power has been considered one of the paramount scaling barriers in the field, whereas I see GCN's SE, CU, and RBE limits as microarchitectural quirks that AMD has indicated it could change if it wanted.
AMD's patent states the register file on its own can take 30% of the area and power budget of a SIMD compute unit. Since that file is inevitably tied to at least the minimum SIMD VALU complement, that minimum set would be a dominant portion of the CU's power budget. The physical reality that AMD's facing is a more profound question than whether it can be bothered to tweak SE counts.
A design trying to get increased throughput by expanding the "footprint" of vector and register resources would have problems even if GCN had higher architectural limits.

Since part of the patent includes an output cache that has some of the capabilities of the VLIW reuse flags and the operand reuse cache for Nvidia, the way the register file can impede scaling seems bigger than GCN.
 
At least for now, the pattern of investment into RTG has a slowed cadence and oblique references from former employees concerning resources to make for some doubt.

There are rumors/reports from last year claiming that as Zen 2 development wrapped up, AMD started sending CPU engineers towards RTG to help on their core clock scaling issues.
Assuming Zen 2's "wrapping up" was close to Vega 7nm, I'm guessing we won't see the results of that until at least a Navi refresh.
 
The primary goal of that patent is to better utilize a vector register file capable of 4 reads and 4 writes in a 4-cycle cadence. The physical structures and power consumption of the register file are sized for the peak requirements of the vector units, but in general only half of that bandwidth is used with the single-issue vector path. The area and power impact of the register file is proportional to its over-engineered size, and that is a scaling barrier for the CU and GPU overall.
The patent outlines a VLIW2 instruction stream that has one standard ALU and a core ALU that performs instructions with fewer operands. There is hardware for selecting either a VLIW2 from one thread or selecting one VLIW element from two, and an output cache that tries to avoid writes back to the vector register file in order to reduce the power cost and reduce the burden on the register file due to there being twice as many outputs per cycle.

That seems to be addressing a design barrier for the GPU architecture in general, where a chiplet specializing in not having it doesn't seem to be specializing in a positive way.

Recent patent
STREAM PROCESSOR WITH HIGH BANDWIDTH AND LOW POWER VECTOR REGISTER FILE
 

That looks like it explains changes in an area surrounding the ALU portion of the super-SIMD patent, which might pertain the to Navi thread. The more aggressive forwarding and coalescing handled by the operand network and the behavior defined for the output cache seem to fill out the description of the parts of the other patent that weren't specifically about the SIMD organization.

I am thinking through the two documents to see how they work together, although my thoughts might go in the other thread since I have not seen descriptions of Vega's hardware indicating it has these features.
 
That looks like it explains changes in an area surrounding the ALU portion of the super-SIMD patent, which might pertain the to Navi thread. The more aggressive forwarding and coalescing handled by the operand network and the behavior defined for the output cache seem to fill out the description of the parts of the other patent that weren't specifically about the SIMD organization.

I am thinking through the two documents to see how they work together, although my thoughts might go in the other thread since I have not seen descriptions of Vega's hardware indicating it has these features.
Mark Leather, one of the signers of the new stream processor patent, is the chief architect of the new AMD efficiency centric architecture and as his linkedin says finished its design in December 2017. More than enough time to be taped out into Navi.
 
Are there any news/rumors of next gen. workstation card based on Vega 11/12? (I suppose it'll be named WX 3200/4200/5200...)

Nope. The only workstation card you'll see with those chips is the one in the MacBook Pro.
 
So it seems AMD reduced the Vega 56's MSRP down to $275.
Just in time to cockblock the GTX 1660Ti which is reportedly coming up today.

Just a gimmick. Very limited supply that sold out in minutes.

Nvidia and their partners will continue to offer the GTX 1660 Ti at $279 going forward whereas AMD will not.

Instead, AMD’s competitor for the GTX 1660 Ti looks like it will be the Radeon RX Vega 56. The company sent word last night that they are continuing to work with partners to offer lower promotional prices on the card, including a single model that was available for $279, but as of press time has since sold out. Notably, AMD is asserting that this is not a price drop

https://www.anandtech.com/show/13973/nvidia-gtx-1660-ti-review-feat-evga-xc-gaming
 
So AMD tried to cock block the GTX 1660 Ti with a momentary discount, and wasn't even brave enough to claim it wasn't momentary?

Wow, that's lame.
 
So AMD tried to cock block the GTX 1660 Ti with a momentary discount, and wasn't even brave enough to claim it wasn't momentary?

Wow, that's lame.
To be specific, AMD actually didn't claim any price drop. They only said that specific Vega 56 model is currently (was) available at Newegg at this certain price
 
Back
Top