AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

I remember it from an interview with Raja Koduri.

I think the logic was if you have enough memory bandwidth to spare then general performance won't be affected by more frequent memory swaps through the PCIe bus.


What is the bandwidth of the pci-e bus and what is the memory bandwidth (pci-e 3.0 really shows no benefits over pci-e 2 at least not for single cards) ? That should answer your question right there, what is the limiting factor, and why that statement was BS, and yeah it was AMD that stated that too.

https://www.techpowerup.com/reviews/AMD/R9_Fury_X_PCI-Express_Scaling/3.html
 
Last edited:
The OS manages all the process VAS. Migrating pages from a physical location to another while maintaining coherency cannot bypass the OS in any way.
Largely for security(obvious arbitrary memory read issues), but at a lower level the hardware should be able to perform the operations without the OS being involved. It need not be coherent either. What I'm saying is that if a CPU resided on the GPU, at a low level it would have full access to system memory without requiring a host CPU to be involved. On P100, as well as most GCN models, what you suggested is likely accurate. The exception being the higher bandwidth links(IBM/NVLink and AMD's APUs) that likely have more robust interfaces with the system memory controller. The SSGs for example obviously don't have to go through the OS to access their local storage. This was demonstrated by AMD to be superior to even reading from system memory over PCIE as I recall.

Most mobile games use double rate FP16 extensively. Both Unity and Unreal Engine are optimized for FP16 on mobiles. So far there hasn't been PC discrete cards that gained noticeable perf from FP16 support, so these optimizations are not yet enabled on PC. But things are changing. PS4 Pro already supports double rate FP16, and both of these engines support it. Vega is coming soon to consumer PCs with double rate FP16. Intel's Broadwell and Skylake iGPUs support double rate FP16 as well. Nvidia also supports it on P100 and on mobiles, but they have disabled support on consumer PC products. As soon as games show noticeable gains on competitor hardware from FP16, I am sure NV will enable it in future consumer products as well.
I'd imagine Volta, possibly a Pascal refresh, will have the support enabled if only for marketing.

More immediately, would it be suitable for accelerating tessellation? The marketing material suggests over twice the geometry throughput. I could see packed FP16 being sufficient accuracy for offsets within a sufficiently small patch. Switch to FP16 for generating all the triangles, then convert back into the domain of world coordinates for whatever survives the culling. Maybe go to FP16 once you get past the 8x(?) tessellation level or determine the primitive to be sufficiently small.

I certainly hope that improvements like this get adopted by other IHVs, just like Nvidia adapted Intel's PixelSync and Microsoft made it a DX12.1 core feature (ROV). Same for conservative raster.
Programming question for you. The primitive shaders look like a more specialized compute shader. Along with the GeometryFX AMD presented and the culling mentioned in those papers. That seems to be the origin of the primitive shaders. So how practical would it be to emulate the capability on existing hardware? I'm guessing the issue is performance as there is insufficient parallel work? Would this be something where that flexible scalar for example could reasonably enable the capability?

It is really, really common to have a USB debugger connected to a microprocessor's serial debug bus on a development board, that can gain access to the internal states or whatever manipulation as designed. I sincerely doubt it is anything new at all...
Common, but I can't say I've come across many prototype boards with an extension and USB3 port affixed. Normally boards just add headers for USB and/or a separate breakout board if needed. That works for development and quality control. The only time I've seen actual ports were for dev kits or training devices that were more widely distributed.

It could be power related, but I doubt there are so many lanes coming off the package to warrant the extension. A simple serial interface and headers would make more sense. The infinity fabric in theory doesn't extend to the PCB so the extension wouldn't be required to wire that in. My guess would be a breakout board for all the spare PCIE or IO lanes on the board. Something partners can use to test a variety of interfaces. Thunderbolt, USB, fibre channel, storage, etc with a riser or ribbon cable. Something where all the headers on the board would be impractical.

Vega 20 says PCIe 4.0 host, meaning it might be just a 4-lane host for future NVMe SSDs. The current Fiji SSG uses two NVMe Samsung 950 Pro in RAID0 (theoretical 6.4GB/s read speeds) through a PLEX controller. A single PCIe 4.0 should be able to get about the same bandwidth with just one SSD.
4 lanes is sufficient for most NVMe devices. I'd say there is a good chance Vega has more than just 20-24 lanes without using a bridge chip. I'm guessing 32 lanes. 16 for host adapter and then another 16 for IO on the board. That should be sufficient for the fastest fibre channel (128gbit/s) currently available at 12.8GB/s in each direction. The next Thunderbolt interface is 10GB/s for 8k UHD. Odds are the board won't run 8k UHD and SAN at the same time, so should be sufficient IO without a lot of bridge chips. Also works for communicating with other chips. Vega20 in theory would use the next iteration of the standards with everything doubled.

I remember it from an interview with Raja Koduri.

I think the logic was if you have enough memory bandwidth to spare then general performance won't be affected by more frequent memory swaps through the PCIe bus.
This logic is BS. I still maintain that someone just pulled it out of their behind or at least misinterpreted what was said by AMD.
It was the basis for all those charts of how much ram games actually used. Along with their future box marketing of bandwidth in addition to capacity.

EDIT: Around 4:20 in the video linked above.
 
Last edited:
Well, I didn't say nothing changed. What I say is these kind of information already exists as publicly shown by these shiny rectangular boxes, so it isn't surprising at all to have abilities to probe into the chip and intercept these signals or the on-chip microcontroller for bring-up or whatever. It is just that somebody happened to ask "what is this" in an interview, while generally no one ever bothers to pitch about it because it isn't quite relevant to the marketing of the end product at all.

So if you think you are sarcastic, maybe it is not. ;)

No it isn't at all surprising, sure. But on that interview on which the recorder battery ran out, Raja went on about speaking along the lines of how that debugging functionality is his favorite part of the chip, even though it wont ever reach the Vega customers. That's a surprising thing to say if it were just something pretty much inherited from previous generations. Especially given the context of these previews..

So yeah, I wish I knew what he meant. The interviewer should have reminded him of PowerTune just like you did. Maybe then he'd have accepted the challenge and talked more
 
Last edited:
Largely for security(obvious arbitrary memory read issues), but at a lower level the hardware should be able to perform the operations without the OS being involved. It need not be coherent either.
Not just for security (which is enforced by the hardware anyway). AMD64 provides an architected TLB, but in the end it is the OS who owns the page table. Let's say if you migrate a page table entry that belongs to a swappable page but without involving the OS, apparently you would be at least racing against the swapping process or migration process (in a NUMA system). Good luck with that.

Though of course if you are dealing with managed API allocations (CUDA/D3D/Vulkan/OCL/HSA coarse grained), the requirements can be relaxed to only what the API guarantees. Just that I don't think AMD aims that low. It is nice to have it in compute for the entire process VAS to KISS, at least for the dGPU.

What I'm saying is that if a CPU resided on the GPU, at a low level it would have full access to system memory without requiring a host CPU to be involved. On P100, as well as most GCN models, what you suggested is likely accurate. The exception being the higher bandwidth links(IBM/NVLink and AMD's APUs) that likely have more robust interfaces with the system memory controller. The SSGs for example obviously don't have to go through the OS to access their local storage. This was demonstrated by AMD to be superior to even reading from system memory over PCIE as I recall.
To be clear, I didn't mean that accesses in the cache hierarchy requiring the host CPU to be involved. Apparently, if the NVMe SSD is mapped into the physical address space via PCIe BAR, a virtual page can always target the location directly. What I meant is that — in a compute context — hot page migration from anywhere in the system triggered by the address translation hierarchy requires host CPU involvement.
 
Last edited:
It was the basis for all those charts of how much ram games actually used. Along with their future box marketing of bandwidth in addition to capacity.

EDIT: Around 4:20 in the video linked above.
What does an Vega interview have to do with the Fiji tangent to which I was responding? Yes, Vega changes this because it's able to swap stuff in and out of system memory at much finer granularity (you don't need to swap whole 8k x 8k texture mip chain if GPU only accesses level 8 of mip-chain). That's not something that applies to Fury.
 
In regards to the power connectors, they likely won't tell us much. Any card implementing thunderbolt should be capable of outputting 100W over the display connector. 100W will obviously have an impact on the board's power requirements. Dual 8 pins would probably be reasonable and no I don't expect most Thunderbolt connections to be drawing anywhere near that, yet it is the specification.

To be clear, I didn't mean that accesses in the cache hierarchy requiring the host CPU to be involved. Apparently, if the NVMe SSD is mapped into the physical address space via PCIe BAR, a virtual page can always target the location directly. What I meant is that — in a compute context — hot page migration from anywhere in the system triggered by the address translation hierarchy requires host CPU involvement.
In existing models I'd agree, but an APU (possibly IBM/NVLink) specifically might have a more relaxed requirement. The GPU may very well reside in host address space without translation. At a low level it may very well think it is a CPU and operate accordingly. This is highly speculative, but with ROCm and an APU with only system memory not necessarily unreasonable.

What does an Vega interview have to do with the Fiji tangent to which I was responding?
While not mentioned directly, Raja alluded to current 4GB cards that outperform 8GB cards. Might have been just before the 4:20 mark. I would take that to be Fiji with HBM being distinctly different.
 
While not mentioned directly, Raja alluded to current 4GB cards that outperform 8GB cards. Might have been just before the 4:20 mark. I would take that to be Fiji with HBM being distinctly different.
4GB card outperforming 8GB card (even when game is allocating say 6GB ram) has nothing to do with 4GB card using memory more efficiently (outside what I was saying about GCN1/2 differences to GCN3). It has to do with, as mentioned by Raja, with the fact games do not actively use entire say 6GB allocated memory all the time. And flies into the face of frame buffer size craze of the last few years. None of this has any basis that you can do this now because HBM is faster (with exception of skipping DCC on surfaces you cant texture from). You can do this on classic GDDR just as well with proper GPU support.
 
In existing models I'd agree, but an APU (possibly IBM/NVLink) specifically might have a more relaxed requirement. The GPU may very well reside in host address space without translation. At a low level it may very well think it is a CPU and operate accordingly.
Not sure what you really meant. If the GPU wants per-process shared virtual memory and ideally cache coherency (in a compute context), it needs to interoperate with the host translation system through IOMMU, and translation is essential. Being an APU wouldn't make a difference. It is just like adding a CPU core or a NUMA processor whose MMU(s) has to participate in the coherence domain, and cooperate with other MMUs (cores & OS kernel) to maintain the coherency of the translation system.

This is highly speculative, but with ROCm and an APU with only system memory not necessarily unreasonable.
For APU (and some dGPUs), the GPU can already access pageable system memory, regardless of being anonymous or I/O-mapped. The point is always that migrating a page for compute use — without changing the virtual address & breaking the "illusion" of coherency — from the system memory to the GPU local memory requires OS & driver intervention. It is not something that "can be solved by a new model", but a fundamental requirement even for NUMA SMP unless you gonna abolish the OS managed virtual memory and process isolation.
 
Last edited:
What does an Vega interview have to do with the Fiji tangent to which I was responding? Yes, Vega changes this because it's able to swap stuff in and out of system memory at much finer granularity (you don't need to swap whole 8k x 8k texture mip chain if GPU only accesses level 8 of mip-chain). That's not something that applies to Fury.

He mentioned 4GB cards that "beat the crap out of 8GB cards", so he was definitely talking about Fiji (which BTW has the same bandwidth as Vega 10). Vega 10 will most probably be 8GB cards unless Hynix start making 2-Hi stacks. He wasn't talking about Vega 10.
Some sentences later he mentions Megatexture, which AFAIK is a method to stream a texture/textures directly through the PCIe bus, instead of copying the whole thing to the VRAM.


This logic is BS. I still maintain that someone just pulled it out of their behind or at least misinterpreted what was said by AMD.
(...)
Which then got translated by the online world int what you're saying in your post. Which is BS.
Dude, you need to take a deep breath...
Yes, I may be wrong. I can't find the specific interview where I heard that but I don't seem to be the only one here who remembers it.


What is the bandwidth of the pci-e bus and what is the memory bandwidth (pci-e 3.0 really shows no benefits over pci-e 2 at least not for single cards) ? That should answer your question right there, what is the limiting factor, and why that statement was BS, and yeah it was AMD that stated that too.

https://www.techpowerup.com/reviews/AMD/R9_Fury_X_PCI-Express_Scaling/3.html

So you do agree that it was AMD who stated that, right?
As for that article, AFAIK average FPS values see little impact with PCIe bandwidth unless you go for really tight values.
What you do see is substantially lower minimum FPS and/or worse frametimes in the 99th percentile. Meaning a tight PCIe bandwidth causes stuttering.
 
Last edited by a moderator:
So you do agree that it was AMD who stated that, right?
As for that article, AFAIK average FPS values see little impact with PCIe bandwidth unless you go for really tight values.
What you do see is substantially lower minimum FPS and/or worse frametimes in the 99th percentile. Meaning a tight PCIe bandwidth causes stuttering.

It was AMD that stated that it was the reason why 4 gb was enough for Fiji, which is just BS though lol.

Yeah Vega has features where they can swap without penalty of the pci-e bottleneck but Fiji doesn't have it. And to that affect we still don't know how usefully it will be for Vega on current games. I suspect not, as it seems to be needing to use primitive shaders to take advantage of high bandwidth cache in in-game situations.

The bus is the bottleneck when streaming textures or assets from system memory to video memory, the bandwidth that Fiji has wasn't even being used to its fullest potential when games had to stream data over the pci-e bus.
 
4GB card outperforming 8GB card (even when game is allocating say 6GB ram) has nothing to do with 4GB card using memory more efficiently (outside what I was saying about GCN1/2 differences to GCN3). It has to do with, as mentioned by Raja, with the fact games do not actively use entire say 6GB allocated memory all the time. And flies into the face of frame buffer size craze of the last few years. None of this has any basis that you can do this now because HBM is faster (with exception of skipping DCC on surfaces you cant texture from). You can do this on classic GDDR just as well with proper GPU support.
The argument being made was that the quality of RAM was more important than the capacity as it has been poorly utilized. That's the only reason Raja would suggest putting bandwidth figures on boxes in the future. Adding capacity is unlikely to improve performance given the paging model, adding bandwidth however should.

For APU (and some dGPUs), the GPU can already access pageable system memory, regardless of being anonymous or I/O-mapped. My point is always that migrating a page — without changing the virtual address — from the system memory to the GPU local memory requires OS & driver intervention. It is not something that "can be solved by a new model", but a fundamental requirement unless you gonna abolish the OS managed virtual memory and process isolation.
This wouldn't be migrating pages at all but working out of the same pool. Coherency wouldn't exist in that model.
 
This wouldn't be migrating pages at all but working out of the same pool. Coherency wouldn't exist in that model.
Promoting pages to the GPU local memory in an UMA APU still has a slight advantage in bypassing the coherence domain and improved bandwidth. Either way, coherency is essential to achieve the optimal GPU-as-first-class-citizen programming model that is envisioned by both AMD and Nvidia. Not for managed API buffers though.
 
The argument being made was that the quality of RAM was more important than the capacity as it has been poorly utilized. That's the only reason Raja would suggest putting bandwidth figures on boxes in the future. Adding capacity is unlikely to improve performance given the paging model, adding bandwidth however should.
I'm not arguing against this. I'm arguing that this applies to Vega though and not for current AMD lineup Fiji included.
 
Vega 20 says PCIe 4.0 host, meaning it might be just a 4-lane host for future NVMe SSDs.
Another possibility is if this is a board enabled with peer to peer GMI links, the PCIe interface is the traditional connection to a host CPU.

The corresponding Greenland APU has the GPU connected to the CPU through GMI, and the CPU part being a 1st-gen Zen probably has a PCIe 3.0 host.
Greenland was supposedly the lead chip for Vega, while Vega 20 is labelled as being a 7nm chip in the slide. The 1/2 rate DP Vega 20 wouldn't appear to be the lead chip if AMD has physical Vega 10/11 chips without it.
The lack of ECC in Vega 10 is another missing feature that an HPC architecture like Greenland should have had.

Perhaps Greenland is still out there, but if it isn't cancelled it for some reason isn't getting a server board presence.
The 7nm Vega 20 also throws AMD's Polaris roadmap into question. The power-efficiency gains from the process would make the Navi datapoint rather unimpressive.
 
Greenland was supposedly the lead chip for Vega, while Vega 20 is labelled as being a 7nm chip in the slide. The 1/2 rate DP Vega 20 wouldn't appear to be the lead chip if AMD has physical Vega 10/11 chips without it.
The lack of ECC in Vega 10 is another missing feature that an HPC architecture like Greenland should have had.

Perhaps Greenland is still out there, but if it isn't cancelled it for some reason isn't getting a server board presence.
The 7nm Vega 20 also throws AMD's Polaris roadmap into question. The power-efficiency gains from the process would make the Navi datapoint rather unimpressive.

I think I've only ever seen Greenland inside the Zeppelin APU.

ThnuCw.jpg
IcC4dq.jpg




What if Vega 20 is simply Greenland outside the APU?
The only thing that doesn't match is the number of HBM stacks, but maybe the discrete version will simply use more stacks to access more HBM memory (maybe using half-rate speed to each stack to maintain bandwidth, if that's ever possible...).

EDIT: Scratch that. 4 TFLOPs obviously don't fit with Vega 20 either.
Maybe Zeppelin <-> Greenland was cancelled? Maybe it's Vega 11?
 
I think I've only ever seen Greenland inside the Zeppelin APU.
Per the second slide, Greenland and its HBM memory are on an interposer and separate from the CPU portion. They are part of an MCM package. Going by that, AMD's definition of APU has been stretched to the point of breaking even for a marketing term.


What if Vega 20 is simply Greenland outside the APU?
It's the wrong node if using Zeppelin.

Vega 20 may be a re-implementation of Greenland on a more advanced node, possibly as a pipe cleaner given the aggressive timing of the next node from either TSMC and GF. Where this leaves Navi is curious. Given Navi 11's appearance at the bottom of the tier and its timing, perhaps the PS5 or next Xbox might echo some of it.
 
I'm not arguing against this. I'm arguing that this applies to Vega though and not for current AMD lineup Fiji included.
Does it though? Fiji in many cases appeared to be doing the same thing, but perhaps not as efficiently. Most other cards in AMD's lineups weren't in performance tiers that memory was likely to be a limit. Geometry issues aside, Fiji performance only dropped when using a lot of memory. Better paging would help, but I doubt the memory requirements would drop significantly in the games where it's an issue. The advantage for Vega, beyond obvious specs, I'd expect to be a finer granularity in pages. Although details on HBCC are a bit scarce. Doom for example seemed to perform rather well on Fiji and the virtual texturing would seem a fair comparison of what AMD is after with the HBCC.
 
I think I've only ever seen Greenland inside the Zeppelin APU.

What if Vega 20 is simply Greenland outside the APU?
The only thing that doesn't match is the number of HBM stacks, but maybe the discrete version will simply use more stacks to access more HBM memory (maybe using half-rate speed to each stack to maintain bandwidth, if that's ever possible...).

EDIT: Scratch that. 4 TFLOPs obviously don't fit with Vega 20 either.
Maybe Zeppelin <-> Greenland was cancelled? Maybe it's Vega 11?
4 TFLOPs could very well mean DPFP, which gives you 8+ SP TFLOPs assuming a half rate DP. One might also consider that it would be packed with the 16-core CPU in the same package, which might mean a lower clocked bin to cope with the thermal density & socket limits.

By the way, the server APU was listed in the 2016-17 server roadmap, which already (kinda) has one of the three items scrapped. i.e. K12. The alleged target timeframe of Vega 20 is off a few quarters.
 
Last edited:
Does it though? Fiji in many cases appeared to be doing the same thing, but perhaps not as efficiently. Most other cards in AMD's lineups weren't in performance tiers that memory was likely to be a limit.
Everything already does that (to a point). There's no problem allocating 16GB textures from DX11 or DX12 side. Even if graphics card has only 8GB or even only 4GB of ram. If a game does that it won't jump out at you from GPU-Z memory usage. Or tank your performance. It will work just fine as long as amount of memory GPU is actually accessing per frame (or a few frames) fits in GPU local memory. That's the amount we should be interested in. There is however no tool that will report this to you that I'm aware of that will do this in a kind of easy GPU-Z fashion.
AMD says only about half of allocated resources are accessed each frame (series of frames). That's probably measured but you can work it out without that as well: game won't access all its textures every frame. Neither will it access all mip map levels of each texture every frame.
There will come a point however when GPU will need to access something that's not in its local memory. This presents a problem because driver will first have to evict something large enough from GPU to system memory and upload new resource from system memory to GPU memory (or access it directly from system memory at PCI-Ex bandwidth). Resource can be hundreds of megabytes large! There might not be just one resource you'll need to pickup. If driver doesn't spot this far enough in advance (and remember gpu drivers love having all the commands for a few frames ahead) frame times will spike.

And yes many games (not just Doom) are already treating GPU memory as a cache and will allocate some amount of memory for this and then copy smaller tiles into GPU memory. This is at the moment done completely in software on API-s that are not really designed for this (well DX11 I'd say is a pain, I don't know OpenGL that well but I imagine it's better there). DX12 (and DX11 on Win 8.1) improves this by giving developers an ability to "allocate" texture that's not completely resident and game engines are able to upload only tiles that they are actually accessing.
Vega goes even a step forward from this in that it would appear to treat all resources like this transparently and independently of weather they are tiled by the API or not.
 
By the way, the server APU was listed in the 2016-17 server roadmap, which already (kinda) has one of the three items scrapped. i.e. K12. The alleged target timeframe of Vega 20 is off a few quarters.
K12 might not be canned, there is an ARM semi custom gig AMD has. Also in the hotchips presentation Q&A Micheal Clarke that the arm road map hasn't changed.
 
Back
Top