AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Rootax · Sep 5, 2017

Yeah I hope wattman will be present in the new drivers. If so, I'll be very happy. It will basically be a 16gb Vega RX64. Already some game are eating more than 8gb, even at 1440p, so, eh, that's nice

3dilettante · Sep 5, 2017

sebbbi said:
In order to support larger data sets, you need to have a multi-tier memory system. Big chunk of DDR for storage and fast small HBM pool is a perfect solution for games, as long as you can page data from DDR to HBM on demand at low latency. Intel has already done this with their MCDRAM based cache on their Xeon Phi processors and Nvidia has P100 and V100 with hardware virtual memory paging. I am thrilled that this is actually happening in consumer space so quickly. Nvidia's solution is CUDA centric, and geared towards professional usage. I don't know whether their hardware could support automated paging on OpenGL, DirectX and Vulkan or whether it only supports the CUDA memory model (which doesn't use resource descriptors to abstract resources).

Some of Pascal's architectural features like the 49-bit address space and an automatic facility for page migration seem to be leading features where Vega is playing catch-up.
Volta's memory management includes among other things access counters to better determine optimal migration patterns.
This seems like a ripe area for watering the features down for a consumer product in the future. The actual address space and fault-handling are base-level system infrastructure, which HBCC also relies on regardless of API.

AMD's HBCC adopts 4-level page tables like x86 (may lag Intel's 5-level extensions in future), although the direct access of x86 page tables via the Address Translation Cache was noted as being inactive when EPYC launched.
Volta's Address Translation Services offer something similar, but only for Power. Perhaps that is a specific platform limit that could give AMD an advantage, or a matter for a future change/implementation. Or perhaps something a hypothetical future console could take in.

It would be interesting to see the two platforms benchmarked head to head. I've seen some intimation that HBCC for all its features currently doesn't have the raw handling throughput. Even if Nvidia's methods are not quite as diverse in their options, games do not appear to really need all that HBCC offers, and it's possible for the sake of broad compatibility that even partially software-based methods may be good enough for client use.

silent_guy · Sep 5, 2017

Looking at the difference between Pascal/Volta and Vega, is the following statement correct?

Pascal/Volta need OS support for each page swap into the GPU and Vega does not.

3dilettante · Sep 5, 2017

silent_guy said:
Looking at the difference between Pascal/Volta and Vega, is the following statement correct?

Pascal/Volta need OS support for each page swap into the GPU and Vega does not.

Nvidia's migration engine has controllers dedicated to managing migrations, and at least for Power a translation method that allows access to CPU page tables. That may depend on the interfacing hardware in Power's NVlink/CAPI blocks, and possibly some OS support is needed.

When EPYC launched, Vega's corresponding method was inoperative.
Vega's ATC memory management needs an IOMMU present (Or so I interpret the mention of the IOMMU and ATC bit settings in: https://llvm.org/docs/AMDGPUUsage.html), and the support for newer versions of IOMMU needed implementation for the Linux kernel at least.

For base level functionality, it look as if some level of driver and/or OS awareness is needed, even if it's supposedly transparent for client software. Both vendors have methods that appear to be overkill for what a game needs, so there could be slack even if one product isn't fully hardware-managed.

Ethatron · Sep 5, 2017

3dilettante said:
Some of Pascal's architectural features like the 49-bit address space and an automatic facility for page migration seem to be leading features where Vega is playing catch-up.

Polaris (edit: I think) added 49bit adress-space:

With a total addressable memory space of 512TB in this new system a 49-bit address space, it is similar to the x86-64 address space of 48-bit (256TB). That leaves a lot of room for growth on the GPU memory area, even when you start to get into massive network storage configurations.

https://www.pcper.com/reviews/Graph...ecture-Preview-Redesigned-Memory-Architecture

3dilettante · Sep 5, 2017

Ethatron said:
Polaris (edit: I think) added 49bit adress-space:

https://www.pcper.com/reviews/Graph...ecture-Preview-Redesigned-Memory-Architecture

I'm reading that as being in the context of HBCC and the 512TB address space. I haven't had much luck finding a Polaris-specific reference for this.

Ethatron · Sep 5, 2017

3dilettante said:
I'm reading that as being in the context of HBCC and the 512TB address space. I haven't had much luck finding a Polaris-specific reference for this.

Actually, it's from Fiji.

GCN3 Buffer descriptor: 48 bit + 1 Bit ATC (Address Translation Cache), Table 8-9
GCN5 Buffer descriptor: 48 bit + 1 Bit HBCC (mapped tile pool / heap), Page 56 / Table 33

3dilettante · Sep 5, 2017

Ethatron said:
Actually, it's from Fiji.

GCN3 Buffer descriptor: 48 bit + 1 Bit ATC (Address Translation Cache), Table 8-9
GCN5 Buffer descriptor: 48 bit + 1 Bit HBCC (mapped tile pool / heap), Page 56 / Table 33

The ATC bit goes back further, as it's mentioned for Sea Islands in Table 8.5.
http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf

I'm not clear if the 48-bit base and ATC bit correspond to the same situation as the 48-bit base and mapped tile bit. Potentially, the HBCC doesn't need a bit in the shader's context to manage the resource's placement in the overall memory space.
Having an explicit ATC bit as part of a resource's descriptor seems to be related to the inability of prior generations to autonomously manage memory straddling CPU and GPU pools, since the resource needs to know which side it's on. Rather than a 49-bit space managed in a common handler, it's two disparate 48-bit spaces.

As far as the hardware's capability goes, what the shader sees may not be representative of the virtualized resource.

Anarchist4000 · Sep 6, 2017

silent_guy said:
Looking at the difference between Pascal/Volta and Vega, is the following statement correct?

Pascal/Volta need OS support for each page swap into the GPU and Vega does not.

OS support and a process or service to handle the transaction. Vega would only need minimal OS support or awareness for certain capabilities where both interact. Security, virtualization, sharing pointers, configuration/layout, etc that are usually handled by the driver. Arbitrary reads of system memory aren't necessarily desirable and more a programming issue in any case, but would be possible.

Rasterizer · Sep 6, 2017

So this might sound like a bit of a strange question, but I just want to make sure my understanding is correct as I'm having a discussion about Vega on another forum where this has come up: With respect to the next generation geometry engines in Vega, it's my understanding that it is the shaders within the geometry engines themselves (which were all fixed function shaders in Fiji) that have been (mostly) replaced with programmable non-compute shaders that can be reconfigured to act as primitive shaders rather than their default behavior, and it is not at all the case that primitive shaders work by bypassing the geometry engines entirely and using the compute units to process geometry instead of using the geometry engines, right?

Deleted member 13524 · Sep 6, 2017

AFAIK Primite Shaders are only discarding triangles that don't appear in the final frame, and the end result is sent to the 4 geometry engines that still exist in Vega. Primitive shaders aren't working in Vega yet, so what you get is the same performance/clock as Fiji.

EDIT because I don't know if this was the actual question:
- I'm not sure if the primitive shaders (a shader is a program, not a physical hardware component) are running in the ALUs / shader processors that reside in the NCUs or if they're completely new units that go into the front-end, but I think they're using the NCUs. That being the case, what's new in Vega compared to older GCN are the bridges (bus and caches) created between the shader processors and the front-end (geometry processors).

3dilettante · Sep 6, 2017

Rasterizer said:
So this might sound like a bit of a strange question, but I just want to make sure my understanding is correct as I'm having a discussion about Vega on another forum where this has come up: With respect to the next generation geometry engines in Vega, it's my understanding that it is the shaders within the geometry engines themselves (which were all fixed function shaders in Fiji) that have been (mostly) replaced with programmable non-compute shaders that can be reconfigured to act as primitive shaders rather than their default behavior, and it is not at all the case that primitive shaders work by bypassing the geometry engines entirely and using the compute units to process geometry instead of using the geometry engines, right?

My understanding is that the primary processing grunt of the shader engines CUs. Primitive shaders as they're presented do not replace the fixed-function elements of the geometry pipeline. They exist in addition to the standard pipeline, which already has some capability to be fed via compute.

The GPU's geometry path has a number of internal shader types that handle various combinations of vertex position, parameter, and amplification/decimation below the smaller number of API shader types.
Each type outside of fixed-function blocks like the still fixed-function tessellation unit still runs on the CUs.

The reasons for these shaders being separate at all seems to vary, with some being related to which shader is allowed certain sources and destinations, the format and packaging of data, or changes in the amount of data by a stage that could amplify or remove primitives. What Vega seems to have done is combined certain programmable stages that might have been separate for source/format reasons, but not necessarily where there are fixed-function transitions or amplification. Combining the producer and consumer portions of various pairings and removing the more esoteric divisions between programmable sections seems to underpin the new functionality.

So it seems as if the programmable portions have become more generic in what they can source/send, but the difficulties in getting the feature activated and the description of how hard it might be to expose primitive shaders to developers makes it seem like this was not a clean change in this first-gen implementation.

Ethatron · Sep 6, 2017

3dilettante said:
The ATC bit goes back further, as it's mentioned for Sea Islands in Table 8.5.
http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf

I'm not clear if the 48-bit base and ATC bit correspond to the same situation as the 48-bit base and mapped tile bit. Potentially, the HBCC doesn't need a bit in the shader's context to manage the resource's placement in the overall memory space.
Having an explicit ATC bit as part of a resource's descriptor seems to be related to the inability of prior generations to autonomously manage memory straddling CPU and GPU pools, since the resource needs to know which side it's on. Rather than a 49-bit space managed in a common handler, it's two disparate 48-bit spaces.

As far as the hardware's capability goes, what the shader sees may not be representative of the virtualized resource.

Yeah, I'm also uncertain about what exactly the 49bits in the media refers to. The descriptor clearly creates two address pools, which is about what I would expect as a client programmer, and which is fine by me. The descriptor is basically a typed/tagged address, and it can address 49bits of distinguishable memory AFAICS, and simply raising the bit while keeping the address field the same makes not too much sense most of the time (aliasing the same address in two pools for maybe the same data is exotic I would say). I would not expect heterogenous UMA. I also don't expect the HBCC controller being invoken on all and any address or descriptor presented to the GPU. There is no consecutive pointer type of 49bits in the ISA manual at least. CPU address space is 48bits anyway, so having a true unified address pool of 49bits is useless if the GPU can't potentially address more than 48bits itself, which I don't think is the case.

There is the possibility that ATC is/was for Hyper-V or GPU virtualization, and it's in reality N virtual adress spaces (sequentially acessible, not simultaniously), N being the number of virtualized instances. How virtualization works with HBCC is also an open question. Maybe the HBCC-controller is the XDMA-controller and a TLB for the system memory pool fused in one unit. There is still the GPUs native TLB for the on-board pool.

Seems really interesting, but not too relevant for normal operation.

Rasterizer · Sep 6, 2017

ToTTenTranz said:
I'm not sure if the primitive shaders (a shader is a program, not a physical hardware component) are running in the ALUs / shader processors that reside in the NCUs or if they're completely new units that go into the front-end, but I think they're using the NCUs. That being the case, what's new in Vega compared to older GCN are the bridges (bus and caches) created between the shader processors and the front-end (geometry processors).

3dilettante said:
My understanding is that the primary processing grunt of the shader engines CUs. Primitive shaders as they're presented do not replace the fixed-function elements of the geometry pipeline. They exist in addition to the standard pipeline, which already has some capability to be fed via compute

Okay, now I'm definitely confused. The Vega whitepaper talks about "Next Generation Geometry Engines" and says:

To highlight one of the innovations in the new geometry engine, primitive shaders are a key element in its ability to achieve much higher polygon throughput per transistor. Previous hardware mapped quite closely to the standard Direct3D rendering pipeline, with several stages including input assembly, vertex shading, hull shading, tessellation, domain shading, and geometry shading. Given the wide variety of rendering technologies now being implemented by developers, however, including all of these stages isn’t always the most efficient way of doing things. Each stage has various restrictions on inputs and outputs that may have been necessary for earlier GPU designs, but such restrictions aren’t always needed on today’s more flexible hardware.

They aren't talking about processing being done within the geometry engines as shown on the Vega block diagram? I guess that means I should ask which stages of the rendering pipeline are normally done within the geometry engines?

3dilettante · Sep 6, 2017

Ethatron said:
I also don't expect the HBCC controller being invoken on all and any address or descriptor presented to the GPU.

The HBCC wouldn't generally be involved without a page fault. If it's operating in a shared mode, it would need to track what ranges might fault. Hardware-generated offsets like some of the wave-level base pointers may have implicit restrictions where the GPU will use its known-local address range.

CPU address space is 48bits anyway, so having a true unified address pool of 49bits is useless if the GPU can't potentially address more than 48bits itself, which I don't think is the case.

The motivation stated for 49-bit addressing by the vendors is for unified memory addressing where the GPU can access the full CPU range in addition to what it can address independently. If it wants to be generally capable of accessing from the host's 48-bit range, it would exhaust the address space of the GPU's paging system without additional space.

There is the possibility that ATC is/was for Hyper-V or GPU virtualization, and it's in reality N virtual adress spaces (sequentially acessible, not simultaniously), N being the number of virtualized instances.

ATC is the implementation of an IOMMUv2 feature for heterogeneous memory access. It allows the GPU to interface with the host's page tables and cache translations. For protection, everything the GPU does with unified memory treats it as a virtual guest.

Rasterizer said:
Okay, now I'm definitely confused. The Vega whitepaper talks about "Next Generation Geometry Engines" and says:

The only truly fixed-function elements in the diagram are in the solid dark gray blocks. The various VS/DS and GS elements are actually running on the CUs. Internally, some of those actually decomposed into different variants depending on whether of tessellation and geometry shaders were invoked.

Scott_Arm · Sep 8, 2017

Vega20 TSMC?

http://www.digitimes.com/news/a20170908PD210.html

BoMbY · Sep 8, 2017

Very unlikely, they won't use more processes than they absolutely have to (because cost, you know), and their main producer will always be GloFo. Vega 20 will most likely be their first 7LP product from GloFo, somewhere around Q4/2018. Also it could be mostly (only?) a Pro product, because it may have 4 stacks of HBM2.

Deleted member 13524 · Sep 8, 2017

BoMbY said:
Very unlikely, they won't use more processes than they absolutely have to (because cost, you know), and their main producer will always be GloFo. Vega 20 will most likely be their first 7LP product from GloFo, somewhere around Q4/2018. Also it could be mostly (only?) a Pro product, because it may have 4 stacks of HBM2.

Well if GF's solutions keep increasing the efficiency gap to TSMC and Samsung's equivalent processes then AMD had better find a way to avoid being dragged down by them. At least for their halo products (which Vega 20 should be).

Kaotik · Sep 8, 2017

ToTTenTranz said:
Well if GF's solutions keep increasing the efficiency gap to TSMC and Samsung's equivalent processes then AMD had better find a way to avoid being dragged down by them. At least for their halo products (which Vega 20 should be).

7LP, despite name sounding like it, isn't a lowpower process like 14LPx's are, it's "7 Leading Performance" aimed at high performance products (read: GPUs, big x86 cpus)

Deleted member 13524 · Sep 8, 2017

Kaotik said:
7LP, despite name sounding like it, isn't a lowpower process like 14LPx's are, it's "7 Leading Performance" aimed at high performance products (read: GPUs, big x86 cpus)

Yes, but it needs to reach the point of mass production on time...

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Rootax

3dilettante

silent_guy

3dilettante

Ethatron

3dilettante

Ethatron

3dilettante

Anarchist4000

Rasterizer

Deleted member 13524

Guest

3dilettante

Ethatron

Rasterizer

3dilettante

Scott_Arm

BoMbY

Deleted member 13524

Guest

Kaotik

Drunk Member

Deleted member 13524

Guest