AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Not by the one bit, but they could have influenced the ability to add more bits. While 512TB is a lot of space, that's not a whole lot for the exascale systems. Might be something they could change for customers actually needing more than 512TB on a GPU in a single pool. The HMM would definitely be a possibility along the same lines as that compute wave save/restore.
The system virtual memory address space is capped at 48 bits so far for both AMD64 and ARMv8. So in reality before Intel's 5-level paging lands, there isn't an apparent need for GPUs to grow beyond 48 bits.

As for addressing huge amount of physical memory, PCIe BAR mapped memory and memory-mapped I/O, it would be in the realm of the physical address space, not the virtual address space.
 
Last edited:
The system virtual memory address space is capped at 48 bits so far for both AMD64 and ARMv8. So in reality before Intel's 5-level paging lands, there isn't an apparent need for GPUs to grow beyond 48 bits.

As for addressing huge amount of physical memory, PCIe BAR mapped memory and memory-mapped I/O, it would be in the realm of the physical address space, not the virtual address space.
Problem being if they implemented the flat addressing model and were using a fabric with large dataset. Virtual space could exceed 48 bits while the physical space is even larger. Like you said, it is strange to only add a single bit, but maybe that was sufficient for a partner, all they could manage, or some conceptual design?
 
Problem being if they implemented the flat addressing model and were using a fabric with large dataset. Virtual space could exceed 48 bits while the physical space is even larger. Like you said, it is strange to only add a single bit, but maybe that was sufficient for a partner, all they could manage, or some conceptual design?
You seem to have misunderstood what exactly the flat virtual address space is.

It is just a flattened, segmented view of the process virtual address space, the workgroup memory and the work-item private memory. There are also a few utility segments, but they can be collapsed into the global/private segment.

If in your mind "flat" means flat "across agents", that's not the case at all. Agents are interoperating within the global segment (i.e. the process virtual address space), even for agents that accept coarse-grained allocations to its non-coherent local memory.

Let's say GCN. A flat address would either:
1. lies within the 48-bit platform virtual address space (global/kernarg/readonly);
2. lies within the workgroup memory aperture (group); or
3. lies within the private aperture (private/arg/spill).

The first case doesn't need translation. The second and the third case are handled by subtracting the aperture base, and redirecting the access to somewhere else. Group segment addresses apparently go to the LDS. Private segment addresses would be computed from a base address that AFAIK lies within the 48-bit GPUVM address space, at least for the Linux implementation.

So nope. Addresses that points to the system memory would never exceed 48 bits.
 
Last edited:
So nope. Addresses that points to the system memory would never exceed 48 bits.
So if my shader is accessing an array in excess of 512TB as a scratchpad, which address would I use to access a specific byte? In the case of an SSG exceeding 512TB of memory, likely across pools, how is that handled? That pool wouldn't necessarily be visible to the host system as it would be connected directly to the GPU.
 
So if my shader is accessing an array in excess of 512TB as a scratchpad, which address would I use to access a specific byte? In the case of an SSG exceeding 512TB of memory, likely across pools, how is that handled? That pool wouldn't necessarily be visible to the host system as it would be connected directly to the GPU.

You can have the physical address space larger than the GPU private virtual address space, if this SSG thing is ever the first order design priority of a GPU (which apparently isn't). So you can have them mapped into the physical address space, and no reason it wasn't now. Multiple processes would do then.

For sanity though, even if we put aside whether SSD can grow that large (256x to hit the wall), to handle such scale of data it would be quite fun to see it not being tiered or partitioned.
 
Last edited:
You can't access array in excess of 512TB. You can't access an array in excess of 256TB on a CPU. If you have more data then that then you'll have to access it in segments (unmap part from virtual memory, map another part in).
Why not more then 49 bits? Largest address space that GPU will connect to is host virtual address space, which is 48 bits.
 
For sanity though, even if we put aside whether SSD can grow that large (256x to hit the wall), to handle such scale of data it would be quite fun to see it not being tiered or partitioned.
Largest address space that GPU will connect to is host virtual address space, which is 48 bits.
What about a SAN wired directly to the GPU? Take the CPU and host adapters out of the design completely. That's by far the more practical solution for a datacenter implementing SSG technology on a large scale. I would definitely consider that a priority for deployment in an exascale market like that 3PFLOP rack that was demoed. Yes host address space is historically the largest space that would be connected, but not necessarily a requirement. GPU memory, while historically impractical, could exceed that of the host. Those connectors could have been what was taped off on the Instinct demonstrations as opposed to display outputs.
 
This would then leave the realm of PCI-Ex as 48 bits is also the maximum available there per bus/device/function (and SSG connects to GPU through PCI-Ex). I think you'd basically come to the point that you'd have to run some sort of OS on GPU that would handle page faults and get your missing pages over from what ever protocol SANs use.
 
You can't access array in excess of 512TB. You can't access an array in excess of 256TB on a CPU. If you have more data then that then you'll have to access it in segments (unmap part from virtual memory, map another part in).
Why not more then 49 bits? Largest address space that GPU will connect to is host virtual address space, which is 48 bits.

Do we know the adress space of Zen server products yet? Maybe we do, but atm I cannot remember.

512TB virtual address space presumably means there is one extra address bit (49 bits) in GPU's own VM hierarchy over GCN3 (48 bits). No idea why they would just bump up one bit though... Are they going to map the entire host virtual address space into the GPUVM and unify the address translation hierarchies (ATC/GPUVM), heh?
Isn't it 512 TB, aka 4.096 Tbit? So 52 bits of adress space?
 
So, I have to log in to link this patent application:

http://www.freepatentsonline.com/y2016/0371873.html

HYBRID RENDER WITH PREFERRED PRIMITIVE BATCH BINNING AND SORTING



Don't understand it as yet...

It's a continuation of an earlier application discussed before. Subdivide screen space into tiles, feed primitives into the front end in a deferred mode that collects fragments per-bin and culls/occludes non-visible fragments. Then, send the bin and its share of visible fragments for shading. This collects spatially coherent accesses since this is done per-bin rather than per-primitive, and only shades visible pixels (for the finite span/resources of the bin).
The largest difference I can see is discussion of how the tiling is done, with a horizontal coarse rasterization loop with a sorter, and then passing it to a vertical coarse rasterizer and sorter.
 
A binning rasteriser sounds like a match to Nvidia's tiled rasteriser, with "draw stream" probably meaning it isn't TBDR (?).

Draw-streams are a necessary part of TBDR architectures as you need to first identify which primitives hit which screen-space tiles (binning) and generate a draw-stream (list of primitives/drawcalls) for each tile. But it still isn't a real TBDR architecture unless it has an on-chip tile buffer which actually saves the memory write BW of depth and color writes. So my guess is they are generating draw-streams as they are doing binning, but that's only used to parallelize rasterization which is very similar to Nvidia's tiled rasterization. In theory, you could use your L2 cache instead of the on-chip buffer for saving the BW but that would reduce the effective cache size for reads which kinda negates the effect in practice.
 
Do we know the adress space of Zen server products yet? Maybe we do, but atm I cannot remember.


Isn't it 512 TB, aka 4.096 Tbit? So 52 bits of adress space?
You can't address individual bits, so no 52 bit address space.
What would be the point of Zen increasing that (from 48 bits)?
 
You can't address individual bits, so no 52 bit address space.
What would be the point of Zen increasing that (from 48 bits)?
I don't know if I fully understand what you're saying? Why would there be no point in Zen (or other hardware) supporting larger address spaces?
 
Do we know the adress space of Zen server products yet? Maybe we do, but atm I cannot remember.


Isn't it 512 TB, aka 4.096 Tbit? So 52 bits of adress space?

Memory is byte addressable, not bit.

The virtual address space of AMD64 is 48-bit, unless AMD is going after Intel's 5-level paging extension (which still supports 48-bit for compatibility though).
 
What about a SAN wired directly to the GPU? Take the CPU and host adapters out of the design completely. That's by far the more practical solution for a datacenter implementing SSG technology on a large scale. I would definitely consider that a priority for deployment in an exascale market like that 3PFLOP rack that was demoed. Yes host address space is historically the largest space that would be connected, but not necessarily a requirement. GPU memory, while historically impractical, could exceed that of the host. Those connectors could have been what was taped off on the Instinct demonstrations as opposed to display outputs.

Well, the GPU could maintain its own VA as it does now.

But AFAIU the whole point of SSG is caching resources for the GPU to page in from. It is not an addressable pool, and I would not expect it to be in any sense (bandwidth, heh). So even if you connect it to a SAN — which is against any bullet point SSG was claimed to have — it would likely still be a piece of managed filesystem, rather than something directly addressable.
 
Last edited:
I don't know if I fully understand what you're saying? Why would there be no point in Zen (or other hardware) supporting larger address spaces?
What device that you have at the moment doesn't fit into 48 bit virtual address space? There's not going to be a 256TB per socket ram options. Other PCI-Express devices will still be at 48 bits. So if Zen goes from 48 to 50 bits (1PiB) all you get is the ability to map more PCI-Ex devices directly to virtual memory space at the same time.
There was an idea above to wire SAN directly to GPU. ...Or CPU. This are large enough devices but these talk over Ethernet or Infiniband. So to put it a bit more broadly: you'd want to memory map the internet? :)

But AFAIU the whole point of SSG is caching resources for the GPU to page in from. It is not an addressable pool, and I would not expect it to be in any sense (bandwidth, heh). So even if you connect it to a SAN — which is against any bullet point SSG was claimed to have — it would still be a piece of managed filesystem, rather than something directly addressable.
AFAIU SSG is a PCI-Ex device/function. CPU copies to it, GPU then reads/writes from it. Problem with SAN is that it's not like an SSD/HDD, it's more of a network card with additional protocols on top.
 
An upcoming step for HPC and servers is directly-addressed non-volatile VRAM as a complement for DRAM. Intel is promisnig socket-attached versions, and AMD's hypothetical exascale HPC direction includes using NVRAM to compensate for DRAM slowing in power and capacity scaling.

For large clusters for HPC and big data, it seems like there is a desire for being able to directly address beyond the volatile address space. Various newly announced interconnect consortia are aiming for forms of direct addressing across devices. Gen-Z has aspirations for memory addressing between 4096 devices that are individually capable addressing a 64-bit space, if 50 bits seems constraining.
 
AFAIU SSG is a PCI-Ex device/function. CPU copies to it, GPU then reads/writes from it. Problem with SAN is that it's not like an SSD/HDD, it's more of a network card with additional protocols on top.
An upcoming step for HPC and servers is directly-addressed non-volatile VRAM as a complement for DRAM. Intel is promisnig socket-attached versions, and AMD's hypothetical exascale HPC direction includes using NVRAM to compensate for DRAM slowing in power and capacity scaling.

For large clusters for HPC and big data, it seems like there is a desire for being able to directly address beyond the volatile address space. Various newly announced interconnect consortia are aiming for forms of direct addressing across devices. Gen-Z has aspirations for memory addressing between 4096 devices that are individually capable addressing a 64-bit space, if 50 bits seems constraining.
But at least in Gen-Z's case, it is a problem in the realm of physical addresses. The VM space should be agnostic of it.
 
512TB virtual address space presumably means there is one extra address bit (49 bits) in GPU's own VM hierarchy over GCN3 (48 bits). No idea why they would just bump up one bit though... Are they going to map the entire host virtual address space into the GPUVM and unify the address translation hierarchies (ATC/GPUVM), heh?
All current GCN tieration have a 40-bit virtual address space, not 48-bit. Or at least current Windows drivers expose "only" 40-bit virtual address space.
Yes, same reason why nVidia supports this with Pascal:
This makes sense... Except thei solve a non-problem and increase the arch. complexitity... Especially considering x86 currently cannot cover more then 48-bit of virtual address space.. I do not know how this will suite well for iGPU with shared memory...
 
Back
Top