Your wishlist for DX12

I'd like to see Directcompute reach the same programmability level as Cuda. Function pointers and dynamic parallelism FTW! Also, the HLSL ISA should be changed to scalar, since vec4 has pretty much been dropped in modern hardware, and having a vec4 virtual register file causes problems with alignment restrictions and it can be impossible in some cases to identify unused lanes in a computation, resulting in the GPU having to do extra work to process them. i.e.

float4 result = input[a]+input;
output[c] = result //only the first result.xyz is ever actually used, but the compiler can't tell what might use the output array in the future, and so has to compute result.xyzw

Finally, we need a good way to do a deep copy of complicated structures to the GPU memory space.
 
I got an idea during the night, it would need more work to be turned into something coherent but here I go. It's a PC centric API design that assumes CPU and GPU having their own pool of memory.

API:
Exclusively use 64 bits Handles/Pointers, bit 63 has special meaning : 0 means CPU memory (GART), 1 means GPU memory.

Code:
Device
{

/* DATA */
void* MemAlloc( CPU|GPU, size, RESERVE&|COMMIT );	//c.f. VirtualAlloc
void MemFree( address, size, RELEASE|DECOMMIT );	//c.f. VirtualFree

Descriptor* CreateDescriptor( Type, data's address, size, ... );
//Basically create a descriptor for IB, VB, CB, TB, Texture, RenderTarget, UAV... 
//May need one function for each type or some may be merged in a single function


/* PROGRAM/PIPELINE */
Program* CreateProgram( Type, source );
//I'm inclined to have a programming language more akin to Chapel ([url]http://chapel.cray.com/[/url]), that needs more thinking.
//Basic idea (for now) is to have program of the following types: Vertex, { Hull, Tessellator, Domain, } Geometry, Rasterizer, Fragment, DepthStencil, Blender.
//I think { Hull, Tessellator, Domain } ought to be a single thing.
//Tessellator, Rasterizer, DepthStencil & Blender may be more akin to Register Combiners/States, but I put them as-is to end up with a single Pipeline object below.
//There are no SamplerState, those are embedded in their respective programs instead. [Could be changed.]

Pipeline* CreatePipeline( Programs[], count );
//That's the whole graphics pipeline in one object.


/* COMMAND QUEUE */
CommandQueue* CreateCommandQueue();
void enqueue( CommandQueue& queue );
//Could be called execute instead but it won't really do so until all previous CommandQueue have executed...

}


CommandQueue
{
//Every function in here is added to the CommandQueue and doesn't execute until the CommandQueue is enqueued on the Device and is run.

/* MEMORY */
void MemSet( address, size, value );
void MemCopy( src, dst, size );
void MemCopyEx( src, dst, size, DXGI_FORMAT );
//when src is CPUmem, data is linear. When dst is GPUmem it will be optimised to native [Textures].
//When src is CPUmem & dst is GPUmem data will be turned into linear.


/* PIPELINE SETUP */
void Pipeline( Pipeline* );
void PipelineProgramDescriptors( Descriptor[]*, uint32_t* counts );
//For each Program in Pipeline, provide a pointer to its Descriptors.
//That's akin to "SetShaderResources" and "SetConstantBuffers"... functions of D3D10+, except you just provide Descriptors array for each Program.

}

The design is primitive and lacking a HEVENT object to be waited upon or to get a timestamp from.
I think such an HEVENT would be on the Device and inserted between CommandQueues, or maybe each enqueue should be changed to:
HEVENT enqueue( CommandQueue& queue );
So you can wait until the CommandQueue is executed, or get a TimeStamp to benchmark.

It might be possible to have multiple CommandQueues running in parallel too.

I think there's a fair amount of thoughts to put in the progamming language to use with GPU, HLSL is a fair domain specific language, but it might not match current/upcoming GPU.
As a programmer I want something simple and powerful (hint: http://chapel.cray.com/ ;p), and I'd like to run the GPU asynchronously from the CPU to some extent.
(This design, as-is, doesn't really allow that, it still treats the GPU as a slave to the CPU.)
 
More details.

CPU addresses are accessible from both CPU & GPU.
GPU addresses are only accessible from GPU, you need to copy them to read them from CPU.
The programmer is responsible for performance, as he can create a resource in CPU memory and ask the GPU to read from it across PCIe.

Since every resource has a unique virtual address, the driver can easily check that different Descriptors for the same resource are compatible during CreateDescriptor(...).
Having more than one Descriptor usable for the same resource means there's a concept of views. (like D3D10+)

There's no reason a given CommandQueue cannot be kept around and reused.
The HEVENT to be returned when enqueuing a CommandQueue allows the CPU to know which resources are free to be evicted/decommited.


I like the Pipeline object encapsulating all states necessary for execution, I think there's still room for the driver to optimise changes between two Pipelines objects.


I got the idea after working on resource management on my engine yesterday night, which is why it's more detailed in that respect.
 
All I think of during the night is naked ladies


why have you gone that way since the current fashion is hsa

I do think that Haswell GPU will have dedicated on-chip memory, so this design might still hold, but otherwise allocating from GPU memory would return an address with bit 63 @ 0, which is simple enough.
 
Just setting bit 63 wouldn't really be adequite. What about multi-GPU setups, where an address could be resident on any one of them? I think a better solution would be to keep track of physical memory locations in the page table. Then and processor (CPU or GPU) could read from any memory address, at the cost of a page fault if that page is resident on the wrong device (this would evict it from that device and bring it into the device requesting that memory - you would also greatly benefit from using a MESI protocal for coherency, since you have a lot of read-only stuff).

You know, when Microsoft says that DirectX is deeply tied to the OS, this is the sort of thing they're talking about. Getting this to work requires a rewrite of the virtual memory engine:p
 
Just setting bit 63 wouldn't really be adequite. What about multi-GPU setups, where an address could be resident on any one of them? I think a better solution would be to keep track of physical memory locations in the page table. Then and processor (CPU or GPU) could read from any memory address, at the cost of a page fault if that page is resident on the wrong device (this would evict it from that device and bring it into the device requesting that memory - you would also greatly benefit from using a MESI protocal for coherency, since you have a lot of read-only stuff).

You know, when Microsoft says that DirectX is deeply tied to the OS, this is the sort of thing they're talking about. Getting this to work requires a rewrite of the virtual memory engine:p

Ah! Multi-vendor cache coherency...

That's going to be fun.
 
Just setting bit 63 wouldn't really be adequite. What about multi-GPU setups, where an address could be resident on any one of them? I think a better solution would be to keep track of physical memory locations in the page table. Then and processor (CPU or GPU) could read from any memory address, at the cost of a page fault if that page is resident on the wrong device (this would evict it from that device and bring it into the device requesting that memory - you would also greatly benefit from using a MESI protocal for coherency, since you have a lot of read-only stuff).

You know, when Microsoft says that DirectX is deeply tied to the OS, this is the sort of thing they're talking about. Getting this to work requires a rewrite of the virtual memory engine:p

I'd rather use more bits in the memory address and control each GPU memory & queue separately.

Having memory coherency through CPU & GPU instructions would be nice, that's on the programming part I didn't really cover.
I wonder how many people work on those things...
 
I'd rather use more bits in the memory address and control each GPU memory & queue separately.

Having memory coherency through CPU & GPU instructions would be nice, that's on the programming part I didn't really cover.
I wonder how many people work on those things...

Again, the problem with separate address spaces is that it's a royal pain to marshal across any sort of complex multilevel structure. For instance, imagine copying an accelleration structure for ray tracing. Or worse, a complex graph structure, where there are cyclical links between elements. You have to copy each element, keep track of all the new addresses, and then go in and modify all the pointers accordingly. Ugh.

The issue isn't exactly coherency, since these structures are typically read only for the heavy lifting part of the algorithm.
 
I suppose you'd handle it in software in the driver. It might not be significantly worse than dealing with virtual memory and the page file on the hard drive.
Handling page faults and cache coherency protocols are apples and oranges,with miles of separation between them. It's one thing to standardize on a page table structure and handle each access as a page fault. It's quite another to bounce cachelines around with low latency.
 
There's only a single virtual address space in my idea, not multiple, a flag (ie address range) indicates which physical memory pool it is in (CPU, GPU0, GPU1,...).
As rpg.314 said what you want is at a completely different level.

What my idea is about is simple Resource management API that removes the need to create/destroy Textures, Buffers and what not, by reserving address space and COMMIT/DECOMMIT pages as needed by the program. (Plus keeping a number of Descriptors/Views in memory to describe what's at a given memory address range.)

What you are talking about is coherency protocol for a small subset of data in your working set, which is a different problem I don't address. (And which I don't really see solved unless CPU & GPU are on the same die, which is already the case for a small [growing] set of computers out there.)


I'll try to find some time to investigate Chapel use for GPU, writing the whole pipeline in a Chapel-like language might be an interesting option. (Chapel's domain is of special interest to me, but I also prefer its syntax :p)
 
This: http://www.leehowes.com/files/gaster-2013-formalizing_address_spaces.pdf attempts to step in the direction you are discussing (from a software standpoint, in OCL). Note that C++ AMP is provisioning for a similar future through the array_view abstraction (and I think DX Next is in a better shape in this area than DX11), and hardware based on GCN or Fermi and up does have the harness for implementing it. Personally I am not entirely convinced that there's no merit in being capable to specify where something lives, at least when one considers discrete GPUs and other classes of accelerators (yeah, I know, discrete is dead yadda yadda).
 
Handling page faults and cache coherency protocols are apples and oranges,with miles of separation between them. It's one thing to standardize on a page table structure and handle each access as a page fault. It's quite another to bounce cachelines around with low latency.

Coherency can be completely handled at the virtual memory level, it's high latency, but you only pay the cost when you have a page fault, which is rare or one time if you designed the algorithm well. Mind you that page faults are expensive for memory on the hard drive too!

This is really no different than what a shared memory supercomputer would do, where you have multiple nodes which communicate over a fast network, rather than PCIe. Speeds are very comparable in both these cases, both on the order of 10s of GB/s, although the latency on the supercomputer would significantly worse, since there are more hops involved (PCIe->fiber->PCIe).

Basically, when a processor has a page fault, it looks up what processor (or hard drive) has the memory in question, and sends a message to that processor. The other processor sees this as an interrupt, and proceeds to do what it has to do to invalidate that page and send it back to the original processor if it's been modified. Once the first processor recieves the acknowledgement message, it can now use the memory, which has been conviently DMA'd into it's memory. Notice that there's nothing here that can't be done with just software - it's just message passing.
 
Coherency can be completely handled at the virtual memory level, it's high latency, but you only pay the cost when you have a page fault, which is rare or one time if you designed the algorithm well. Mind you that page faults are expensive for memory on the hard drive too!

This is really no different than what a shared memory supercomputer would do, where you have multiple nodes which communicate over a fast network, rather than PCIe. Speeds are very comparable in both these cases, both on the order of 10s of GB/s, although the latency on the supercomputer would significantly worse, since there are more hops involved (PCIe->fiber->PCIe).

Basically, when a processor has a page fault, it looks up what processor (or hard drive) has the memory in question, and sends a message to that processor. The other processor sees this as an interrupt, and proceeds to do what it has to do to invalidate that page and send it back to the original processor if it's been modified. Once the first processor recieves the acknowledgement message, it can now use the memory, which has been conviently DMA'd into it's memory. Notice that there's nothing here that can't be done with just software - it's just message passing.

But that's very inefficient to transfer a multilevel data structure, like a BVH, which is where you started out from. For something like BVH, you really need a cache coherence protocol. Abusing page fault handlers just wont do.
 
But that's very inefficient to transfer a multilevel data structure, like a BVH, which is where you started out from. For something like BVH, you really need a cache coherence protocol. Abusing page fault handlers just wont do.

What I'm describing is a cache coherency protocol, operating at a memory page level rather than cache line level. Trying to keep track of it at a cache line level would be far too slow, since the data transfer would be dominated by overhead, since cache lines are so small. In addition, the directory would be huge.

The reason to do it at the page level in particular is that you already have dedicated hardware. The big limitation is that you wouldn't be able to do read only sharing, since I don't believe the TLB has any concept of read-only, and thus can't throw an interrupt when the first memory write is encountered. You could however have a state in your handler emulating read only sharing by simply not requesting an invalidation whenever a page belonging to a read sharable memory space is brought in, though this would have to be respected by the programmer.

How efficient transferring a multilevel structure would depend on how well malloc can cluster objects together. If the memory page had a large number of other elements of the structure, it would be much more efficient to transfer the entire page than to do each element separately and update pointers (a BVH is, admittedly, an extreme example, but there are many cases where you have small multilevel structures you want to use on both devices). However, false sharing would be *bad*.

What you'd need then is some concept of memory spaces for malloc, so that it wouldn't inadvertently allocate some CPU scratch variable in the middle of a memory page containing GPU data. A memory space would simply be some set of memory allocations that's guaranteed not to alias pages with memory allocations associated with any other memory space. It'd also be important to ensure that any page from a memory space that can be read by multiple devices is small.

The ultimate result would be a system that, while not an ideal coherency protocol, could still retain many of the benefits with regards to usability, and would not make anything slower in the traditional usage pattern. The important part is that this could be implemented in software only, so the only thing that would have to be updated would be the OS and the drivers.
 
Last edited by a moderator:
Trying to keep track of it at a cache line level would be far too slow, since the data transfer would be dominated by overhead, since cache lines are so small. In addition, the directory would be huge.
Maybe for a discrete GPU, but I don't see a problem on integrated where it's using the same memory hierarchy anyways... both AMD and Intel are going this direction.

How much effort to throw at discrete GPUs/memory spaces depends on how much you believe they are going to matter in the future. With all of the consoles going unified and arguably everything laptop level and down as well, it's only going to be the very high end desktop stuff left as discrete. One could make an argument that those systems could take a more brute force path and still be acceptable. I have a hard time accepting that APIs should be designed around their constraints going forward, even though I love my massive discrete GPUs :)
 
What I'm describing is a cache coherency protocol, operating at a memory page level rather than cache line level. Trying to keep track of it at a cache line level would be far too slow, since the data transfer would be dominated by overhead, since cache lines are so small. In addition, the directory would be huge.

The reason to do it at the page level in particular is that you already have dedicated hardware. The big limitation is that you wouldn't be able to do read only sharing, since I don't believe the TLB has any concept of read-only, and thus can't throw an interrupt when the first memory write is encountered. You could however have a state in your handler emulating read only sharing by simply not requesting an invalidation whenever a page belonging to a read sharable memory space is brought in, though this would have to be respected by the programmer.

How efficient transferring a multilevel structure would depend on how well malloc can cluster objects together. If the memory page had a large number of other elements of the structure, it would be much more efficient to transfer the entire page than to do each element separately and update pointers (a BVH is, admittedly, an extreme example, but there are many cases where you have small multilevel structures you want to use on both devices). However, false sharing would be *bad*.

What you'd need then is some concept of memory spaces for malloc, so that it wouldn't inadvertently allocate some CPU scratch variable in the middle of a memory page containing GPU data. A memory space would simply be some set of memory allocations that's guaranteed not to alias pages with memory allocations associated with any other memory space. It'd also be important to ensure that any page from a memory space that can be read by multiple devices is small.

The ultimate result would be a system that, while not an ideal coherency protocol, could still retain many of the benefits with regards to usability, and would not make anything slower in the traditional usage pattern. The important part is that this could be implemented in software only, so the only thing that would have to be updated would be the OS and the drivers.

Too high latency. Every page fault kicks to the kernel. This is not a solution to any real problem. Much better to batch up the stuff and do a single copy call.
 
Too high latency. Every page fault kicks to the kernel. This is not a solution to any real problem. Much better to batch up the stuff and do a single copy call.

Well, of course. An explicit copy will always be faster. That said, batching up and copying anything but trivial structures is quite ugly. Also, index based indirection doesn't really cut it, since it forces you to use an extra register to access anything (base address + offset), in addition to extra address math. There really needs to be some way of dealing with pointers without jumping through hoops.

HSA is interesting, but until the CPU socket catches up and closes the order of magnitude wide memory bandwidth gap with the GPU socket, it's useless for high end stuff.
 
Well, of course. An explicit copy will always be faster. That said, batching up and copying anything but trivial structures is quite ugly. Also, index based indirection doesn't really cut it, since it forces you to use an extra register to access anything (base address + offset), in addition to extra address math. There really needs to be some way of dealing with pointers without jumping through hoops.

HSA is interesting, but until the CPU socket catches up and closes the order of magnitude wide memory bandwidth gap with the GPU socket, it's useless for high end stuff.

It's ugly vs slower. Ugly will win because performance matters.
 
Back
Top