Your wishlist for DX12

Infinisearch · Jul 15, 2012

(Not sure if this thread is in the right place)

I just wanted some developer opinions on the future of graphics API's... but to keep it focused I decided to ask developers their wishlist for DirectX12.
For example:
A DXGI format of 21,21,21 and 21,21,21,1

Something like glAreTexturesResident although it would take some thinking on how to make it work with deferred contexts.

A "multi chunk" based draw call with instance indexing per chunk.

Thanks in advance for any input on the subject.

Edit* It should read: A DXGI format of 22,21,21 and 21,21,21,1.
And as Davros pointed out I should limit it to D3D12.

sebbbi · Jul 15, 2012

We need a better way to feed GPU by GPU than the (very limited) DrawAuto. Goal of course being that GPU could itself fully generate new draw calls and state changes (and fill in all the parameters). Also a GPU->CPU callback/interrupt system (ala XPS) could be a very useful feature (if evolved further), but wouldn't likely work that well for discrete cards.

I didn't like how the tessellation stages were implemented in DX11. We do not need two brand new shader types for a single new feature. API could allow us to create our own pipeline stages, and dependencies between (data items in) them. For example a triangle needs 3 vertices, and the GPU could run a custom shader ("the vertex shader") to generate them when needed. Modern GPUs have general purpose read&write caches in their shader multiprocessors, so these could be used to move (and cache) data from stage to another. Currently if you want to do anything that doesn't fit the fixed pipeline flow, you have to use global memory to transfer data from stage to another. This wastes a lot of bandwidth.

I would also like to see some kind of partially resident texture support (virtual texturing). The hardware implementation in recent AMD cards is pretty good, but the possibilities of such a system should be explored further. It could have lots of different uses if both the API and the hardware was designed properly. A full virtual memory system (with custom fault handlers, etc) would of course be even better. 64 bit virtual address space would of course be required (and would require GPUs to calculate lots of 64 bit integer math for data addressing).

As modern cards have now support for harware context switching (and splitting shader MPs among many tasks), we should have API support for priorising our tasks. For example all draw calls that are latency sensitive could go to a different command list, and the GPU could fetch them first (higher priority). This would be very helpful for GPGPU tasks (for data that needs to be later processed by CPU).

And of course the usual: less overhead and closer to hardware (where the performance gains are worth it).

Davros · Jul 15, 2012

Well since you said dx12 and not d3d12 then I will choose bring back hardware accelerated audio (even if its on the gpu) and realtime holophonics.

Jawed · Jul 16, 2012

I dare say the tessellation in D3D11 is the way it is because the tessellator is notionally a fixed-function unit, like a rasteriser.

In GPUs the rasteriser and tessellator both drive invocation of work items, i.e. an unknown number of work items result from a fixed input list.

Now you could talk about this being generalised in terms of the GPU being capable of generically generating threads of different kinds of work items. But that still doesn't resolve the question of efficiently rasterising or tessellating.

On the other hand, I reckon a 64-core Larrabee would be competitive with the cards we currently have and developers would have lots of fun implementing their own pipelines. But that's not going to happen in games any time soon.

And there's still the minor question of these things scaling down to the 64mm² $50 chip.

And to fire off a subject for a third thread derail: let's make D3D12 the last iteration. There should be no need for anything after it, if GPUs are "first class citizens" (or disappear as discrete entities in the APU-based future) and "compute" can fill in all the remaining gaps.

sebbbi · Jul 16, 2012

Jawed said:
In GPUs the rasteriser and tessellator both drive invocation of work items, i.e. an unknown number of work items result from a fixed input list.

That's true (+ geometry shader as well). In current GPU's work items are mainly created by fixed function units.

Without detailed documentation of Kepler's "Dynamic Parallelism" it's hard to say how fine grained runtime control of work item creation this technlogy gives to kernels, and what are the limitations of it. Of course we can assume that next generation hardware is going to improve this kinds of technology further (as the demand is surely there).

Jawed said:
And to fire off a subject for a third thread derail: let's make D3D12 the last iteration. There should be no need for anything after it, if GPUs are "first class citizens" (or disappear as discrete entities in the APU-based future) and "compute" can fill in all the remaining gaps.

But that pretty much requires that we need to solve all the issues with moving/caching/buffering data and programmable (kernel) invocation of new work items (and kernels). And we need a good virtual memory model with coherent caches, and a good task switching / interrupt / priority model as well. As long as the API is based on fixed function pipeline model it's not general purpose enough to warrant "last iteration". Of course there should be fixed function parts in the pipeline (nobody complains about fixed function cache logic in CPUs either), but the whole pipeline structure should not be fixed, if we want to create an API that can be the "last iteration" (and stay relevant for decades).

We are still getting better and better programming tools for CPU development, and CPUs haven't changed much in many years. C++11 standard was released just last year, and it brings huge boosts to productivity. Programming tools needs to evolve even if hardware starts to stabilize. I do not see the end of DX API development anytime soon, unless the API transforms to something else (DirectCompute could for example be replaced completely by C++ AMP).

Rodéric · Jul 16, 2012

Pretty much what sebbbi said.

Just to add/emphasis:
Where's my GPU ISA, or my minimal API and why isn't my GPU more than a co-processor ?
(Are desktop GPU that different we can't have an ISA, the same memory layout for given formats and such ?)

Why don't we have GPU virtual memory management so we can have massive streamed worlds ?
(Virtual texturing as currently done is a nice trick but it could be formalized better.)

Jawed · Jul 16, 2012

sebbbi said:
But that pretty much requires that we need to solve all the issues with moving/caching/buffering data and programmable (kernel) invocation of new work items (and kernels). And we need a good virtual memory model with coherent caches, and a good task switching / interrupt / priority model as well. As long as the API is based on fixed function pipeline model it's not general purpose enough to warrant "last iteration". Of course there should be fixed function parts in the pipeline (nobody complains about fixed function cache logic in CPUs either), but the whole pipeline structure should not be fixed, if we want to create an API that can be the "last iteration" (and stay relevant for decades).

I don't think any of that rules out making D3D12 the last revision. Sure, some people dream that there should be a common API for some variant of ray tracing or REYES etc. But we're past the era where "an API wrapped fixed function units, because without them graphics simply wouldn't have happened". Now we're in the algorithmic era and any kind of new algorithmic fixed function unit just gets in the way of future progress (and has the inherent fault that it will take about 5 years for that unit to be used properly).

So my proposal is that D3D12 formalises "getting out of the way". It allows a mixture of the traditional forward rendering API and compute (here compute is defined without reference to the traditional forward pipeline). Compute should be able to skip off into the sunset, unfettered by the gremlins of the register-combiner history of GPUs.

The stuff you're asking for isn't difficult as the foundations have been laid. Arbitrary GPU-contained (or APU intrinsic) producer-consumer data- and task-parallelism is within reach and beyond that there's nothing else for D3D13 to say, in my view.

Let engines (third party) operate atop D3D12 to make off-the-shelf REYES, etc.

rpg.314 · Jul 17, 2012

Compute is just not enough to make real time reyes/ray tracing happen. Kayvon Fatahlian's work demonstrated the former conclusively. Ray tracing has been a pie in the sky for decades now. Unless hooks for rasterizer/texturing are offered from compute, it's just not going to happen.

But yes, we need more flexible compute. And we need it yesterday.

Jawed · Jul 17, 2012

rpg.314 said:
Compute is just not enough to make real time reyes/ray tracing happen. Kayvon Fatahlian's work demonstrated the former conclusively.

In five years' time? That's how long it would take to get the hardware changed and for developers to even start using it. So why bother making the hardware changes and constricting developers with an API now?

The cadence for fixed function hardware is just too slow now-a-days.

Click to expand...

Ray tracing has been a pie in the sky for decades now. Unless hooks for rasterizer/texturing are offered from compute, it's just not going to happen.

The number of monitor pixels is stubbornly static, so this is going to fall soon.

I was never excluding hooks twixt compute and the forward pipeline (they are already there, after all). But compute should be de-shackled. Larrabee's flexibility should be the model - and will free us from the GPU in GPGPU.

Yay.

Davros · Jul 17, 2012

Jawed said:
The number of monitor pixels is stubbornly static,

I disagree, with both amd and nv supporting up to 6 monitors on a single card now, I think we will see a lot of gamers go multimonitor

Simon F · Jul 17, 2012

Jawed said:
Now we're in the algorithmic era and any kind of new algorithmic fixed function unit just gets in the way of future progress (and has the inherent fault that it will take about 5 years for that unit to be used properly).

I may have misunderstood your point, but power is probably now the main GPU bottleneck and fixed-function is usually significantly more power efficient.

Jawed said:
Ray tracing has been a pie in the sky for decades now. Unless hooks for rasterizer/texturing are offered from compute, it's just not going to happen.

Click to expand...

The number of monitor pixels is stubbornly static, so this is going to fall soon.

I'm not sure Apple agree. Anyway, I don't think it's the number of pixels that are at fault - it's the moving target of the scene and lighting complexity.

Jawed · Jul 18, 2012

Simon F said:
I may have misunderstood your point, but power is probably now the main GPU bottleneck and fixed-function is usually significantly more power efficient.

Any new FF stuff is going to be sat unused by games for 5 years.

Power efficiency is going to get a huge shot in the arm once we get GB of memory super-close to the ALUs with spectacular bandwidth - which should start happening within 5 years. Integration for the win.

I'm not sure Apple agree.

I'd love a 20MP 30" monitor, but it just isn't going to happen this side of 2020.

Anyway, I don't think it's the number of pixels that are at fault - it's the moving target of the scene and lighting complexity.

Which is why new FF hardware (enshrined by an API) isn't much use - we're in the algorithmic age. FF stuff is too domain-specific.

It's amusing that:

http://www.beyond3d.com/content/articles/94/5

is 4.5 year's old. By the end of the year AMD's new chip should be here (erm, fingers crossed), delivering about 4.5 to 5 TFLOPs, roughly 10x the "peak compute capability" of HD3870 (truthfully a lot more). I think giving developers freedom with full-fat compute is more important than fiddling with the fringe improvements to the current pipeline.

What would Deano do (in about 5 years' time) with 100x more computing power than was available when he wrote that article?...

rpg.314 · Jul 18, 2012

Jawed said:
The number of monitor pixels is stubbornly static, so this is going to fall soon.

Like Simon pointed out, that's not true. Maybe not 30" retina display, but laptops and tablets would be almost universally retina class.

I was never excluding hooks twixt compute and the forward pipeline (they are already there, after all). But compute should be de-shackled. Larrabee's flexibility should be the model - and will free us from the GPU in GPGPU.

I am in favor of as much flexible compute as possible as well, but I think graphics needs more ff hw. I don't know if Caustic/Imagination has all the right answers, but we do need any assistance for ray casting that we can get.

sebbbi · Jul 18, 2012

rpg.314 said:
...we do need any assistance for ray casting that we can get.

But ray casting to what? polygon soups? bouding volume trees? kd trees? sparse voxel octrees? point clouds? parametric patches? distance fields? heightfields (maybe used as displacement maps on planar surfaces or patches)?

The optimal choice of an acceleration structure and a data representation heavily depends on the scene properties, especially how dynamic the world is, and how detailed the objects are. Static data structures are perfect for background geometry, but are pretty much unusable for dynamic geometry (or highly destructible environments). Polygons are better for low detail modelling, and voxels are better for high detail modelling (polygons both consume more memory and are slower to process once detail reaches certain point). Displacement maps of course improve the situation (both surface detail and memory consumption) for polygon and patch based models. However displacement map ray casting requires completely different algorithms and data structures.

Even if we choose just one algorithm & structure for static and dynamic data for each data representation method, there will be lots of combinations. It would be better to identify performance bottlenecks of each algorithm, and add a few more shader instructions to fix the biggest problems and of course identify the data access patterns and investigate how GPU memory/cache hierarchy and datapaths could be generalized to be better suited for these kinds of access patterns.

And nobody yet knows if ray casting is going to be the best bet for games anytime soon. Dedicating lots of fixed function hardware just for it wouldn't be such a good idea. Pixar's Reyes algorithm for example uses micropolygon tessellation & displacement and samples the micropolygons in screen space. It's designed for high detail geometry rendering, and similar algorithms could be implemented for real time rendering. This is a potential future path as well.

rpg.314 · Jul 19, 2012

sebbbi said:
But ray casting to what? polygon soups? bouding volume trees? kd trees? sparse voxel octrees? point clouds? parametric patches? distance fields? heightfields (maybe used as displacement maps on planar surfaces or patches)?

The optimal choice of an acceleration structure and a data representation heavily depends on the scene properties, especially how dynamic the world is, and how detailed the objects are. Static data structures are perfect for background geometry, but are pretty much unusable for dynamic geometry (or highly destructible environments). Polygons are better for low detail modelling, and voxels are better for high detail modelling (polygons both consume more memory and are slower to process once detail reaches certain point). Displacement maps of course improve the situation (both surface detail and memory consumption) for polygon and patch based models. However displacement map ray casting requires completely different algorithms and data structures.

Pick tris for continuity/interop with classic rasterization. Let the IHV decide kd tree/BVH and abstract it away. I believe openRL (caustic/imgtech's proposed API) offers an abstraction that enables patches to be hw accelerated as well.

Even if we choose just one algorithm & structure for static and dynamic data for each data representation method, there will be lots of combinations. It would be better to identify performance bottlenecks of each algorithm, and add a few more shader instructions to fix the biggest problems and of course identify the data access patterns and investigate how GPU memory/cache hierarchy and datapaths could be generalized to be better suited for these kinds of access patterns.

I don't think shader instructions go far enough, though they could be useful. GK110 already has GP read only caches which should help with look ups.

And nobody yet knows if ray casting is going to be the best bet for games anytime soon. Dedicating lots of fixed function hardware just for it wouldn't be such a good idea. Pixar's Reyes algorithm for example uses micropolygon tessellation & displacement and samples the micropolygons in screen space. It's designed for high detail geometry rendering, and similar algorithms could be implemented for real time rendering. This is a potential future path as well.

Just because there are many paths doesn't mean punting to compute is a good idea. More and better compute helps everything. Just pick a paradigm and add more ff hw to accelerate it. Other paradigms will not be disadvantaged since better compute helps everything.

That said, I do not expect much (other than, may be, more flexible compute) from DX12 since I expect it would have only minor deviations from the next gen consoles and next gen consoles will be largely built around GPUs of DX11 era. I just don't see Sony or MS pouring lots of cash into developing novel GPU IP.

sebbbi · Jul 19, 2012

Fixed function hardware is a double edged sword. Yes, it does one thing efficiently, but each added fixed function unit splits the computing resources further. In order to get best performance out of the hardware you have to utilize all the fixed function units all the time. The more fixed function units you have, the harder this becomes. One of these fixed function units is always going to be a bottleneck, causing others to idle. Examples: Vertex & pixel shader hardware was unified in DX10 and it improved the shader unit utilization a lot. New Radeons removed texture coordinate calculation from texture sampling units and are doing it in the programmable shader ALUs (fixed function arithmetic hardware was removed and more programmable units were added, improving performance in ALU bound shaders).

Often the best way to utilize all the fixed function units all the time is to run several parts of the algorithm in parallel. This is popular approach for example in TV sets (fixed function image decompression & processing). You have to do N processing steps to the source image before you can display it. With programmable hardware you'd use all your processing units to process a single step at a time. All processing units would be working 100% of the time, and you would have a latency of one frame to display the result (just enough hardware to finish the frame just in time to begin processing the next). With fixed function hardware, if you process one step at a time, all the other fixed function hardware designed for the other steps idles. To utilize all this hardware you need to process N frames in parallel. Hardware for each step is processing a different frame, and sends data to the next step. This approach yields a 100% hardware utilization (and likely a lower total power requirement than programmable hardware), but introduces a latency of N frames.

The good thing about programmable hardware is that it removes the fixed function hardware bottlenecks. With fixed function hardware you have to always program around the bottlenecks, and it doesn't help us that the bottlenecks change several times per frame (shadow map rendering for example has drastically different bottlenecks than for example deferred lighting). In a fully programmable architecture, all execution units can help in solving the current algorithm as fast as possible. Idle time is minimized (and latency is minimized). It also saves lots of programming time (inventing workarounds to fixed function bottlenecks is very time consuming process).

rpg.314 said:
Just pick a paradigm and add more ff hw to accelerate it. Other paradigms will not be disadvantaged since better compute helps everything.

Of course other paradigms will be disadvantaged. There's a huge piece of unused (fixed function) hardware just sitting there doing nothing. If you are not using the hardware, it will be wasted. That's not what (console) developers are willing to do.

rpg.314 said:
Pick tris for continuity/interop with classic rasterization.

I personally believe that triangles will be one of the reasons why we are leaving rasterization behind (if it happens in future). Subpixel sized triangles are a huge waste (of processing power, bandwidth and memory). Pure triangles are very inefficient in modeling high detail meshes (high detail CAD model geometry can be several gigabytes of size, and that's just for a single building or airplane). With other methods you can have the same level of detail with much smaller memory footprint (and smaller bandwidth requirements). For example patches with displacement maps are considerably better, but are a much harder problem for ray intersection calculation. Voxels on the other hand are very efficient target for ray casting (while being somewhere in the middle of the two in storage requirement for super high detail models, depending of course of the scene).

itsmydamnation · Jul 19, 2012

when power is the major constraining factor why do we care about utilisation? FF will be "king" so long as it allows continued progress while offering ultimately Superior performance. If utilisation was the key metric VLIW never would have happened. being able to do whatever you want however you want has a price too, larrabee wasn't able to pay the price of admission

.

Xmas · Jul 19, 2012

sebbbi said:
Fixed function hardware is a double edged sword. Yes, it does one thing efficiently, but each added fixed function unit splits the computing resources further. In order to get best performance out of the hardware you have to utilize all the fixed function units all the time.

In a power/thermal limited scenario you want to maximise efficiency. Adding area for FF units that sit idle some of the time may be a small price to pay if the alternative is running the same task on less frugal general purpose units.

And while it is true that fixed function units can become a bottleneck, it is also possible that a relatively small fixed function unit outperforms an array of general purpose units at its specific task.

fellix · Jul 19, 2012

ALUs are cheap, dedicated "ALUs" are even cheaper today, both thermally and transistor budget wise. It's the data movement and the whole memory subsystem model that's getting in the way. Why are the SOCs all over the place now, with more and more dedicated logic being crammed into them?

rpg.314 · Jul 20, 2012

sebbbi said:
Of course other paradigms will be disadvantaged. There's a huge piece of unused (fixed function) hardware just sitting there doing nothing. If you are not using the hardware, it will be wasted. That's not what (console) developers are willing to do.

AFAIK, the tessellator in xbox 360 is mostly unused. Would you say that it has disadvantaged non-tessellated geometry rendering?

Some amount of your hw will idle, now matter how clever you are as a programmer or how flexible the compute is. Dark silicon is a reality and it is only going to get worse. If a small amount of ff hw can be added (more like fairly large number of ppl can agree on what to add), that is just so much better than having to use bog standard ALUs.

I personally believe that triangles will be one of the reasons why we are leaving rasterization behind (if it happens in future). Subpixel sized triangles are a huge waste (of processing power, bandwidth and memory). Pure triangles are very inefficient in modeling high detail meshes (high detail CAD model geometry can be several gigabytes of size, and that's just for a single building or airplane). With other methods you can have the same level of detail with much smaller memory footprint (and smaller bandwidth requirements). For example patches with displacement maps are considerably better, but are a much harder problem for ray intersection calculation. Voxels on the other hand are very efficient target for ray casting (while being somewhere in the middle of the two in storage requirement for super high detail models, depending of course of the scene).

I agree about patches/displacement maps. But you need some amount of continuity in the ecosystem.

Your wishlist for DX12

Infinisearch

sebbbi

Davros

Jawed

sebbbi

Rodéric

a.k.a. Ingenu

Jawed

rpg.314

Jawed

Davros

Simon F

Tea maker

Jawed

rpg.314

sebbbi

rpg.314

sebbbi

itsmydamnation

Xmas

Porous

fellix

rpg.314

Similar threads