Agreed
Can you see anything missing from DX11 you'd like? Or anything from OGL4.4 you'd rather see exposed before geometry shaders and tesselation?
Quick answer (in order of my preference):
- Good well defined and portable texture compression (raw data accessible from compute shaders = can do realtime GPU compression).
- Asynchronous compute (multiple concurrent compute queues in addition to render queue). (CUDA, GCN*).
- Multi draw indirect (from OpenGL 4.3 / GCN*)
- Multi draw read draw call count from GPU buffer) (OpenGL 4.4) (
https://www.opengl.org/registry/specs/ARB/indirect_parameters.txt)
- Ballot (from CUDA and GCN*). Return value in 32/64 bit integer (one wave, each thread sets one bit).
- Sparse texture (PRT / hardware virtual texture) (from OpenGL 4.4 / DirectX 11.2)
- Bindless resources (from OpenGL 4.4 / Nvidia extensions / GCN*)
GCN* = see AMD Sea Islands instruction set (here:
http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf). This hardware is close to next gen / hardware used by Mantle API.
Long answer:
Multi draw call is not necessary required, since you can render the whole scene in a single draw call without it. All the required ingredients are included in ES 3.1: indirect draw, indirect dispatch, gl_VertexID, unordered (UAV) buffer load from vertex shader.
My concern here is the performance of mobile hardware UAV (buffer) reads. On modern PC hardware storing vertex (and constant data) to UAVs in SoA layout is actually more efficient to the GPU hardware than using vertex buffers. This way the shader compiler can reorder the calculation between the partial vertex stream reads, and hide latency much better than using AoS style (big struct) vertices. So the performance is actually better when storing vertex data to custom (UAV) buffers than storing them to fat vertex buffer. I am just hoping that the performance on mobile hardware will behave similarly. Modern PC hardware has flexible general purpose L1 and L2 caches that work as well for UAVs as they work for constant buffers or vertex buffers.
Nobody except the hardware engineers themselves knows yet how well PowerVR chips perform in compute shaders and (cache friendly, but multiple indirection) UAV buffer reads.
Hardware sparse texturing (PRT) and/or bindless resources are not that critical for us, because we have been using software virtual texturing (shader based indirection) for multiple projects, and are perfectly happy with it. The most optimal virtual texture indirection code is just 4 (1d) ALU instructions. Custom anisotropic is quite hacky, but trilinear is straightforward and fast (and definitely enough for a mobile game).
The thing I am most concerned about is the state of the texture compression in OpenGL in general. Our virtual texturing heavily relies on real time DXT texture compression. We write directly in a compute shader on top of DXT5 compressed VT atlas (aliased to 32-32-32-32 integer target). Modern GPUs actually do optimized DXT5 compression (simple endpoint selection) faster than they copy uncompressed 8888 data (DXT5 texture compression is also BW bound, but obviously the write BW is only 25% of uncompressed case). Even if real time texture compression would be slightly slower on mobile devices than copying data to VT cache atlas, it wouldn't matter much since the amortized cost is so small. On average case each generated texture page is sampled 200+ times (60 frames per second, 4 seconds = 240 frames) before it goes out of the screen. Texture compression saves 75% of the bandwidth cost of these 200+ sampling operations, and thus would save a huge amount of battery life on mobile devices (and also boost performance in BW limited mobile devices). We need to do real time compression to virtual texture pages, because be blend decals on top of the texture data (this saves huge amount of rendering cost on scenes that have lots of decals. And decals are needed to get lots of texture variety to scenes).
I just hope we don't need to use uncompressed data on mobile devices while we can use proper texture compression on consoles and PCs. That would be very awkward.
Asynchronous compute is great. Our shaders doing rasterization (shadow map or textureless g-buffer rendering) are completely bound by fixed function units (such as triangle/primitive setup, ROP fill rate, attribute caches, etc). Executing ALU & BW heavy operations such as lighting and post processing simultaneously increases performance and GPU utilization dramatically (as the bottlenecks are different). We need this for mobile devices as well.
Ballot instruction (in CUDA and GCN*) is good for reducing LDS traffic (and instruction counts in general), because it allows you to do prefix sum calculation for a wave/warp using just a few instructions. Prefix sum is very important for many GPU algorithms. ES 3.1 has bitCount and bitFieldExtract instructions. All we need is a ballot instruction. Ballot = each thread inputs one boolean to the ballot instruction, and the ballot instruction returns the same packed (one bit per thread) 32/64 bit integer for all threads in the wave/warp.
What kind of LDS usage and access patterns do you think matter in practice? And how much would a slow LDS implementation hurt performance of the kind of renderer you're thinking of?
If append buffers (atomic counter buffers) are as fast on mobile hardware as they are in GCN (almost equal speed to normal linear write), these can be used for many tasks requiring compacting data. This greatly reduces the need for fast LDS (in steps like occlusion culling and scene setup). However LDS is still needed for post processing, blur kernels being the most important use case. LDS saves lots of bandwidth and sampling cost in blur kernels. Modern lighting algorithms also load potentially visible lights to LDS (by screen region or hashed cluster identifier), and read the light data from LDS for each pixel in the same cluster (again saving bandwidth). Hopefully these use cases are fast enough, as compute shaders are much more efficient (saves battery life) compared to pixel shaders in these use cases (data is as close to the execution units as possible = much more energy efficient to read the data repeatedly).