Khronos Releases OpenGL ES 3.1 Specification

DSC

Regular
Banned
https://www.khronos.org/news/press/khronos-releases-opengl-es-3.1-specification

http://www.anandtech.com/show/7867/khronos-announces-opengl-es-31

March 17, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the immediate release of the OpenGL® ES 3.1 specification, bringing significant functionality enhancements to the industry-leading, royalty-free 3D graphics API that is used on nearly all of the world’s mobile devices. OpenGL ES 3.1 provides access to state-of-the-art graphics processing unit (GPU) functionality with portability across diverse mobile and embedded operating systems and platforms. The full specification and reference materials are available for immediate download at http://www.khronos.org/registry/gles/.
 
Yeah! Everything I need :)

(and not some crap like tessellation or geometry shaders)

---

Compute shaders: check
Indirect dispatch: check :)
Indirect draw: check :)
Atomics: check
Atomic counter buffers: check
Barriers: check
Reinterpret cast float<->int: check
Buffers (UAVs): check :)
Image store/load (RW texture UAVs): check
Packing / unpacking instructions (for small types) + float exp/mantissa generation: check
Gather: check
Thread block shared memory (LDS): check!!!!1 :)

This API is basically as good as DirectX 11. Everything important is there. You could run a next gen console engine on top of this :)

EDIT: Hah, the forum said I had too many images (smilies) in my post. Good news is always worth celebrating.
 
Last edited by a moderator:
Yeah! Everything I need :)
(and not some crap like tessellation or geometry shaders)
Agreed :D Can you see anything missing from DX11 you'd like? Or anything from OGL4.4 you'd rather see exposed before geometry shaders and tesselation?

Thread block shared memory (LDS): check!!!!1 :)
I'm very curious how that will work out in practice for game developers. Unlike on the desktop, the way shared memory is implemented is very very different from architecture to architecture (you could argue it's either flawed or too different on some architectures).

What kind of LDS usage and access patterns do you think matter in practice? And how much would a slow LDS implementation hurt performance of the kind of renderer you're thinking of?
 
What's wrong with tesselation in mobile?

Given an identic end result, isn't lower polygon count + tesselation cheaper/more power-efficient to make than higher polygon count?
 
What's wrong with tesselation in mobile?

Given an identic end result, isn't lower polygon count + tesselation cheaper/more power-efficient to make than higher polygon count?
In theory, yes. But in practice making tessellation work properly with LODs and complex content pipelines (including third party 3d-modeling/animation tools) is not that straightforward, especially if your intention is to reduce the polygon count (optimize rendering). Artists used to polygon modelling also need to learn new (quite different) way to model things.

Tessellation cannot efficiently replace per pixel displacement mapping techniques (such as parallax occlusion mapping or QDM), because tessellating to single pixel triangles kills both quad efficiency and overloads the triangle/primitive setup engines. If you don't tessellate to single pixel triangles, you basically cannot use tessellation rough surfaces, because the vertices will wobble over the small surface details (causing unstable look).
 
Are OES3.1 compute shaders comparable to OGL4.3 and DX11 compute shaders or are there limitations? By not having tessellation there are die area and power savings by not having the tessellator, but if compute shaders, vertex shaders, and pixel shaders are all full featured comparable to OGL4.x/DX11, with unified shaders, are there really much savings by omitting geometry shaders?
 
Agreed :D Can you see anything missing from DX11 you'd like? Or anything from OGL4.4 you'd rather see exposed before geometry shaders and tesselation?
Quick answer (in order of my preference):
- Good well defined and portable texture compression (raw data accessible from compute shaders = can do realtime GPU compression).
- Asynchronous compute (multiple concurrent compute queues in addition to render queue). (CUDA, GCN*).
- Multi draw indirect (from OpenGL 4.3 / GCN*)
- Multi draw read draw call count from GPU buffer) (OpenGL 4.4) (https://www.opengl.org/registry/specs/ARB/indirect_parameters.txt)
- Ballot (from CUDA and GCN*). Return value in 32/64 bit integer (one wave, each thread sets one bit).
- Sparse texture (PRT / hardware virtual texture) (from OpenGL 4.4 / DirectX 11.2)
- Bindless resources (from OpenGL 4.4 / Nvidia extensions / GCN*)

GCN* = see AMD Sea Islands instruction set (here: http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf). This hardware is close to next gen / hardware used by Mantle API.

Long answer:

Multi draw call is not necessary required, since you can render the whole scene in a single draw call without it. All the required ingredients are included in ES 3.1: indirect draw, indirect dispatch, gl_VertexID, unordered (UAV) buffer load from vertex shader.

My concern here is the performance of mobile hardware UAV (buffer) reads. On modern PC hardware storing vertex (and constant data) to UAVs in SoA layout is actually more efficient to the GPU hardware than using vertex buffers. This way the shader compiler can reorder the calculation between the partial vertex stream reads, and hide latency much better than using AoS style (big struct) vertices. So the performance is actually better when storing vertex data to custom (UAV) buffers than storing them to fat vertex buffer. I am just hoping that the performance on mobile hardware will behave similarly. Modern PC hardware has flexible general purpose L1 and L2 caches that work as well for UAVs as they work for constant buffers or vertex buffers.

Nobody except the hardware engineers themselves knows yet how well PowerVR chips perform in compute shaders and (cache friendly, but multiple indirection) UAV buffer reads.

Hardware sparse texturing (PRT) and/or bindless resources are not that critical for us, because we have been using software virtual texturing (shader based indirection) for multiple projects, and are perfectly happy with it. The most optimal virtual texture indirection code is just 4 (1d) ALU instructions. Custom anisotropic is quite hacky, but trilinear is straightforward and fast (and definitely enough for a mobile game).

The thing I am most concerned about is the state of the texture compression in OpenGL in general. Our virtual texturing heavily relies on real time DXT texture compression. We write directly in a compute shader on top of DXT5 compressed VT atlas (aliased to 32-32-32-32 integer target). Modern GPUs actually do optimized DXT5 compression (simple endpoint selection) faster than they copy uncompressed 8888 data (DXT5 texture compression is also BW bound, but obviously the write BW is only 25% of uncompressed case). Even if real time texture compression would be slightly slower on mobile devices than copying data to VT cache atlas, it wouldn't matter much since the amortized cost is so small. On average case each generated texture page is sampled 200+ times (60 frames per second, 4 seconds = 240 frames) before it goes out of the screen. Texture compression saves 75% of the bandwidth cost of these 200+ sampling operations, and thus would save a huge amount of battery life on mobile devices (and also boost performance in BW limited mobile devices). We need to do real time compression to virtual texture pages, because be blend decals on top of the texture data (this saves huge amount of rendering cost on scenes that have lots of decals. And decals are needed to get lots of texture variety to scenes).

I just hope we don't need to use uncompressed data on mobile devices while we can use proper texture compression on consoles and PCs. That would be very awkward.

Asynchronous compute is great. Our shaders doing rasterization (shadow map or textureless g-buffer rendering) are completely bound by fixed function units (such as triangle/primitive setup, ROP fill rate, attribute caches, etc). Executing ALU & BW heavy operations such as lighting and post processing simultaneously increases performance and GPU utilization dramatically (as the bottlenecks are different). We need this for mobile devices as well.

Ballot instruction (in CUDA and GCN*) is good for reducing LDS traffic (and instruction counts in general), because it allows you to do prefix sum calculation for a wave/warp using just a few instructions. Prefix sum is very important for many GPU algorithms. ES 3.1 has bitCount and bitFieldExtract instructions. All we need is a ballot instruction. Ballot = each thread inputs one boolean to the ballot instruction, and the ballot instruction returns the same packed (one bit per thread) 32/64 bit integer for all threads in the wave/warp.
What kind of LDS usage and access patterns do you think matter in practice? And how much would a slow LDS implementation hurt performance of the kind of renderer you're thinking of?
If append buffers (atomic counter buffers) are as fast on mobile hardware as they are in GCN (almost equal speed to normal linear write), these can be used for many tasks requiring compacting data. This greatly reduces the need for fast LDS (in steps like occlusion culling and scene setup). However LDS is still needed for post processing, blur kernels being the most important use case. LDS saves lots of bandwidth and sampling cost in blur kernels. Modern lighting algorithms also load potentially visible lights to LDS (by screen region or hashed cluster identifier), and read the light data from LDS for each pixel in the same cluster (again saving bandwidth). Hopefully these use cases are fast enough, as compute shaders are much more efficient (saves battery life) compared to pixel shaders in these use cases (data is as close to the execution units as possible = much more energy efficient to read the data repeatedly).
 
Are OES3.1 compute shaders comparable to OGL4.3 and DX11 compute shaders or are there limitations? By not having tessellation there are die area and power savings by not having the tessellator, but if compute shaders, vertex shaders, and pixel shaders are all full featured comparable to OGL4.x/DX11, with unified shaders, are there really much savings by omitting geometry shaders?
According to OpenGL ES 3.1 reference (http://www.khronos.org/opengles/sdk/docs/man31/), the compute shaders are fully featured. I didn't find any DirectX 11 compute shader (or indirect draw) feature that I miss. The only thing not mentioned was the define value of maximum LDS size... Hopefully no implementation returns zero for this :(
 
sebbi said:
- Asynchronous compute (multiple concurrent compute queues in addition to render queue). (CUDA, GCN*).

Just expanding on sebbi's point here.

These multiple async compute queues are not really exposed in DX and GL right now. DX and GL seem to serialize all kernels (graphics and compute) into the same queue afaik. It will be nice to have this support.

GCN (in Tahiti generation) allows upto 2 ACEs and newer GCN such as Bonaire and Kaveri allow upto 8. These are exposed in OpenCL and I did find nice performance improvements in some apps by using multiple queues. From Mantle's public presentations so far, it looks like Mantle does support multiple async compute queues on GCN as well.

On Nvidia side, some support has been present since Fermi though it had weird restrictions, and Kepler improves things a lot. This support has only been exposed in CUDA so far with Nvidia's lackluster OpenCL driver not exposing it afaik.

Anyway, would be great if these were available in DX or GL.
 
As long as Apple and Google adopt this update to their API, GPGPU can finally move forward a little more on mobile.
 
These multiple async compute queues are not really exposed in DX and GL right now. DX and GL seem to serialize all kernels (graphics and compute) into the same queue afaik. It will be nice to have this support.
Yes, this is the biggest thing I would want to have in future desktop OpenGL and DirectX versions as well. Shouldn't be that hard, since CUDA and OpenCL (and seems that Mantle as well) already expose it. Should be a big GPU performance boost, even for those GPUs that only have 2 ACEs (two simultaneous tasks should be enough to get most gains).
 
Last edited by a moderator:
Do we need new mobile hardware or current generation (Adreno 3XX / Rogue / Mali T6XX) is enough (at least hardware wise)?
 
Tessellation cannot efficiently replace per pixel displacement mapping techniques (such as parallax occlusion mapping or QDM), because tessellating to single pixel triangles kills both quad efficiency and overloads the triangle/primitive setup engines. If you don't tessellate to single pixel triangles, you basically cannot use tessellation rough surfaces, because the vertices will wobble over the small surface details (causing unstable look).

Tessellation is not meant to replace displacement mapping, but rather to complement it, while replacing some existing bump mapping techniques that have their own set of drawbacks: http://www.nvidia.com/object/tessellation.html .

"At its most basic, displacement mapping can be used as a drop-in replacement for existing bump mapping techniques. Current techniques such as normal mapping create the illusion of bumpy surfaces through better pixel shading. All these techniques work only in particular cases, and are only partially convincing when they do work. Take the case of parallax occlusion mapping, a very advanced form of bump mapping. Though it produces the illusion of overlapping geometry, it only works on flat surfaces and only in the interior of the object (see image above). True displacement mapping has none of these problems and produces accurate results from all viewing angles."

And while there are obviously tradeoffs even with programmable tessellation, NVIDIA believes that tessellation is a "key technology for efficient geometry", even in the ultra mobile space. They claim " > 50x Triangle Savings vs Brute Force ES2.0"

OpenGL ES 3.1 is clearly a great step forward for the ultra mobile space, but it may be challenging to quickly and easily bring console games to the ultra mobile space without full OpenGL 4.x support (including support for tessellation and geometry shaders). I can understand why OpenGL ES 3.x was implemented as such, but I don't consider it a great thing that some key features of OpenGL 4.x were removed in their entirety when upcoming ultra mobile hardware from NVIDIA, Qualcomm, etc. will support these said features.
 
Last edited by a moderator:
Really nice step forwards :)

it may be challenging to quickly and easily bring console games to the ultra mobile space without full OpenGL 4.x support
Why would it be difficult?

I'm not sure I've seen a game with tessellation where turning it off wasn't an option. Geometry shaders are a little more ingrained but not hard to swap out. I'm all for tessellation since when used well it is indeed a key technology for efficient geometry. However, lets be honest - the majority of applications of tessellation to date have been to excessively over-tessellate surfaces to make them look round. There aren't too many titles that when you enable tessellation actually decrease the geometry load.
 
Here is what Tim Sweeney had to say:

most importantly the software runs full open GL on mobile hardware. To have the full graphics API that's available on PC and the highest in platforms available in the industry is just a breakthrough. It enables us to bring graphics up to the next level without any compromises. With full Open GL it knocks down the remaining major barrier between PC level graphics and mobile level graphics. From here onward, I think we're going to see the performance between mobile, PC and high-end console gaming continue to narrow to the point where the differences between the platforms really blur.
 
I don't believe he's saying he would find it difficult to port desktop content onto an API without tessellation/GS in that quote... Maybe I'm missing something?

It would of course be easier not to change anything and run desktop OpenGL 4 directly but I'm suggesting it isn't meaningfully harder to use this new OGLES either. Perhaps naively, I'd like to hope most developers know that simply copy pasting a game unmodified from 100W CPU + 200W GPU environment onto a mobile SOC using closer to 2W would be a pretty bad idea (assuming we don't want heatsinks and fans to start appearing in tablets and permanently connecting power cables).

Once you're profiling your apps for power consumption and bandwidth performance the API isn't going to too high on their priority list. Just because something can be done doesn't mean it should be.

(PS: not knocking NVidia btw I'm excited to see what K1 can do - part of me hopes they don't get unfairly penalised from inefficient ports off the back of the API support as it looks like a decent architecture).
 
Back
Top