PowerVR Rogue Architecture

So this is so you can pre-fill the texture memory with textures that aren't visible?

Well, yeah; the point here is that the APIs for virtual texturing (AMD's mechanism, DX11.2, GL4.x etc) are being defined in such a way that it is not possible to avoid doing this for overdrawn geometry.
 
It'll only come back to haunt them.

Ummm, the impossibility of avoiding this kind of virtual-texture overfetch is a consequence of how the feature is defined at the API level, not specifics of their hardware design. For immediate-mode renderers like those of Nvidia and AMD, the only harm done is that you fetch somewhat bigger parts of the texture than what you actually need for your final framebuffers, whereas for renderers that usually remove back-to-front rendering as part of their normal operation - PowerVR and Mali - it also breaks the ability to actually remove back-to-front rendering (specifically because it manifests itself in shaders as accesses to UAVs and/or other forms of non-framebuffer memory, which is a form of activity that is in general NOT safe to cull).

This will not haunt THEM, it will haunt US.
 
Speaking strictly for myself, obviously...

It's UAVs that are the problem because MS specified DirectX 11 without tilers in mind - virtual texturing by itself is perfectly fine.

In GLES3.1 or Vulkan, you would simply use an atomic buffer to get an unique index then write to that location without order guarantees. The specification is extremely clear that there are no guarantees that fragments will be executed if it does not ultimately contribute to the framebuffer, irrespective of side effects. See section 7.11.1 of the GLES3.1 Specification:

https://www.khronos.org/registry/gles/specs/3.1/es_spec_3.1.pdf said:
In addition, if early per-fragment tests are enabled (see section 13.6), the fragment shader will not be executed if it is discarded during the early perfragment tests, and a fragment may not be executed if the fragment will never contribute to the framebuffer. For example, if a fragment A written to a pixel or sample from primitive A will be replaced by a fragment B written to a pixel or sample from primitive B, then fragment A may not be executed even if primitive A is specified prior to primitive B.

I must admit I didn't know myself and had to ask around then double-check the spec to make sure we weren't missing anything, but thankfully mobile APIs have always been designed with TBDRs in mind so we can avoid this kind of issue. It's definitely beneficial for us that Vulkan is also a Desktop API :)

I'm not completely sure how this work with ARM's publicly stated approach of killing fragment shaders mid-way into their execution when a new fragment obscuring the former is rasterised though. While the specification clearly allows either executing or not executing a fragment that does not ultimately contribute to the framebuffer, it makes no mention of *partially* executing the fragment shader.

To take an extreme example, imagine a shader updating a linked list, and it is killed after it has updated some of the pointers, but not all of them - the linked list is now degenerate. I am not certain but I think it's not possible to safely kill a fragment mid-execution if it has any side-effects. Of course you can still kill it if you find it is not required before it starts execution, so you would still get most of the benefit.
 
Last edited:
It's UAVs that are the problem because MS specified DirectX 11 without tilers in mind - virtual texturing by itself is perfectly fine.

In GLES3.1 or Vulkan, you would simply use an atomic buffer to get an unique index then write to that location without order guarantees. The specification is extremely clear that there are no guarantees that fragments will be executed if it does not ultimately contribute to the framebuffer, irrespective of side effects. See section 7.11.1 of the GLES3.1 Specification:



I must admit I didn't know myself and had to ask around then double-check the spec to make sure we weren't missing anything, but thankfully mobile APIs have always been designed with TBDRs in mind so we can avoid this kind of issue. It's definitely beneficial for us that Vulkan is also a Desktop API :)

Oh. OK. That clearly solves the issue I had in mind, at least for GLES3.1 and Vulkan. Thanks.
 
I'm not completely sure how this work with ARM's publicly stated approach of killing fragment shaders mid-way into their execution when a new fragment obscuring the former is rasterised though. While the specification clearly allows either executing or not executing a fragment that does not ultimately contribute to the framebuffer, it makes no mention of *partially* executing the fragment shader.

To take an extreme example, imagine a shader updating a linked list, and it is killed after it has updated some of the pointers, but not all of them - the linked list is now degenerate. I am not certain but I think it's not possible to safely kill a fragment mid-execution if it has any side-effects. Of course you can still kill it if you find it is not required before it starts execution, so you would still get most of the benefit.

You can kill a shader like that in the middle just fine, as long as you don't actually kill it inbetween side-effects.
 
The "if early per-fragment tests are enabled (see section 13.6)" part is notable, though. This can be enabled in fragment shader code in ES3.1 but is disabled by default. I wonder if the same applies for Vulkan.
 
You can kill a shader like that in the middle just fine, as long as you don't actually kill it inbetween side-effects.
Good point - and you could possibly share some of that logic with pixel output RAW logic which you might need for e.g. Pixel Local Storage already, depending on your architecture (you similarly don't want to interrupt a fragment mid-way after it has started writing to one of the PLS outputs).

The "if early per-fragment tests are enabled (see section 13.6)" part is notable, though. This can be enabled in fragment shader code in ES3.1 but is disabled by default. I wonder if the same applies for Vulkan.
Agreed it's an easy way for developers to shoot themselves in the foot... But remember this will kill performance not just for TBDRs but also for IMRs as it effectively disables Early-Z completely! So it's really not any more a problem for IMG/ARM than it is for e.g. NVIDIA.
 
Unfortunately we had to reduce the texture resolution and compression quality to fit into the texture budget of the Nexus Player which doesn't have much RAM available (we wanted to demo this on a public device with only modified drivers). This is to highlight the fact all Series6 GPUs support Vulkan(!)

Also the parallax offset for the reflections in the Physically Based Shading model had to be removed due to early Vulkan driver issues, which makes some of the reflections look a bit weird. The OpenGL ES version looks better than the Vulkan version, but of course we didn't want to highlight that fact in a blog post about Vulkan ;) Especially as these are just early driver issues that will be fully resolved sooner rather than later.

The main downside of the demo is the lack of dynamic lighting as it is spending most of its performance on implementing a full Physically Based Shading model (not just a simplified version e.g. Mobile UE4 - although their trade-offs make a lot more sense for real-world applications as opposed to tech demos).
 
That demo is not particularly impressive...It reminds of early DX8.1 demos.

I was rather directing to the quite interesting aspects of Vulkan as an API. Albeit today's PowerVR demos are a LOT better than in the past they still lack in the artistic department. I loved their Kitten techdemo and this one in its original form was quite atmospheric but not really something it would knock me out of my socks either.

Vulkan looks quite good so far :)
 
I don't know what rpg.314 is referring to either, but I can think of at least one thing that can break both the overdraw removal and the submission order independence of TBDRs quite badly: Virtual Texturing.

Especially if the form introduced by AMD (http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/6) and made part of DX11.2 becomes the preferred form: this is a form of virtual texturing where, at the shader level, the texturing operation returns as part of its result whether it hit an unmapped part of the texture, and where the shader itself is supposed to record any such hits in a data structure on the side. Since such a shader necessarily has the ability to write to memory other than just the framebuffer, it is required to run even if it is opaquely-overdrawn later in the same frame, and as such, a TBDR cannot just skip it like it can with more traditional content.
You could use another render target (MRT) and write the missing page ids to the render target (one page per pixel). This way you don't need UAV output. This is 100% comparible with TBDR.

However, using the hardware PRT in a way you describe is very bad for the performance. First, if you try to access an unmapped texture area, you get a full TLB miss (page fault) per pixel = VERY SLOW (actually two faults if trilinear/aniso is used and both mips fail). A single object usually has multiple textures. Now assume you open a door, and suddenly 5 million texture fetches miss the TLB :)

When you notice that you access an unmapped part of a texture, you must somehow solve the situation RIGHT NOW, or you will present a corrupted image. You could for example have a dynamic loop that tries to load lower resolution mip data. That data might also page fault (STALL). Also a dynamic loop isn't exactly cheap for each texture read, and UAV writes (with atomics) per pixel aren't cheap either (in cases where hundreds of thousands of pixels page fault). A single 256x256 pixel page can cover 64k pixels in the screen. Meaning that even a single missing texture page can cost a lot (64k faults + 64k UAV atomic writes).

Virtual textured games do not access unmapped texture areas. You have R8 map that has one texel per page (256x256 pixels) and that map tells you the most precise mip level that is available on this virtual UV area. First you point sample that 8 bit map (with max filter), then you clamp your mip level to the highest available, and then you read the real texture. This way you never read unmapped areas, and you don't need dynamic loops to solve the data corruption issue.

Also you never want to load data that is hidden (textures of objects/terrain hidden by walls/hills/etc). You want to write the page ids to an additional render target (32 bit id is more than enough), and read that render target when the scene rendering is complete (= only the visible surfaces remain). Many games render the page id buffer at lower resolution (using a predicted camera to hide the loading latency).

Intentionally reading unmapped areas is not a good idea. The error code is mostly useful for debugging purposes (it allows you to output debug instead of crashing the GPU).
It's UAVs that are the problem because MS specified DirectX 11 without tilers in mind - virtual texturing by itself is perfectly fine.

In GLES3.1 or Vulkan, you would simply use an atomic buffer to get an unique index then write to that location without order guarantees. The specification is extremely clear that there are no guarantees that fragments will be executed if it does not ultimately contribute to the framebuffer, irrespective of side effects.
If I understand correctly, DirectX doesn't give you any depth culling ordering guarantees when it does not affect the end result (blending, stencil, etc), unless you add [earlydepthstencil] attribute to your pixel shader entry function. The documentation is a little big vague about this, but it states that the GPU is allowed to perform depth culling either before or after the pixel shader execution if this attribute is not present. I am not sure whether this means that TBDR is allowed to run the pixel shader only for the closest surface. If not, then Microsoft needs to add another attribute that allows the GPU to behave that way. Let's call this [allowdepthrejection]. Problem solved :)
 
The render passes sound like a nice design decision. Explicit control for tilers is a good thing. The real question is: does Vulcan support custom resolve (compute) shaders? And if yes, then the next question is, can you access the other pixels in the tile in the resolve compute shader (*)?

(*) This would allow developers to implement tiled deferred lighting VERY efficiently, assuming of course that the resolve compute shader is also allowed to use workgroup shared memory. Basically the resolve compute shader would first go through the depth buffer pixels in the tile to calculate depth min/max, then it would clamp the tile frustum to depth min/max, cull the lights, apply the lights to pixels and write result out. All the temporary data (including g-buffers) would never get written to memory. Only the final lit pixels would be written to memory. One write per pixel :)

This kind of tiled light renderers are used widely in console games (with the exception that traditional GPUs need to store the g-buffers to main memory). You can easily push thousands of visible lights per frame with these. I would love to see similar lighting techniques being possible on mobile devices. There is not much missing to make this possible on tiling GPUs super efficiently. This kind of tile resolve compute shader would be enough... Pretty please? :)
 
The render passes sound like a nice design decision. Explicit control for tilers is a good thing. The real question is: does Vulcan support custom resolve (compute) shaders? And if yes, then the next question is, can you access the other pixels in the tile in the resolve compute shader (*)?
Compute shader running directly on the tile buffer is ideal but is hard to generalise as just about every TBDR handles its tiles differently. In an idealised TBDR, a compute following a render, would be able to grab the pixel data from the tile data using the same HW used in the render pass do its thing then output using the render HW.

However thats a lot of render fixed function HW being used by compute shader which up to now haven't used much beyond texture sampler from the render HW.
 
Compute shader running directly on the tile buffer is ideal but is hard to generalise as just about every TBDR handles its tiles differently. In an idealised TBDR, a compute following a render, would be able to grab the pixel data from the tile data using the same HW used in the render pass do its thing then output using the render HW.

However thats a lot of render fixed function HW being used by compute shader which up to now haven't used much beyond texture sampler from the render HW.
There might be some hardware limitations (like the compute workgroup shared memory and the tile buffer using the same HW resource). However from a API perspective the 2d workgroups are already tiles (rectangles) independent of each other and the programmer must manually ensure that the thread count per tile and shared memory per tile does not exceed certain values. When combined with tiling you'd have to enforce that the work group size is aligned by the tile size... And that's pretty much it. I don't see other restrictions (from the API point of view).
 
Such a tile compute shader would be great for a lot of effects. What may be difficult is to define how exactly you could access framebuffer values and if/how you could overwrite them. The actual tile framebuffer layout may be somewhat surprising, especially for multisampling and/or certain pixel formats (e.g. Metal is fairly explicit about the fact that some formats may take up more space than what you'd expect).
 
Such a tile compute shader would be great for a lot of effects. What may be difficult is to define how exactly you could access framebuffer values and if/how you could overwrite them. The actual tile framebuffer layout may be somewhat surprising, especially for multisampling and/or certain pixel formats (e.g. Metal is fairly explicit about the fact that some formats may take up more space than what you'd expect).
the g buffer values could be exposed as workgroup shared memory.

The space per pixel could be computed by the API. The workgroup size for such a compute shader could be exposed as an API specific intrinsic in side the shader.
 
Series 7 would not easily support this, unfortunately.

I'm very interested in having a conversation about possible use cases I might be missing, but I've been through this *in significant detail* while trying to make the decision of whether I should ask HW to add support for this in the short-term or not. My conclusion was that the use cases were highly limited, especially when excluding techniques which are unlikely to be a good fit to our architecture in practice.

I can really only think of non-linear downsampling as a use case, and that would be mostly useful for tonemapping, for which we have a few other tricks at our disposal (unfortunately not all exposed at the moment...)

Sebbi, I've done some analysis of this recently, and I'm quite confident that no matter what clever tricks anyone comes up with, tiled deferred lighting is nearly certainly NOT the optimal solution for either PowerVR or ARM. It is much better for us to use lighly tesselated geometry (~50 vertices/triangles per sphere or cone maximum) with Pixel Local Storage (or simply discarding the outputs at the end of the pass in Metal).

Series 7XT should have fast 256-bit PLS (or fast 128-bit + 4xMSAA). I would strongly encourage developers to design around that than algorithms which were originally invented around the limitations of IMRs...
 
Back
Top