PowerVR Rogue Architecture

Sebbi, I've done some analysis of this recently, and I'm quite confident that no matter what clever tricks anyone comes up with, tiled deferred lighting is nearly certainly NOT the optimal solution for either PowerVR or ARM. It is much better for us to use lighly tesselated geometry (~50 vertices/triangles per sphere or cone maximum) with Pixel Local Storage (or simply discarding the outputs at the end of the pass in Metal).
This is also my current mobile lighting pipeline design. During my career I have implemented pretty much every single variation of deferred rendering. I suppose it would be faster to discard (or branch out) pixels outside light's influence in the light pixel shader (before doing the lighting math) instead of using stencil to mark pixels that are inside the light volume (stencil mark pixels that fail depth of the light front faces, render light back faces with dpeth fail)? Without stencil you can pick only one culling direction (render light back faces with depth fail or light front faces with depth pass). Both have their good and bad cases.

I am a bit worried that in the worst case there are going to be considerably more lighting shader invocations (unless stencil buffer two side culling is used) compared to tiled lighting. Tiled lighting does both min/max depth bounds tests, so the culling is quite precise (and of course test is executed only once per tile, not once per pixel). We also have a preprocess light transform + viewport + occlusion culling compute shader that reduces the number of lights by a big amount (usually 10x or more). This makes tile culling more efficient. Unfortunately light occlusion culling pass (depth pyramid test) is not possible to do efficently on tiled architectures, as the whole depth buffer is not available before the lighting step starts. I suppose occlusion culling lights is not that important for tiled architectures, if the light geometry is very simple and if future mobile APIs support multidraw (Vulkan does). Multidraw is definitely needed if you want to reach similar light counts as tiled lighting pipelines (we can do 16k visible animating lights at 60 fps).
 
I suppose it would be faster to discard (or branch out) pixels outside light's influence in the light pixel shader (before doing the lighting math)
That is an interesting question. GFXBenchmark 3.x / Manhattan does per-pixel discard to optimise the lighting equations, and so every major mobile GPU designer had to optimise for this case in the last year (or was lucky to already be optimsed for it). So it's a safe bet that it's not going to hit an ultra-slow-path on any modern mobile architecture, and is likely to only get faster in the future. Certainly for some architectures it was significantly faster to branch ~2 years ago, but this has meant discard should be as fast now (note that multiple discards per shader are handled differently by different architecture and/or compilers, and you may or may not be saving the work in-between the first and last discard).

instead of using stencil to mark pixels that are inside the light volume (stencil mark pixels that fail depth of the light front faces, render light back faces with dpeth fail)? Without stencil you can pick only one culling direction (render light back faces with depth fail or light front faces with depth pass). Both have their good and bad cases.
My intuition is that double-sided stencil is probably faster than discard/branch on every shipping PowerVR GPU. However I suppose this is slightly dependent on the position and especially size of the lights in your scene; if most lights are very small and positioned close to walls, one-sided depth tests are likely to be nearly identical to two-sided tests, then it may turn out that not doing either stencil or discard/branch is slightly faster. Alternatively, you could do a hybrid approach where e.g. larger lights use stencil tests and others don't.

Honestly, it feels like this is the kind of thing that GPU designers themselves should analyse, and provide easy-to-understand SDK samples *with* performance data for different scene types so that developers know what high-level algorithms are recommended for their architecture without having to waste weeks (or even months) of time testing out different alternatives, but without the necessary low-level knowledge to know exactly how the technique should be implemented for optimal performance, thus making for an unfair comparison. Alternatively, they should provide the optimised paths for common engines like Unity and Unreal themselves, since these engines already do GPU detection to some extent. Of course, this is effectively what NVIDIA already does by just giving free engineers to AAA games in the TWIMTBP program, but only they can afford this and it only benefits a small part of the developer community for a single GPU architecture...
 
My intuition is that double-sided stencil is probably faster than discard/branch on every shipping PowerVR GPU. However I suppose this is slightly dependent on the position and especially size of the lights in your scene; if most lights are very small and positioned close to walls, one-sided depth tests are likely to be nearly identical to two-sided tests, then it may turn out that not doing either stencil or discard/branch is slightly faster. Alternatively, you could do a hybrid approach where e.g. larger lights use stencil tests and others don't.
Definitely it should be a good idea to switch the approach based on light screen space radius (and other factors). For lights that overlap the camera position, stencil is useless (and wastes lot of fill rate). Just rendering the light backfaces with depth fail is optimal in this case. Tessellation would also be benefical for light rendering. You could generate the light mesh by the tessellator (saving bw) and you could reduce the triangle counts for light geometry based on camera distance. This would produce both better quad efficiency (for distant lights) and less wasted area (for near lights).
Honestly, it feels like this is the kind of thing that GPU designers themselves should analyse, and provide easy-to-understand SDK samples *with* performance data for different scene types so that developers know what high-level algorithms are recommended for their architecture without having to waste weeks (or even months) of time testing out different alternatives, but without the necessary low-level knowledge to know exactly how the technique should be implemented for optimal performance, thus making for an unfair comparison. Alternatively, they should provide the optimised paths for common engines like Unity and Unreal themselves, since these engines already do GPU detection to some extent. Of course, this is effectively what NVIDIA already does by just giving free engineers to AAA games in the TWIMTBP program, but only they can afford this and it only benefits a small part of the developer community for a single GPU architecture...
More information is always better. Especially for PC and mobile platforms. Console generations last so long that developers have the intentive and time to find the optimal solution for their needs. But developers can't afford to spend as much time for each different PC and mobile GPU (unfortunately).
 
A couple of guys on my demo team wrote it in the last month or so. It's part of a two-system demo comparing Vulkan to OpenGL ES3, running on Android on an Intel Moorefield-based platform (PowerVR G6430). More details after Tobias has shown it off at SIGGRAPH next week.
 
I sincerely hope so. I can't promise anything of course, it's not my job to decide any of that or make it happen, but I can see it happening at some undecided point in the future if things go well. Before anyone takes that away and prints it somewhere as "PowerVR to open source Vulkan graphics driver", that is not what I have just said. We might, maybe, someday, hopefully. Nothing more.
 
There's some more technical detail coming which we can't preannounce before the Vulkan BOF, so take a look at that blog post again later on Wednesday (I think/hope).
 
Predictably, Mantle 2.0 Vulkan seems to be doing wonders on smartphone SoCs.
 
Predictably, Mantle 2.0 Vulkan seems to be doing wonders on smartphone SoCs.

My gut feeling tells me that Vulkan's added efficiency won't be limited to just ULP mobile SoCs. It's obviously OT here, but I'd love to hear from developers a DirectX12 vs. Vulkan comparison for the desktop.
 
My gut feeling tells me that Vulkan's added efficiency won't be limited to just ULP mobile SoCs.

Of course not, but low-power (and consequently lower-performing) multi-core CPUs will benefit more.

On the x86 desktop, the gaming industry evolved towards smaller amounts of very high-clocked CPU cores carrying much higher IPC.
DX12 and Vulkan will probably bring a huge boost to AMD 4-8 core solutions, but not much to the latest Core i5 and i7.
 
Of course not, but low-power (and consequently lower-performing) multi-core CPUs will benefit more.

On the x86 desktop, the gaming industry evolved towards smaller amounts of very high-clocked CPU cores carrying much higher IPC.
DX12 and Vulkan will probably bring a huge boost to AMD 4-8 core solutions, but not much to the latest Core i5 and i7.
So I wonder who will be able to afford Vulkan development. Big AAA titles on consoles will probably benefit (but the APIs on consoles were already low level). Big AAA titles on PC probably won't benefit, as you mention.

So that leaves mobile devices, where Vulkan could have the largest impact. However, budgets for mobile game development are small, and Vulkan code is more complex, since many things must be done manually that were done by the driver.

I'm wondering how many titles will actually use Vulkan. It seems like it's in a difficult place due to market forces.
 
Robust support in UE4 and Unity is enough to cover off a huge swathe of the market on mobile. Metal uptake on iOS looks strong, anecdotally (might ask Apple if they have real figures to back up my suspicion), which I think is a good signal. When the benefits of a new technology are as clear as they are here, I think uptake will naturally tend to be strong.

Personally I see a bright future for the API, especially now that it's going to be a first class citizen on Android.
 
So that leaves mobile devices, where Vulkan could have the largest impact. However, budgets for mobile game development are small, and Vulkan code is more complex, since many things must be done manually that were done by the driver.

I'm wondering how many titles will actually use Vulkan. It seems like it's in a difficult place due to market forces.
(In my opinion) Vulkan code is much more readable and less complex than OpenGL. Vulkan API is super clean. OpenGL on mobiles also has unbelievable amount of CPU overhead, making Vulkan even more attractible. If (when) Vulkan supports enough existing hardware configurations, it will become very popular.
 
The middleware (Unity/Unreal/etc) will put Vulkan everywhere on mobile but I suspect uptake's going to be pretty strong even for the direct graphics coder...

The big win for Vulkan might actually be the clean API and a decent set of documentation. Even with more complex memory control, a clearly documented API is a pretty massive win. Plus with less driver and decent conformance testing it should get rid of the randomness between IHVs. Personally I find DX development much simpler these days simply because of MSDN, good SDK sample code and there's a good chance much of the driver will work the same way regardless of vendor. This is less true on <= DX9 and some recent proprietary extensions muddy the waters but basically if it's broken its your fault. GL's hardest part for a hobby coder isn't really the language but more the mix of legacy documentation and tutorials scattered across the web written by people with very mixed abilities. GLES is better just by virtue of less baggage (imo).

Documentation may be boring but some good/accurate docs with a reliable source for sample code/tutorials opens a language up to hobby/student developers in a big way. A lot of people who want to write direct to graphics APIs can cope with complexity, but there has to be at least one definitive spec for those requirements and if the quicksand of driver bugs can be kept away from the core functionality then they can learn and experiment in safety.
 
Of course not, but low-power (and consequently lower-performing) multi-core CPUs will benefit more.

Will they?.

On the x86 desktop, the gaming industry evolved towards smaller amounts of very high-clocked CPU cores carrying much higher IPC.
DX12 and Vulkan will probably bring a huge boost to AMD 4-8 core solutions, but not much to the latest Core i5 and i7.

Why are the core amounts on desktop smaller? Just because ARM has standard big.LITTLE CPU configs it doesn't mean that mobile games in their majority actually benefit from any crapload of cores over 4 at a time. The primary purpose of big.LITTLE is to safe power while trading tasks between big and small clusters. Yes the global task scheduling is there but if its of any purely theoretical benefit then more like in useless synthetic benchmarks than anything else. I can still see "only" 3 "fat" cores (for the ULP market) in the A8X and any SoC manufacturer that goes for custom cores based on ARM ISAs are in their majority going for similar trends.

If we'd be using in real time far more cores on our mobile devices wouldn't it actually mean that we can do far more efficient multitasking on those than on our desktops? :D
 
Back
Top