PowerVR Rogue Architecture

Those extra mobile CPU cores may be underutilized in general device usage and currently in graphics usage under GL ES, but Vulcan's (and similarly Metal's) multi-threaded capabilities help fix that for graphics and GPU compute, at least.

Great demo for clearly showing the benefits of multi-processing the regenerated command buffers and keeping the GPU well fed.
 
Those extra mobile CPU cores may be underutilized in general device usage and currently in graphics usage under GL ES, but Vulcan's (and similarly Metal's) multi-threaded capabilities help fix that for graphics and GPU compute, at least.

Great demo for clearly showing the benefits of multi-processing the regenerated command buffers and keeping the GPU well fed.

True. However the purpose of big.LITTLE isn't to have all 8 cores at 2 quad clusters f.e. to run always at full tilt. If then you'd actually burn far more power then it was ever purposed for. As with all things what we need is a fine balance between performance and power consumption or better battery life. It won't do me any good if I get all my "8 cores" maxed out as much as possible if I have to run for a power plug about every hour.
 
The thing to remember is there are CPU benefits to be had whether you have one CPU or many. Even outside of multi-threaded use, there's just less CPU work happening in a Vulkan app and client driver in order to have the GPU do work.
 
The thing to remember is there are CPU benefits to be had whether you have one CPU or many. Even outside of multi-threaded use, there's just less CPU work happening in a Vulkan app and client driver in order to have the GPU do work.
Don't you mean potentially less? Since the responsibilities of the driver have been handed off to the application, doesn't it depend on exactly how the application deal with what the driver was responsible for before?
 
Drivers for APIs like GLES do far far more than a Vulkan application now has to do, even though some of the responsibilities have shifted.
 
Drivers for APIs like GLES do far far more than a Vulkan application now has to do, even though some of the responsibilities have shifted.
Can you give me some insight into this? I'm at a loss... I thought all the responsibility was shifted but since the application knows things about what its doing it can handle said responsibilities more efficiently depending on implementation.
 
I don't want to do a disservice to the story of how a GPU driver goes about the business of commanding a GPU via a client API, but things like workarounds for badly behaved apps (you'd cry if you could see how much of that crap happens), online shader compilation, support for inherently branchy host work like render state validation, code to figure out what to do because the spec is so loose and ill-defined: they are all things that are either completely gone from either side or significantly reduced in either Vulkan driver or Vulkan-using app.

That's not to say Vulkan cruft can't accrue on either side of the app-driver contract over time, but the clean slate is fundamentally liberating and removes swathes of code from the overall interaction.
 
Forgive my ignorance, but do mobile OSes such as Android and IOS, use standard APIs (i.e. gles3.0) within the OS to drive the user interface, composition etc, or is that all done at a lower level. Basically asking if existing Oses(as well as apps) are also suffering due to the use of glesx.x
Is Metal perhaps Apple just exposing/formalising as an API, what they have been using internally within the OS for quite some time ?
 
Can't tell you what happens in iOS for obvious reasons, but in Android hardware accelerated drawing and composition is via standard APIs.
 
As part of the performance enhancements in iOS 9, Apple mentioned that Metal will now be used in place of GL ES (on applicable devices) for the OS's Core Graphics and Core Animation APIs.

Internally, I imagine they've been tapping the GPU in a relatively direct manner since iOS 1 for at least some aspects of OS graphics operations as well as some limited compute (browser acceleration, camera/photo/video acceleration, etc.)

What I've been wondering since Metal's introduction is how it compares to the proprietary PowerVR SGL API.
 
Is SGL still alive? (honest question)

For Metal: is it me or is Apple even without Metal getting a high rate of efficiency out of their drivers already for Rogue GPUs? I'm asking because there might be a benefit with Metal in Gfxbench results, but the persentages are relatively small (=/<9%). Unless of course Gfxbench isn't a good indication of what Metal can do in general for GPUs.
 
Last edited:
Is SGL still alive? (honest question)

For Metal: is it me or is Apple even without Metal getting a high rate of efficiency out of their drivers already for Rogue GPUs? I'm asking because there might be a benefit with Metal in Gfxbench results, but the persentages are relatively small (=/<9%). Unless of course Gfxbench isn't a good indication of what Metal can do in general for GPUs.
PowerSGL ? no.
 
There's no real general indication when it comes to this kind of thing; it's all case-by-case. GFXBench was already heavily GPU limited, so the gains to come from switching to Metal were (and are) slim. You can't make a call about overall driver efficiency just by taking at look at that case (unfortunately).

SGL is still alive, but there's no point comparing it to anything since it'll never see the light of day in a way that helps a non-IMG person understand its utility.
 
Can you give me some insight into this? I'm at a loss... I thought all the responsibility was shifted but since the application knows things about what its doing it can handle said responsibilities more efficiently depending on implementation.
that's a back and forth in history.

in GL1.0 the process was quite simple
1. you set your drawing settings (e.g. flat shading, smooth,...)
2. you set your texture
3. you draw by telling what type (e.g. GL_QUAD ) and then push vertex by vertex to the rasterizing device.

problems:
the settings and data you set is usually not the format the GPU wants (e.g rasterizers used fixpoint vetices, not float, texel could be R8G8B8A8 or A8R8G8B8 or R4G4B4A4 or...) and every time you do that, the driver had to convert it. that's why in the old days everyone was sorting drawcalls by texture switches (even nowadays some have that mindset without knowing the historical reason)

GL1.1 solution : Displaylists
in display lists you can
init:
1. start display list recording
2.1. you set your drawing settings (e.g. flat shading, smooth,...)
2.2. you set your texture
2.3. you draw by telling what type (e.g. GL_QUAD ) and then push vertex by vertex to the rasterizing device.
3. stop recording
drawing:
4. replay that recording as many times as you want, the driver will not do any conversion

problems:
1. how do you change something? you need to record the display list again.
2. hardware appeared that support an awesome new feature: multitexturing, but opengl was overwriting the previous "thing" you've set, thus you always had just one texture.

GL1.2 solution texture objects, vertex arrays (I think those were actually earlier there, but slower)
init:
1. create texture objects
drawing:
1. you set your drawing settings (e.g. flat shading, smooth,...)
2. you set your texture obejects (barely driver work)
3. you draw by pointing at a vertex array in memory and telling GL how many primitives to draw

problems:
1. we are at Riva128 and Voodoo graphics times now, those were actually way faster than CPUs by using very smart pipelining and dedicated memory. dedicated memory is fast, but moving data to it was super slow (I think that was still ISA or EISA? time. thus you really become limited by the vertices you can copy to the rasterizer chips. With GeForce256 TnL hit consumers and the situation was even more unbalanced.

from now on mostly extensions took over
quick solution: VAS (vertex array storage) don't kill me if I'm calling it wrong, that's like ~1999 I think
you can specify to GL that the array you point at will not be alternate until you tell so, that way the rasterizer can keep it in memory and just redraw. TnL was taking care of transforms, thus the CPU was not involved at all.
problem: but you still had to copy.

now we got all the memory handle, for Vertex (VBO), Rendertarget (RBO), uniform/constant (UBO, I think that was in GL 3.2)

at this time nobody maintained Displaylists, because they become overly complicated to track by the driver. Displaylist allowed some data to be static (e.g. textures) but some data to be dynamic (whatever you set outside that was not recorded, thus overwritten inside the displaylilst). with VBO,RBO.. it went beyond the specs. I think the last attempt was by nvidia that supported for a short time PBuffers (kind of predecessor of frame buffer objects). but the driver guys said this became insanity.


from now on the API was pushing all commands good old opengl 1.0 way, the driver recorded the commands into some buffer and pushed it to the driver thread.

but why a driver thread if all data is on GPU and we just push commands? well, the GPU guys figured that everything you add to a GPU and which isn't used all the time is a waste. hence lets remove everything static and make (aka emulate it in shaders) it dynamic.
I think PowerVR is the pioneer of this (I don't know exactly to what extend they've gone, but if you look at their GL extensions, you'll get quite some hints). As an example: transparency. That's not needed for most objects and if you need it, the shader could do it just as good, right? ok, but OpenGL has a dozens of settings, how do we know which combination to create? we cannot compile all 100 different permutations.... well, let's do that in a driver when it's needed.
problem: there are tons of settings that can change every drawcalls, texture formats, framebuffer formats, blend settings, vertex layouts, shader, sampler...... and all of those trigger a new permutation of those super flexible units.
well, the set of permutations you really need in a game is small, because there are 100 trees that render the exact same way, but every game has a different way to render its trees, thus the driver needs to evaluate all settings on the first drawcall and the consecutive drawcalls need to at least check all settings for a possible change... insane work nowadays... that's why it takes a lot of CPU time.
and it's not just the average cost, but the unpredictable cost that makes this solution bad. if one frame some more object/drawcalls appear, the CPU will spent way more time preparing those draw calls than the GPU needs to executed'em.


what was the GL 1.1 solution for the "the driver does it every drawcalls, but the data doesn't change between frames"? ah, yes: Displaylists... or lest call those command list or command buffer now :)

as you can see in the presentation http://blog.imgtec.com/powervr/gnomes-per-second-in-vulkan-and-opengl-es , the world is divided into those display lists like back then, once you see a new one or an existing one needs to be modified, a new display list is recorded. for all the other frames the CPU just tells the API to replay the list...

problems with Vulkan and DX12 will be obviously the same as back then in GL1.0
"1. how do you change something? you need to record the display list again"

my prediction for DX13 and emm... (Mantle...Vulkan...) Magma is a programmable command processor (which is just like moving the GPU back to the CPU and do it GL1.0 style).
The PCP will allow you to evaluate a scene on the GPU and push data in a flexible way to the GPU backend....

I hope everybody is sleeping well by now
 
@Davros Damn, I remember Ultimate Race Pro... I owned that, once upon a time, via a PowerVR PCX2 card I bought 2nd hand off of a guy on FidoNET back in the day if any of you guys remember that old thing (a card which I subsequently burnt while hardware overclocking it slightly too enthusiastically by the way... :( Not that it really matters anymore though now that PCI is legacy junk.)
 
I hope everybody is sleeping well by now
Thank you for the trip through time. Didn't know that about sort by texture, I always assumed it had to do with textures being uploaded to video memory. Any other insight you wanna get off your chest, you have my attention. Thanks once again.
 
Back
Top