Apple (PowerVR) TBDR GPU-architecture speculation thread

And yet render passes are a thing in modern APIs now and are here to stay, especially with advent of WebGPU. I really don't see why you are making such a big issue out of it.

It's still made optional on D3D12 while you're forced to use it on other modern APIs. Console APIs also don't expose renderpasses as well since it's not necessary ...

Geometry shaders are a bad abstraction for GPU execution model. They don't scale. Metal only exposes stuff that makes sense for the hardware. If you need functionality of geometry shaders, use compute shaders with GPU-driven render loop.

"Metal only exposes stuff that makes sense for the hardware."

Maybe this is true on Apple hardware but I don't think it's true for AMD hardware ...

Metal has some sub-optimal API design decisions for other hardware. Resource state transitions and barriers are resolved implicitly which can complicate using async compute so that limits potential of being able to schedule compute work as efficiently as possible on some GPUs. Also, if you look at the Sea Islands register documentation the register GpuF0MMReg:0x28b54 specifically has a bit for controlling whether the geometry shader is active or not and the now deprecated Mantle API subsequently exposed geometry shaders too but it's also available on D3D12 and Vulkan as well. On the RDNA architecture, it's geometry shader implementation was designed to be more effective in several more cases compared to it's previous generation as well ...

For modern desktop GPUs, it's not out of the realm of possibility that they have an acceptable implementation of geometry shaders so why does Apple keep actively avoiding exposing this feature on Metal when it can be somewhat usable on IMRs ?

I heard that Apple plans on exposing programmable blending everywhere but that is going to decimate performance on most IMRs so in what way does Metal only expose features that "makes sense" for the hardware ? From the perspective of AMD HW, out of all the modern APIs Metal exposes the most amount of abstractions that doesn't make sense to their HW ...

I think your information might be a bit outdated? Metal has supported bindless ressources (they call them Argument Buffers) for a while now, including arbitrary pointer chasing, writeable resource handles and whatever you want. You can build entire hierarchies of resource bindings on the GPU. This years update to Metal includes fully-featured implementation of ray tracing, with programmable intersection shaders, function pointers for dynamic function loading, recursive shader execution etc. The performance is not the best right now, since ray tracing is done in "software" (using compute shaders), but it is obvious that hardware ray tracing support is coming. Regarding mesh shaders: I agree that they are not the best fit for the TBDR hardware, since the generated vertices cannot be just passed through to the rasterizer immediately. So I am not sure whether we will ever see them in Metal. But if all you need to do is generating some geometry on the GPU, well, compute shaders + GPU-driven rendering loops got you covered. And of course, Metal has sparse textures, variable rate shading, fully programmable multisampling and so on.

Besides, you are leaving out some advantages TDBR brings to the table. Like programmable blending or direct control over tile memory. Explicit render passes, memory-less render targets and tile shaders simplify implementation of many advanced rendering techniques and end up being much more efficient.

Argument buffers looks like they provide similar functionality as descriptor indexing would on other APIs in that they all let you create unbounded/bindless arrays/tables. I'm not quite sure if Metal's argument buffers are comparable to some of Nvidia's OpenGL bindless extensions. How much longer do we also have to wait for Apple silicon to support hardware accelerated ray tracing ? I don't see most of the industry will tolerate waiting for another generation ...

Programmable blending and tile memory are nice to have but they aren't all that compelling for most desktop vendors to implement or expose ...
 
"Metal only exposes stuff that makes sense for the hardware."

Maybe this is true on Apple hardware but I don't think it's true for AMD hardware ...

Did geometry shaders ever make sense for any hardware? Does modern Nvidia and AMD hardware have built-in support for them or are they run as a driver-managed compute shaders with bad parallelism? I can't claim in-depth knowledge of the industry, but to me it seems that geometry shaders are all but deprecated. Nvidia is pushing mesh shaders instead and I don't really know what AMD's current stance on these things are. Apple made a pragmatic choice. Instead of giving the programmer a tool that works sub optimally in most cases, they don't give you this tool at all. Their argument got slick this: for GPU-side geometry generation use a tool that works well — compute shaders with GPU-driven render loops. You can do anything you want while using a programming model that is much closer to how GPUs actually work.


Metal has some sub-optimal API design decisions for other hardware. Resource state transitions and barriers are resolved implicitly which can complicate using async compute so that limits potential of being able to schedule compute work as efficiently as possible on some GPUs.

Metal offers two modes of operation. By default it does state management (barriers, memory, residence) automatically — convenient but not the most efficient. You are however free to do low-level management yourself.

I heard that Apple plans on exposing programmable blending everywhere but that is going to decimate performance on most IMRs so in what way does Metal only expose features that "makes sense" for the hardware ? From the perspective of AMD HW, out of all the modern APIs Metal exposes the most amount of abstractions that doesn't make sense to their HW ...

Where did you hear that? As to your second sentence, I think you might be overexaggerating it a little bit. Probably the only abstraction Metal exposes that's not necessary for AMD hardware are render passes.

Argument buffers looks like they provide similar functionality as descriptor indexing would on other APIs in that they all let you create unbounded/bindless arrays/tables. I'm not quite sure if Metal's argument buffers are comparable to some of Nvidia's OpenGL bindless extensions.

I am not sure that I am up to date enough to answer that confidently. At any rate, data buffers that contain resource handles are exposed in Metal shaders as pointers to user defined structs. Components of these structs can be freely manipulated by the shaders, which allows you to build sets of resource bindings on the GPU. As far as I understand, it is a strict superset of functionality exposed by Vulkan or DX12.

How much longer do we also have to wait for Apple silicon to support hardware accelerated ray tracing ? I don't see most of the industry will tolerate waiting for another generation ...

How much longer do we have to wait for AMD to support hardware accelerated ray tracing? For Intel?

Programmable blending and tile memory are nice to have but they aren't all that compelling for most desktop vendors to implement or expose ...

No, but we are not talking about most desktop vendors. We are talking about Apple's TBDR architecture which is coming to the desktop this year. I severely doubt that Apple will ever make GPUs that can compete with the large desktop IMR brute-force renderers, but they should be able to leverage the efficiency of their approach to deliver GPUs that are very fast in compact laptop space.
 
Did geometry shaders ever make sense for any hardware? Does modern Nvidia and AMD hardware have built-in support for them or are they run as a driver-managed compute shaders with bad parallelism? I can't claim in-depth knowledge of the industry, but to me it seems that geometry shaders are all but deprecated. Nvidia is pushing mesh shaders instead and I don't really know what AMD's current stance on these things are. Apple made a pragmatic choice. Instead of giving the programmer a tool that works sub optimally in most cases, they don't give you this tool at all. Their argument got slick this: for GPU-side geometry generation use a tool that works well — compute shaders with GPU-driven render loops. You can do anything you want while using a programming model that is much closer to how GPUs actually work.

Geometry shaders are at least somewhat acceptable on modern desktop GPUs and their drivers almost certainly do NOT emulate geometry shaders with compute shaders otherwise it'd be near impossible for them to handle cases involving interactions with other features like streamout/transform feedbacks. Geometry shaders have ordered geometry amplification if transform feedbacks are enabled which can't exactly be emulated on compute shaders efficiently ...

Nvidia have a mesh shading pipeline but they still offer the traditional geometry pipeline in their hardware. On AMD's latest RDNA architecture their "next generation geometry pipeline" is actually a superset of the traditional geometry pipeline and they also have a unique hardware functionality to efficiently emulate transform feedbacks too. If you take a look at console APIs for an instance such as GNM or NVN they also have the geometry shaders that you seem to dread so much ...

Let's just stop beating around the bush and admit that Apple doesn't want to expose geometry shaders because it would wreck their TBDR GPU designs in comparison IMRs which can have a passable implementation over there. The disappointment is that Apple still don't have a competitive alternative ...

Metal offers two modes of operation. By default it does state management (barriers, memory, residence) automatically — convenient but not the most efficient. You are however free to do low-level management yourself.

Metal doesn't have any concept resource state transitions so it's handled by the drivers which will be sub-optimal for AMD HW and Metal does not offer any control over this ...

Where did you hear that? As to your second sentence, I think you might be overexaggerating it a little bit. Probably the only abstraction Metal exposes that's not necessary for AMD hardware are render passes.

Here's what the manager behind their driver team had to say. They plan on exposing all of those features he just mentioned on all of the GPUs that they support. Exposing programmable blending would be a spectacular disaster on AMD hardware. Metal is only ever "low-level" in the sense that it mostly only applies to Apple hardware ...

How much longer do we have to wait for AMD to support hardware accelerated ray tracing? For Intel?

On AMD, it's guaranteed that they'll be launching ray tracing hardware in less than 2 months on consoles ...

Can you even argue that's there's a solid timeline for an implementation to show up in Apple silicon ?
 
Geometry shaders are at least somewhat acceptable on modern desktop GPUs and their drivers almost certainly do NOT emulate geometry shaders with compute shaders otherwise it'd be near impossible for them to handle cases involving interactions with other features like streamout/transform feedbacks. Geometry shaders have ordered geometry amplification if transform feedbacks are enabled which can't exactly be emulated on compute shaders efficiently ...

Fair enough. I am curious to know how they can achieve this functionality on a massively parallel processor. Seems to me there would be a lot of synchronization overhead. I don't think it contradicts my main point however: a lot of tasks that geometry shaders are used for can be more efficiently implemented via compute shaders, by harnessing the parallelism offered by the GPUs explicitly. We need less shader stages, not more :)

Here's what the manager behind their driver team had to say. They plan on exposing all of those features he just mentioned on all of the GPUs that they support. Exposing programmable blending would be a spectacular disaster on AMD hardware. Metal is only ever "low-level" in the sense that it mostly only applies to Apple hardware ...

I think there might have been some miscommunication and/or misunderstanding. I am fairly sure that the first tweet refers to Macs with Apple GPUs (i.e. Apple Silicon) only. The second tweet is about the other new features in Metal (raytracing, pull interpolation model, debugging tools etc.). Note that there was a long stream of posts Mr. Avkarogullari made before answering the question.

Can you even argue that's there's a solid timeline for an implementation to show up in Apple silicon ?

No, I certainly cannot. But Metal includes fully featured ray tracing API, which at least to me suggests that hardware ray tracing is something they are working on. Let's all not forget about their renewed deal with Imagination who alley has raytracing IP for TBDR...
 
Well Imagination/PowerVR has a lot of ip and stuff on paper, but implementing it in a real product is often an other story : /

(Damn I wished they could continue on PC after Series 3 / Kyros)
 
Can you even argue that's there's a solid timeline for an implementation to show up in Apple silicon ?

Given recent events we can safely say that even if there was a short term timeline and even if they had outside developers cooperating we wouldn't necessarily know. Apple is good at information security.

The moment where raytracing takes over primary intersections, forward rendering is dead and all the silicon dedicated to optimizing it so much dead weight for modern applications. A G-buffer tile will still be useful though.
 
Fair enough. I am curious to know how they can achieve this functionality on a massively parallel processor. Seems to me there would be a lot of synchronization overhead. I don't think it contradicts my main point however: a lot of tasks that geometry shaders are used for can be more efficiently implemented via compute shaders, by harnessing the parallelism offered by the GPUs explicitly. We need less shader stages, not more :)

On Nvidia's 2nd gen Maxwell architecture and above, there's fast path for a specific set of geometry shaders and they mentioned a 30% speed up in their voxelization pass for VXAO by using the supposed "pass-through geometry shaders" functionality. The side effect of this extension is that it restricts the capability of the geometry shaders which means that no geometry amplification or transform feedbacks are allowed to be used in conjunction with it but the bonus is that it bypasses the synchronization overhead that you mentioned ...

AMD's RDNA architecture takes things up a step further by removing all of the restrictions imposed in Nvidia's extension. Their hardware can handle nasty edge cases too like ordered geometry amplification with transform feedbacks relatively elegantly since they can use global ordered append (via DS_Ordered_Count instruction) to do very fast synchronization ...

On Intel, they never seemingly struggled with geometry shaders since they have a unique SIMD-group mode in MSL terminology which might contribute to GS performance ...

On TBDR GPUs, geometry shaders are an antithetical concept over there. The problem with geometry shaders over there is that it happens AFTER the tiling stage. Geometry shaders can do arbitrary transformations or geometry amplification to the screen space primitives right before the rasterization stage and that can break tiling optimizations which are decided beforehand so there's a potential mismatch since tiling may not necessarily match up to the screen space primitives that get submitted to the rasterizer. Geometry shaders will inevitably cause load imbalance on tilers ...

As far as shader stages are concerned, we've just added an entirely new ray tracing pipeline with 5 separate shader stages such as ray gen, intersection, miss, closest-hit and any-hit shaders so I don't think we'll be getting rid of any shader stages soon. Neither AMD nor Nvidia are thinking about removing support for geometry shaders too anytime soon ...

No, I certainly cannot. But Metal includes fully featured ray tracing API, which at least to me suggests that hardware ray tracing is something they are working on. Let's all not forget about their renewed deal with Imagination who alley has raytracing IP for TBDR...

Let's hope that Apple is also working on exposing conservative rasterization in the Metal API too since it's universally supported among IMRs ...
 
On TBDR GPUs, geometry shaders are an antithetical concept over there. The problem with geometry shaders over there is that it happens AFTER the tiling stage. Geometry shaders can do arbitrary transformations or geometry amplification to the screen space primitives right before the rasterization stage and that can break tiling optimizations which are decided beforehand so there's a potential mismatch since tiling may not necessarily match up to the screen space primitives that get submitted to the rasterizer. Geometry shaders will inevitably cause load imbalance on tilers ...

Sorry, but that's simply not correct, in all modern TBR/TBDR architectures ALL geometry processing happens before tiling, as such the GS does not break the tiling optimisations.
 
The multi-core part sounds alright given TBDR is already binning all triangles before rasterization. I am curious how workload distribution before binning and how graphics pipeline is managed entirely through memory (and doorbells?) though.
 
The multi-core part sounds alright given TBDR is already binning all triangles before rasterization. I am curious how workload distribution before binning and how graphics pipeline is managed entirely through memory (and doorbells?) though.

It's somewhat dependent on the tiling strategy chosen by the driver. A driver can either choose larger or smaller tiles with different trade-offs ...

If a larger tile size is chosen, it means there'll likely be less primitives crossing tile boundaries which translates into processing less duplicated vertex shader invocations. The downside is that larger tiles are more likely to have variable geometry density between the different tiles which can cause load imbalance since some tiles will have larger clumps of geometry densely packed together than the other tiles ...

If a smaller tile size is chosen, there'll be more primitives crossing tile boundaries which will mean more duplicated vertex shader invocations being processed. A smaller tile size can give you a more even distribution of geometry density between the different screenspace tiles so it results in a better load balance ...

Too big and certain tiles will dominate the frame latency or if it's too small then there'll be lot's of redundant geometry processing. Tile based GPUs and their drivers try to pick the ideal middle ground that'll give them the lowest latency ...
 
By the way for the other myth considering PowerVR and T&L units: the SEGA Naomi 2 arcade machines had besides the mGPU config a PowerVR T&L chip named ELAN clocked at 100MHz capable of 10Mio Polys/sec with 6 light sources which was fairly strong for year 2000.
 
Last edited:
If a larger tile size is chosen, it means there'll likely be less primitives crossing tile boundaries which translates into processing less duplicated vertex shader invocations. The downside is that larger tiles are more likely to have variable geometry density between the different tiles which can cause load imbalance since some tiles will have larger clumps of geometry densely packed together than the other tiles ...

If a smaller tile size is chosen, there'll be more primitives crossing tile boundaries which will mean more duplicated vertex shader invocations being processed. A smaller tile size can give you a more even distribution of geometry density between the different screenspace tiles so it results in a better load balance ...

TBDR GPUs usually have hardware-determined tile size. For Apple (I assume PowerVR is the same) it's 32x32, 32x16 or 16x16 (depending on how much data you want to store per fragment).

Why do you mention "duplicated vertex shader invocations"? Vertex shaker is only invoked once — binning happens after the vertex shader stage.
 
TBDR GPUs usually have hardware-determined tile size. For Apple (I assume PowerVR is the same) it's 32x32, 32x16 or 16x16 (depending on how much data you want to store per fragment).

Why do you mention "duplicated vertex shader invocations"? Vertex shaker is only invoked once — binning happens after the vertex shader stage.

The bolded isn't the exact truth ...

The vertex pipeline is split into two parts on tiling architectures. Position only shading happens before the tiling stage. The varying shading on the other hand happens after the tiling stage ...
 
The bolded isn't the exact truth ...

The vertex pipeline is split into two parts on tiling architectures. Position only shading happens before the tiling stage. The varying shading on the other hand happens after the tiling stage ...

Where do you have this information from? This seems at odds to Apple's and Imagination documentation. Not to mention that if parts of the vertex shader were executed multiple times, it would completely break some supported behavior (such as writing to memory from vertex shaders etc.).
 
Back
Top