Apple (PowerVR) TBDR GPU-architecture speculation thread

One then I'm confused: if Apple's (and Imagination's) TBDR GPUs are so awesomely power efficiently, why isn't everyone doing it? What are the inherent limitations of a TBDR GPU?
For most of the past two decades it's been patents. The fundamental ones should have expired by now but I'm sure there's recent ones. Qualcomm's Adreno GPUs are said to be some kind of deferred renderers but we have very little info on the architectures.
 
For most of the past two decades it's been patents. The fundamental ones should have expired by now but I'm sure there's recent ones. Qualcomm's Adreno GPUs are said to be some kind of deferred renderers but we have very little info on the architectures.


Can you "renew" the patent if it's still used ?
 
By microarchitecture I was referring to the design of each "core". And do we know for sure that the M1's single precision throughput per core is better than A14's?

This has been confirmed by Apple GPU driver team leader on Twitter


I have also run some benchmarks (look in the posts above) that show FMA throughput on M1 and A14 GPUs.

One then I'm confused: if Apple's (and Imagination's) TBDR GPUs are so awesomely power efficiently, why isn't everyone doing it? What are the inherent limitations of a TBDR GPU?

I would guess mainly because it's a very intricate engineering puzzle. Getting the deferred rendering behavior in hardware while considering all the edge cases sounds extremely complicated... In the end, it seems that IMG were the only one to get it right (which kind of makes sense when you think that they have been at it consistently since forever), and they have patented the hell out of it. Apple has inherited their tech, possibly making some improvements here and there. I don't think anything else on the market is currently "real" TBDR — Mali and friends seem to be using simple tiled immediate renderer, often relying on hacks like vertex shader splitting to make the data management simpler.

As to the inherent limitations, well, this has been the very topic of this thread, so I would recommend you to read it. Just keep in mind that there has been a lot of confusion considering various rendering approaches, and some posts are discussing limitations of mobile hardware that does not necessarily apply to Apple GPUs.

In my book, the main limitation of TBDR is that it has to keep the results of the vertex shader stage until the tile is fully shaded. Vertex shader outputs are stored in a buffer in the device RAM which puts additional bandwidth pressure on the already bandwidth-limited GPU. Usually it's not a problem, since TBDR will save much more bandwidth in the fragment shading stage later on, but if you have a lot of small primitives (and things like transparency etc. on top), deferred shading might end up more expensive. This is the reason why TBDR is though to be incompatible with some of the more current techniques like mesh shading. The later is based around the idea of generating many small primitives on the GPU and shading them immediately — a great fit for modern desktop GPUs, but doesn't really work for the TBDR GPUs.

What I really like about Apple's implementation is that shading ultimately occurs over well defined tiles with strong memory ordering guarantees. Basically, pixel shading is nothing more than a compute shader invocation over a 16x16 or a 32x32 pixel grid, and this compute shader will fetch the necessary primitive and texture data. And Metal makes this fact fully transparent to the programmer. You can actually use your own compute shaders to work on tile data, and you can transform them in a variety of ways. Even more, you can run a sequence of pixel and compute shaders, and they will share the same fast persistent on-chip memory (local/shared memory) to communicate between different shader invocations. This allows a lot of things to be implemented in a rather elegant fashion and makes the GPU less reliant on memory bandwidth.
 
Can you "renew" the patent if it's still used ?
Generally speaking, no.
That’s the point of patents - you get a protected market in exchange for sharing your knowledge and methods.
Trademarks (and copyright) work differently. Allowing bastardizations such as "design patents" is unfortunate.
 
Generally speaking, no.
That’s the point of patents - you get a protected market in exchange for sharing your knowledge and methods.
Trademarks (and copyright) work differently. Allowing bastardizations such as "design patents" is unfortunate.
I'm not familiar with the exact patents and what they contain but generally I think it should be viewed as a group of patents adding onto each other throughout the years, solving issues and addressing new techniques that are used as graphics technology had evolved from simple fixed function pipelines to programmable shaders.

Similarly x86-64 patents have expired but AVX and newer extensions have not. It's possible that making a switch to TBDR might no longer require the original patents now as they're expired but to make a working modern TBDR you'd be entering a minefield of various other more recent patents that are critical to make a TBDR GPU work today.
 
For most of the past two decades it's been patents. The fundamental ones should have expired by now but I'm sure there's recent ones. Qualcomm's Adreno GPUs are said to be some kind of deferred renderers but we have very little info on the architectures.
I've never heard engineering decisions be driven by fear of patents. Architects shouldn't be reading patents in the first place so they don't know what are new implementation ideas vs. patented ones. To switch of company from an IMR to a TBDR or vice versa requires a significant shift in architecture and it takes a lot to convince a large architecture team to deviate that much from what they are already doing, especially when there are negatives to both methods.
 
I've never heard engineering decisions be driven by fear of patents. Architects shouldn't be reading patents in the first place so they don't know what are new implementation ideas vs. patented ones. To switch of company from an IMR to a TBDR or vice versa requires a significant shift in architecture and it takes a lot to convince a large architecture team to deviate that much from what they are already doing, especially when there are negatives to both methods.

From what I had heard ARM tried more than once to dismiss IMG patents simply because for some headaches workarounds are unique. IMG's tiling methods are unique, irrelevant if they defer or not.
 
This has been confirmed by Apple GPU driver team leader on Twitter


I have also run some benchmarks (look in the posts above) that show FMA throughput on M1 and A14 GPUs.



I would guess mainly because it's a very intricate engineering puzzle. Getting the deferred rendering behavior in hardware while considering all the edge cases sounds extremely complicated... In the end, it seems that IMG were the only one to get it right (which kind of makes sense when you think that they have been at it consistently since forever), and they have patented the hell out of it. Apple has inherited their tech, possibly making some improvements here and there. I don't think anything else on the market is currently "real" TBDR — Mali and friends seem to be using simple tiled immediate renderer, often relying on hacks like vertex shader splitting to make the data management simpler.

As to the inherent limitations, well, this has been the very topic of this thread, so I would recommend you to read it. Just keep in mind that there has been a lot of confusion considering various rendering approaches, and some posts are discussing limitations of mobile hardware that does not necessarily apply to Apple GPUs.

In my book, the main limitation of TBDR is that it has to keep the results of the vertex shader stage until the tile is fully shaded. Vertex shader outputs are stored in a buffer in the device RAM which puts additional bandwidth pressure on the already bandwidth-limited GPU. Usually it's not a problem, since TBDR will save much more bandwidth in the fragment shading stage later on, but if you have a lot of small primitives (and things like transparency etc. on top), deferred shading might end up more expensive. This is the reason why TBDR is though to be incompatible with some of the more current techniques like mesh shading. The later is based around the idea of generating many small primitives on the GPU and shading them immediately — a great fit for modern desktop GPUs, but doesn't really work for the TBDR GPUs.

What I really like about Apple's implementation is that shading ultimately occurs over well defined tiles with strong memory ordering guarantees. Basically, pixel shading is nothing more than a compute shader invocation over a 16x16 or a 32x32 pixel grid, and this compute shader will fetch the necessary primitive and texture data. And Metal makes this fact fully transparent to the programmer. You can actually use your own compute shaders to work on tile data, and you can transform them in a variety of ways. Even more, you can run a sequence of pixel and compute shaders, and they will share the same fast persistent on-chip memory (local/shared memory) to communicate between different shader invocations. This allows a lot of things to be implemented in a rather elegant fashion and makes the GPU less reliant on memory bandwidth.

I'd have to dig up a few of John's comments here from a past discussion about tessellation being a major problem on a TBDR. The patents that worked around the issue were also under his name from what I recall. It's true that DRs are not fond of high geometry volumes, but that shouldn't mean that workarounds aren't possible. The real question at the end of the equation is if a DR is left at the end with advantages or not. IMHO yes but nothing as ground breaking as many would imagine.

As for buffering on a PowerVR there are buffers and they use patented parameter and geometry compression techniques amongst others to keep the data traffic in control.
 
As to the inherent limitations, well, this has been the very topic of this thread, so I would recommend you to read it. Just keep in mind that there has been a lot of confusion considering various rendering approaches, and some posts are discussing limitations of mobile hardware that does not necessarily apply to Apple GPUs.

In my book, the main limitation of TBDR is that it has to keep the results of the vertex shader stage until the tile is fully shaded. Vertex shader outputs are stored in a buffer in the device RAM which puts additional bandwidth pressure on the already bandwidth-limited GPU. Usually it's not a problem, since TBDR will save much more bandwidth in the fragment shading stage later on, but if you have a lot of small primitives (and things like transparency etc. on top), deferred shading might end up more expensive. This is the reason why TBDR is though to be incompatible with some of the more current techniques like mesh shading. The later is based around the idea of generating many small primitives on the GPU and shading them immediately — a great fit for modern desktop GPUs, but doesn't really work for the TBDR GPUs.

I'm sure you know this but small geometry can also cause bottlenecks in PowerVR/Apple's fixed function hardware unit known as the "Tile Accelerator". I wonder if this bottleneck is down to the prohibitive cost of sorting lot's of geometry into the bins ?

Another consequence of the parameter buffer and it's limited space is that hidden surface removal becomes less effective on current TBDR implementations if multiple tile shading passes are necessary to free up the memory within the parameter buffer. Hidden surface removal only applies after the geometry is processed ...
 
I'm sure you know this but small geometry can also cause bottlenecks in PowerVR/Apple's fixed function hardware unit known as the "Tile Accelerator". I wonder if this bottleneck is down to the prohibitive cost of sorting lot's of geometry into the bins ?
I guess you're unaware that tiling is basically the same as course grain (conservative) rasterisation, and is no more of a bottleneck than rasterisation is in an IMR, in fact considerably less so due to running at tile resolution, and that it is also possible to scale up in a near linear fashion? In short small geometry does not cause bottlenecks during the tiling process.
 
I'm sure you know this but small geometry can also cause bottlenecks in PowerVR/Apple's fixed function hardware unit known as the "Tile Accelerator". I wonder if this bottleneck is down to the prohibitive cost of sorting lot's of geometry into the bins ?

Another consequence of the parameter buffer and it's limited space is that hidden surface removal becomes less effective on current TBDR implementations if multiple tile shading passes are necessary to free up the memory within the parameter buffer. Hidden surface removal only applies after the geometry is processed ...

I have the feeling that we already had this very conversation few pages back? Tiling certainly adds additional per-primitive cost to the process, but this cost should be proportional to the number of tiles a primitive intersects. Small primitives should actually be the cheapest. Not that this (or many other tiling-related arguments you have mentioned) matters since modern desktop GPUs also do tiling.

And if you have a lot of very small primitives, yes, the parameter buffer will overflow and a tile will get shaded multiple times. I don’t see why this would put a TBDR architecture at a disadvantage however. Many small triangles per tile in practice means a lot of overdraw, so TBDR should still be able to save a substantial amount of shading work even with tile flushes. Not to mention that a lot of small triangles means bad SIMD ALU utilization on an IMR (you need a homogeneous 8x4 pixel block to fill up a modern 32-SIMD processor), where TBDR always dispatches shader work over the entire tile.
 
The entire argument is mute as long as modern game engines result rather to compute solutions for fine grained geometry.
 
I guess you're unaware that tiling is basically the same as course grain (conservative) rasterisation, and is no more of a bottleneck than rasterisation is in an IMR, in fact considerably less so due to running at tile resolution, and that it is also possible to scale up in a near linear fashion? In short small geometry does not cause bottlenecks during the tiling process.

Would you be willing to elaborate more on page 11 of PowerVR's Performance Recommendations guide ?

They implicate that submitting many small triangles will cause a bottleneck on their TA (Tile Accelerator). They also mention that doing face culling is advantageous relieve the TA by binning a lower amount of geometry. I'm curious but how does the cost of binning scale up with increasing geometry density with the TA ?
 
Would you be willing to elaborate more on page 11 of PowerVR's Performance Recommendations guide ?
Unfortunately I can't speak for IMG as I don't work for them anymore, so perhaps the guys from PowerVR/IMG could explain why they've worded it as they have?

In my opinion as written it doesn't make sense from the HW perspective as we would generally attempt to balance the performance of the tiling with that of the rest of the pipeline i.e. geometry processing and tiling throughput would normally be matched to rasterisation triangle throughput. The guidance I would have given is 'manage model LOD correctly to avoid excessive redundant small triangles and associated aliasing' i.e. the same guidance your would generally give for any architecture.

They implicate that submitting many small triangles will cause a bottleneck on their TA (Tile Accelerator). They also mention that doing face culling is advantageous relieve the TA by binning a lower amount of geometry. I'm curious but how does the cost of binning scale up with increasing geometry density with the TA ?
In general the cost of the binning process decreases as triangle size decreases to the point where a triangle or multiple attached triangles lay entirely within the bounds of a tile they can be trivially inserted into that tile (effectively bypassing the bulk of the tiling process).
 
Last edited:
Would you be willing to elaborate more on page 11 of PowerVR's Performance Recommendations guide ?

A complete amateur opinion here, but maybe they refer to the fact that transformed triangle data must be collected prior to rasterization, so a large amount of very small triangles will quickly fill up the buffers, causing premature flushes and additional memory operations? TBDR only really makes sense when there is more fragment shading work than vertex shading work. Then again, if you are drawing high amounts of very small triangles, you most likely have a bigger problem. As JohnH points out, managing very small triangles is a general recommendation, TBDR or not.

By the way, talking about tessellation it seems that modern Apple devices are faster doing tessellation rather than rendering pre-tessellated geometry of the same complexity: https://github.com/gpuweb/gpuweb/issues/445 (in the test linked in that issue my A13 is around 6% faster with tessellation enabled)
 
Unfortunately I can't speak for IMG as I don't work for them anymore, so perhaps the guys from PowerVR/IMG could explain why they've worded it as they have?

In my opinion as written it doesn't make sense from the HW perspective as we would generally attempt to balance the performance of the tiling with that of the rest of the pipeline i.e. geometry processing and tiling throughput would normally be matched to rasterisation triangle throughput.

Wouldn't that imply that if either the TA or the rasterizer would be bottlenecked then the other unit would become bottlenecked too as well ?

The guidance I would have given is 'manage model LOD correctly to avoid excessive redundant small triangles and associated aliasing' i.e. the same guidance your would generally give for any architecture.

A mesh shading pipeline would be very helpful in this scenario since we can now do dynamic LoD selection with amplification/task shaders! ;)

In general the cost of the binning process decreases as triangle size decreases to the point where a triangle or multiple attached triangles lay entirely within the bounds of a tile they can be trivially inserted into that tile (effectively bypassing the bulk of the tiling process).

What about the case where we have many 'thin' and small triangles crossing tile boundaries all around each of the corners of just about every tile ? How awfully or gracefully could that be handled with a TBDR pipeline ?
 
I'm not familiar with the exact patents and what they contain but generally I think it should be viewed as a group of patents adding onto each other throughout the years, solving issues and addressing new techniques that are used as graphics technology had evolved from simple fixed function pipelines to programmable shaders.

Similarly x86-64 patents have expired but AVX and newer extensions have not. It's possible that making a switch to TBDR might no longer require the original patents now as they're expired but to make a working modern TBDR you'd be entering a minefield of various other more recent patents that are critical to make a TBDR GPU work today.
Extending x86 in order to lock out competitors works quite well since Intel is dominant enough to push through at least some utilisation, and the main benefit of x86 is binary code backwards compatibility.
It isn't as prohibitive in graphics as you access GPU capabilities through an API. Also, depending on market, you may not have to care much about backwards compatibility at all. So if the fundamental ideas of how to build an efficient tiler is in the public domain by now, it is at least much easier to build a viable modern GPU architecture on that foundation than, say, a commercially viable x86 CPU. Hell, not having to go through (, and staying backwards compatible with,) the intervening 20 years of evolution may even be an advantage in some respects!
That said, not having to design around good ideas that are still patent protected is obviously beneficial as well.

(Aside: during a period of time a couple of decades+ back, I felt that Intel was introducing ISA features in order to mess up AMD who would be three years or so behind in implementing said feature, and have to graft it onto an design foundation where it wouldn't necessarily fit as well. Those extensions weren't necessarily used by anyone outside a few benchmark programs/companies that were in Intels pocket, or even outright owned by them, but it created an impression that AMD was behind. Of course, those new features also had to be supported going forward. )
 
Wouldn't that imply that if either the TA or the rasterizer would be bottlenecked then the other unit would become bottlenecked too as well ?
Not sure what you mean here? Balanced means that the pipeline throughputs match under certain circumstances, for example if the rasteriser has a triangle setup rate of 1 every 2 clocks then ideally you want the tiler to have similar throughput in the circumstances where you expect to be able to hit that rate e.g. for many small triangles the tile and the rasteriser end up taking similar amounts of time.

A mesh shading pipeline would be very helpful in this scenario since we can now do dynamic LoD selection with amplification/task shaders! ;)
BTW I'm not anti mesh shader, I just think they're wrong solution for now. If it'd been originally proposed instead of the mess that is the original FF Dx tessellation pipeline I think it would be universal at this point, but at this time I think things may be moving towards for generalised compute solutions. Personally I'm not that convinced that generalised compute is necessarily the answer here either given the power consumption implications that come along with that.

What about the case where we have many 'thin' and small triangles crossing tile boundaries all around each of the corners of just about every tile ? How awfully or gracefully could that be handled with a TBDR pipeline ?
Those long thin triangle also have to be processed in the rasteriser, their long thing nature means that they yield many rasteriser sub-blocks so the rasteriser is more likely to be the bottleneck than the tiler in that case. E.g. Tiling is at 32x32, raster sub block is 4x4, for long thin diagonal triangles (tile corner intersecting), any one tile insertion yields raster work of 10-15 blocks within a tile, which means the tiler has 10-15 clocks available per tile (assuming 1 raster block/clock), more than enough time for a tiler running at say 1 triangle/tile/clock.
 
Last edited:
Back
Top