Apple (PowerVR) TBDR GPU-architecture speculation thread

MuteyM · Oct 24, 2020

rikrak said:
Where do you have this information from? This seems at odds to Apple's and Imagination documentation. Not to mention that if parts of the vertex shader were executed multiple times, it would completely break some supported behavior (such as writing to memory from vertex shaders etc.).

I can't post links yet, but if you google "Arm Mali Offline Compiler User Guide" [ModEdit Link: https://developer.arm.com/documenta...ler/Performance-analysis/IDVS-shader-variants ] and read the IDVS Shader Variants chapter you can see that they split their vertex shaders exactly as @Lurkmass said. I imagine if a vertex shader has side effects like memory writes, then it'd be up to the specific graphics API and/or shading language to define which of the two shaders those side effects go into.

rikrak · Oct 28, 2020

MuteyM said:
I can't post links yet, but if you google "Arm Mali Offline Compiler User Guide" [ModEdit Link: https://developer.arm.com/documenta...ler/Performance-analysis/IDVS-shader-variants ] and read the IDVS Shader Variants chapter you can see that they split their vertex shaders exactly as @Lurkmass said. I imagine if a vertex shader has side effects like memory writes, then it'd be up to the specific graphics API and/or shading language to define which of the two shaders those side effects go into.

But we are not discussion ARM Mali here, are we? We are discussing Apple GPUs which are based on PowerVR TBDR. That's completely different hardware, not to mention Mali's performance is significantly lower. I mean, you wouldn't argue that IMR shades fragments in 4x2 groups just because Intel GPUs happen to, would you?

Looking at PowerVR documentation (https://cdn.imgtec.com/sdk-documentation/PowerVR.Performance+Recommendations.pdf), specifically section 4.5.6 seems to suggest that all vertex shader outputs are written out to the tile parameter buffer and later fetched for interpolation. There is no mention of vertex shader splitting. Neither was I able to find any reference to similar procedure in the Apple documentation.

Anyway, I suppose this can be tested in a simple way: write a vertex shader that makes all outputs dependent on an atomic read and increment and then count how often the increment did happen. I'll try to quickly do it if I have 20 minutes of free time.

Ailuros · Oct 29, 2020

A simple test as rikrak suggests should be an indication of what is going on. PowerVR documentation might be worthless considering how much of the GPU code Apple has actually re-written for its GPUs.

Rodéric · Oct 29, 2020

rikrak said:
Where do you have this information from? This seems at odds to Apple's and Imagination documentation. Not to mention that if parts of the vertex shader were executed multiple times, it would completely break some supported behavior (such as writing to memory from vertex shaders etc.).

He described ARM Mali inner working, you described PowerVR.
I doubt Apple has changed something that fundamental to PowerVR.

Ailuros · Oct 29, 2020

Someone correct me if I'm wrong, but the two stage VS affects Bifrost and Valhall architectures only?

rikrak · Oct 29, 2020

Ailuros said:
A simple test as rikrak suggests should be an indication of what is going on. PowerVR documentation might be worthless considering how much of the GPU code Apple has actually re-written for its GPUs.

Rodéric said:
He described ARM Mali inner working, you described PowerVR.
I doubt Apple has changed something that fundamental to PowerVR.

So I did try it out, both on my Mac (AMD Navi GPU) and on my iPhone 11. I used a trivial vertex shader that atomically increments a per-vertex counter each time a vertex shader is run (to guarantee this, all outputs are dependent on the previous value in the buffer). Buffer is initialized to all zeros at the start of the rendering loop.

Code:

vertex VertexOut vertex_shader(const device float2 *vertices [[buffer(0)]],
                               device atomic_uint *counter [[buffer(1)]],
                               unsigned int vid [[vertex_id]]) {
    VertexOut out;
   
    float s = atomic_fetch_add_explicit(&counter[vid], 1, memory_order_relaxed)/4.0;
 
    float2 vpos = vertices[vid] + s;
    
    out.position = float4(vpos.x, vpos.y, 0.0f, 1.0f);
    out.color = float4(0.5f + s, 0.0f, 0.0f, 1.0f);
    
    return out;
}

The idea is that after the frame is completed, the buffer will contain the exact number of times a vertex shader was run for a given vertex. To control for this, I am using instanced rendering — the counts in the buffer must match the number of rendered instances if the shader is run exactly once per vertex.

The result is identical for the Navi GPU and the A13 GPU. The counts in the buffer after the frame execution exactly matches the number of drawn instances.

I hope this disproves the idea that Apple GPUs do any kind of shader splitting.

Lurkmass · Nov 3, 2020

Rodéric said:
He described ARM Mali inner working, you described PowerVR.
I doubt Apple has changed something that fundamental to PowerVR.

On Mali, their vertex pipeline is split into two stages ...

On Adreno, by some developer accounts their tile based rendering pipeline is completely disabled in the presence of geometry shaders ...

PowerVR's rendering pipeline details at the low level is still very much unknown other than the fact that they natively support geometry shading and tessellation in hardware but it remains to be a mystery how well their hardware implementation matches with current APIs. The only remotely useful information about geometry shading on their hardware is that using it for geometry amplification is a bad idea ...

Also this is somewhat proprietary information and possibly outdated information too but on older PowerVR HW when the parameter buffer gets full they free up memory by doing partial rendering where the vertex shading is stalled and they start fragment shading for some tiles prior to having all of the draws processed through the vertex pipeline. This means that tiles in the same screenspace locations this case can potentially be accessed multiple times. I don't know if this is relevant to modern PowerVR HW but they still recommend you to avoid rendering small triangles as much as possible to avoid trashing their parameter buffer ...

rikrak · Nov 3, 2020

Lurkmass said:
On Mali, their vertex pipeline is split into two stages ...
Also this is somewhat proprietary information and possibly outdated information too but on older PowerVR HW when the parameter buffer gets full they free up memory by doing partial rendering where the vertex shading is stalled and they start fragment shading for some tiles prior to having all of the draws processed through the vertex pipeline. This means that tiles in the same screenspace locations this case can potentially be accessed multiple times. I don't know if this is relevant to modern PowerVR HW but they still recommend you to avoid rendering small triangles as much as possible to avoid trashing their parameter buffer ...

Apple also mentions this in their documentation. Buffer overflows or transparency spills can cause partial tile flushes. But each vertex is still only processed once and even with partial flushes a TBDR renderer still does less work than an IMR one. Anyway, you are focusing so much on finding problems with tilers that you seem to forget that both Nvidia and AMD are tilers too. And they flush early (the limit seems to be around 500 primitives per tile on Navi in my experiments, which is not much when you consider that their tiles are huge).

As to rendering many small triangles... it will impose a performance hit on any mainstream GPU. They rasterize and shade in SIMD-aligned blocks. A lot of small edges = suboptimal SIMD utilization. TBDR avoids this problem by shading the entire tile by the way.

Xmas · Nov 3, 2020

Scene complexity is unbounded so having to flush the parameter buffer is always a possibility, the alternative would be to accept incorrect output when putting too much geometry into a single frame. Of course by the time you're filling up the parameter buffer the relative cost of partially rendering tiles has gone down.

Lurkmass · Nov 3, 2020

rikrak said:
Apple also mentions this in their documentation. Buffer overflows or transparency spills can cause partial tile flushes. But each vertex is still only processed once and even with partial flushes a TBDR renderer still does less work than an IMR one. Anyway, you are focusing so much on finding problems with tilers that you seem to forget that both Nvidia and AMD are tilers too. And they flush early (the limit seems to be around 500 primitives per tile on Navi in my experiments, which is not much when you consider that their tiles are huge).

If I believe my information to be correct, this "parameter buffer" is stored to the off-chip video memory on PowerVR devices which feature their TBDR rendering pipeline while on IMRs the transformed geometry is kept inside the on-chip caches so the concerns of a smaller amount of buffering space doesn't necessarily apply to their case ...

How much extra latency would this incur for the pipeline in case we do see the parameter buffers being flushed ?

Xmas · Nov 3, 2020

Lurkmass said:
If I believe my information to be correct, this "parameter buffer" is stored to the off-chip video memory on PowerVR devices which feature their TBDR rendering pipeline while on IMRs the transformed geometry is kept inside the on-chip caches so the concerns of a smaller amount of buffering space doesn't necessarily apply to their case ...

In what way does the location of the parameter buffer matter? Its bandwidth requirements practically don't change whether you need to flush multiple times or not.

How much extra latency would this incur for the pipeline in case we do see the parameter buffers being flushed ?

It would depend on how much extra work you need to do due to the extra shading (HSR being no longer optimal) and tile transfers. I'd think of it as similar to finishing a frame and beginning a new one without framebuffer clear.

rikrak · Nov 3, 2020

Lurkmass said:
If I believe my information to be correct, this "parameter buffer" is stored to the off-chip video memory on PowerVR devices which feature their TBDR rendering pipeline while on IMRs the transformed geometry is kept inside the on-chip caches so the concerns of a smaller amount of buffering space doesn't necessarily apply to their case ...

And I think this is the big elephant in the room with TBDR and the main reason why one would be skeptical of its scaling potential. All transformed vertices have to be streamed out, so if your pixel shading is trivial while your geometry is very complex, TBDR will be less efficient than IMR approaches. Similar concern goes for mesh shaders — TBDR sounds to me fundamentally incompatible with the core idea of mesh shading.

I suppose one could "solve" it by having the parameter buffer reside in on-chip memory, but that sounds terribly expensive, as you would need a lot of SRAM...

Lurkmass said:
How much extra latency would this incur for the pipeline in case we do see the parameter buffers being flushed ?

I would be surprised if this mattered much — rasterization+tile shading and vertex+primitive processing can be scheduled asynchronously, so I would suspect these devices are very good at hiding latency.

MfA · Nov 3, 2020

rikrak said:
TBDR sounds to me fundamentally incompatible with the core idea of mesh shading.

So is RTX ray tracing and both are incompatible with the core idea of Unreal Nanite. Mesh shading isn't the core of future rendering.

Also Apple will likely get big enough that developers start doing geometry level tiling.

Rodéric · Nov 3, 2020

What matters is how well PowerVR could handle Mesh shaders which are IMO the short term future of the graphics pipeline.

MfA · Nov 3, 2020

I think it will be another pointlessly specific shader stage consigned to the dustbin of history with so many others.

I wish Sweeney had delivered on his promise of a computational pipeline so much earlier, but better late than never.

3dcgi · Nov 4, 2020

rikrak said:
Not to mention that if parts of the vertex shader were executed multiple times, it would completely break some supported behavior (such as writing to memory from vertex shaders etc.).

APIs don't specify vertex shaders must be run a specific number of times. This allows each IHV to implement their own vertex reuse algorithms. So if you write to memory from a vertex shader you can't expect identical results across IHVs.

rikrak · Nov 4, 2020

3dcgi said:
APIs don't specify vertex shaders must be run a specific number of times. This allows each IHV to implement their own vertex reuse algorithms. So if you write to memory from a vertex shader you can't expect identical results across IHVs.

Makes total sense to me, but we were discussing a specific point: does tiling incur additional vertex shading cost on a modern TBDR implementation. My experiments (see the post above) suggest that it does not.

Ailuros · Nov 4, 2020

rikrak said:
Makes total sense to me, but we were discussing a specific point: does tiling incur additional vertex shading cost on a modern TBDR implementation. My experiments (see the post above) suggest that it does not.

A few years ago the suggestion circulated that TBDRs suck with DX11 tesselation. One of the reasons was that anything relevant behaved more than bad on SGX GPU IP in the Sony Vita handheld. With the GPU being DX9L3 I wouldn't expect any tesselation to present anything better than seconds per frame on a GPU like that.

Now the discussion here is moving back and forth for Apple's SoC GPUs which are based on a PowerVR Series7 Plus GPU which despite the tesselation unit present or anything else is not more than DX10.x compliant (if memory serves well it might only be up to DX10.0) due to high precision values that they skipped in the ALUs, and here we are discussing whether A or B DX12 feature works well or not on a TBDR. No idea where Alborix A or B lies in capabilities, but I wouldn't be surprised if their baseline starts with DX11.0 compliance and we don't have a single integrated unit yet from A series.

https://forum.beyond3d.com/posts/2168556/

Considering he's been leading the GPU designs at IMG for at least two past decades I suggest he knows what he's talking about.

Lurkmass · Nov 4, 2020

MfA said:
I think it will be another pointlessly specific shader stage consigned to the dustbin of history with so many others.

I wish Sweeney had delivered on his promise of a computational pipeline so much earlier, but better late than never.

Actually, Nanite still benefits from using the graphics pipeline like we see on PS5 where they are using primitive shaders ...

Even then Nanite doesn't handle dynamic geometry so a geometry pipeline will still have role to play for many more years to come especially with more advanced pipelines like mesh shading. Mesh shaders aren't going anywhere as AMD, Nvidia, and Microsoft decided to standardize the functionality with DX12 Ultimate ...

MfA · Nov 5, 2020

Lurkmass said:
Actually, Nanite still benefits from using the graphics pipeline like we see on PS5 where they are using primitive shaders ...

I know you know, but just to be clear they also use primitive shaders. For the "vast majority" of triangles it's true compute. Very likely the impact of primitive shaders is just eeking out the final few percentage points.

Even then Nanite doesn't handle dynamic geometry so a geometry pipeline will still have role to play for many more years to come especially with more advanced pipelines like mesh shading.

For now ... but engines can only avoid dicing to memory for the engines which don't let world space raytracing and spatiotemporal denoising handle all the lighting. Less relevant to Unreal which uses image based hacks, but amplifying geometry transiently with primitive/mesh shaders without going through memory is under attack from two sides. Relegated to a trivial niche on one and incompatible on the other.

Apple (PowerVR) TBDR GPU-architecture speculation thread

MuteyM

rikrak

Ailuros

Epsilon plus three

Rodéric

a.k.a. Ingenu

Ailuros

Epsilon plus three

rikrak

Lurkmass

rikrak

Xmas

Porous

Lurkmass

Xmas

Porous

rikrak

MfA

Rodéric

a.k.a. Ingenu

MfA

3dcgi

rikrak

Ailuros

Epsilon plus three

Lurkmass

MfA

Similar threads