How do PowerVR SGX handle ddx/ddy?

RacingPHT

Newcomer
I'm a bit confused as the powervrs are claming zero-overdraw. If this is true, the chip has to operate on per-pixel level instead of per-quad level that's pretty much standard for years. And according to the configuration, some powerVR seems using 2 pixel shaders/TMUs which does not seem natural for a quad based renderer.

So it's possible that only one fragment survives the depth testing. In this case, only interpolated attributes have well-defined ddx/ddy, other non-linear function are probably not. For example, the mip-map level calculation of a dependant texture fetch.

Can someone shed a light on this?
 
Good question. Another facet which supports SGX not being in any way quad based is literature that says every USSE/USSE2 (basically, ALU pipeline) have completely independent flow control and operate on completely independent threads.
 
2x2 quad processing is unrelated to the deferred rendering aspect of the architecture.

As with _all_ architectures, any instructions relying on gradients or instructions that feed into such instructions must be executed across all pixels within the 2x2 stamp even if those pixels don't lie within the triangle. This process is unrelated to why a pixel may or may not be present/within the triangle when it reaches the fragment shader.

Note that any early Z architecture will also have cases where pixels within a triangle are removed by the hidden surface removal phase so this is not specific to TBDR.

John.
 
Good question. Another facet which supports SGX not being in any way quad based is literature that says every USSE/USSE2 (basically, ALU pipeline) have completely independent flow control and operate on completely independent threads.

Well IMO the documents said the USSE1/2 operates on 4 threads granularity(pretty much like AMD's wavefronts but so much smaller). So it's very possible that the SGX operates 4 threads(maybe fragments) in an "unrolled loop". That's pretty different from many desktop GPUs. For instance, some GPUs claiming 24 shader units could actually be 6 quad pixel pipes.
 
If required, we process pixels in quads.

That's an very interesting statement. You are actually suggesting there's an ability to switch between quad mode and pixel mode.

IMO, OpenGL ES1 does not have dependant texture fetch and shaders. So, it sounds perfectly fine to me if they(or maybe you guys) run the pixel pipes not in a quad mode. It's obviously saving the resources.

That's cool.
 
2x2 quad processing is unrelated to the deferred rendering aspect of the architecture.

As with _all_ architectures, any instructions relying on gradients or instructions that feed into such instructions must be executed across all pixels within the 2x2 stamp even if those pixels don't lie within the triangle. This process is unrelated to why a pixel may or may not be present/within the triangle when it reaches the fragment shader.

Note that any early Z architecture will also have cases where pixels within a triangle are removed by the hidden surface removal phase so this is not specific to TBDR.

John.

I don't worry about early-Z GPUs. I know they are doing quads, they are just free to do whatever they want and just mask out unused fragments.

It's the zero overdraw statements actually confused me, and many other things lead me thinking that PowerVR might not be quad-pixel GPUs.

So you are "officially" confirmed that PVRs are also doing quads, if not exclusively. Thanks!

It's still cool to guessing how it collecting quads for a given state in a tile, though. But because PowerVRs are using a thread granularity in a absolute minimal size(4), there might not be a "quad collector(for a given state)" as some other GPUs, because it switches back and forth easy enough. That also explains why the granularity has to be 4, because otherwise it lose too much efficiency in the single pixel case.

So it's not so different compared to IMR early-Z. But considering such a finer granularity, it's still a big win in terms of efficiency in some cases.

Maybe I'll check some patents;)
 
As with _all_ architectures, any instructions relying on gradients or instructions that feed into such instructions must be executed across all pixels within the 2x2 stamp even if those pixels don't lie within the triangle
I think that's an important distinction - only the parts of the pixel shader that are related to quads need to run with a branch granularity of 4, the other parts of the same shader could run at a granularity of 1. This would remove both the 'very small triangle' and 'divergent branching' penalties for these parts of the program.

I don't think Imagination has ever said whether they can do that but *in theory* it's certainly possible.
 
I think that's an important distinction - only the parts of the pixel shader that are related to quads need to run with a branch granularity of 4, the other parts of the same shader could run at a granularity of 1. This would remove both the 'very small triangle' and 'divergent branching' penalties for these parts of the program.

I don't think Imagination has ever said whether they can do that but *in theory* it's certainly possible.

Didn't the document of Imagination said their USSE units execute threads in a 4x4 way? 4 threads in a group is the basic unit of context switching, and there's max 4 thread groups running on a USSE processor. So first of all, I don't think they can run shaders with granularity of 1.

Second, I think they might be too cool to process the shaders in a way as you said. Analysis the program, separate which instructions should be run in quad and pixel is definitely possible, but I don't know how this execution could be made possible(efficiently) in hardware. Because what he said it's not specific to PowerVR, what might end up happening in hardware might still be using predications or masking, as usual.
 
Last edited by a moderator:
I've always wondered what the exact implications of the 4 thread grouping were. Nonetheless, on page 22 of the SGX OpenGL ES 2.0 Application Developer Recommendations manual the following is indicated:

"The branching granularity on SGX is one fragment or one vertex. This means you don’t have to worry a lot about making branching coherent for an area of fragments."
 
Didn't the document of Imagination said their USSE units execute threads in a 4x4 way? 4 threads in a group is the basic unit of context switching, and there's max 4 thread groups running on a USSE processor. So first of all, I don't think they can run shaders with granularity of 1.
These threads are obviously grouped 4 by 4 so that you can run 2x2 quad processing when you need to. But the branch granularity is definitely 1 (as Exophase just said).

Since dependent texture operations aren't that uncommon, you'd expect Imagination to warn that branch granularity increases to 4 (or a 2x2 quad) when doing that. But they don't yet it must - which implies it's quite possible that branch granularity only increases for that part of the program and each active thread can run a different instruction at a given time. Or maybe the document is simply incomplete, I suppose JohnH or Rys could clarify ;)

---

As for having only 16 threads in groups of 4: it seems to me there is some rough similarity to the two-level scheduler hierarchy proposed by NVIDIA (and presumably targeted at either Kepler or Maxwell) in this paper: http://cva.stanford.edu/publications/2011/gebhart-isca-2011.pdf (there's no indication that Imagination implements register file caching but they certainly could in theory and it'd save a tiny bit of power). I don't know whether they can switch an individual vertex (or thread for OpenCL) rather than the full group of four - I'd expect not but who knows.

It also means there's relatively little threading to hide memory latency and so (like NVIDIA) they are very dependent on Instruction Level Parallelism to hide it. But most likely it doesn't need ILP to hide ALU latency and that means the ALU pipeline can't be more than 4 cycles (presumably the total pipeline is longer which makes things slightly more complex for branches).
 
http://www.imgtec.com/powervr/insid...cture guide for developers.1.0.8.External.pdf

4.1.2. Thread scheduling
Each USSE execution unit in a given SGX graphics core has its own thread scheduler. Each
scheduler manages 16 threads, 4 of which can be active at a given time (Figure 10).


Could be 2 different concepts. It might be my misunderstanding.

Because PowerVRs are executing thread group in a multi-cycle style, it's technically possible that every thread in a group goes different execution path(But not the case with AMD and NVidia). In my understanding, USSE operates in 4 thread groups, But I have no idea what's restriction/execution style they are using. I just assumed that they are doing SIMD, but I could very possibly be wrong.

So now it's 2 different questions:
1: Can those threads in a group goes different path? (very possible yes)
2: Can threads in a group be different shader programs/states?
 
Because PowerVRs are executing thread group in a multi-cycle style, it's technically possible that every thread in a group goes different execution path(But not the case with AMD and NVidia). In my understanding, USSE operates in 4 thread groups, But I have no idea what's restriction/execution style they are using. I just assumed that they are doing SIMD, but I could very possibly be wrong.
As you say yourself execution is multi-cycle for a group - or rather it's really single cycle for a single pixel. And SGX520 only has a single 'pipeline' so that's the real level of granularity.

I haven't heard of any pipelined processor architecture that takes more than one cycle (throughput not latency) for instruction fetch or decode. I think you can't gain any area that way (even if you can save a little bit of power by clock gating). Therefore each thread must necessarily be able to go a different path (the only problem I can see is that the instruction cache would get thrashed faster but I doubt that's a big problem).

2: Can threads in a group be different shader programs/states?
Indeed and I have no idea. At least presumably different groups must be able to have different programs since otherwise vertex shaders would force the texture unit to idle. Beyond that... who knows.
 
As you say yourself execution is multi-cycle for a group - or rather it's really single cycle for a single pixel. And SGX520 only has a single 'pipeline' so that's the real level of granularity.

I haven't heard of any pipelined processor architecture that takes more than one cycle (throughput not latency) for instruction fetch or decode. I think you can't gain any area that way (even if you can save a little bit of power by clock gating). Therefore each thread must necessarily be able to go a different path (the only problem I can see is that the instruction cache would get thrashed faster but I doubt that's a big problem).

Maybe AMD/NVidia still save area because the cost of decode and fetch are amortized with more ALUs attached and it's the point of SIMD.

Indeed and I have no idea. At least presumably different groups must be able to have different programs since otherwise vertex shaders would force the texture unit to idle. Beyond that... who knows.

I'm interested to this because seems to me that if you could let threads in a group go different path, it's not that difficult that they could be different programs. They are using different context pointer and instruction pointer already.

And that relates to the topic: The effeciency in the case of a single fragment passing depth test. If you could mix program in a group, you don't have to idle the most of the rest slots in the group.
 
Maybe AMD/NVidia still save area because the cost of decode and fetch are amortized with more ALUs attached and it's the point of SIMD.
Right, exactly - but SGX clearly only handles one pixel/vertex/whatever at a time. So there would be no way for them to save any area without fundamentally changing their architecture.

And that relates to the topic: The effeciency in the case of a single fragment passing depth test. If you could mix program in a group, you don't have to idle the most of the rest slots in the group.
Yes, although even if it didn't support that, the vast majority of sub-quad triangles have adjacent triangles from the same batch (same shader/state). So yes, if all of this is correct, then ironically SGX should theoretically be more efficient at shading very small triangles than modern IMRs despite being less bandwidth efficient for them.

But then what happens with DX11 tessellation? If they do it pre-binning, it'll be very bandwidth inefficient. If they do it post-binning, it will save a lot of bandwidth and they could genuinely claim to be more efficient for small triangles than IMRs. That would certainly be an interesting twist but I have no idea what they're going to do there.
 
Right, exactly - but SGX clearly only handles one pixel/vertex/whatever at a time. So there would be no way for them to save any area without fundamentally changing their architecture.

Yes, although even if it didn't support that, the vast majority of sub-quad triangles have adjacent triangles from the same batch (same shader/state). So yes, if all of this is correct, then ironically SGX should theoretically be more efficient at shading very small triangles than modern IMRs despite being less bandwidth efficient for them.

Yeah.

For small triangles, there should be some kind of fragment merging or single pixel thread, otherwise it lose a lot of efficiency. I think for desktop, the better way would be quad fragment merging. Or maybe a fancier word "coalescing".

But then what happens with DX11 tessellation? If they do it pre-binning, it'll be very bandwidth inefficient. If they do it post-binning, it will save a lot of bandwidth and they could genuinely claim to be more efficient for small triangles than IMRs. That would certainly be an interesting twist but I have no idea what they're going to do there.

Yeah to me the only logical way is post binning. But it's a really complicated problem, if the API have no support for TBDRs.
 
The document is pretty accurate and we worked hard to get as much in there as we could to describe the hardware well. There are 16 threads, each with 1 or 4 things to work on. If it's a pixel quad, depending on branching in the shader, they can run in lock-step with a single PC (because they have to otherwise rendering would break). That's about as 'restrictive', if you can even call it that, as it gets. That's all wrapped up in the scheduling model.
 
The document, is pretty accurate and we worked hard to get as much in there as we could to describe the hardware well. There are 16 threads, each with 1 or 4 things to work on.
Wait so it can handle up to 64 pixels in theory? Wouldn't that increase register pressure too much?
 
As for DX11, it'd be really helpful if the API had some way of basically giving us a bounding box for expansion (which would help a nice handful of things in the hardware and software), but we're not really hampered by the current state of play. The hardware has some specific changes to the execution model just for this stuff, and geometry is processed almost completely differently to SGX when tessellating.
 
Back
Top