ARM Midgard Architecture

One of the classical selling points of Mali GPUs is "No renderer state change penalty"; all Mali cores from Mali55 to the T604 have the ability to switch render state intra-tile with zero cycles of overhead - including TMU state.
Frankly, I'm not sure shader state switching would be much of an issue with any traditional tiler - maintaining a pool of ready (pixel) threads the size of the (traditionally small) tile would be a natural solution to minimize thread-switching costs. IIRC, SGX53x did this with up to 16 threads, though I don't remember if there were any limitations to the amount of different shader programs that could participate in the scheme.

What I don't see being as trivial, is the TMU state, and particularly texture caches. The latter, alone, would pose a massive bandwidth multiplicity problem, if those caches were really offering no penalty for context switching. By 'no penalty' I'm referring to eviction of hot data and the associated extra cache misses. So I'm really curious to see how Mali achieves the advertised behavior.

Don't you usually have a different drawcall for every model, in order to change the modelview matrix? If that's the case and what you're saying is true TBDR would rarely help you, since a model will rarely occlude itself. Even if the texture binding stays the same, it could still have totally different texture coordinates from one pixel to the other, so I dunno.. being able to handle texture state changes from pixel to pixel shouldn't be a huge problem. Cache wouldn't be explicitly flushed, but you'd get potentially worse locality of reference - but that's going to happen no matter how you draw the tile.
Just to make clear, what I'm referring to is basically this: imagine a TBDR tile having a (post-occlusion) content (no translucencies, for simplicity) of:

Code:
AAABBCCC
AABBBCCC
CBBBBCCC
CCCBBCCC

where A-, B- and C-marked zones are pixels belonging to drawcalls A, B and C respectively. The order I think would be natural for filling in the tile would be if each of these zones got processed without intermixing pixels from other zone, i.e. other drawcalls, or IOW, in a drawcall-by-drawcall order. Aside from minimising shader context switching, such a scheme would also optimise the utilisation of texture caches.

I do think there are some render-catch up events in the tile rendering, at the very least when going between opaque, alpha, and alpha test (hence why IMG suggests binning by this), and quite likely when changing shaders. But for things like texture binding or uniform changes, I'm not so sure. So I expect the granularity to be a little coarser than per-draw call.
I'm not sure the fact IMG suggest binning by opaqueness/alpha-op is related so much to shader context switching, as it is to avoiding stalling the the entire deferred-shading mechanism through the frag_kill -type of ops which is apha-testing.

Somewhere I recall IMG material claiming that the USSEs can switch to completely independent thread contexts with unique program position et al, so it's not completely out of the question that it can switch shaders on a per-pixel level within the tile. It'd just have to tag the pixels with a shader number (and deal with the case where that gets exhausted)
It's quite likely the case, but to reiterate: it's not the shader context switch that I see the major issue with - it's the extra stress on the texture caches that would be brought by any fill-in scheme other than drawcall-by-drawcall.
 
One of the classical selling points of Mali GPUs is "No renderer state change penalty"; all Mali cores from Mali55 to the T604 have the ability to switch render state intra-tile with zero cycles of overhead - including TMU state.

No idea how SGX handles this, though.

Interesting, annalysis seems to show that Mali only supports a single shader program within it's fragment pipeline at any time, this would imply an extremely high state change cost imo.

SGX (and MBX for that matter) fully pipelines all state changes and support multiple shaders executing simultaneously so it's safe to say it has this covered.

John.
 
Just to make clear, what I'm referring to is basically this: imagine a TBDR tile having a (post-occlusion) content (no translucencies, for simplicity) of:

Code:
AAABBCCC
AABBBCCC
CBBBBCCC
CCCBBCCC

where A-, B- and C-marked zones are pixels belonging to drawcalls A, B and C respectively. The order I think would be natural for filling in the tile would be if each of these zones got processed without intermixing pixels from other zone, i.e. other drawcalls, or IOW, in a drawcall-by-drawcall order. Aside from minimising shader context switching, such a scheme would also optimise the utilisation of texture caches.
Yep, that's basically correct.
I'm not sure the fact IMG suggest binning by opaqueness/alpha-op is related so much to shader context switching, as it is to avoiding stalling the the entire deferred-shading mechanism through the frag_kill -type of ops which is apha-testing.
Series2 binned by opaqueness/alpha, however this was dropped from series 3 onwards and we bin and process the bins preserving presentation/draw order.

John.
 
Thanks for the clarification darkblu, I see what you mean now. It seems rather complex, since an ordering table the size of the tile would have to be maintained. Surely the texture cache would be larger than the tile, so you shouldn't get capacity collisions no matter how you access it.. but you'd get associativity collisions if there are too many textures, I suppose (you'd still get them of course, this would just help minimize it). Maybe texture prefetching is something that works better this way? I can see how state change costs for texture binding would be minimized at least.

This is interesting though, since it at least says a little bit about the stuff Andrew Richards was saying about texture cache locality problems with TBDRs.

So I think the answer here is that some state changes may have costs, either implicit or explicit, in the tile rendering execution, but few of them will stall the TBDR entirely - ie, cause the entire tile to be rendered up to the state change point.

Series2 binned by opaqueness/alpha, however this was dropped from series 3 onwards and we bin and process the bins preserving presentation/draw order.

What we're referring to is the IMG suggestion that the user bin by opaque, alpha test, and alpha (as far as I can remember, in that order, although I'm not sure why the order matters). My presumption (and I think darkblu's) is that this is so that all of the opaque primitives can be rendered fully deferred, while where alpha is detected the renderer switches into a non-deferred rendering state.

And alpha-test seems to be different/more expensive, at least on SGX530 and SGX535.. but I heard that later versions change this.. I don't really understand the details though.
 
This is interesting though, since it at least says a little bit about the stuff Andrew Richards was saying about texture cache locality problems with TBDRs.
Who is Andrew Richards, and why would he think there'd be problems with cache locality in a TBDR?
 
Thanks for the clarification darkblu, I see what you mean now. It seems rather complex, since an ordering table the size of the tile would have to be maintained. Surely the texture cache would be larger than the tile, so you shouldn't get capacity collisions no matter how you access it.. but you'd get associativity collisions if there are too many textures, I suppose (you'd still get them of course, this would just help minimize it). Maybe texture prefetching is something that works better this way? I can see how state change costs for texture binding would be minimized at least.

Tile size vs cache size/structure is one of the balances that a tile based design must get right, as long as you get that right locality becomes a non issue in the overall scheme of things.

What we're referring to is the IMG suggestion that the user bin by opaque, alpha test, and alpha (as far as I can remember, in that order, although I'm not sure why the order matters). My presumption (and I think darkblu's) is that this is so that all of the opaque primitives can be rendered fully deferred, while where alpha is detected the renderer switches into a non-deferred rendering state.
I don't think the suggestion of grouping these types is specific to IMG, as an optimisation it is equally applicable to all modern architectures. In series 5 we can in fact maintain deferral even in the presence of alpha blended primitives, so although these optimisations are still worth while they're less important than they used to be, for us anyway.
And alpha-test seems to be different/more expensive, at least on SGX530 and SGX535.. but I heard that later versions change this.. I don't really understand the details though.
As I said in a previous thread the cost of alpha test on 530/535 has nothing to do with deferred nature of the architecture. In general alpha test reduces the effectivenes of all current architectures that perform an early Z check due to the feed back path from the back to the front of the pipeline.
 
I don't see why alpha test would necessarily reduce effectiveness of early-Z, so long as the depth check happens at the start of the pipeline and the depth update happens at the end. You'd have some latency before you could allow overdraw but this doesn't seem like it'd usually be a problem; you'd normally draw a whole mesh which won't usually have self-overdraw. Z pre-pass would botch things, but that goes for depth updating translucent primitives too.

Hierarchical-Z would have issues, though. And it could get in the way of allowing higher depth fill than fragment fill.
 
I don't see why alpha test would necessarily reduce effectiveness of early-Z, so long as the depth check happens at the start of the pipeline and the depth update happens at the end. You'd have some latency before you could allow overdraw but this doesn't seem like it'd usually be a problem; you'd normally draw a whole mesh which won't usually have self-overdraw. Z pre-pass would botch things, but that goes for depth updating translucent primitives too.

Hierarchical-Z would have issues, though. And it could get in the way of allowing higher depth fill than fragment fill.

The issue is that you have a delay on the feed back of up to date depth values to the front end, this means that once alpha test has been used you need to either insert a bubble into the pipeline in order to get front end depth back up to date or run with out of date values. The later means that you won't reject some things up front that you may otherwise have done and if depth compare mode changes you may have to insert the bubble anyway. This applies irrespective of the use of per pixel or hierarchical tests.

A simple example of this is draw mesh 1 with alpha test, draw mesh 2 without, mesh 2 is behind mesh 1 but is still sent down to be shaded as feedback from mesh1 hasn't come back yet i.e. the benefit of early Z check is defeated.

Now, you can argue that there are lots of cases where the feedback will have come back, but in reality there will also be lots of cases where it won't have done.

This reduction in efficiency applies to all current architectures afaik.
 
The issue is that you have a delay on the feed back of up to date depth values to the front end, this means that once alpha test has been used you need to either insert a bubble into the pipeline in order to get front end depth back up to date or run with out of date values. The later means that you won't reject some things up front that you may otherwise have done and if depth compare mode changes you may have to insert the bubble anyway. This applies irrespective of the use of per pixel or hierarchical tests.

A simple example of this is draw mesh 1 with alpha test, draw mesh 2 without, mesh 2 is behind mesh 1 but is still sent down to be shaded as feedback from mesh1 hasn't come back yet i.e. the benefit of early Z check is defeated.

Now, you can argue that there are lots of cases where the feedback will have come back, but in reality there will also be lots of cases where it won't have done.

This reduction in efficiency applies to all current architectures afaik.

I understand what you're saying, but you're talking about having to insert static stalls, which I was looking at the perspective of dynamic stalls. How much complexity would it be to add interlocks for the case where the overdraw will actually happen before the pipeline runs out. Of course the interlock would be based on tile position. I know you say this feedback is needed a lot, but it'd be nicer to have some picture of how statistically often the interlocks would happen, because it seems to me it wouldn't be that frequent. I suppose it depends on pipeline length (including fragment shader length, having the discard nearer the start would be potentially better), tile length, and the ratio of ALUs/TMUs to ROPs. IMRs would probably run up against it less than tilers since they can really draw the entire mesh and not just the part in the tile.

Depth compare mode changing having a cost is a totally different story, I doubt that's something that's done a lot in a typical scene.
 
I understand what you're saying, but you're talking about having to insert static stalls, which I was looking at the perspective of dynamic stalls. How much complexity would it be to add interlocks for the case where the overdraw will actually happen before the pipeline runs out. Of course the interlock would be based on tile position. I know you say this feedback is needed a lot, but it'd be nicer to have some picture of how statistically often the interlocks would happen, because it seems to me it wouldn't be that frequent. I suppose it depends on pipeline length (including fragment shader length, having the discard nearer the start would be potentially better), tile length, and the ratio of ALUs/TMUs to ROPs. IMRs would probably run up against it less than tilers since they can really draw the entire mesh and not just the part in the tile.

Depth compare mode changing having a cost is a totally different story, I doubt that's something that's done a lot in a typical scene.

I'm not sure what you mean by static vs dynamic stalls in this context, if you want to sync the front end with the back end you have a stall, I would call that a dynamic stall condition. You could reduce the impact of this by "localising" feedback zones however this significantly increases complexity elsewhere. This is probably why most implementations just let the early Z check be conservative and deal with anything that was erroneously let through at the back end.

I don't have a statistics that I could give out online for this, but bear in mind that as shaders get longer this will likely become the predominant part of the delay.

This is probably less of an issue for an IMR that processes things on a conventional polygon by polygon basis as in theory this would tend to avoid any contention as within mesh you'd expect to generally move away from the current locality. I say "conventional" as I suspect newer IMRs may actually do some localised tiling of the incoming data stream in order to get best use of their many parallel pipelines and to increase the effectiveness of Z and FB locality caches.
 
I mentioned static/dynamic because the impression I got was that you were saying that bubbles would always have to be entered to serialize the front and back ends if frag-kill were present. When I say interlock I mean stalling only when a given pixel's depth data is detected as in-flight and not ready. It appears you're describing this.

I still find it a little hard to believe that you'd hit overdraw a lot for all tilers.. wouldn't it at least vary a lot depending on tile size? Platforms like Adreno (and especially Xenos) will allow for much bigger tile sizes than PowerVR or Mali platforms.

At any rate, on SGX 53x performance is halved by frag-kill being present in the shader, regardless of whether or not it's ever used. I don't know if this is hit less hard on other platforms (Xmas once told me that other SGXs have some kind of earlier frag-kill support, IIRC) but I definitely don't expect the hit to be this bad on a tiler that interlocks only as necessary.
 
I mentioned static/dynamic because the impression I got was that you were saying that bubbles would always have to be entered to serialize the front and back ends if frag-kill were present. When I say interlock I mean stalling only when a given pixel's depth data is detected as in-flight and not ready. It appears you're describing this.
Nope, only on a transition away from alpha test and then the cost would be dependent on how long the shaders are.
I still find it a little hard to believe that you'd hit overdraw a lot for all tilers.. wouldn't it at least vary a lot depending on tile size? Platforms like Adreno (and especially Xenos) will allow for much bigger tile sizes than PowerVR or Mali platforms.
I wouldn't call it a "lot", but what you need to consider is that in a tiler you are actually preventing yourself from moving away from the current locality with the natural flow of the mesh, so you end up more likely to get overdraw.
At any rate, on SGX 53x performance is halved by frag-kill being present in the shader, regardless of whether or not it's ever used. I don't know if this is hit less hard on other platforms (Xmas once told me that other SGXs have some kind of earlier frag-kill support, IIRC) but I definitely don't expect the hit to be this bad on a tiler that interlocks only as necessary.
As I keep saying, the frag kill cost on 53x has nothing to do with frag kill itself, the drop of that you're seeing relates to another non core architectural issue.
 
Nope, only on a transition away from alpha test and then the cost would be dependent on how long the shaders are.

Sorry but you're totally losing me here, what I'm talking about is stalling when alpha test is done and only if the depth value is in flight (like you said, length of the shader - more specifically, can be minimized to length from the start of the shader to the frag kill because the depth update can be done immediately afterwards). Since you're talking about the cost being incurred with overdraw this has to be the case, otherwise the cost would happen for every pixel.

I wouldn't call it a "lot", but what you need to consider is that in a tiler you are actually preventing yourself from moving away from the current locality with the natural flow of the mesh, so you end up more likely to get overdraw.

I get that, but the "big" tilers can still cover many thousands of pixels.

As I keep saying, the frag kill cost on 53x has nothing to do with frag kill itself, the drop of that you're seeing relates to another non core architectural issue.

How does it not have anything to do with frag kill itself if it goes away if the frag kill isn't there? I get that it's an issue not fundamental to early-Z or TBDR for that matter, but I don't really understand what the issue is exactly. Xmas did explain it to me once as far as I can remember, but I either forgot or didn't really get it then.
 
Sorry but you're totally losing me here, what I'm talking about is stalling when alpha test is done and only if the depth value is in flight (like you said, length of the shader - more specifically, can be minimized to length from the start of the shader to the frag kill because the depth update can be done immediately afterwards). Since you're talking about the cost being incurred with overdraw this has to be the case, otherwise the cost would happen for every pixel.
There isn't a stall in overdraw in aplha tested stuff itself, this only suffers a reduction in efficiency due to the lag on the returned depth values (early check becomes conservtive), this in itself doesn't result in a stall condition. Stall conditions only potentially exist where switching from alpha test to non alpha test, specifically in TBDR, although there are ways of getting around this as well.
I get that, but the "big" tilers can still cover many thousands of pixels.
Yes, but most "performant" tilers don't use a big tile size for other reasons.
How does it not have anything to do with frag kill itself if it goes away if the frag kill isn't there? I get that it's an issue not fundamental to early-Z or TBDR for that matter, but I don't really understand what the issue is exactly. Xmas did explain it to me once as far as I can remember, but I either forgot or didn't really get it then.
Incorrect frasing on my part, perhaps I should say the performance reduction has nothing to do with interaction between early depth check/deferral and frag kill/alpha test. Am happy to explain the issue, but not in a public forum at this time.
 
http://translate.google.de/translate?u=http%3A%2F%2Fwww.4gamer.net%2Fgames%2F137%2FG013737%2F20110725062%2F&sl=ja&tl=en&hl=&ie=UTF-8

006.jpg


Am I reading that slide wrong or are the up to 2.0 GPixels and 68GFLOPs for the entire T604 MP4? If yes each core is again a 1 TMU design? Additionally if the 68 GFLOPs are for the MP4, that's 17 GFLOPs/core. Doesn't strike me as a lot or I'm missing something essential here.
 
If that's true, the T604 aims to compete in performance with AMD's Cedar GPU in the Fusion E-350.

That should do wonders in a 540p screen, for example.

(EDIT: wrong codename)
 
Last edited by a moderator:
17GFLOP/s core at 500MHz = 34 FLOPS/cycle..

It looked like 2:1 ALU:TMU ratio (from what we saw about the "tri-pipe" architecture). A single ALU handling 17 FLOPS/cycle seems like a lot. I guess 2x vec4 FMA + 1 something else. Maybe it's double throughput for FP16 or something.

If that's true, the T604 aims to compete in performance with AMD's Caicos GPU in the Fusion E-350.

That should do wonders in a 540p screen, for example.

And get beaten by SGX 543MP4, a part a generation behind.
 
In their results yesterday, ARM indicated that they had 2 Mali licencees for "NG".

eetimes has picked that up and state that "NG" is the one AFTER T604.

http://www.eetimes.com/electronics-news/4218195/ARM-financial-results-Q2

One might think that was just a misunderstanding, but there was an interview with Warren East in Feb in eetimes, in which he specifically referred to Mali NG, being the one after T604.

http://www.eetimes.com/electronics-news/4212774/ARM-updates-roadmap-with-Kingfisher--Cygnet

"East also highlighted Mali NG as the core to follow on from the Mali T604, which was launched at the ARM Technology Conference in November 2010"


Now something looks weird here. T-604 was announced in Nov 2010. 10 or so weeks later, Arm are talking about Mali "NG", and now they state that they've signed two NG licences sometime before end of June. Would appear to me that at most NG can be little more than a T604+ given the above timeline.
 
Back
Top