ARM Midgard Architecture

Arun · Dec 10, 2010

Atomics are very nice, but they don't mention local memory (aka shared memory in CUDA) at all. This is not a good sign... Here's a direct quote from the OpenCL 1.1 specification:

Local Memory: A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group. It may be implemented as dedicated regions of memory on the OpenCL device. Alternatively, the local memory region may be mapped onto sections of the global memory.

So is there any on-chip local memory (i.e. SRAM, whether dedicated or somehow repurposing cache) - and if not, how can they claim any significant advantage from supporting Full Profile when one of the most frequent idioms would come at a massive performance and especially power cost? Hopefully I'm just reading too much into the lack of an explicit mention...

Exophase: I agree 64-bit pointers is a pure marketing gimmick on that specific iteration of the architecture, but as it scales up in peformance, I could see it being very useful, so it makes sense to invest R&D into it now (although unfortunately it doesn't make as much sense to invest die area into it...)

I agree the overdraw post is ludicrous. Maybe it's simply phrased very badly and what they meant is that games with a rough front-to-back ordering never shade their pixels more than twice on average. I don't have any hard numbers personally, but something between 1.0x and 2.0x but slightly closer to 1.0x seems very realistic to me. Determining the performance penalty of a Z-only pass would be another good way of evaluating the real cost of not being a TBDR.

Of course many games have the iPhone (i.e. SGX) as their lead development platform, so casual devs might not even bother with front-to-back ordering or Z-only passes. Then again as long as the performance is high enough, that's just a power penalty in something where the display takes the vast majority of the power anyway - so performance in more demanding games which will usually receive the necessary optimisations arguably matters more, and there the difference is much smaller (even if it'd be insane to pretend it's negligible).

Lazy8s: Well the Mali400 was justified no matter what by the fact the 543MP wouldn't have been ready in that timeframe and it's much faster (at 4MP) than a 540. I suppose T604 also has the big advantage that in terms of APIs, it's more comparable to Series6 than SGX. I do not personally believe that's of much practical significance given its performance level, but I suppose Samsung would have taken it into consideration in their choice.

BTW, T604 (as per those blog posts) is only a 1 TMU design. That means a 4-core T604 is comparable per MHz to a single-core 544MP. I'd certainly like to know how their die sizes compare!

Exophase · Dec 10, 2010

What I don't like about Samsung switching to Mali 400 is that FP16 fragment shading is going to push a new lowest common denominator across the board that'll discourage highp from being used where useful. It would have been better if it had it in some capacity, even if it were much slower.

I don't recall reading anything about Mali having a higher Z-fill rate than pixel rate like Tegra and afaik Adreno do. Nowhere have I seen ARM recommend a Z pre-pass, instead citing average 50% early-Z performance. Also, can someone fill me in on this.. does OpenGL ES 2 have any real provisions for reusing geometry so the vertex shaders don't have to be double ran to do a Z pre-pass? Or is it expected that the drivers have an option to perform it automatically? Because Mali 400MP4 with its one vertex shader doesn't sound like it needs that workload to be doubled.

Arun · Dec 10, 2010

Exophase said:
What I don't like about Samsung switching to Mali 400 is that FP16 fragment shading is going to push a new lowest common denominator across the board that'll discourage highp from being used where useful. It would have been better if it had it in some capacity, even if it were much slower.

*shrugs* I see your point, and for a long time I optimistically assumed it would be FP24. Oh well.

I don't recall reading anything about Mali having a higher Z-fill rate than pixel rate like Tegra and afaik Adreno do.

Hmmm. Well, they've got free 4xMSAA - I always assumed that they could use it for 4xZ too (ala NVIDIA), but I could be horribly wrong.

does OpenGL ES 2 have any real provisions for reusing geometry so the vertex shaders don't have to be double ran to do a Z pre-pass?

Pretty sure it doesn't.

Or is it expected that the drivers have an option to perform it automatically?

ES 2.0 is all about VBOs for rendering, so I suppose that in theory you certainly could detect that the position VBO is the same one. However keep in mind that the shader will also be different because the vertex shader can do more than applying a single matrix to the position vector; so you'd have to determine whether the calculations done with the shared VBOs are the exact same ones if you want to reuse the results. And that's before we consider memory management complexity and the necessity to know in advance that it's going to happen.

So I'd be surprised if any commercial driver actually did it.

Because Mali 400MP4 with its one vertex shader doesn't sound like it needs that workload to be doubled.

Hmm, yeah, that's certainly a disadvantage for 400MP4. I think you're right that they encourage developers to do front-to-back ordering rather than a Z-Only pass however, but I'm not sure.

Exophase · Dec 10, 2010

I was actually not thinking of an implicit driver mode that detects duplicate geometry, but something you call out as an extension. I mean, something that would automatically do a Z pre-pass after scene gathering, or maybe with an actual state machine extension call to perform it.

Lazy8s · Dec 10, 2010

A four core T-604 should be well out of the league of a single core 544MP, in performance and features as well as die size. They're not really directly comparable... perhaps that example was a typo?

543MP would be ready in time for Orion. Samsung had a choice, and I think they simply liked what ARM was offering.

I still think they'll take out a new PowerVR license in addition to that, though.

darkblu · Dec 10, 2010

Exophase said:
What I don't like about Samsung switching to Mali 400 is that FP16 fragment shading is going to push a new lowest common denominator across the board that'll discourage highp from being used where useful. It would have been better if it had it in some capacity, even if it were much slower.

I'm not sure we could blame this vendor or that implementation - that's a provision in the API design, ergo lowp-exclusive fragment shaders were bound to happen. *shrug*

I don't recall reading anything about Mali having a higher Z-fill rate than pixel rate like Tegra and afaik Adreno do. Nowhere have I seen ARM recommend a Z pre-pass, instead citing average 50% early-Z performance. Also, can someone fill me in on this.. does OpenGL ES 2 have any real provisions for reusing geometry so the vertex shaders don't have to be double ran to do a Z pre-pass? Or is it expected that the drivers have an option to perform it automatically? Because Mali 400MP4 with its one vertex shader doesn't sound like it needs that workload to be doubled.

Hmm, I don't remember having seen SoC vendors recommending z pre-passes. And frankly, I don't think it would bring a large (fillrate) gain, statistically, vis-a-vis a front-to-back pre-sort. That in the context of early-z, of course.

Exophase · Dec 10, 2010

Guess we'll have to wait and see how it does in real world comparisons. There are a lot of hidden variables that could make a big difference.. I mean, last I recall Mali 400 could go all the way to 400MHz and if somehow a 400MP4 takes a reasonable amount of power at that clock it could easily provide an advantage even if it doesn't have much to offer per clock.

So to match 4x 2ALU/1TMU Mali-T604 cores wouldn't 544MP have to be 8ALU/4TMU?? This is assuming that USSE2 can do vec4 FP16 FMADD per cycle, which I think it can. Aren't those unit counts more 554 territory? Was 544 a typo? I'm actually not sure what the differences are between 543 and 544.

darkblu said:
I'm not sure we could blame this vendor or that implementation - that's a provision in the API design, ergo lowp-exclusive fragment shaders were bound to happen. *shrug*

For what it's worth, FP16 is mediump. You certainly wouldn't want to use lowp on texture coordinates.

It's just a pretty serious issue that now game programmers may have to make the choice to not use highp in fragment shaders, ever. Granted, they would have had to limit themselves to FP24 already, but that's much better than FP16. I'm concerned about cases where avoiding highp will mean a completely different design and not something that can be accepted at a cost of visual degradation. A ton of dynamic range is being lost here.

darkblu said:
Hmm, I don't remember having seen SoC vendors recommending z pre-passes. And frankly, I don't think it would bring a large (fillrate) gain, statistically, vis-a-vis a front-to-back pre-sort. That in the context of early-z, of course.

I know Z pre-pass is big on XBox 360. There's probably lower-level API exposure that makes it efficient. If vendors aren't interested in Z pre-pass why have and advertise a higher Z-fillrate at all?

Lazy8s · Dec 10, 2010

Stencils should be an advantage for PowerVR, going hand in hand with the speed of the visible surface determination.

Exophase · Dec 10, 2010

If it can do a fast Z pre-pass it should be able to do a fast stencil pre-pass too. Not that SGX doesn't still have the win in being able to do these things implicitly and concurrently, usually for free.

Fortunately SGX also supports explicit stencil-only geometry that it can fill really fast.

darkblu · Dec 10, 2010

Exophase said:
For what it's worth, FP16 is mediump. You certainly wouldn't want to use lowp on texture coordinates.

My bad, I keep mentally delegating fp16 to the lowp fp corner. Anyway, what I meant was that if the 'low energy' state was allowed by the API, exclusive hardware was bound to happen.

It's just a pretty serious issue that now game programmers may have to make the choice to not use highp in fragment shaders, ever. Granted, they would have had to limit themselves to FP24 already, but that's much better than FP16. I'm concerned about cases where avoiding highp will mean a completely different design and not something that can be accepted at a cost of visual degradation. A ton of dynamic range is being lost here.

Well, a good chunk of shaders would not really care about the degradation, but for the rest, you're right to suspect re-design might be needed, if the preservation of appearance was really a must. The thing is, most of the times in multi-platform production pipelines it's an 'either-or' scenario - one platform supports features X, and so it gets the fancy shaders, another platform does not support feature X, and so it gets the stick (read: a downgraded illumination model, yadda-yadda), so scenarios where people will really sweat over shoehorning their state-of-art shaders down to something that poses a serious challenge will be more of the exception than the norm. IMO, of course.

The whole reason vendors go for the lowest threshold of admittance is that they know that they'll get away with it ; )

I know Z pre-pass is big on XBox 360. There's probably lower-level API exposure that makes it efficient. If vendors aren't interested in Z pre-pass why have and advertise a higher Z-fillrate at all?

Well, for one, a whole generation of engines/shading models relied on predominantly-z passes (read: doom3, and the whole one-pass-per-light generation, where the scene gets a first depth-plus-ambience pass). But even today, z-only fillrate is an important factor in depth-based volumetric techniques (e.g. most shadow algos used nowadays), and also z-only fillrate is directly related to early-z performance, which, as already mentioned, works for front-to-back sorted scenes as well.

Exophase · Dec 10, 2010

Ah yeah, I wasn't even considering how fast-Z helps early-Z. Oops.

I suppose it's possible for something like Mali to have multiple Z resolutions per pixel for early-Z, but no provision for using them between the state changes you'd need when transitioning from a depth-only pass to a rendering pass. I've actually always wondered how tilers handled these sorts of state changes, especially hardware binners like Mali and PowerVR. At least for deferred shading, if the shaders have to change you could end up with a tile with a bunch of pixels from different shaders. I guess for direct rendering it wouldn't really make a difference.

Lazy8s · Dec 10, 2010

About the difference between the 543 and 544, the 544 is their DirectX 9 Feature Level 3 focused core, kinda like the difference between the UMPC originated 535 versus the mobile targeted 530. The 544 also brought that DirectX 9L3 focus to a part with four ALU pipes and the XT architectural enhancements.

I don't think it makes for much of a direct comparison to the T604. Not even to the Mali400s, really.

Arun · Dec 10, 2010

Oops, obviously meant 554MP as being comparable to T604MP4, and actually even that is far from certain given how little we know of either's specs.

darkblu · Dec 10, 2010

Exophase said:
I've actually always wondered how tilers handled these sorts of state changes, especially hardware binners like Mali and PowerVR. At least for deferred shading, if the shaders have to change you could end up with a tile with a bunch of pixels from different shaders.

IANAL, but the only logical explanation is that the tile is filled in a drawcall-by-drawcall order, as otherwise it's not just shader state that would need to be changed across pixels, but TMU state/texture caches as well, and that could really hurt.

Xmas · Dec 10, 2010

Arun said:
I suppose T604 also has the big advantage that in terms of APIs, it's more comparable to Series6 than SGX.

I'm not sure I understand what you're saying.

arjan de lumens · Dec 11, 2010

darkblu said:
IANAL, but the only logical explanation is that the tile is filled in a drawcall-by-drawcall order, as otherwise it's not just shader state that would need to be changed across pixels, but TMU state/texture caches as well, and that could really hurt.

One of the classical selling points of Mali GPUs is "No renderer state change penalty"; all Mali cores from Mali55 to the T604 have the ability to switch render state intra-tile with zero cycles of overhead - including TMU state.

No idea how SGX handles this, though.

Exophase · Dec 11, 2010

darkblu said:
IANAL, but the only logical explanation is that the tile is filled in a drawcall-by-drawcall order, as otherwise it's not just shader state that would need to be changed across pixels, but TMU state/texture caches as well, and that could really hurt.

Well I'm not sure what being a lawyer would have to do with this

Don't you usually have a different drawcall for every model, in order to change the modelview matrix? If that's the case and what you're saying is true TBDR would rarely help you, since a model will rarely occlude itself. Even if the texture binding stays the same, it could still have totally different texture coordinates from one pixel to the other, so I dunno.. being able to handle texture state changes from pixel to pixel shouldn't be a huge problem. Cache wouldn't be explicitly flushed, but you'd get potentially worse locality of reference - but that's going to happen no matter how you draw the tile.

I do think there are some render-catch up events in the tile rendering, at the very least when going between opaque, alpha, and alpha test (hence why IMG suggests binning by this), and quite likely when changing shaders. But for things like texture binding or uniform changes, I'm not so sure. So I expect the granularity to be a little coarser than per-draw call.

Somewhere I recall IMG material claiming that the USSEs can switch to completely independent thread contexts with unique program position et al, so it's not completely out of the question that it can switch shaders on a per-pixel level within the tile. It'd just have to tag the pixels with a shader number (and deal with the case where that gets exhausted)

Arun · Dec 11, 2010

Xmas said:
I'm not sure I understand what you're saying.

Sorry, I meant that I expect Series6 to support OpenCL Full Profile, which currently only T604 supports. The reason why I left it so vague is that I'm not sure how the various IPs compare in terms of future OpenGL ES version support or even DirectX. ARM said DX11 to some people, but I assume that's DX11's DX9-level profile which is comparable to SGX544/554. Or maybe not, I don't know!

Xmas · Dec 11, 2010

Arun said:
Sorry, I meant that I expect Series6 to support OpenCL Full Profile, which currently only T604 supports.

I don't think "currently" is a word that can be applied to T604 (at least in the context of public perception) given that ARM have publicly stated they expect devices with it to start shipping in the 2012/2013 timeframe.

Arun · Dec 11, 2010

Xmas said:
I don't think "currently" is a word that can be applied to T604 (at least in the context of public perception) given that ARM have publicly stated they expect devices with it to start shipping in the 2012/2013 timeframe.

Oh, that's what you meant. They've said they would be delivering final RTL in early 2011, which compares to Imaginations shipping final RTL for SGX543MP in late 2009 (unless ARM and IMG don't mean exactly the same thing, but either way it's probably not completely different). I would be very (positively) surprised if we saw Series6 final RTL in a similar timeframe to T604, but hey, you wouldn't see me complaining!

ARM Midgard Architecture

Arun

Unknown.

Exophase

Arun

Unknown.

Exophase

Lazy8s

darkblu

Exophase

Lazy8s

Exophase

darkblu

Exophase

Lazy8s

Arun

Unknown.

darkblu

Xmas

Porous

arjan de lumens

Exophase

Arun

Unknown.

Xmas

Porous

Arun

Unknown.

Similar threads