ARM Mali-400 MP

Grumpy · Nov 9, 2008

If the SGX530 is ~8mm2, what's the area of the 520? 6mm2?

JohnH · Nov 10, 2008

Don't think current sizing info on 520 is public...

Simon F · Nov 20, 2008

It is now: http://www.imgtec.com/News/Release/index.asp?NewsID=411

POWERVR SGX520 is less than 2.6mm2 in TSMC 65LP process.

roninja · Nov 20, 2008

Thats pretty small size could describe it as being "nano" centric.....

Ailuros · Nov 20, 2008

roninja said:
Thats pretty small size could describe it as being "nano" centric.....

Apparently the SGX factsheet has been adjusted too.

At 65nm 200MHz for SGX520 up to SGX545:

7-40M Triangles/s
250-1000 MPixels/s
2.6-12.5mm^2

Meaning SGX520 can achieve 250 MPixels/s (including overdraw) and 7 M Tris/s at 200MHz, with a <50% shader load. If the 520 could achieve as many pixels/clock as it's larger brothers, it would beat a KYRO senseless at merely 2.6mm^2....

tangey · Nov 21, 2008

Arm's smallest Mali core that is OpenGLes2.0 compliant is the Mali200 and at 65nm its 5mm square, which makes it just about twice the size of SGX520, and yet it is only slighty higher performing (9M triangles and 275M pixels V's 7M triangles and 250M pixels).

THE one thing missing from both specs sheets of course is power comsumption.

JohnH · Nov 21, 2008

tangey said:
Arm's smallest Mali core that is OpenGLes2.0 compliant is the Mali200 and at 65nm its 5mm square, which makes it just about twice the size of SGX520, and yet it is only slighty higher performing (9M triangles and 275M pixels V's 7M triangles and 250M pixels).

THE one thing missing from both specs sheets of course is power comsumption.

I beleive the 5mm for Mali200 excludes the Mali GP, I suspect its closer to 6.5mm with that included.

Power is roughly proportional to area, so assuming Mali200 is as aggressive with its clock gating as SGX then Mali200 will consume ~2x the power. This does however ignore the fact that Mali isn't a deferred rendering device so in the presence of overdraw its power consumption is likely to be higher per unit area for both core and IO.

John.

tangey · Nov 21, 2008

JohnH said:
This does however ignore the fact that Mali isn't a deferred rendering device so in the presence of overdraw its power consumption is likely to be higher per unit area for both core and IO.

John.

according the the feature list in this pdf, mali is both tile-based and deferred rendering....but they also mention immediate rendering ?

http://www.arm.com/miscPDFs/21863.pdf

JohnH said:
Power is roughly proportional to area, so assuming Mali200 is as aggressive with its clock gating as SGX then Mali200 will consume ~2x the power. This does however ignore the fact that Mali isn't a deferred rendering device so in the presence of overdraw its power consumption is likely to be higher per unit area for both core and IO..

Not only is clock gating important, i.e turning off the bits that are not needed at any one time, but it may well be that one or other solution inherently results in more of the chip being able to be turned off at any one time.

JohnH · Nov 21, 2008

tangey said:
according the the feature list in this pdf, mali is both tile-based and deferred rendering....but they also mention immediate rendering ?

http://www.arm.com/miscPDFs/21863.pdf

They're an early Z based tiler not a deferred tiler, they mention deferred rendering in a rather obscure context, but I know for a fact that they are early Z tile based. The mention of IMR is just random marketing, they are fundamentally a tile based renderer, end of story.

Not only is clock gating important, i.e turning off the bits that are not needed at any one time, but it may well be that one or other solution inherently results in more of the chip being able to be turned off at any one time.

Of course clock gating is important, I specifically said "assuming Mali200 is as aggressive with its clock gating as SGX". They are a tile based render, if anything this offers less opportunity for clock gating than a deferred tile based render due to how the laters pipeline fits together.

John.

Arun · Nov 21, 2008

They are a tile based render, if anything this offers less opportunity for clock gating than a deferred tile based render due to how the laters pipeline fits together.

Yeah, but then again they're not unified so that means more opportunities for clock gating. Doesn't necessarily mean lower *overall* power consumption, but it does have an effect on average power consumption per mm² - thus given all of these factors, the latter seems like a very problematic metric to use here!

JohnH · Nov 21, 2008

Arun said:
Yeah, but then again they're not unified so that means more opportunities for clock gating. Doesn't necessarily mean lower *overall* power consumption, but it does have an effect on average power consumption per mm² - thus given all of these factors, the latter seems like a very problematic metric to use here!

LOL Interresting point, make you core bigger with loads of logic that sits around idle so that you can claim lower power per mm^2 , although even in an LP process leakage does need to be factored into this...

Perhaps I shouldn't have used the term "unit area"

John.

Arun · Nov 22, 2008

JohnH said:
LOL Interresting point, make you core bigger with loads of logic that sits around idle so that you can claim lower power per mm^2 , although even in an LP process leakage does need to be factored into this...

A certain bright green company seems to have followed that train of thought to an extreme too!

(there's a patent on a pre-pixel shader stage from them that is basically a systematic waste of area with no way to improve performance, but it can save a tiny bit of power, sometimes, if the stars are aligned right)

However I would argue the main advantage of non-unified hardware from a power POV is to be able to use FP24 instead of FP32 in the pixel shader. If you don't care about having MIMD everywhere, it also allows you to have higher branching granularity in the PS than in the VS; how much that helps you depends a lot on how naive your architecture is though, ofc... (and what real-world handheld applications are & will be like)

In the end though, anyone not having a fully unified architecture with maximum utilization all the time on 28nm is playing a very dangerous game. Right now handheld 3D cores are always implemented in minimum-leakage technology, but given that triple-gate oxide is standard on 28nm (at least for TSMC, certainly not for IBM!) that seems like an awful plan to me compared to the alternatives.

Perhaps I shouldn't have used the term "unit area"

Heheh, indeed you probably shouldn't have!

JohnH · Nov 23, 2008

Arun said:
..
However I would argue the main advantage of non-unified hardware from a power POV is to be able to use FP24 instead of FP32 in the pixel shader. If you don't care about having MIMD everywhere, it also allows you to have higher branching granularity in the PS than in the VS; how much that helps you depends a lot on how naive your architecture is though, ofc... (and what real-world handheld applications are & will be like)

Going forward FP24 won't cut it in the PS, part of the reason for this is that these devices are already being applied to UI's running at HD resolutions making FP24 marginal for texture calculations. Throw GP-GPU into the mix and FP24 doesn't cut it at all. Courser branch granularity also falls foul of GP-GPU, and, it is possible to architect for the finest level of branch granularity without adding significant area to the design imo.

John.

Arun · Nov 23, 2008

JohnH said:
Going forward FP24 won't cut it in the PS, part of the reason for this is that these devices are already being applied to UI's running at HD resolutions making FP24 marginal for texture calculations.

Obviously FP24 per-se shouldn't be a problem for high target resolutions, so I presume you're thinking of very large textures? I have some difficulty to believe this is a big problem even at 1080p, but heh!

Throw GP-GPU into the mix and FP24 doesn't cut it at all. Courser branch granularity also falls foul of GP-GPU, and, it is possible to architect for the finest level of branch granularity without adding significant area to the design imo.

I definitely agree with both points, but it's not my fault PowerVR's competitors don't seem able to figure out how to implement efficient MIMO to save their lives!

Obviously Imagination's processor/DSP heritage with META helps a lot there.

JohnH · Nov 23, 2008

Arun said:
Obviously FP24 per-se shouldn't be a problem for high target resolutions, so I presume you're thinking of very large textures? I have some difficulty to believe this is a big problem even at 1080p, but heh!

FP24 gives you 15 bits of mantissa, 1:1 HD textures are 1920 wide, so you need 11 bits to achieve texel level addressing, this leaves you with 4 bits of sub texel accuracy, I consider this borderline, but artifacts do depend on the application.

Incedentally, could be wrong but I thought ARMs fragment shaders where restricted to FP16, or maybe that was the old bit boys part <shrugs>...

John.

Arun · Nov 23, 2008

JohnH said:
FP24 gives you 15 bits of mantissa, 1:1 HD textures are 1920 wide, so you need 11 bits to achieve texel level addressing, this leaves you with 4 bits of sub texel accuracy, I consider this borderline, but artifacts do depend on the application.

Yes but if it's applied 1:1 I'm not sure I see how it could go wrong? Surely you don't need to rotate it or anything like that... (or even if you did for a special effect it'd go fast enough that nobody would ever notice there's a 0.25 pixel error)

Incedentally, could be wrong but I thought ARMs fragment shaders where restricted to FP16, or maybe that was the old bit boys part <shrugs>...

I think that was Bitboys, yeah...

http://web.archive.org/web/20051124174700/www.bitboys.fi/g40.php
http://web.archive.org/web/20050305122106/www.bitboys.fi/comparison.php

JohnH · Nov 23, 2008

Arun said:
Yes but if it's applied 1:1 I'm not sure I see how it could go wrong? Surely you don't need to rotate it or anything like that... (or even if you did for a special effect it'd go fast enough that nobody would ever notice there's a 0.25 pixel error)

A lot of the examples we're seeing aren't 1:1 and the greater the zoom factor the more significant it becomes.

John.

Ailuros · Nov 24, 2008

Arun said:
I definitely agree with both points, but it's not my fault PowerVR's competitors don't seem able to figure out how to implement efficient MIMO to save their lives! Obviously Imagination's processor/DSP heritage with META helps a lot there.

Sorry for the OT, but albeit the last sentence shouldn't be wrong per se, I'd say that Metagence was created somewhere in the middle of IMG's history based on PowerVR experience. Now you may shoot me and carry on

***edit: as for the higher texture needs, if there wouldn't had been such needs IMG wouldn't had rushed out and inserted the SGX531 without a reason. There's still the question if they'll have anything beyond the 545 and if yes what it's going to look like.

TheArchitect · Nov 25, 2008

JohnH said:
FP24 gives you 15 bits of mantissa, 1:1 HD textures are 1920 wide, so you need 11 bits to achieve texel level addressing, this leaves you with 4 bits of sub texel accuracy, I consider this borderline, but artifacts do depend on the application.

Incedentally, could be wrong but I thought ARMs fragment shaders where restricted to FP16, or maybe that was the old bit boys part <shrugs>...

John.

Its not the bits you've got its what you do with them that counts

Input and storage precision are not the same as intermediate result precision. There are ways of managing numerical computation in an architecture such that you don't need to maintain a complete FP24 pipe to maintain accuracy.

Besides which Mali200 is still the only IP Core to have achieved Khronos conformance at 1080p resolution... so evidently it's not as much of a problem as people seems to think.

You'd think if SGX was capable of passing conformance at 1080p they'd have press released it by now (they press release every other bleeping thing).

TheArchitect · Nov 25, 2008

Two notes of caution here :-

Its well known in the industry that ARM has a track record of conservatively estimating their core sizes, the PowerVR guys can be a little more errr, "creative" shall we say.

Similarly, don't just take it as read that the performance numbers are correct. I worked with one of the chips implementing the original MBX and it was nowhere near the performance envolope stated in their material. Remember SGX is a unified shader - Ask yourself are they quoting SGX peak fill rate with the core 100% dedicate to fragment processing ? Similar question goes for vertex processing...

Lies, lies, damn lies and GPU marketing material and all that.

On the power consumption front there are a number of variables to take into account.

Total power efficiency for the GPU core will depend on the number of gates in the core, number and area of the RAM's in the core. How many of those are active at anyone time and (this is the key bit you've missed so far) the amount of external BW consummed by the GPU core.

Not to trivialise it though the gate/RAM area is a big issue without power gating. Sub 65nm static power consumption through leakage is a big deal, so the SGX would seem to have the edge over Mali there, however, if the utilisation of the core is 100% during a rendering phases then there is no/limited opportunity to power gate (you need to keep the gates powered up to do the work) and this is where the SGX gets let down.

SGX being a unified shader architecture its compute core is shared between vertex and fragment processing (which inccidentally is probably why its smaller). It attempts to load balance using some hoopy hyper threading system, this will likely have the effect that the core is active a lot more of the time meaning agressive power gating really may not buy you that much. Mali has the advantage that MaliGP can be completely power gated after its finished processing (thats about 30% of the architecture powered off). Thats gotta be worth something!

Another factor that plays in here is the number and size of the RAM instances in the design. I don't know the in's and out's of the implementation of SGX (I haven't seen any die shots I can analyse), but to keep a hyper threaded unified shader architecture fed they probably have a big ass cache RAM to context switch in and out of to keep the thing ticking over. That's gonna cost big on the power consumption front. As long as the core is active that RAM needs to stay powered up.

Mali has some neat (and patent protected) tricks up its sleave in that regard. It doesn't have any context switch overhead thanks to a nifty trick of carrying the context with each thread. This means they have little or no pipeline flush overhead and no need for a munging great cache to store it.

Last thing you need to take into account is the memory bandwith consummed by the two cores. External memory banwidth to DRAM consumes stupid amounts of power and nothing hammers the crap out of memory quite like a GPU running 3D graphics.

I attended a tech seminar (come to think of it I think it was an ARM one) were they talked about external DRAM accesses being 10x the cost of internal computation in some cases. While I'm not sure I buy 10x, even 2x would be a significant effect and reducing the bandwidth used by a GPU would make a significant difference to overall power consumption.

I've heard ARM make some pretty bold claims about Mali's BW reduction techniques. Whilst I don't have any first hand experience to confirm or deny those claims. I am told by trusted sources that they are on the level though and they do have an advantage compared to SGX with real world work loads. Enough to offset the size difference? I can't say, but interesting to note.

So whats my point? The above are just a few obvious things that I've obsereved which tell me its impossible to make an apples for apples comparison of the two based on publicly available data. We are only getting a tiny glimpse of the whole picture.

Going on experience I'd say PowerVR will over promise and under deliver on the SGX, but they'll sell a bundle of them anyway and so we'll suffer more mediocre graphics experiences on handsets for another generation. ARM are winning designs away from PowerVR however, so there must be something in this that's making sense to some big names.

As for the Mali400 MP, I think is a very poorly thought out product. If you are going to introduce a multi core scaleble product why the hell not scale both fragment and vertex shader cores. This smacks of something nailed together in a hurry to meet some spurious customer request if you ask me (wonder if thats anything to do with them loosing one of their key strategic technical people earlier in the year...).

Anyway lets hope they get more of a clue with the next one and give PowerVR a real run for their money.

ARM Mali-400 MP

Grumpy

JohnH

Simon F

Tea maker

roninja

Ailuros

Epsilon plus three

tangey

JohnH

tangey

JohnH

Arun

Unknown.

JohnH

Arun

Unknown.

JohnH

Arun

Unknown.

JohnH

Arun

Unknown.

JohnH

Ailuros

Epsilon plus three

TheArchitect

TheArchitect

Similar threads