ARM Mali-400 MP

Its not the bits you've got its what you do with them that counts ;)

Input and storage precision are not the same as intermediate result precision. There are ways of managing numerical computation in an architecture such that you don't need to maintain a complete FP24 pipe to maintain accuracy.

Besides which Mali200 is still the only IP Core to have achieved Khronos conformance at 1080p resolution... so evidently it's not as much of a problem as people seems to think.

You'd think if SGX was capable of passing conformance at 1080p they'd have press released it by now (they press release every other bleeping thing).

ALthough teh precison within the texturing pipeline itself can be optimised per stage and can in soem plaes be reduced, we're talking about the shader pipe here, to make your shader pipe fp24 you cannot drop its precision below FP24. This is of course secondary to the fact that only having 4 bits of sub texel in terms of your ability to accurately sample textures is borderline imo.

In terms of precison and its effect of texture sampling it is the resolution of the source textures that has an impact, not the target resolution i.e. sub pixel resolution doe snot chaneg with screen size, so it is unlikely that target resolution wil have any impact of the results of the conformance tests.

John.
 
You'd think if SGX was capable of passing conformance at 1080p they'd have press released it by now (they press release every other bleeping thing).
I'm sure if a customer actually wanted a specific resolution tested it would be done. FWIW, of those companies who do actually have conformant OpenGL ES 2.0 systems, most seem to have opted for a "box standard" VGA resolution when doing the test. <shrug>

I worked with one of the chips implementing the original MBX and it was nowhere near the performance envolope stated in their material.
Well, there are a range of MBX models and SOCs that they are put into, and clearly some perform better than others. I can't comment on an individual case as I don't know their "innards" in enough detail.
 
Given that the only two IP vendors left in that space seem to be ARM and PowerVR, yes. Lets hope they give them a real run for their money.


Actually Vivante have been showing signs of life lately. They showed some silicon running at an event recently.

I also heard a rumour that Matrox were selling IP now as well.
 
Well, there are a range of MBX models and SOCs that they are put into, and clearly some perform better than others. I can't comment on an individual case as I don't know their "innards" in enough detail.

You could have picked a better benchmark to illustrate your point. GLBenchmark is soooooo bad.

Isn't it the same benchmark that claims that some MBX implementations don't support Bi-lerp, MIP MAPing, etc. when actually they are key hardware features.
 
In terms of precison and its effect of texture sampling it is the resolution of the source textures that has an impact, not the target resolution i.e. sub pixel resolution doe snot chaneg with screen size, so it is unlikely that target resolution wil have any impact of the results of the conformance tests.

John.

Hold on a minute, how many embedded systems do you know that actually have enough room to store a 1920x1080x32 texture (8 MB for the top MIP level), let alone have the need to zoom into it by nearly 16x?????

Well I suppose, viewing JPG stills maybe with some zoom, but then you can do a partial decode on them to limit the source texture size so its not a problem.

Or post processing of HD video, but then thats not likely to need to be zoomed by 16x...

I'd like to see your use case...
 
Two notes of caution here :-

Its well known in the industry that ARM has a track record of conservatively estimating their core sizes, the PowerVR guys can be a little more errr, "creative" shall we say.
IMG figures are actual synthesis figures in the same way ARM's are claimed to be.

Similarly, don't just take it as read that the performance numbers are correct. I worked with one of the chips implementing the original MBX and it was nowhere near the performance envolope stated in their material.
It is well known that not all MBX systems are alike, some clearly do hit the our performance claims which suggesst that the performance of MBX itself was exactly as stated.

Remember SGX is a unified shader - Ask yourself are they quoting SGX peak fill rate with the core 100% dedicate to fragment processing ? Similar question goes for vertex processing...

SGX figures are quoted at 50% shader load for fill or poly throughput. The reality is that I've only ever seen contrived cases where a unified design doesn't win out over a similar area non unified design.

Lies, lies, damn lies and GPU marketing material and all that.

And of course ARMs marketing team only dish out absolute truthful and factual information <rolls eyes>

On the power consumption front there are a number of variables to take into account.

Total power efficiency for the GPU core will depend on the number of gates in the core, number and area of the RAM's in the core. How many of those are active at anyone time and (this is the key bit you've missed so far) the amount of external BW consummed by the GPU core.

Not to trivialise it though the gate/RAM area is a big issue without power gating. Sub 65nm static power consumption through leakage is a big deal, so the SGX would seem to have the edge over Mali there, however, if the utilisation of the core is 100% during a rendering phases then there is no/limited opportunity to power gate (you need to keep the gates powered up to do the work) and this is where the SGX gets let down.

SGX being a unified shader architecture its compute core is shared between vertex and fragment processing (which inccidentally is probably why its smaller). It attempts to load balance using some hoopy hyper threading system, this will likely have the effect that the core is active a lot more of the time meaning agressive power gating really may not buy you that much. Mali has the advantage that MaliGP can be completely power gated after its finished processing (thats about 30% of the architecture powered off). Thats gotta be worth something!

By being unified you expose maximum compute power to the problem at hand irrespective of being vertex or pixel bound, this increases the opportunities for idle power gating the entire core which is the granularity that most power gating schemes work at at this time. Beyond this finer granularity gating remains possible withn those parts of the chip dedicated to vertex and pixel processing. Basically being unified is a net gain wrt to power, if it wasn't we we wouldn't have designed the architecture like we did.

Another factor that plays in here is the number and size of the RAM instances in the design. I don't know the in's and out's of the implementation of SGX (I haven't seen any die shots I can analyse), but to keep a hyper threaded unified shader architecture fed they probably have a big ass cache RAM to context switch in and out of to keep the thing ticking over. That's gonna cost big on the power consumption front. As long as the core is active that RAM needs to stay powered up.
You seems to be making assumptions about the SGX architecture which aren't based on reality.

Mali has some neat (and patent protected) tricks up its sleave in that regard. It doesn't have any context switch overhead thanks to a nifty trick of carrying the context with each thread. This means they have little or no pipeline flush overhead and no need for a munging great cache to store it.
You seem to be assuming that SGX has an inherent context switch overhead, again, this is simply an incorrect assumption.

Last thing you need to take into account is the memory bandwith consummed by the two cores. External memory banwidth to DRAM consumes stupid amounts of power and nothing hammers the crap out of memory quite like a GPU running 3D graphics.

I attended a tech seminar (come to think of it I think it was an ARM one) were they talked about external DRAM accesses being 10x the cost of internal computation in some cases. While I'm not sure I buy 10x, even 2x would be a significant effect and reducing the bandwidth used by a GPU would make a significant difference to overall power consumption.

I've heard ARM make some pretty bold claims about Mali's BW reduction techniques. Whilst I don't have any first hand experience to confirm or deny those claims. I am told by trusted sources that they are on the level though and they do have an advantage compared to SGX with real world work loads. Enough to offset the size difference? I can't say, but interesting to note.

Perhaps you should ellude to where you think this difference comes from, becuase I'm pretty certain that equivalent SGX consumes less BW than Mali in every instance.

So whats my point? The above are just a few obvious things that I've obsereved which tell me its impossible to make an apples for apples comparison of the two based on publicly available data. We are only getting a tiny glimpse of the whole picture.

Going on experience I'd say PowerVR will over promise and under deliver on the SGX, but they'll sell a bundle of them anyway and so we'll suffer more mediocre graphics experiences on handsets for another generation. ARM are winning designs away from PowerVR however, so there must be something in this that's making sense to some big names.

The fact of the situation is that previous generations of Mali where pretty unimpressive when compared to the equivalent PowerVR cores, you've obviously decided the new cores are going to reverse this situtation, which is odd given the absense of ratified public benchmark information.

As for the Mali400 MP, I think is a very poorly thought out product. If you are going to introduce a multi core scaleble product why the hell not scale both fragment and vertex shader cores. This smacks of something nailed together in a hurry to meet some spurious customer request if you ask me (wonder if thats anything to do with them loosing one of their key strategic technical people earlier in the year...).

Anyway lets hope they get more of a clue with the next one and give PowerVR a real run for their money.

Nothing wrong with a bit of competition.

John.
 
Last edited by a moderator:
Hold on a minute, how many embedded systems do you know that actually have enough room to store a 1920x1080x32 texture (8 MB for the top MIP level), let alone have the need to zoom into it by nearly 16x?????

Well I suppose, viewing JPG stills maybe with some zoom, but then you can do a partial decode on them to limit the source texture size so its not a problem.

Or post processing of HD video, but then thats not likely to need to be zoomed by 16x...

I'd like to see your use case...

Obviously I can't talk about the applications our customers are using this technology for.

However the key point here is that at FP24 you have little headroom for other math when dealign with large textures, hence the statement that FP24 i borderline.

John.
 
Last edited by a moderator:
Lies, lies, damn lies and GPU marketing material and all that.

And of course ARMs marketing team only dish out absolute truthful and factual information <rolls eyes>

Well actually I was suggesting that *all* GPU marketing stretches the truth, but hey hoo.
 
First of all, welcome to the forum TheArchitect, enjoy your stay! :)

I don't want to take part in this Holy War too much, but here are a few quick points that hopefully can't be perceived as anything but fairly objective...
TheArchitect said:
Its well known in the industry that ARM has a track record of conservatively estimating their core sizes, the PowerVR guys can be a little more errr, "creative" shall we say.
JohnH said:
IMG figures are actual synthesis figures in the same way ARM's are claimed to be.
As JohnH implied, both are pre-layout. By looking at some of the META cores, I've come to the conclusion that sometimes PowerVR will indicate a clock target and a die size, but those clocks are for speed-optimized designs and the size is for area-optimized designs. It's not a lie per-se (clocks are 'up to') of course, just overly aggressive marketing. But on the other hand, ARM (at least for CPU designs) has a tendency to more clearly associate a die size with a specific frequency target. It's not usually a massive difference, and this might not systematically be true (or it might be outdated) but it does seem noteworthy to me. Actually ARM seems to be doing the same with the Cortex-A9...

In the interest of looking overly balanced about these two companies, I would like to claim that neither is very trustworthy about die size estimates and it's always infinitely better to look at real numbers from finished designs - in fact this is true for most IP houses. The one exception I've seen is Tensilica, which *seems* stunningly honest about post-synthesis vs post-layout, clocks, etc.

JohnH said:
By being unified you expose maximum compute power to the problem at hand irrespective of being vertex or pixel bound, this increases the opportunities for idle power gating the entire core which is the granularity that most power gating schemes work at at this time.
I agree with John here, power gating only makes sense during long inactivity times. If your VS is 5x faster than required during part of the processing, it'll still need to be active 20% of the time and it's not viable to have an absurdly massive FIFO to let it idle for sufficiently long times. The fact MaliGP can be power gated individually makes sense and obviously can't hurt, but an unified architecture is still likely to benefit more from power gating in general. I haven't thought enough about deferred rendering in this context though to be sure if it has an impact of its own (good or bad) however.

TheArchitect said:
I don't know the in's and out's of the implementation of SGX (I haven't seen any die shots I can analyse), but to keep a hyper threaded unified shader architecture fed they probably have a big ass cache RAM to context switch in and out of to keep the thing ticking over. That's gonna cost big on the power consumption front. As long as the core is active that RAM needs to stay powered up.
As JohnH implied, you couldn't be any more wrong here: http://www.eetimes.com/news/design/...cleID=210003530&cid=RSSfeed_eetimes_designRSS
http://i.cmpnet.com/eet/news/08/07/1538UTH_1.gif
SGX is the core in the top right. It's very clear that it has incredibly little SRAM; it's nearly pure logic. At the right, based on a SRAM cell size of ~0.5, there seems to be 64-80KiB of SRAM. On the left, at the bottom and maybe in the center, there's also a very little amount of extra SRAM. That's more than enough for texture caches, FIFOs, and register files.

Given that SGX 530 has two "shader cores" presumably, you could assume that the top right and bottom right SRAM is the shader pipe-specific stuff (incl. RF) and the texture caches, while the center right is for the FIFOs. The rest are misc. buffers, for example to communicate with the outside world. Compared to a non-deferred renderer, they can also save quite a bit of SRAM by not needing on-chip HierZ and stuff like that; and obviously the memory controller is off-block.

JohnH said:
Perhaps you should ellude to where you think this difference comes from, becuase I'm pretty certain that equivalent SGX consumes less BW than Mali in every instance.
Uhoh, note to self: not reply to TBDR Bandwidth Holy Wars. Ever! :D

TheArchitect said:
Going on experience I'd say PowerVR will over promise and under deliver on the SGX, but they'll sell a bundle of them anyway and so we'll suffer more mediocre graphics experiences on handsets for another generation.
Now THAT's being opinionated! ;)

TheArchitect said:
ARM are winning designs away from PowerVR however, so there must be something in this that's making sense to some big names.
Why must we expect every licensee to be rational, and why must we expect performance, die size, and power consumption to be the only factors? I'm not saying this to diminish either ARM or PowerVR; however my point is the only thing this tells us is the difference isn't so massive that the choice is always clear-cut for potential licensees in the real world. You would expect that to be the case anyway for the surviving players in an open market...
 
You could have picked a better benchmark to illustrate your point. GLBenchmark is soooooo bad.
I'd be interested in hearing of any other benchmark with published figures.

Now, as I can't hide behind a pseudonym I shan't comment further.
 
First of all, welcome to the forum TheArchitect, enjoy your stay! :)

Thank you, huge fun so far :)

BTW - There is no holy war here, I have no allignment to either company, just thought I'd make that clear.

The discussion seemed to lack a protagonist for the counter agrument, so I thought I'd pitch in. ;)

I agree with John here, power gating only makes sense during long inactivity times. If your VS is 5x faster than required during part of the processing, it'll still need to be active 20% of the time and it's not viable to have an absurdly massive FIFO to let it idle for sufficiently long times.

I think you are assuming that the VS and FS cores are decoupled by a fifo correct? In actual fact this is not the case for either architecture (I think and I'm sure JohnH will be very quick to correct me if I'm not... btw does anyone from ARM, Vivante or Matrox follow this forum???), but the intermediate data between the VS and FS processing stages is actually stored to main memory (post VS, post binning). Therefore it would actually be possible and even reasonable to power off the MaliGP (this was the same with MBX equiped with VGP, but I'm not sure you could power gate it in the same way).

As JohnH implied, you couldn't be any more wrong here: http://www.eetimes.com/news/design/...cleID=210003530&cid=RSSfeed_eetimes_designRSS
http://i.cmpnet.com/eet/news/08/07/1538UTH_1.gif
SGX is the core in the top right. It's very clear that it has incredibly little SRAM; it's nearly pure logic. At the right, based on a SRAM cell size of ~0.5, there seems to be 64-80KiB of SRAM. On the left, at the bottom and maybe in the center, there's also a very little amount of extra SRAM. That's more than enough for texture caches, FIFOs, and register files.

Hmmm the legend at the bottom of the graphic is not clear about which components are which. Its easy to pick out the Cortex 8, 'cos its implemented on a semi hard flow so looks a little more "rigid" than normal synthesised logic. You can imagine that the IVA 1 and IVA 2 would be pretty similar so you can probably spot those, but I think its non-obvious where the boundaries of the rest of the cores are.

Given that SGX 530 has two "shader cores" presumably, you could assume that the top right and bottom right SRAM is the shader pipe-specific stuff (incl. RF) and the texture caches, while the center right is for the FIFOs. The rest are misc. buffers, for example to communicate with the outside world.

I think PowerVR refers to SGX530 as being two shader "pipes" rather than cores. The premise being that you can share infrastructure between the two pipes and save area (common in programmable architectures) rather than stamping down two identical cores.

Compared to a non-deferred renderer, they can also save quite a bit of SRAM by not needing on-chip HierZ and stuff like that; and obviously the memory controller is off-block.

Uhoh, note to self: not reply to TBDR Bandwidth Holy Wars. Ever! :D

LOL - Okay we'll forget you said anything (wave your hand and say after me "these aren't the posts your looking for")

Now THAT's being opinionated! ;)

Is that not allow? :D

BTW - If JohnN and anyone from ARM wants to pass me the TRM's for there cores I'll do a proper tear down for you guys...
 
Well actually I was suggesting that *all* GPU marketing stretches the truth, but hey hoo.

Make that:

Well actually I was suggesting that *all* marketing stretches the truth...
...and we'll have an immediate agreement.

As for the rest:

Anyway lets hope they get more of a clue with the next one and give PowerVR a real run for their money.
I'm hearing the same story since the birth of 3D in the mobile/PDA market. In fact it we should have seen some fierce competition after MBX from ATI (now AMD) which abandoned the market with flying colours, the Bitboys (absorbed by ATI before AMD bought the latter and the result lies in the former sentence...), NVIDIA (which seems to do a lot better with APX2500 than with the initial GoForce) and Falanx (absorbed by ARM) etc.

However the result doesn't differ (as in amount of major licensing deals and in extension success) with SGX today than it did in the past with MBX. I actually expected competition to heat up with the OGL_ES2.0 generation, but I don't see any earth breaking changes either.

If it's really who operates with "smokes and mirrors" then I'd like to hear which of them all is innocent for one, which cannibalize prices in order to gain even one deal or which of them give their IP away for free to even claim a deal after all.

Before anyone throws any stones at IMG, I'd like to hear the entire rotten story that stages itself behind the curtains including all fud like that tiling stinks (which obviously doesn't come from ARM). Besides marketing there's only thing to say for any potential competitor at all times: deliver or shut up. It's a free market and the more competition the merrier especially for consumers.
 
Make that:
If it's really who operates with "smokes and mirrors" then I'd like to hear which of them all is innocent for one, which cannibalize prices in order to gain even one deal or which of them give their IP away for free to even claim a deal after all.

Arrrrh! Now there be some tales to tell... but perhaps for another thread?

Before anyone throws any stones at IMG, I'd like to hear the entire rotten story that stages itself behind the curtains including all fud like that tiling stinks (which obviously doesn't come from ARM).

Oooo yeah thats a good one and goes waaaaaaaaay back to the mid 90's when you couldn't move in the industry without tripping over a 3D Graphic Chip company! Remember Renditions Verite, the Cirus Logic Laguna, the NV1? The scandal around the DX1 Tunnel test (I seem to remember that was the genesis of Toms Hardware). Damn that makes me feel old!

Like I said, perhaps another thread is required...
 
Arrrrh! Now there be some tales to tell... but perhaps for another thread?

I know that you know where I'm getting at. Apart from that why another thread? It's perfectly fine here and I don't feel it's off topic either.


Oooo yeah thats a good one and goes waaaaaaaaay back to the mid 90's when you couldn't move in the industry without tripping over a 3D Graphic Chip company! Remember Renditions Verite, the Cirus Logic Laguna, the NV1? The scandal around the DX1 Tunnel test (I seem to remember that was the genesis of Toms Hardware). Damn that makes me feel old!

Like I said, perhaps another thread is required...

The 3D history is full of such stories and not just one 3D Graphics company or just one scandal. You're obviously as long around as I and I doubt you're that much older than me. Difference being I'm a simple user with no interests whatsoever nor ulterior motives. Now that's then truly material for another thread.
 
BTW - There is no holy war here, I have no allignment to either company, just thought I'd make that clear.
I figured that, no problem - I rather meant in the context of TBDR vs IMR vs [...] debates, which tend to have rather bold and highly contradictory arguments on both sides when it comes to things like bandwidth and impact of newer APIs. Honestly, 99% of the arguments I've seen personally proved little but the lack of understanding of the other side of the aisle - that doesn't mean there isn't a real winner (I have no idea), but if there is then the arguments probably aren't (only?) those made oh-so-often.

Furthermore I am not very interested in comparing actual implementations as they rarely mean all that much when it comes to the core algorithms. There are plenty of incredibly smart things you can do on a TBDR that I've never seen any IMR proponent mention, and there are plenty of incredibly smart things you can do on an IMR architecture that I've never seen any TBDR proponent mention. Heck I've never even seen anyone mention most of those publicly! The foundations of the debate really tend to be set at the wrong level... And any argument about 'real-world' workloads are often inherently biased and are always different for an open ecosystem such as smartphones and a fixed one such as a handheld gaming console.

I think you are assuming that the VS and FS cores are decoupled by a fifo correct? In actual fact this is not the case for either architecture (I think and I'm sure JohnH will be very quick to correct me if I'm not... btw does anyone from ARM, Vivante or Matrox follow this forum???)
You are correct, I obviously knew that for SGX but had a brainfart for Mali. Regarding your question, Arjan (who made a quick post earlier in the thread :)) is a Falanx/ARM engineer. I am sadly not aware of anyone from Vivante or Matrox, although who knows who's lurking out there! (*cough* you *cough*)

but the intermediate data between the VS and FS processing stages is actually stored to main memory (post VS, post binning). Therefore it would actually be possible and even reasonable to power off the MaliGP (this was the same with MBX equiped with VGP, but I'm not sure you could power gate it in the same way).
Yes, that is perfectly correct, and so obviously in Mali's specific case you could use power gating for the geometry processing. In the case of NVIDIA, they don't do binning/tiling of any sort so they obviously couldn't do that (at least not for the whole unit - the APX 2500's version is relatively high-end and can be scaled down downwards 2x or possibly more so maybe they could power gate half of it all the time if VS requirements aren't veryhigh); I tend to confuse Tegra's 3D core and Mali on a few things, heh...

Hmmm the legend at the bottom of the graphic is not clear about which components are which.
It's not but they say it's 5.5mm² while the total chip is 60mm²; based on that it is very easy to see what it is... (or at least what TechInsights thinks it is, but it makes perfect sense from a size POV at least given PowerVR's claims)

I think PowerVR refers to SGX530 as being two shader "pipes" rather than cores. The premise being that you can share infrastructure between the two pipes and save area (common in programmable architectures) rather than stamping down two identical cores.
I don't care how they refer to them; they are cores. And when I say cores, I mean real cores, not the marketing hyperbole like NVIDIA does it. SGX's shader 'pipes' are full-blown VLIW FP32 processors with 16 concurrent threads (and 4 being prepared pre-ALU at the same time, in the same stages; I will let it as an exercise to the reader to figure out why this saves valuable register file die space and power; it's really not any different from PC GPUs, but very different from Larrabee which however benefits more from its L1 cache...)

Is that not allow? :D
Of course it is, it was just funny because I was making the extra effort to be as objective as possible here and not voice clear opinions, while you come along and decide to say things like that - it's funny, that's all! :D
 
Back
Top