View Full Version : ARM Mali-400 MP
Rob Evans
02-Jun-2008, 11:43
A competitor for Imagination Technologies SGX?
From http://www.prnewswire.co.uk/cgi/news/release?id=228914
ARM Mali-400 MP Technology Brings High-End Graphics Performance to All Consumer Devices
CAMBRIDGE, England, June 2 /PRNewswire/ --
- Multicore Graphics Solution Revolutionizes User Experiences With Pioneering Scalability
ARM ((LSE:ARM); (Nasdaq:ARMH)) today announced the ARM(R) Mali(TM)-400 MP scalable multiprocessor graphics solution, capable of delivering performance of up to 1G pixels per second and enabling licensees to serve multiple product markets with the same architecture, whilst retaining the flexibility to choose the optimum power, performance and area configuration for their application. The pioneering Mali-400 MP architecture offers breakthrough scalability, also reducing costs for developers and OEMs associated with platform fragmentation, as no changes are required to support one to four processors.
"Architectural reuse of software and hardware components is of increasing importance to SoC developers," said Frank Dickson, co-founder and chief research officer, MultiMedia Intelligence. "The scalability of the ARM Mali-400 MP GPU, from 300 million to over 1 billion pixels per second, will enable OEMs to deliver a wide range of market-leading products on the same underlying architecture, reducing their total cost of ownership and maximizing ROI."
The ARM family of Mali GPUs is opening up new product markets that will benefit from graphics acceleration, from mobile feature phones through to 1080p-based iDTVs. Recent Mali GPU licensees in the set-top box market mean that the consumer experience in the home is set for a radical change.
"We see an increasing need for pixel processing of up to 1G pixels in the home as HD screens become ubiquitous," said Ola Larsén, vice president of marketing at TAT, an ARM Mali Developer Relations Program Partner. "The set-top box and digital TV user interface will never be the same once graphics acceleration such as Mali technology becomes the standard."
The Mali-400 MP solution builds on ARM's experience and knowledge gained in the widely-adopted ARM MPCore(TM) technology, implemented in the ARM11(TM) MPCore and Cortex(TM)-A9 MPCore multicore processors, designed to reduce system bandwidth, optimize performance and reduce power consumption.
"The battle for consumers' attention across all consumer electronics product markets means that the graphics acceleration capability of these devices is rapidly becoming a must-have feature. Consumers expect an equally compelling user experience when accessing content from their mobile and digital home entertainment devices," said Michael Dimelow, director of marketing, Media Processing Division, ARM. "The ARM Mali-400 MP graphics solution enables our customers to bring scalable graphics processing performance to a wider range of product markets and more economically than before."
Power consumption and area efficiency are key aspects of the Mali-400 MP GPU design. The ability to scale performance to meet different price points and power budgets allows the Mali-400 MP GPU to address the widest possible market while bringing significant cost benefits through utilizing the same single software stack across multiple devices.
The ability to capture the attention of consumers through the display on their electronic devices, as demonstrated by a range of high-end designs today, is driving the need for hardware graphics acceleration in a wide range of consumer devices. The scalable multicore graphics processing design of the Mali-400 MP solution will help bring graphics acceleration to almost any device with a screen.
Availability
The ARM Mali-400 MP GPU is available for licensing today. For more information about the ARM Mali graphics stack, please visit: http://www.arm.com/products/esd/multimediagraphics_home.html.
roninja
06-Jun-2008, 14:38
SGX is better!
Laurent06
09-Jun-2008, 07:48
SGX is better!
That was constructive :)
roninja
09-Jun-2008, 10:20
I know thought I would just be too the point, but all too seriously the competitive environment is heating up, I just think that PowerVR folks have a lead in this area but the likes of both Nvidia and ARM can still catch up....
roninja
10-Jun-2008, 11:51
credit to Rob Evans for uneathering this article as well....
Multi-core GPU in ARM strategy
ARM advances its graphics processing strategy with multi-core versions of Mali 3D graphics processor cores
EW 11-17 JUNE 2008 ElectronicsWeekly.com
DAVID MANNERS
ARM has laid-out its graphics processor strategy with scale able multi-processor versions of its Mali 3D graphics processing cores.
The family, called Mali-400, has four variants: single core, dual core, triple core and four core. The four core can deliver a graphics processing performance of billion pixels a second or 30 million triangles a second.
“It has the lowest memory bandwidth of any GPU (graphics processing unit) which directly translates into lower power consumption,” Chris Porthouse senior product manager for media hardware at ARM, told Electronics Weekly.
Low memory bandwidth translates into lower power usage because the GPU is not writing and reading from memory all the time.
The single core version delivers 275 million pixels per second, the dual core delivers 550 million pixels per second, and the triple core delivers 825 million pixels per second. The 275 million to 1 billion pixel per second spread allows licensees to make a range of products using the same architecture, and the same software stack, which cuts down their costs, while ARM, as an IP vendor, can spread its cost of developing Mali across many users.
The Mali-400 is designed for 65nm process technologies. The dual core version occupies nine square millimetres of silicon. “We have customers who are asking us for 32nm and 22nm versions,” said Porthouse, “it scales to 32nm quite easily.”
ARM has been talking to customers and expects to sign the initial licenses for Mali-400 shortly. “We are very close to signing licences,” said Porthouse.
Clearly the big market is smartphones where every phone will need a GPU, and there are expected, by ARM, to be 600 smartphones made every year by 2012.
The less highly featured smartphones will not all have GPUs but will still represent a 400 million annual unit market for GPUs by 2012, reckons ARM.
Other products which may in part use GPUs are portable media players, which ARM reckons could be a 200 million unit market by 2012 and satnav, a 65 million unit market by 2012.
High definition TVs are driving a rapidly growing market for high-end GPUs in set-top box, representing a potential 231 million units market by 2012 and digital TV, which could be a market of 113 million units by 2012.
So ARM is looking at a substantial target market for the multi-processor graphics core in the next four years.
-End-
Rob Evans
10-Jun-2008, 14:27
“It has the lowest memory bandwidth of any GPU (graphics processing unit) which directly translates into lower power consumption,” Chris Porthouse senior product manager for media hardware at ARM, told Electronics Weekly.
Interesting that they believe they have a lower bandwidth solution than IMG's PowerVR series.
Rob.
http://www.arm.com/news/23608.html
ARM signed 13 processor licenses in Q3. The quarter was characterised by licensing of ARM® technologies across the portfolio, with licenses being signed for the ARM7™, ARM9™, ARM11™ and Cortex™ processor families, as well as for the Mali™ graphics processor, including with STMicroelectronics who licensed ARM’s latest graphics processor, the Mali 400MP GPU.
Hah. I wonder what'll happen with Qualcomm then - unless they'll actually purchase AMD's handheld division? I would actually expect them to given how much of their IP is sourced from them, FWIW... And congratulations to the ex-Falanx guys/ARM! :)
Of course as is the nature of press releases, the most interesting things are what they don't say.
For example what clock frequency are used to get the stated performance figures, and at what power cost. No size is stated for the quad-core part.
The PR also says
"The four core can deliver a graphics processing performance of billion pixels a second or 30 million triangles a second."
which seems strange cause the docs on the arm site
http://www.arm.com/miscPDFs/21863.pdf
indicate that the dual core can do 30M tri/s and 550M pixels. If the data is true, it would be say that there is no more extra triangle capability with quad over dual ????
The latest SGX datasheet states that at the high end, its IP can do 100M tri/s and 4B pixels per sec. this is at a 20mm size and at 200Mhz. So thats the high end crown taken.
The smallest SGX part is 1.5mm square and still does 200M pixels per sec. The smallest Mali-400 is 5.5mms. To get down to 1mm, you have to go to MAli55, which doesn't have any vertex engine and thus isn't Opengles2.0 compliant, and has 1/2 triangle performance of the 1.5mm SGX part which is openglEs2.0 compliant. So thats the low end taken too.
...
The PR also says
"The four core can deliver a graphics processing performance of billion pixels a second or 30 million triangles a second."
which seems strange cause the docs on the arm site
http://www.arm.com/miscPDFs/21863.pdf
indicate that the dual core can do 30M tri/s and 550M pixels. If the data is true, it would be say that there is no more extra triangle capability with quad over dual ????
...
Beware of marketing fud, they appear to only use a single MaliGP irrespective of the number of cores so the geometry throughput does not scale with cores laid down...
John.
SGX 530 is 2 TMUs, SGX 520 is 1 TMU, SGX 510 is canned. Clock speeds for SGX, Mali, Imageon and GoForce are all similar on 65nm, which is to say low to mid 100MHz (sigh...)
Anyway a single-core Mali 400MP is most similar to a SGX 530 in terms of theoretical performance excluding potential boosts related to TBDR (avoiding a Z-Pass and/or shading fewer useless pixels). In this context, their perf/mm˛ don't seem to be fundamentally different.
SGX 530 is 2 TMUs, SGX 520 is 1 TMU, SGX 510 is canned. Clock speeds for SGX, Mali, Imageon and GoForce are all similar on 65nm, which is to say low to mid 100MHz (sigh...)
Anyway a single-core Mali 400MP is most similar to a SGX 530 in terms of theoretical performance excluding potential boosts related to TBDR (avoiding a Z-Pass and/or shading fewer useless pixels). In this context, their perf/mm˛ don't seem to be fundamentally different.
Well, at the low end IMG offer a core which is much smaller yet still ES2.0 compliant and at the higher end the equivalent perf Mali MP solution is going to be quiet a bit bigger than the single core SGX solution (which may also have higher poly throughput).
To be clear, I don't think there's anything wrong with multi-core as a concept (IMG have already done this in the past afterall), however I just don't think ARM have got it right in terms of the base core that they are scaling from in terms of hitting good area efficiency/ performance per unit area.
John.
John,
Point take regarding the "cores", however its their term, and the quad core one and he dual core one appear to have exactly the same tri/s rate (but pixel rate is different).
SGX 530 is 2 TMUs, SGX 520 is 1 TMU, SGX 510 is canned. Clock speeds for SGX, Mali, Imageon and GoForce are all similar on 65nm, which is to say low to mid 100MHz (sigh...)
What you say is correct for the mobile phone space.
However SGX530 is used in the System Controller Hub that is the chipset for the Z series Atom processor. There are 3 SKUs and 2 of them clock the graphics at 200Mhz. Depending on which spec sheet you look at, the SCH chip is either fabbed at 90nm or 130nm.
There is a rumoured refresh of Menlow coming out that suggests a different SCH SKU. Be interesting to see if the graphics core is changed, the may also take the opportunity to fabbed it differently.
http://www.digitimes.com/NewsShow/NewsSearch.asp?DocID=PD000000000000000000000000007 764&query=MENLOW
What you say is correct for the mobile phone space.
However SGX530 is used in the System Controller Hub that is the chipset for the Z series Atom processor. There are 3 SKUs and 2 of them clock the graphics at 200Mhz. Depending on which spec sheet you look at, the SCH chip is either fabbed at 90nm or 130nm.That's a SGX 535, but yes, it's available in two SKUs: one at 100MHz and without the PowerVR HD video decoder, and one at 200MHz with the latter.
What's much more interesting is what will happen with Moorestown, where instead of being manufactured on 130nm (definitely not 90nm), it will be on 45nm (!!!) so clocks might increase quite a bit if they don't screw up! :) AFAIK, it's still a SGX 535 though...
arjan de lumens
07-Nov-2008, 18:30
If the data is true, it would be say that there is no more extra triangle capability with quad over dual ????
This is indeed correct. Mali-400MP is not a unified shader (it has separate processing units for vertex processing and fragment processing), and we decided against making the vertex processing section scalable this time around. The 30Mtris/s is basically the upper limit of what our current vertex/binning processing unit can do; the fragment processors can do ~18Mtris/s per core.
Moorestown was described as having 50% more graphics performance than the SCH, so a difference in clock speed and not SGX variant seems likely.
This is indeed correct. Mali-400MP is not a unified shader (it has separate processing units for vertex processing and fragment processing), and we decided against making the vertex processing section scalable this time around. The 30Mtris/s is basically the upper limit of what our current vertex/binning processing unit can do; the fragment processors can do ~18Mtris/s per core.
Of course you also won't be able to handle higher VS load, so its not only poly throughput that doesn't scale....
John.
Well, at the low end IMG offer a core which is much smaller yet still ES2.0 compliant and at the higher end the equivalent perf Mali MP solution is going to be quiet a bit bigger than the single core SGX solution (which may also have higher poly throughput).
To be clear, I don't think there's anything wrong with multi-core as a concept (IMG have already done this in the past afterall), however I just don't think ARM have got it right in terms of the base core that they are scaling from in terms of hitting good area efficiency/ performance per unit area.
John.
You're still talking about the cancelled SGX510, right?
You're still talking about the cancelled SGX510, right?
No, I'm talking about SGX520 which is smaller than any OGL ES2.0 capable part offered by ARM.
John.
If the SGX530 is ~8mm2, what's the area of the 520? 6mm2?
Don't think current sizing info on 520 is public...
Simon F
20-Nov-2008, 07:18
It is now: http://www.imgtec.com/News/Release/index.asp?NewsID=411
POWERVR SGX520 is less than 2.6mm2 in TSMC 65LP process.
roninja
20-Nov-2008, 15:19
Thats pretty small size could describe it as being "nano" centric.....
Ailuros
20-Nov-2008, 18:52
Thats pretty small size could describe it as being "nano" centric.....
Apparently the SGX factsheet has been adjusted too.
At 65nm 200MHz for SGX520 up to SGX545:
7-40M Triangles/s
250-1000 MPixels/s
2.6-12.5mm^2
Meaning SGX520 can achieve 250 MPixels/s (including overdraw) and 7 M Tris/s at 200MHz, with a <50% shader load. If the 520 could achieve as many pixels/clock as it's larger brothers, it would beat a KYRO senseless at merely 2.6mm^2.... :shock:
Arm's smallest Mali core that is OpenGLes2.0 compliant is the Mali200 and at 65nm its 5mm square, which makes it just about twice the size of SGX520, and yet it is only slighty higher performing (9M triangles and 275M pixels V's 7M triangles and 250M pixels).
THE one thing missing from both specs sheets of course is power comsumption.
Arm's smallest Mali core that is OpenGLes2.0 compliant is the Mali200 and at 65nm its 5mm square, which makes it just about twice the size of SGX520, and yet it is only slighty higher performing (9M triangles and 275M pixels V's 7M triangles and 250M pixels).
THE one thing missing from both specs sheets of course is power comsumption.
I beleive the 5mm for Mali200 excludes the Mali GP, I suspect its closer to 6.5mm with that included.
Power is roughly proportional to area, so assuming Mali200 is as aggressive with its clock gating as SGX then Mali200 will consume ~2x the power. This does however ignore the fact that Mali isn't a deferred rendering device so in the presence of overdraw its power consumption is likely to be higher per unit area for both core and IO.
John.
This does however ignore the fact that Mali isn't a deferred rendering device so in the presence of overdraw its power consumption is likely to be higher per unit area for both core and IO.
John.
according the the feature list in this pdf, mali is both tile-based and deferred rendering....but they also mention immediate rendering ?
http://www.arm.com/miscPDFs/21863.pdf
Power is roughly proportional to area, so assuming Mali200 is as aggressive with its clock gating as SGX then Mali200 will consume ~2x the power. This does however ignore the fact that Mali isn't a deferred rendering device so in the presence of overdraw its power consumption is likely to be higher per unit area for both core and IO..
Not only is clock gating important, i.e turning off the bits that are not needed at any one time, but it may well be that one or other solution inherently results in more of the chip being able to be turned off at any one time.
according the the feature list in this pdf, mali is both tile-based and deferred rendering....but they also mention immediate rendering ?
http://www.arm.com/miscPDFs/21863.pdf
They're an early Z based tiler not a deferred tiler, they mention deferred rendering in a rather obscure context, but I know for a fact that they are early Z tile based. The mention of IMR is just random marketing, they are fundamentally a tile based renderer, end of story.
Not only is clock gating important, i.e turning off the bits that are not needed at any one time, but it may well be that one or other solution inherently results in more of the chip being able to be turned off at any one time.
Of course clock gating is important, I specifically said "assuming Mali200 is as aggressive with its clock gating as SGX". They are a tile based render, if anything this offers less opportunity for clock gating than a deferred tile based render due to how the laters pipeline fits together.
John.
They are a tile based render, if anything this offers less opportunity for clock gating than a deferred tile based render due to how the laters pipeline fits together.Yeah, but then again they're not unified so that means more opportunities for clock gating. Doesn't necessarily mean lower *overall* power consumption, but it does have an effect on average power consumption per mm˛ - thus given all of these factors, the latter seems like a very problematic metric to use here! :)
Yeah, but then again they're not unified so that means more opportunities for clock gating. Doesn't necessarily mean lower *overall* power consumption, but it does have an effect on average power consumption per mm˛ - thus given all of these factors, the latter seems like a very problematic metric to use here! :)
LOL Interresting point, make you core bigger with loads of logic that sits around idle so that you can claim lower power per mm^2 , although even in an LP process leakage does need to be factored into this...
Perhaps I shouldn't have used the term "unit area" ;)
John.
LOL Interresting point, make you core bigger with loads of logic that sits around idle so that you can claim lower power per mm^2 , although even in an LP process leakage does need to be factored into this...A certain bright green company seems to have followed that train of thought to an extreme too! :) (there's a patent on a pre-pixel shader stage from them that is basically a systematic waste of area with no way to improve performance, but it can save a tiny bit of power, sometimes, if the stars are aligned right)
However I would argue the main advantage of non-unified hardware from a power POV is to be able to use FP24 instead of FP32 in the pixel shader. If you don't care about having MIMD everywhere, it also allows you to have higher branching granularity in the PS than in the VS; how much that helps you depends a lot on how naive your architecture is though, ofc... (and what real-world handheld applications are & will be like)
In the end though, anyone not having a fully unified architecture with maximum utilization all the time on 28nm is playing a very dangerous game. Right now handheld 3D cores are always implemented in minimum-leakage technology, but given that triple-gate oxide is standard on 28nm (at least for TSMC, certainly not for IBM!) that seems like an awful plan to me compared to the alternatives.
Perhaps I shouldn't have used the term "unit area" ;)Heheh, indeed you probably shouldn't have! :D
..
However I would argue the main advantage of non-unified hardware from a power POV is to be able to use FP24 instead of FP32 in the pixel shader. If you don't care about having MIMD everywhere, it also allows you to have higher branching granularity in the PS than in the VS; how much that helps you depends a lot on how naive your architecture is though, ofc... (and what real-world handheld applications are & will be like)
Going forward FP24 won't cut it in the PS, part of the reason for this is that these devices are already being applied to UI's running at HD resolutions making FP24 marginal for texture calculations. Throw GP-GPU into the mix and FP24 doesn't cut it at all. Courser branch granularity also falls foul of GP-GPU, and, it is possible to architect for the finest level of branch granularity without adding significant area to the design imo.
John.
Going forward FP24 won't cut it in the PS, part of the reason for this is that these devices are already being applied to UI's running at HD resolutions making FP24 marginal for texture calculations.Obviously FP24 per-se shouldn't be a problem for high target resolutions, so I presume you're thinking of very large textures? I have some difficulty to believe this is a big problem even at 1080p, but heh! :)
Throw GP-GPU into the mix and FP24 doesn't cut it at all. Courser branch granularity also falls foul of GP-GPU, and, it is possible to architect for the finest level of branch granularity without adding significant area to the design imo.I definitely agree with both points, but it's not my fault PowerVR's competitors don't seem able to figure out how to implement efficient MIMO to save their lives! ;) Obviously Imagination's processor/DSP heritage with META helps a lot there.
Obviously FP24 per-se shouldn't be a problem for high target resolutions, so I presume you're thinking of very large textures? I have some difficulty to believe this is a big problem even at 1080p, but heh! :)
FP24 gives you 15 bits of mantissa, 1:1 HD textures are 1920 wide, so you need 11 bits to achieve texel level addressing, this leaves you with 4 bits of sub texel accuracy, I consider this borderline, but artifacts do depend on the application.
Incedentally, could be wrong but I thought ARMs fragment shaders where restricted to FP16, or maybe that was the old bit boys part <shrugs>...
John.
FP24 gives you 15 bits of mantissa, 1:1 HD textures are 1920 wide, so you need 11 bits to achieve texel level addressing, this leaves you with 4 bits of sub texel accuracy, I consider this borderline, but artifacts do depend on the application.Yes but if it's applied 1:1 I'm not sure I see how it could go wrong? Surely you don't need to rotate it or anything like that... (or even if you did for a special effect it'd go fast enough that nobody would ever notice there's a 0.25 pixel error)
Incedentally, could be wrong but I thought ARMs fragment shaders where restricted to FP16, or maybe that was the old bit boys part <shrugs>...I think that was Bitboys, yeah... :)
http://web.archive.org/web/20051124174700/www.bitboys.fi/g40.php
http://web.archive.org/web/20050305122106/www.bitboys.fi/comparison.php
Yes but if it's applied 1:1 I'm not sure I see how it could go wrong? Surely you don't need to rotate it or anything like that... (or even if you did for a special effect it'd go fast enough that nobody would ever notice there's a 0.25 pixel error)
A lot of the examples we're seeing aren't 1:1 and the greater the zoom factor the more significant it becomes.
John.
Ailuros
24-Nov-2008, 05:55
I definitely agree with both points, but it's not my fault PowerVR's competitors don't seem able to figure out how to implement efficient MIMO to save their lives! ;) Obviously Imagination's processor/DSP heritage with META helps a lot there.
Sorry for the OT, but albeit the last sentence shouldn't be wrong per se, I'd say that Metagence was created somewhere in the middle of IMG's history based on PowerVR experience. Now you may shoot me and carry on :P
***edit: as for the higher texture needs, if there wouldn't had been such needs IMG wouldn't had rushed out and inserted the SGX531 without a reason. There's still the question if they'll have anything beyond the 545 and if yes what it's going to look like.
TheArchitect
25-Nov-2008, 11:22
FP24 gives you 15 bits of mantissa, 1:1 HD textures are 1920 wide, so you need 11 bits to achieve texel level addressing, this leaves you with 4 bits of sub texel accuracy, I consider this borderline, but artifacts do depend on the application.
Incedentally, could be wrong but I thought ARMs fragment shaders where restricted to FP16, or maybe that was the old bit boys part <shrugs>...
John.
Its not the bits you've got its what you do with them that counts :wink:
Input and storage precision are not the same as intermediate result precision. There are ways of managing numerical computation in an architecture such that you don't need to maintain a complete FP24 pipe to maintain accuracy.
Besides which Mali200 is still the only IP Core to have achieved Khronos conformance at 1080p resolution... so evidently it's not as much of a problem as people seems to think.
You'd think if SGX was capable of passing conformance at 1080p they'd have press released it by now (they press release every other bleeping thing).
TheArchitect
25-Nov-2008, 12:43
Two notes of caution here :-
Its well known in the industry that ARM has a track record of conservatively estimating their core sizes, the PowerVR guys can be a little more errr, "creative" shall we say.
Similarly, don't just take it as read that the performance numbers are correct. I worked with one of the chips implementing the original MBX and it was nowhere near the performance envolope stated in their material. Remember SGX is a unified shader - Ask yourself are they quoting SGX peak fill rate with the core 100% dedicate to fragment processing ? Similar question goes for vertex processing...
Lies, lies, damn lies and GPU marketing material and all that.
On the power consumption front there are a number of variables to take into account.
Total power efficiency for the GPU core will depend on the number of gates in the core, number and area of the RAM's in the core. How many of those are active at anyone time and (this is the key bit you've missed so far) the amount of external BW consummed by the GPU core.
Not to trivialise it though the gate/RAM area is a big issue without power gating. Sub 65nm static power consumption through leakage is a big deal, so the SGX would seem to have the edge over Mali there, however, if the utilisation of the core is 100% during a rendering phases then there is no/limited opportunity to power gate (you need to keep the gates powered up to do the work) and this is where the SGX gets let down.
SGX being a unified shader architecture its compute core is shared between vertex and fragment processing (which inccidentally is probably why its smaller). It attempts to load balance using some hoopy hyper threading system, this will likely have the effect that the core is active a lot more of the time meaning agressive power gating really may not buy you that much. Mali has the advantage that MaliGP can be completely power gated after its finished processing (thats about 30% of the architecture powered off). Thats gotta be worth something!
Another factor that plays in here is the number and size of the RAM instances in the design. I don't know the in's and out's of the implementation of SGX (I haven't seen any die shots I can analyse), but to keep a hyper threaded unified shader architecture fed they probably have a big ass cache RAM to context switch in and out of to keep the thing ticking over. That's gonna cost big on the power consumption front. As long as the core is active that RAM needs to stay powered up.
Mali has some neat (and patent protected) tricks up its sleave in that regard. It doesn't have any context switch overhead thanks to a nifty trick of carrying the context with each thread. This means they have little or no pipeline flush overhead and no need for a munging great cache to store it.
Last thing you need to take into account is the memory bandwith consummed by the two cores. External memory banwidth to DRAM consumes stupid amounts of power and nothing hammers the crap out of memory quite like a GPU running 3D graphics.
I attended a tech seminar (come to think of it I think it was an ARM one) were they talked about external DRAM accesses being 10x the cost of internal computation in some cases. While I'm not sure I buy 10x, even 2x would be a significant effect and reducing the bandwidth used by a GPU would make a significant difference to overall power consumption.
I've heard ARM make some pretty bold claims about Mali's BW reduction techniques. Whilst I don't have any first hand experience to confirm or deny those claims. I am told by trusted sources that they are on the level though and they do have an advantage compared to SGX with real world work loads. Enough to offset the size difference? I can't say, but interesting to note.
So whats my point? The above are just a few obvious things that I've obsereved which tell me its impossible to make an apples for apples comparison of the two based on publicly available data. We are only getting a tiny glimpse of the whole picture.
Going on experience I'd say PowerVR will over promise and under deliver on the SGX, but they'll sell a bundle of them anyway and so we'll suffer more mediocre graphics experiences on handsets for another generation. ARM are winning designs away from PowerVR however, so there must be something in this that's making sense to some big names.
As for the Mali400 MP, I think is a very poorly thought out product. If you are going to introduce a multi core scaleble product why the hell not scale both fragment and vertex shader cores. This smacks of something nailed together in a hurry to meet some spurious customer request if you ask me (wonder if thats anything to do with them loosing one of their key strategic technical people earlier in the year...).
Anyway lets hope they get more of a clue with the next one and give PowerVR a real run for their money.
oh my....thats quite a thought provoking post, not bad for your 2nd one !
RussSchultz
25-Nov-2008, 13:30
Anyway lets hope they get more of a clue with the next one and give PowerVR a real run for their money.
Given that the only two IP vendors left in that space seem to be ARM and PowerVR, yes. Lets hope they give them a real run for their money.
Its not the bits you've got its what you do with them that counts :wink:
Input and storage precision are not the same as intermediate result precision. There are ways of managing numerical computation in an architecture such that you don't need to maintain a complete FP24 pipe to maintain accuracy.
Besides which Mali200 is still the only IP Core to have achieved Khronos conformance at 1080p resolution... so evidently it's not as much of a problem as people seems to think.
You'd think if SGX was capable of passing conformance at 1080p they'd have press released it by now (they press release every other bleeping thing).
ALthough teh precison within the texturing pipeline itself can be optimised per stage and can in soem plaes be reduced, we're talking about the shader pipe here, to make your shader pipe fp24 you cannot drop its precision below FP24. This is of course secondary to the fact that only having 4 bits of sub texel in terms of your ability to accurately sample textures is borderline imo.
In terms of precison and its effect of texture sampling it is the resolution of the source textures that has an impact, not the target resolution i.e. sub pixel resolution doe snot chaneg with screen size, so it is unlikely that target resolution wil have any impact of the results of the conformance tests.
John.
Simon F
25-Nov-2008, 13:34
You'd think if SGX was capable of passing conformance at 1080p they'd have press released it by now (they press release every other bleeping thing).
I'm sure if a customer actually wanted a specific resolution tested it would be done. FWIW, of those companies who do actually have conformant OpenGL ES 2.0 systems, most seem to have opted for a "box standard" VGA resolution when doing the test. <shrug>
I worked with one of the chips implementing the original MBX and it was nowhere near the performance envolope stated in their material.
Well, there are a range of MBX models and SOCs that they are put into, and clearly some perform better than others (http://www.glbenchmark.com/result.jsp). I can't comment on an individual case as I don't know their "innards" in enough detail.
TheArchitect
25-Nov-2008, 14:26
Given that the only two IP vendors left in that space seem to be ARM and PowerVR, yes. Lets hope they give them a real run for their money.
Actually Vivante have been showing signs of life lately. They showed some silicon running at an event recently.
I also heard a rumour that Matrox were selling IP now as well.
TheArchitect
25-Nov-2008, 14:37
Well, there are a range of MBX models and SOCs that they are put into, and clearly some perform better than others (http://www.glbenchmark.com/result.jsp). I can't comment on an individual case as I don't know their "innards" in enough detail.
You could have picked a better benchmark to illustrate your point. GLBenchmark is soooooo bad.
Isn't it the same benchmark that claims that some MBX implementations don't support Bi-lerp, MIP MAPing, etc. when actually they are key hardware features.
TheArchitect
25-Nov-2008, 14:46
In terms of precison and its effect of texture sampling it is the resolution of the source textures that has an impact, not the target resolution i.e. sub pixel resolution doe snot chaneg with screen size, so it is unlikely that target resolution wil have any impact of the results of the conformance tests.
John.
Hold on a minute, how many embedded systems do you know that actually have enough room to store a 1920x1080x32 texture (8 MB for the top MIP level), let alone have the need to zoom into it by nearly 16x?????
Well I suppose, viewing JPG stills maybe with some zoom, but then you can do a partial decode on them to limit the source texture size so its not a problem.
Or post processing of HD video, but then thats not likely to need to be zoomed by 16x...
I'd like to see your use case...
Two notes of caution here :-
Its well known in the industry that ARM has a track record of conservatively estimating their core sizes, the PowerVR guys can be a little more errr, "creative" shall we say.
IMG figures are actual synthesis figures in the same way ARM's are claimed to be.
Similarly, don't just take it as read that the performance numbers are correct. I worked with one of the chips implementing the original MBX and it was nowhere near the performance envolope stated in their material.
It is well known that not all MBX systems are alike, some clearly do hit the our performance claims which suggesst that the performance of MBX itself was exactly as stated.
Remember SGX is a unified shader - Ask yourself are they quoting SGX peak fill rate with the core 100% dedicate to fragment processing ? Similar question goes for vertex processing...
SGX figures are quoted at 50% shader load for fill or poly throughput. The reality is that I've only ever seen contrived cases where a unified design doesn't win out over a similar area non unified design.
Lies, lies, damn lies and GPU marketing material and all that.
And of course ARMs marketing team only dish out absolute truthful and factual information <rolls eyes>
On the power consumption front there are a number of variables to take into account.
Total power efficiency for the GPU core will depend on the number of gates in the core, number and area of the RAM's in the core. How many of those are active at anyone time and (this is the key bit you've missed so far) the amount of external BW consummed by the GPU core.
Not to trivialise it though the gate/RAM area is a big issue without power gating. Sub 65nm static power consumption through leakage is a big deal, so the SGX would seem to have the edge over Mali there, however, if the utilisation of the core is 100% during a rendering phases then there is no/limited opportunity to power gate (you need to keep the gates powered up to do the work) and this is where the SGX gets let down.
SGX being a unified shader architecture its compute core is shared between vertex and fragment processing (which inccidentally is probably why its smaller). It attempts to load balance using some hoopy hyper threading system, this will likely have the effect that the core is active a lot more of the time meaning agressive power gating really may not buy you that much. Mali has the advantage that MaliGP can be completely power gated after its finished processing (thats about 30% of the architecture powered off). Thats gotta be worth something!
By being unified you expose maximum compute power to the problem at hand irrespective of being vertex or pixel bound, this increases the opportunities for idle power gating the entire core which is the granularity that most power gating schemes work at at this time. Beyond this finer granularity gating remains possible withn those parts of the chip dedicated to vertex and pixel processing. Basically being unified is a net gain wrt to power, if it wasn't we we wouldn't have designed the architecture like we did.
Another factor that plays in here is the number and size of the RAM instances in the design. I don't know the in's and out's of the implementation of SGX (I haven't seen any die shots I can analyse), but to keep a hyper threaded unified shader architecture fed they probably have a big ass cache RAM to context switch in and out of to keep the thing ticking over. That's gonna cost big on the power consumption front. As long as the core is active that RAM needs to stay powered up.
You seems to be making assumptions about the SGX architecture which aren't based on reality.
Mali has some neat (and patent protected) tricks up its sleave in that regard. It doesn't have any context switch overhead thanks to a nifty trick of carrying the context with each thread. This means they have little or no pipeline flush overhead and no need for a munging great cache to store it.
You seem to be assuming that SGX has an inherent context switch overhead, again, this is simply an incorrect assumption.
Last thing you need to take into account is the memory bandwith consummed by the two cores. External memory banwidth to DRAM consumes stupid amounts of power and nothing hammers the crap out of memory quite like a GPU running 3D graphics.
I attended a tech seminar (come to think of it I think it was an ARM one) were they talked about external DRAM accesses being 10x the cost of internal computation in some cases. While I'm not sure I buy 10x, even 2x would be a significant effect and reducing the bandwidth used by a GPU would make a significant difference to overall power consumption.
I've heard ARM make some pretty bold claims about Mali's BW reduction techniques. Whilst I don't have any first hand experience to confirm or deny those claims. I am told by trusted sources that they are on the level though and they do have an advantage compared to SGX with real world work loads. Enough to offset the size difference? I can't say, but interesting to note.
Perhaps you should ellude to where you think this difference comes from, becuase I'm pretty certain that equivalent SGX consumes less BW than Mali in every instance.
So whats my point? The above are just a few obvious things that I've obsereved which tell me its impossible to make an apples for apples comparison of the two based on publicly available data. We are only getting a tiny glimpse of the whole picture.
Going on experience I'd say PowerVR will over promise and under deliver on the SGX, but they'll sell a bundle of them anyway and so we'll suffer more mediocre graphics experiences on handsets for another generation. ARM are winning designs away from PowerVR however, so there must be something in this that's making sense to some big names.
The fact of the situation is that previous generations of Mali where pretty unimpressive when compared to the equivalent PowerVR cores, you've obviously decided the new cores are going to reverse this situtation, which is odd given the absense of ratified public benchmark information.
As for the Mali400 MP, I think is a very poorly thought out product. If you are going to introduce a multi core scaleble product why the hell not scale both fragment and vertex shader cores. This smacks of something nailed together in a hurry to meet some spurious customer request if you ask me (wonder if thats anything to do with them loosing one of their key strategic technical people earlier in the year...).
Anyway lets hope they get more of a clue with the next one and give PowerVR a real run for their money.
Nothing wrong with a bit of competition.
John.
Captain Chickenpants
25-Nov-2008, 15:17
Anyone for popcorn! :-)
Hold on a minute, how many embedded systems do you know that actually have enough room to store a 1920x1080x32 texture (8 MB for the top MIP level), let alone have the need to zoom into it by nearly 16x?????
Well I suppose, viewing JPG stills maybe with some zoom, but then you can do a partial decode on them to limit the source texture size so its not a problem.
Or post processing of HD video, but then thats not likely to need to be zoomed by 16x...
I'd like to see your use case...
Obviously I can't talk about the applications our customers are using this technology for.
However the key point here is that at FP24 you have little headroom for other math when dealign with large textures, hence the statement that FP24 i borderline.
John.
TheArchitect
25-Nov-2008, 15:39
Lies, lies, damn lies and GPU marketing material and all that.
And of course ARMs marketing team only dish out absolute truthful and factual information <rolls eyes>
Well actually I was suggesting that *all* GPU marketing stretches the truth, but hey hoo.
First of all, welcome to the forum TheArchitect, enjoy your stay! :)
I don't want to take part in this Holy War too much, but here are a few quick points that hopefully can't be perceived as anything but fairly objective...
Its well known in the industry that ARM has a track record of conservatively estimating their core sizes, the PowerVR guys can be a little more errr, "creative" shall we say.IMG figures are actual synthesis figures in the same way ARM's are claimed to be.As JohnH implied, both are pre-layout. By looking at some of the META cores, I've come to the conclusion that sometimes PowerVR will indicate a clock target and a die size, but those clocks are for speed-optimized designs and the size is for area-optimized designs. It's not a lie per-se (clocks are 'up to') of course, just overly aggressive marketing. But on the other hand, ARM (at least for CPU designs) has a tendency to more clearly associate a die size with a specific frequency target. It's not usually a massive difference, and this might not systematically be true (or it might be outdated) but it does seem noteworthy to me. Actually ARM seems to be doing the same with the Cortex-A9...
In the interest of looking overly balanced about these two companies, I would like to claim that neither is very trustworthy about die size estimates and it's always infinitely better to look at real numbers from finished designs - in fact this is true for most IP houses. The one exception I've seen is Tensilica, which *seems* stunningly honest about post-synthesis vs post-layout, clocks, etc.
By being unified you expose maximum compute power to the problem at hand irrespective of being vertex or pixel bound, this increases the opportunities for idle power gating the entire core which is the granularity that most power gating schemes work at at this time.I agree with John here, power gating only makes sense during long inactivity times. If your VS is 5x faster than required during part of the processing, it'll still need to be active 20% of the time and it's not viable to have an absurdly massive FIFO to let it idle for sufficiently long times. The fact MaliGP can be power gated individually makes sense and obviously can't hurt, but an unified architecture is still likely to benefit more from power gating in general. I haven't thought enough about deferred rendering in this context though to be sure if it has an impact of its own (good or bad) however.
I don't know the in's and out's of the implementation of SGX (I haven't seen any die shots I can analyse), but to keep a hyper threaded unified shader architecture fed they probably have a big ass cache RAM to context switch in and out of to keep the thing ticking over. That's gonna cost big on the power consumption front. As long as the core is active that RAM needs to stay powered up.As JohnH implied, you couldn't be any more wrong here: http://www.eetimes.com/news/design/rss/showArticle.jhtml?articleID=210003530&cid=RSSfeed_eetimes_designRSS
http://i.cmpnet.com/eet/news/08/07/1538UTH_1.gif
SGX is the core in the top right. It's very clear that it has incredibly little SRAM; it's nearly pure logic. At the right, based on a SRAM cell size of ~0.5, there seems to be 64-80KiB of SRAM. On the left, at the bottom and maybe in the center, there's also a very little amount of extra SRAM. That's more than enough for texture caches, FIFOs, and register files.
Given that SGX 530 has two "shader cores" presumably, you could assume that the top right and bottom right SRAM is the shader pipe-specific stuff (incl. RF) and the texture caches, while the center right is for the FIFOs. The rest are misc. buffers, for example to communicate with the outside world. Compared to a non-deferred renderer, they can also save quite a bit of SRAM by not needing on-chip HierZ and stuff like that; and obviously the memory controller is off-block.
Perhaps you should ellude to where you think this difference comes from, becuase I'm pretty certain that equivalent SGX consumes less BW than Mali in every instance.Uhoh, note to self: not reply to TBDR Bandwidth Holy Wars. Ever! :D
Going on experience I'd say PowerVR will over promise and under deliver on the SGX, but they'll sell a bundle of them anyway and so we'll suffer more mediocre graphics experiences on handsets for another generation.Now THAT's being opinionated! ;)
ARM are winning designs away from PowerVR however, so there must be something in this that's making sense to some big names.Why must we expect every licensee to be rational, and why must we expect performance, die size, and power consumption to be the only factors? I'm not saying this to diminish either ARM or PowerVR; however my point is the only thing this tells us is the difference isn't so massive that the choice is always clear-cut for potential licensees in the real world. You would expect that to be the case anyway for the surviving players in an open market...
Simon F
25-Nov-2008, 16:37
You could have picked a better benchmark to illustrate your point. GLBenchmark is soooooo bad.
I'd be interested in hearing of any other benchmark with published figures.
Now, as I can't hide behind a pseudonym I shan't comment further.
TheArchitect
25-Nov-2008, 17:12
First of all, welcome to the forum TheArchitect, enjoy your stay! :)
Thank you, huge fun so far :-)
BTW - There is no holy war here, I have no allignment to either company, just thought I'd make that clear.
The discussion seemed to lack a protagonist for the counter agrument, so I thought I'd pitch in. :wink:
I agree with John here, power gating only makes sense during long inactivity times. If your VS is 5x faster than required during part of the processing, it'll still need to be active 20% of the time and it's not viable to have an absurdly massive FIFO to let it idle for sufficiently long times.
I think you are assuming that the VS and FS cores are decoupled by a fifo correct? In actual fact this is not the case for either architecture (I think and I'm sure JohnH will be very quick to correct me if I'm not... btw does anyone from ARM, Vivante or Matrox follow this forum???), but the intermediate data between the VS and FS processing stages is actually stored to main memory (post VS, post binning). Therefore it would actually be possible and even reasonable to power off the MaliGP (this was the same with MBX equiped with VGP, but I'm not sure you could power gate it in the same way).
As JohnH implied, you couldn't be any more wrong here: http://www.eetimes.com/news/design/rss/showArticle.jhtml?articleID=210003530&cid=RSSfeed_eetimes_designRSS
http://i.cmpnet.com/eet/news/08/07/1538UTH_1.gif
SGX is the core in the top right. It's very clear that it has incredibly little SRAM; it's nearly pure logic. At the right, based on a SRAM cell size of ~0.5, there seems to be 64-80KiB of SRAM. On the left, at the bottom and maybe in the center, there's also a very little amount of extra SRAM. That's more than enough for texture caches, FIFOs, and register files.
Hmmm the legend at the bottom of the graphic is not clear about which components are which. Its easy to pick out the Cortex 8, 'cos its implemented on a semi hard flow so looks a little more "rigid" than normal synthesised logic. You can imagine that the IVA 1 and IVA 2 would be pretty similar so you can probably spot those, but I think its non-obvious where the boundaries of the rest of the cores are.
Given that SGX 530 has two "shader cores" presumably, you could assume that the top right and bottom right SRAM is the shader pipe-specific stuff (incl. RF) and the texture caches, while the center right is for the FIFOs. The rest are misc. buffers, for example to communicate with the outside world.
I think PowerVR refers to SGX530 as being two shader "pipes" rather than cores. The premise being that you can share infrastructure between the two pipes and save area (common in programmable architectures) rather than stamping down two identical cores.
Compared to a non-deferred renderer, they can also save quite a bit of SRAM by not needing on-chip HierZ and stuff like that; and obviously the memory controller is off-block.
Uhoh, note to self: not reply to TBDR Bandwidth Holy Wars. Ever! :D
LOL - Okay we'll forget you said anything (wave your hand and say after me "these aren't the posts your looking for")
Now THAT's being opinionated! ;)
Is that not allow? :grin:
BTW - If JohnN and anyone from ARM wants to pass me the TRM's for there cores I'll do a proper tear down for you guys...
Ailuros
25-Nov-2008, 17:24
Well actually I was suggesting that *all* GPU marketing stretches the truth, but hey hoo.
Make that:
Well actually I was suggesting that *all* marketing stretches the truth......and we'll have an immediate agreement.
As for the rest:
Anyway lets hope they get more of a clue with the next one and give PowerVR a real run for their money.I'm hearing the same story since the birth of 3D in the mobile/PDA market. In fact it we should have seen some fierce competition after MBX from ATI (now AMD) which abandoned the market with flying colours, the Bitboys (absorbed by ATI before AMD bought the latter and the result lies in the former sentence...), NVIDIA (which seems to do a lot better with APX2500 than with the initial GoForce) and Falanx (absorbed by ARM) etc.
However the result doesn't differ (as in amount of major licensing deals and in extension success) with SGX today than it did in the past with MBX. I actually expected competition to heat up with the OGL_ES2.0 generation, but I don't see any earth breaking changes either.
If it's really who operates with "smokes and mirrors" then I'd like to hear which of them all is innocent for one, which cannibalize prices in order to gain even one deal or which of them give their IP away for free to even claim a deal after all.
Before anyone throws any stones at IMG, I'd like to hear the entire rotten story that stages itself behind the curtains including all fud like that tiling stinks (which obviously doesn't come from ARM). Besides marketing there's only thing to say for any potential competitor at all times: deliver or shut up. It's a free market and the more competition the merrier especially for consumers.
roninja
25-Nov-2008, 17:50
Wow lively debate folks..
TheArchitect
25-Nov-2008, 17:51
Make that:
If it's really who operates with "smokes and mirrors" then I'd like to hear which of them all is innocent for one, which cannibalize prices in order to gain even one deal or which of them give their IP away for free to even claim a deal after all.
Arrrrh! Now there be some tales to tell... but perhaps for another thread?
Before anyone throws any stones at IMG, I'd like to hear the entire rotten story that stages itself behind the curtains including all fud like that tiling stinks (which obviously doesn't come from ARM).
Oooo yeah thats a good one and goes waaaaaaaaay back to the mid 90's when you couldn't move in the industry without tripping over a 3D Graphic Chip company! Remember Renditions Verite, the Cirus Logic Laguna, the NV1? The scandal around the DX1 Tunnel test (I seem to remember that was the genesis of Toms Hardware). Damn that makes me feel old!
Like I said, perhaps another thread is required...
Ailuros
25-Nov-2008, 18:11
Arrrrh! Now there be some tales to tell... but perhaps for another thread?
I know that you know where I'm getting at. Apart from that why another thread? It's perfectly fine here and I don't feel it's off topic either.
Oooo yeah thats a good one and goes waaaaaaaaay back to the mid 90's when you couldn't move in the industry without tripping over a 3D Graphic Chip company! Remember Renditions Verite, the Cirus Logic Laguna, the NV1? The scandal around the DX1 Tunnel test (I seem to remember that was the genesis of Toms Hardware). Damn that makes me feel old!
Like I said, perhaps another thread is required...
The 3D history is full of such stories and not just one 3D Graphics company or just one scandal. You're obviously as long around as I and I doubt you're that much older than me. Difference being I'm a simple user with no interests whatsoever nor ulterior motives. Now that's then truly material for another thread.
BTW - There is no holy war here, I have no allignment to either company, just thought I'd make that clear.I figured that, no problem - I rather meant in the context of TBDR vs IMR vs [...] debates, which tend to have rather bold and highly contradictory arguments on both sides when it comes to things like bandwidth and impact of newer APIs. Honestly, 99% of the arguments I've seen personally proved little but the lack of understanding of the other side of the aisle - that doesn't mean there isn't a real winner (I have no idea), but if there is then the arguments probably aren't (only?) those made oh-so-often.
Furthermore I am not very interested in comparing actual implementations as they rarely mean all that much when it comes to the core algorithms. There are plenty of incredibly smart things you can do on a TBDR that I've never seen any IMR proponent mention, and there are plenty of incredibly smart things you can do on an IMR architecture that I've never seen any TBDR proponent mention. Heck I've never even seen anyone mention most of those publicly! The foundations of the debate really tend to be set at the wrong level... And any argument about 'real-world' workloads are often inherently biased and are always different for an open ecosystem such as smartphones and a fixed one such as a handheld gaming console.
I think you are assuming that the VS and FS cores are decoupled by a fifo correct? In actual fact this is not the case for either architecture (I think and I'm sure JohnH will be very quick to correct me if I'm not... btw does anyone from ARM, Vivante or Matrox follow this forum???)You are correct, I obviously knew that for SGX but had a brainfart for Mali. Regarding your question, Arjan (who made a quick post earlier in the thread :)) is a Falanx/ARM engineer. I am sadly not aware of anyone from Vivante or Matrox, although who knows who's lurking out there! (*cough* you *cough*)
but the intermediate data between the VS and FS processing stages is actually stored to main memory (post VS, post binning). Therefore it would actually be possible and even reasonable to power off the MaliGP (this was the same with MBX equiped with VGP, but I'm not sure you could power gate it in the same way).Yes, that is perfectly correct, and so obviously in Mali's specific case you could use power gating for the geometry processing. In the case of NVIDIA, they don't do binning/tiling of any sort so they obviously couldn't do that (at least not for the whole unit - the APX 2500's version is relatively high-end and can be scaled down downwards 2x or possibly more so maybe they could power gate half of it all the time if VS requirements aren't veryhigh); I tend to confuse Tegra's 3D core and Mali on a few things, heh...
Hmmm the legend at the bottom of the graphic is not clear about which components are which.It's not but they say it's 5.5mm˛ while the total chip is 60mm˛; based on that it is very easy to see what it is... (or at least what TechInsights thinks it is, but it makes perfect sense from a size POV at least given PowerVR's claims)
I think PowerVR refers to SGX530 as being two shader "pipes" rather than cores. The premise being that you can share infrastructure between the two pipes and save area (common in programmable architectures) rather than stamping down two identical cores.I don't care how they refer to them; they are cores. And when I say cores, I mean real cores, not the marketing hyperbole like NVIDIA does it. SGX's shader 'pipes' are full-blown VLIW FP32 processors with 16 concurrent threads (and 4 being prepared pre-ALU at the same time, in the same stages; I will let it as an exercise to the reader to figure out why this saves valuable register file die space and power; it's really not any different from PC GPUs, but very different from Larrabee which however benefits more from its L1 cache...)
Is that not allow? :grin:Of course it is, it was just funny because I was making the extra effort to be as objective as possible here and not voice clear opinions, while you come along and decide to say things like that - it's funny, that's all! :D
Ailuros
26-Nov-2008, 06:12
A bit of digging around and I found an article that caught my attention in 2005:
http://www.hardocp.com/article.html?art=ODAyLCwsaGVudGh1c2lhc3Q=
Now Falanx wasn't part of ARM back then and they of course marketed themselves their IP. In any case I even back then didn't see anything weird in some of the marketing claims, but if we're going to get into any marketing hyperboles let's just pick two incidents:
When questioned about Mail200 performance we were told that Mali200 could be offering much more efficient integrated graphics by the second half of 2006 that would be "on par" with the current add-in board graphics processing units of today.
http://www.hardocp.com/images/articles/1123007615PtcKMOrr3N_1_9_l.jpg
TheArchitect
26-Nov-2008, 08:48
It's likely that ARM's priorities were different to Falanx's in terms of roadmap and while the claim was probably true at the time (remember we were only really seeing Gen 1 of "proper" shader architecture on the PC, i.e. DX9c/SM2 and not reg. combiners and Mali200 was full shader programmable) attacking that higher end market was probably a low priority for ARM.
I guess it's one of the things they (Falanx) traded for financial stability. They may want to revisit that in the face of the Intel's charge into ARM space with Atom though (ARM have a big software and performance hill to climb there on the CPU side). A symetrically (both VS and FS) scalable core would have given them the ability to scale much more easily. Mali 400 was rooted in a good idea, but like I said it feels like a rush job.
Speaking of Atom I'd have to feel a bit concerned if I was PowerVR as well... Intel have their own GPU play with LRB now, how long before they get that to a point where it could be scaled into an Atom core? They don't have to worry about the inconvenience of other peoples fabrication tech, they can leverage every last inch of their own. That coupled to the coming of heterogeneous computing without a proper CPU play PowerVR (and Nvidia on the desktop for that mater) could be out in the cold.
Makes you wonder if they'll get bought out along the way. Come to think of it that does beg the question why weren't they bought when Falanx, BitBoy's and Hybrid were snapped up? AMD and ARM certainly had the financial resource, they are an openly traded company. They look a little expensive based on revenues (and the management is probably arrogant enough to demand a huge premium for a controlling interest).
Like we discussed earlier though there isn't always technical reasoning to these things - ATI buying BitBoys for portable seemed crazy (G40 was not the best tech by any means), but if you just think about it from a "buying the bodies" point of view it makes total sense.
However I digress this is a graphics tech forum not and investment one.
TheArchitect
26-Nov-2008, 09:57
I know that you know where I'm getting at. Apart from that why another thread? It's perfectly fine here and I don't feel it's off topic either.
The 3D history is full of such stories and not just one 3D Graphics company or just one scandal. You're obviously as long around as I and I doubt you're that much older than me. Difference being I'm a simple user with no interests whatsoever nor ulterior motives. Now that's then truly material for another thread.
In all seriousness you've prompted me to start work on an (already pretty long)article which discusses the relative merits of different approaches and how that pans out for the user in terms of performance.
I'm also looking at releasing some work I did on designing a more robust benchmark for GPU's. It looks at the premise that its not a case of "one figure to rule them all" but a balance of curves, like a dyno read out for a car.
Might take a while, but seems to be a need for something like that... I've actually been approach about writing up some of this as a book, but a book sounds like a lot of work to me. Would be a bit niche as well, hardly likely to be a best seller!
I could throw some Jackie Collins moments in to spice it up a bit I guess - "His attempts to get ambient occlusion to light the contours of her ample bossom amussed her. She knew she had to have him...."
Speaking of Atom I'd have to feel a bit concerned if I was PowerVR as well... Intel have their own GPU play with LRB now, how long before they get that to a point where it could be scaled into an Atom core?I could scale anything down to the size of a single transistor, doesn't mean it's a good idea. Intel's arguments for why they should in theory have lower bandwidth per frame are the same as those of PowerVR and Falanx; they don't have anything magical on their side to go even farther than that.
Furthermore they have some disadvantages, such as the apparent lack of real compression hardware; this will bite them when, for example, reading back from a shadowmap. They could try to improve things slightly through even more clever software, however doing any form of compression on a programmable x86 system will always take more power than dedicated hardware. Therefore it is hard to see how they could be competitive on power with handheld hardware (I'm not willing to make a clear statement versus PC hardware), and this is just one reason among many.
They don't have to worry about the inconvenience of other peoples fabrication techIn Kucinich's eternal words to Neel Kashkari: "That statement that you just made, you will hear
about for the rest of your career." ;) (I kid, but I wish it was so simple)
they can leverage every last inch of their own. That coupled to the coming of heterogeneous computing without a proper CPU play PowerVR (and Nvidia on the desktop for that mater) could be out in the cold.CPU? Is that the thing I buy along with the rest of my groceries at Walmart? Or was that Walmarm or perhaps Ibmart? :)
Makes you wonder if they'll get bought out along the way.Well Apple is certainly putting them at the very core of their massively lucrative handheld business, so if things go south for any reason they'll always have a helping hand on that side... Not that I think it will, their design pipeline seems extremely solid and they surely must have been investing in longer-term R&D for some time.
However I digress this is a graphics tech forum not and investment one.Feel free to discuss that in the 3D & Semiconductor Industry forum, I'm sad it has died so badly in recent months... :(
TheArchitect said:
"Come to think of it that does beg the question why weren't they bought when Falanx, BitBoy's and Hybrid were snapped up?"
We can hardly view much of what AMD/ATI did in the mobile graphics space with much authority given their quick exit.
ARM bought Falanx. why not IMG. That is indeed a really intriguing question. Was Falanx available for a comparative song because of the dearth of design wins ? If in buying IMG would ARM be closing so many doors to potential graphics licencees (INTEL,Apple spring to mind) that the company they bought would no longer be worth what was paid.
I'd also be intrested to know the exact order of events. Did IMG halt the co-sell they had with ARM (where Arm would sell IMG IP and get a share of the royalities) because ARM bought Falanx, or did ARM buy Falanx because IMG halted the agreement.
TheArchitect
26-Nov-2008, 12:49
I could scale anything down to the size of a single transistor, doesn't mean it's a good idea. Intel's arguments for why they should in theory have lower bandwidth per frame are the same as those of PowerVR and Falanx; they don't have anything magical on their side to go even farther than that.
Furthermore they have some disadvantages, such as the apparent lack of real compression hardware; this will bite them when, for example, reading back from a shadowmap. They could try to improve things slightly through even more clever software, however doing any form of compression on a programmable x86 system will always take more power than dedicated hardware. Therefore it is hard to see how they could be competitive on power with handheld hardware (I'm not willing to make a clear statement versus PC hardware), and this is just one reason among many.
Now you see you are applying technical reason and I got told off for that :wink:
Independance from a third party vendor who could be at risk of being purchased by someone with different goals to yours (such as consolidation) is a valid reason to pursue your own home grown tech in this instance.
You are also comparing the scant details released about the current LRB platform with a potential yet to be designed PPA optimised mobile version. So yes at the moment that is the case, but you've gotta think they'd realise that before building a mobile version.
In Kucinich's eternal words to Neel Kashkari: "That statement that you just made, you will hear
about for the rest of your career." ;) (I kid, but I wish it was so simple)
You have to admit though AMD and Nvidia's reliance on foundries has not exactly paid off recently. Maybe thats them paying for the folly of pushing un-proven tech. on un-proven fabrication and playing around in the margins of its characterisation.
CPU? Is that the thing I buy along with the rest of my groceries at Walmart? Or was that Walmarm or perhaps Ibmart? :)
But when the world goes hetro and its all on the same die... what then? AMD, ARM and Intel will more than likely pull in that direction (or if they have any sense they will, it plays to their strength and position to do that). Closer coupling between the GPU and CPU is enevitable for lots of reason (not least of which is the Albatros that is PCIe on the PC and distributed non-coherent caching on mobile).
Well Apple is certainly putting them at the very core of their massively lucrative handheld business, so if things go south for any reason they'll always have a helping hand on that side... Not that I think it will, their design pipeline seems extremely solid and they surely must have been investing in longer-term R&D for some time.
Feel free to discuss that in the 3D & Semiconductor Industry forum, I'm sad it has died so badly in recent months... :(
Apple still have other options at the moment - Mali, Vivante, etc. if there is another buying spree or consolidation move then this may happen, but Apple like to stay somewhat independant of any one tech vendor in my experience (hell they bought their own ARM design team pretty much).
If I was a betting man I'd actually say TI has more invested PowerVR than Apple from that stand point. TI's revenue stream is probably worth alot more to PowerVR $ for $ as well.
TheArchitect
26-Nov-2008, 13:28
TheArchitect said:
"Come to think of it that does beg the question why weren't they bought when Falanx, BitBoy's and Hybrid were snapped up?"
We can hardly view much of what AMD/ATI did in the mobile graphics space with much authority given their quick exit.
I'll give them credit for winning Qualcomm though. That's a big chunk of the mobile units shipped. However, there was a lot of channel conflict between the two from what I heard, which basically didn't bode well (romour was ATI sales force were encouraging Qualcomm customers to take a cheap non-GPU enabled Qualcomm part and then adding an adjunct from ATI, becuase they'd get more revenue and the sales guys numbers would look good so they'd get their bonus).
The rapid exit was I suspect a co-opratition issue, enabling their partners who they would/could eventually be competing against wouldn't have sat well with AMD management post merger.
ARM bought Falanx. why not IMG. That is indeed a really intriguing question. Was Falanx available for a comparative song because of the dearth of design wins ?
There maybe some truth to that, but not from that perspective though. PowerVR is part of the Imagination Technologies group of companies which has a whole host of baggage. No doubt the Imagination majority shareholders would have wanted to off load the whole lot in one go making it less attractive to ARM who don't need a DAB radio factory or a DSP. Some of the video stuff might have been useful, but ARM hasn't got a great track record with taking on other peoples architecture (they didn't do much with that Philips off shoot they bought, which was a shame, looked like it had promise).
If in buying IMG would ARM be closing so many doors to potential graphics licencees (INTEL,Apple spring to mind) that the company they bought would no longer be worth what was paid.
I've had a quote from ARM's CTO repeated to me before - goes something like "Now we've establish we are whores its just a mater of dealing with the price", basically, they'll sell to anyone if the price is right and the customer has some morality. So the door to Intel, Apple etc. would still be open I think.
Very much in the spirit of Sir Robin, a great character from the industry. The mans a legend!
I'd also be intrested to know the exact order of events. Did IMG halt the co-sell they had with ARM (where Arm would sell IMG IP and get a share of the royalities) because ARM bought Falanx, or did ARM buy Falanx because IMG halted the agreement.
I expect we will never know the truth, the press statement was fairly non descript. Just like Hollywood, Vegas wedding, no fault quicky divorce. Wonder if PowerVR signed a pre-nup?
It's likely that ARM's priorities were different to Falanx's in terms of roadmap and while the claim was probably true at the time (remember we were only really seeing Gen 1 of "proper" shader architecture on the PC, i.e. DX9c/SM2 and not reg. combiners and Mali200 was full shader programmable) attacking that higher end market was probably a low priority for ARM.
The only thing that has changed for Mali 200 since ARM aquired Falanx is that it got slower and bigger than Falanx originally claimed. Incedentally for someone who claims to know so much about graphics I'm surprised that you don't know that SM3.0 was common placed when Mali200 was first announced, and shaders where well beyond the first gen of such architectures.
...
Speaking of Atom I'd have to feel a bit concerned if I was PowerVR as well... Intel have their own GPU play with LRB now, how long before they get that to a point where it could be scaled into an Atom core? They don't have to worry about the inconvenience of other peoples fabrication tech, they can leverage every last inch of their own. That coupled to the coming of heterogeneous computing without a proper CPU play PowerVR (and Nvidia on the desktop for that mater) could be out in the cold.
If you weren't so tied up in putting PowerVR down, you would have looked at how LRB works and thought about how exactly you scale that down to produce something that is competitive in performance power and area space PowerVR sells into and would realise that it isn't entirely sensible.
Makes you wonder if they'll get bought out along the way. Come to think of it that does beg the question why weren't they bought when Falanx, BitBoy's and Hybrid were snapped up? AMD and ARM certainly had the financial resource, they are an openly traded company. They look a little expensive based on revenues (and the management is probably arrogant enough to demand a huge premium for a controlling interest).
What a rediculous comment, at the time ARM purchased Falanx I think IMG where valued at over Ł200M on the open stock market compared to a few milion quid for the other two, that is the answer to your question.
Like we discussed earlier though there isn't always technical reasoning to these things - ATI buying BitBoys for portable seemed crazy (G40 was not the best tech by any means), but if you just think about it from a "buying the bodies" point of view it makes total sense.
However I digress this is a graphics tech forum not and investment one.
Yes you have digressed, and I think its time to come clean about exactly who you are rather than claiming to be a neutral observer.
Having done a little search on linked in involving the the keywords ARM, ex employee, architect, and coupled to a rather familier approach to negative marketing I'm pretty certain I know who you are, the question is do you have the integrity to come clean?
John.
Independance from a third party vendor who could be at risk of being purchased by someone with different goals to yours (such as consolidation) is a valid reason to pursue your own home grown tech in this instance.Obviously, but then again someone the size of Intel has the leverage to negociate contracts that make such things of relatively little concern in the short & mid-terms. There are certainly plenty of advantages to doing things in-house, but the real question is they're worth the extra costs and especially the risk of simply designing an inferior solution. And then what, you just spent $50M in R&D, do you just throw it to the garbage bin? Very few companies have the guts to do that... :)
You are also comparing the scant details released about the current LRB platform with a potential yet to be designed PPA optimised mobile version. So yes at the moment that is the case, but you've gotta think they'd realise that before building a mobile version.The real question then is what technical advantages such a solution could have over the competition. In theory, you have all the cost of maximum flexibility along with the cost of fixed-function hardware. Therefore, for such an approach to be superior, the base 'cores' must be supremely executed and *more* efficient per transistor than the competition's shader processors. This is far from impossible, but I think you'll have to agree that it's normal for me to be very skeptical that it is the most likely outcome...
You have to admit though AMD and Nvidia's reliance on foundries has not exactly paid off recently. Maybe thats them paying for the folly of pushing un-proven tech. on un-proven fabrication and playing around in the margins of its characterisation.I disagree, it has paid off just fine. RV670 or G80 are textbook examples of how a great foundry relationship can go. Yes, there are hiccups from time to time, but that is not necessarily related to the foundry model! (*cough* AMD/65nm/Barcelona *cough*)
But when the world goes hetro and its all on the same die... what then? AMD, ARM and Intel will more than likely pull in that direction (or if they have any sense they will, it plays to their strength and position to do that). Closer coupling between the GPU and CPU is enevitable for lots of reasonIn the specific case of Larrabee, the inclusion of a MIMD aspect on each core makes a single-chip solution with OoOE cores rather redundant in my mind. By definition, the OoOE cores are only useful for tasks that are not highly parallel; therefore, the data transit in an optimized software algorithm should not be a real problem.
Ideally everything would always be a single chip, but ideally I'd also have all of my DRAM on-chip. Sadly a little thing called 'reality' tends to get in the way of such things happening, at least in successful products :twisted:
Apple still have other options at the moment - Mali, Vivante, etc. if there is another buying spree or consolidation move then this may happen, but Apple like to stay somewhat independant of any one tech vendor in my experience (hell they bought their own ARM design team pretty much).
If I was a betting man I'd actually say TI has more invested PowerVR than Apple from that stand point. TI's revenue stream is probably worth alot more to PowerVR $ for $ as well.Historically you would be correct. But TI will never license PowerVR's VXD or VXE; Apple, however, did. My guesses for their upcoming SoCs are:
0) 90nm/ARM11/MBXLite/In-House or Samsung Audio&Video
1) 65nm/ARM11/SGX520/VXD330/In-House Audio&ISP
2) 45nm/Cortex-A9/SGX540/VXD380/VXE280/In-House Audio&ISP
*If* this is correct, it's pretty clear that the amount of PowerVR IP into future Apple products is extremely high. And when you depend on 3 separate pieces of IP from a company, it becomes less desirable to switch to someone else for just one of the three...
I've had a quote from ARM's CTO repeated to me before - goes something like "Now we've establish we are whores its just a mater of dealing with the price"What is it with UK semiconductor CTOs that make them so cool? I really like all I've seen/heard from the top technical guys from ARM, CSR, Icera, etc. - I especially liked this story: http://www.electronicsweekly.com/blogs/david-manners-semiconductor-blog/2008/07/the-late-simon-knowles.html - as Ailuros once said, maybe it's the tea! :)
If you weren't so tied up in putting PowerVR down, you would have looked at how LRB works and thought about how exactly you scale that down to produce something that is competitive in performance power and area space PowerVR sells into and would realise that it isn't entirely sensible.To TheArchitect's credit, he was likely thinking of a much longer-term horizon than you are; say, 22nm or so... It would make little sense to scale Larrabee down to less than 1 core/16 ALUs, but such a level of performance is perfectly sensible for handhelds in that technology generation. Therefore the real question is more what its real efficiency is, and that's quite a debate in itself to say the least!
Of course if you are willing to criticize Larrabee's overall efficiency publicly/on the record and start a catfight about it here, be my guest - I'm all for good television! ;) :D
Having done a little search on linked in involving the the keywords ARM, ex employee, architect, and coupled to a rather familier approach to negative marketing I'm pretty certain I know who you are, the question is do you have the integrity to come clean?Hah! This is getting pretty hot - just a quick comment from a moderation POV: please don't force people to come clean about their RL identities publicly and bringing RL animosities in if they exist. However in certain circumstances, I would obviously find it entirely appropriate/sensible (and certainly desirable from your POV) to come clean in private via PMs.
In case PMs are disabled for you because of your low post cost, please just let me know and I'll take care of it.
TheArchitect
26-Nov-2008, 16:00
That would certainly seem to be the case Arun, PM's are not enabled due to low post count it would seem (must be a high threshold, how many posts do you need?!?).
Simon F
26-Nov-2008, 16:16
That would certainly seem to be the case Arun, PM's are not enabled due to low post count it would seem (must be a high threshold, how many posts do you need?!?).
I can't recall but at "10.71 posts per day (http://forum.beyond3d.com/member.php?u=17590)" you should be there in no time :grin:
TheArchitect
26-Nov-2008, 16:24
Bloody hell - perhaps I should get some real work done :grin:
Ailuros
26-Nov-2008, 16:50
It's likely that ARM's priorities were different to Falanx's in terms of roadmap and while the claim was probably true at the time (remember we were only really seeing Gen 1 of "proper" shader architecture on the PC, i.e. DX9c/SM2 and not reg. combiners and Mali200 was full shader programmable) attacking that higher end market was probably a low priority for ARM.
20GFLOPs out of 2-3 square millimeters at 90nm? It didn't and doesn't sound realistic to me. Incidentally the article above at HardOCP went public merely 5 days after Eurasia/SGX had been announced. My only other point was that marketing can play nasty tricks from all sides.
I guess it's one of the things they (Falanx) traded for financial stability. They may want to revisit that in the face of the Intel's charge into ARM space with Atom though (ARM have a big software and performance hill to climb there on the CPU side). A symetrically (both VS and FS) scalable core would have given them the ability to scale much more easily. Mali 400 was rooted in a good idea, but like I said it feels like a rush job. I've no idea how the human resources look like at ARM's graphics department these days. Had it not grown significantly though since the Falanx days, I have the feeling though that they might need some serious enforcements.
Speaking of Atom I'd have to feel a bit concerned if I was PowerVR as well... Intel have their own GPU play with LRB now, how long before they get that to a point where it could be scaled into an Atom core? They don't have to worry about the inconvenience of other peoples fabrication tech, they can leverage every last inch of their own. That coupled to the coming of heterogeneous computing without a proper CPU play PowerVR (and Nvidia on the desktop for that mater) could be out in the cold. Let's separate IP from fabless semiconductor markets when it comes to graphics otherwise we'll get lost in the longrun.
Even if IMG wanted, could (add whatever feels more comfortable) try to touch the CPU market it would end up with conflicting interests with their most significant partners. The above most certainly is a consideration, but you seem to forget that "now" when it comes to LRB is more likely a first sample release in somewhere early 2010 and it's not like it takes a snip of your hand to scale down from the high end into the small form factor either. At least for the lifetime of SGX IMG is safe; and here's the trick for the next generation. They need to secure other markets with SGX already to build on in order to further stabilize them with their next generation. One good example would be the handheld console gaming market, for which for the next generation consoles I can see only two contenders; one of them being announced indirectly already.
Makes you wonder if they'll get bought out along the way. Come to think of it that does beg the question why weren't they bought when Falanx, BitBoy's and Hybrid were snapped up? AMD and ARM certainly had the financial resource, they are an openly traded company. They look a little expensive based on revenues (and the management is probably arrogant enough to demand a huge premium for a controlling interest). AMD had the financial resources back then, but we all know how their story with the PDA/mobile market ended. Besides there's a quite a difference in cost between say the Bitboys and IMG.
So despite your disgress as you said (and I'm more than well aware after all these years what Beyond3D stands for), shall we cut a bit deeper into things and get a wee bit more specific?
You don't strike me like someone that doesn't have anything to do on a professional level with the markets debated here, nor do I think that you never ever had any direct or indirect ties with ARM. Prove me wrong, expsose your identity and I will have the dignity for a full public apology.
That said:
I'll give them credit for winning Qualcomm though. That's a big chunk of the mobile units shipped. However, there was a lot of channel conflict between the two from what I heard, which basically didn't bode well (romour was ATI sales force were encouraging Qualcomm customers to take a cheap non-GPU enabled Qualcomm part and then adding an adjunct from ATI, becuase they'd get more revenue and the sales guys numbers would look good so they'd get their bonus).
The rapid exit was I suspect a co-opratition issue, enabling their partners who they would/could eventually be competing against wouldn't have sat well with AMD management post merger.Ironically there was somebody in the past that worked for said company above that had a very similar reasoning as you seem to have. He was very vocal by that time how ATI/NVIDIA and others will recap for the 2nd generation and all they have now is that they're left dry with very little to battle with.
Didn't I mention before that some IHVs cannibalized prices to gain some deals? Just don't tell me that it wasn't the case here. I myself had been fooled in the past when reading of a "mini-Xenos" that the result would be as impressive as its larger XBox360 brother; it turns out it had very little in common after all. 2nd case of the usual marketing stunts and other participants being "innocent" in the usual marketing circus.
There maybe some truth to that, but not from that perspective though. PowerVR is part of the Imagination Technologies group of companies which has a whole host of baggage. No doubt the Imagination majority shareholders would have wanted to off load the whole lot in one go making it less attractive to ARM who don't need a DAB radio factory or a DSP. Some of the video stuff might have been useful, but ARM hasn't got a great track record with taking on other peoples architecture (they didn't do much with that Philips off shoot they bought, which was a shame, looked like it had promise).PowerVR is the heart of IMG and their other subdivisions had been created during the years based on the experience collected from the first. It's a whole web interconnected patents and I don't think it would make any sense for IMG to sell of PowerVR especially since the latter is the subdivision with the highest revenue.
ARM had an agreement with IMG to sell MBX IP. IMG during the process saw that they can channel/sell their IP themselves too and decided to carry on by themselves for the 2nd generation. Meaning that for the first generation and as long as ARM was selling MBX on IMG's behalf, IMG might have in the "good book" for some and became afterwards the rotten company that has flooded the market during the decades of its existence with lies, smokes and mirrors.
If I ever had been employed by ARM for relevant matters I'd too today might have similar feelings, but I'd never go out in public with them and not in that sense by all means.
I've had a quote from ARM's CTO repeated to me before - goes something like "Now we've establish we are whores its just a mater of dealing with the price", basically, they'll sell to anyone if the price is right and the customer has some morality. So the door to Intel, Apple etc. would still be open I think.And I happen to have heard from someone that works at a large company that enquired for an license that they nowadays get some extra IP virtually for free if they'd also buy a CPU. Does that also sound familiar?
There's beyond a constant price war for any contenders in this market. When one of them though tries to get rid of them for free just because at least some market penetration after all is necessary, then some folks shouldn't point fingers. I wonder do they have any mirrors by any chance?
I expect we will never know the truth, the press statement was fairly non descript. Just like Hollywood, Vegas wedding, no fault quicky divorce. Wonder if PowerVR signed a pre-nup?You mean to say you weren't there? I start to feel terrible now.....*snicker*
Ailuros
26-Nov-2008, 16:59
To TheArchitect's credit, he was likely thinking of a much longer-term horizon than you are; say, 22nm or so... It would make little sense to scale Larrabee down to less than 1 core/16 ALUs, but such a level of performance is perfectly sensible for handhelds in that technology generation. Therefore the real question is more what its real efficiency is, and that's quite a debate in itself to say the least!
Of course if you are willing to criticize Larrabee's overall efficiency publicly/on the record and start a catfight about it here, be my guest - I'm all for good television! ;) :D
ROFL :D *now it's my turn to fetch popcorn if it ever gets as far...*
...
To TheArchitect's credit, he was likely thinking of a much longer-term horizon than you are; say, 22nm or so... It would make little sense to scale Larrabee down to less than 1 core/16 ALUs, but such a level of performance is perfectly sensible for handhelds in that technology generation. Therefore the real question is more what its real efficiency is, and that's quite a debate in itself to say the least!
What level of perfromance do you think will be acceptable for handhelds in the time scales of a 22nm? Obviously can't say much, but I think people may be surprised. Wrt LRB I think the approach being used fits the high end where the scale of the devices being built is close to a couple of orders of maginitude greater than typical handheld graphics cores, the drop to 22nm just about gets you one order of magnitude, personally I'm not convinced that a 10x scaled back LRB would be competitive.
Of course if you are willing to criticize Larrabee's overall efficiency publicly/on the record and start a catfight about it here, be my guest - I'm all for good television! ;) :D
Nah, not something I feel inclined to debate, although I will say do like their aproach.
Hah! This is getting pretty hot - just a quick comment from a moderation POV: please don't force people to come clean about their RL identities publicly and bringing RL animosities in if they exist. However in certain circumstances, I would obviously find it entirely appropriate/sensible (and certainly desirable from your POV) to come clean in private via PMs.
In case PMs are disabled for you because of your low post cost, please just let me know and I'll take care of it.
Hey I'm not going to out anyone on a public forum, but given the nature of some of the PowerVR slamming being done I think its fair that people should be aware of current/recent affiliations so that the correct level of salt can be taken with stated opinion.
John.
darkblu
26-Nov-2008, 23:41
Hold on a minute, how many embedded systems do you know that actually have enough room to store a 1920x1080x32 texture (8 MB for the top MIP level), let alone have the need to zoom into it by nearly 16x?????
i believe the whole iphone family has a heafty chunk of memory dedicated to video, but i can't quote any figures..
Well I suppose, viewing JPG stills maybe with some zoom, but then you can do a partial decode on them to limit the source texture size so its not a problem.
i take it you have not been acquainted with apple's doings on the iphone: kinetic scrolling of surfaces, 'core animation' fx over those, including, but not limited to, auto-rotating views, etc.
a scenario where a jpg is viewed in portrait, zoomed in, and then the device is roated to landscape and the view follows suit smoothly is a trivial use case on the iphone.
actually, right now, i have a rather intriguing GraphViz-produced 1,985 x 1,064, png on my ipod touch, and i entertain myself by rotating the device and watching the view rotate, like a good texture on a quad would.. left, then right. then left again. nice.
ed: i'm not suggesting the png surface is 2k x 2k, but it's fairly high resolution, nevertheless.
TheArchitect
27-Nov-2008, 01:06
i believe the whole iphone family has a heafty chunk of memory dedicated to video, but i can't quote any figures..
That possibly isn't for perfromance reasons, some systems require static allocation of the Gfx memory pool for various reasons (OS security/stability, SoC implementation constrains, etc.).
i take it you have not been acquainted with apple's doings on the iphone: kinetic scrolling of surfaces, 'core animation' fx over those, including, but not limited to, auto-rotating views, etc.
a scenario where a jpg is viewed in portrait, zoomed in, and then the device is roated to landscape and the view follows suit smoothly is a trivial use case on the iphone.
actually, right now, i have a rather intriguing GraphViz-produced 1,985 x 1,064, png on my ipod touch, and i entertain myself by rotating the device and watching the view rotate, like a good texture on a quad would.. left, then right. then left again. nice.
ed: i'm not suggesting the png surface is 2k x 2k, but it's fairly high resolution, nevertheless.
Yeah I've seen the Iphone do its thing, very nice (still the best browser expereince on a mobile device at the moment, mind I haven't got my hands on a G1 yet, apparently that gives Iphone a run for its money). Like I said though you can do clever things with a JPG because its constructed of macro blocks.
When looking at the entire image you are down sampling to the screen size so I would have thought large amounts of sub pixel accuracy are not an issue (you are sampling groups of whole pixels to create a single pixel mostly), planar rotation when you tilt the device is done on the screen sized version most of the time so again no sub pixel accuracy issues and when zoomed in you only need to access the macro block groups from the JPG that your zoomed target area is made of limiting the U,V range and subsiquently the requirement for high levels of sub pixel precision.
(btw - if this isn't the way they are doing it I call dibs on the IP :grin: )
I'm not sure how you'd do it on a PNG file, but probably taking a texture sub image which again would limit the section your accessing when zoomed to a "manageable precision" (I know Apple desktop solution providers have to supply a number of required extensions, perhaps an extension attaching a sub region of a texture to a texture is one of them).
Ailuros
27-Nov-2008, 05:18
What level of perfromance do you think will be acceptable for handhelds in the time scales of a 22nm? Obviously can't say much, but I think people may be surprised. Wrt LRB I think the approach being used fits the high end where the scale of the devices being built is close to a couple of orders of maginitude greater than typical handheld graphics cores, the drop to 22nm just about gets you one order of magnitude, personally I'm not convinced that a 10x scaled back LRB would be competitive.
Albeit it's not that relevant, I doubt NV's APX2500 has that much in common with their current G8x/GT2x0 high end architectures either. It's not an absolute necessity that if Intel purely hypothetically wanted to continue from point X in the PDA/mobile market with their own technology that they couldn't. And no whatever Intel decides in the future it doesn't have to necessarily make sense either, since we've seen weirder things happening.
Nah, not something I feel inclined to debate, although I will say do like their aproach.
But you could answer a quick OT question to a layman: given the nature of the hardware did they have an alternative choice?
Hey I'm not going to out anyone on a public forum, but given the nature of some of the PowerVR slamming being done I think its fair that people should be aware of current/recent affiliations so that the correct level of salt can be taken with stated opinion.
As I said in my former post I've seen another similar attempt in the past here. Eventually I could understand that guy because he was amongst the ones that convinced his company to get license X and nowadays his past efforts didn't bring the results he had wished for.
Anyway if you want to chew on some further material, here you go:
http://www.design-reuse.com/articles/9591/competitive-advantages-of-the-mali-graphics-architecture.html
When I first heard of the original Mali I said back then that if a small IP company can integrate single cycle 4xMSAA then it's high time we see it in standalone GPUs too and we did in 2006, albeit many before that considered that it doesn't make "sense".
Two things for the above apart from the long list of obvious misconceptions: when it takes 4 cycles for 16xMSAA someone might also enable 4x Supersampling instead. Before anyone says that 16xMSAA delivers far better polygon edge/intersections AA than 4xSSAA (and be of course right), I'll say that with the latter you usually get also a -1.0 LOD offset which is near 2xAF quality.
The 2nd thing is something that might be neglected by most if all not all IHVs in this market: albeit I understand the importance of having 2x,4x or more Multisampling on devices with small screens, it does sound to me that none of them has the power to add at least some portion of anisotropic filtering.
In my head Multisampling compared to Supersampling is a sort of "performance optimization", yet the first should also come with at least 2xAF to be comparable. And yes I understand that an advanced analytical AF algorithm along with the required TMU strength is a tough cookie to break for now when die area is so limited.
To me though it takes a bit more than good polygon edge/intersection antialiasing (which is what nowadays 5-10% of the total screen space?) and blurry bilinear filtering for the majority of the screen.
Captain Chickenpants
27-Nov-2008, 09:47
When looking at the entire image you are down sampling to the screen size so I would have thought large amounts of sub pixel accuracy are not an issue (you are sampling groups of whole pixels to create a single pixel mostly), planar rotation when you tilt the device is done on the screen sized version most of the time so again no sub pixel accuracy issues and when zoomed in you only need to access the macro block groups from the JPG that your zoomed target area is made of limiting the U,V range and subsiquently the requirement for high levels of sub pixel precision.
I can't really comment on the 3d pipeline stuff (I am a video rather than 3d guy), but I can say that accessing a subset of blocks in a jpeg is not that straightforward. You would generally decode the whole thing and then view sub-sections of it, and then it is just a really big image.
CC
Simon F
27-Nov-2008, 10:07
I can't really comment on the 3d pipeline stuff (I am a video rather than 3d guy), but I can say that accessing a subset of blocks in a jpeg is not that straightforward.
CC
Indeed. That's why they aren't used (directly) as textures!
TheArchitect
27-Nov-2008, 10:46
Indeed. That's why they aren't used (directly) as textures!
Yes your are right you can't do random access into the fully compressed stream. You'd have to do the initial stages (huffman decode, etc.) to get the block information and then only do the full IDCT, etc. on the block you're interested in though, but that should be pretty quick I would have thought.
I think I'm right in saying that there has been provision in the spec for the JPG file format from the begining for you to insert "markers" or "bookmarks" to subset the image (kinda region of interest or print optimisation thing - hang over from when it took an age to decode jpg's). Would make it less efficient on the compression side, but much easier to do this sort of trick. Not sure it was ever that popular though.
Wonder what you'd do with JPG2000? Theoretically you can partial decode a wavelet based image to a lesser resolution by leaving out different frequency components... if the image is moving fast you'd be less like to see low (or is that high) frequency components anyway.
I suppose you could always use the progressive image techniques they used in the early days of the web to reference a lower effective resolution image when the image is moving as well (is that part of the JPG spec? or something proprietary?).
Probably over think this a bit - most of these things seem to be achieved through brute force mores the pity. <Sigh> Wheres the ingenuity you used to see in the old days...
Captain Chickenpants
27-Nov-2008, 11:04
Yes your are right you can't do random access into the fully compressed stream. You'd have to do the initial stages (huffman decode, etc.) to get the block information and then only do the full IDCT, etc. on the block you're interested in though, but that should be pretty quick I would have thought.
I think I'm right in saying that there has been provision in the spec for the JPG file format from the begining for you to insert "markers" or "bookmarks" to subset the image (kinda region of interest or print optimisation thing - hang over from when it took an age to decode jpg's). Would make it less efficient on the compression side, but much easier to do this sort of trick. Not sure it was ever that popular though.
Wonder what you'd do with JPG2000? Theoretically you can partial decode a wavelet based image to a lesser resolution by leaving out different frequency components... if the image is moving fast you'd be less like to see low (or is that high) frequency components anyway.
I suppose you could always use the progressive image techniques they used in the early days of the web to reference a lower effective resolution image when the image is moving as well (is that part of the JPG spec? or something proprietary?).
Probably over think this a bit - most of these things seem to be achieved through brute force mores the pity. <Sigh> Wheres the ingenuity you used to see in the old days...
The DC coefficients in the block are predicted based on the previous blocks DC coefficient. So you will need to zip through the previous blocks in the frame to figure out your DC coefficient.
Yes jpeg has restart markers which allow you to start decoding from a point part way through the stream. These are primarily for error resiliance though rather than region of interest.
The progressive jpeg is also part of the standard, basically the order that coefficients were transmitted is different so that you get the DC and low frequency coefficients first, then you 'transmit' the higher frequency coefficients which gives you your detail back.
I am not sure that modern way of doing things is necesarily 'brute force'. It is really making better use of available resources. When you have a big chunk of memory, why not use it to store the decoded image rather than re-decode it every time?
CC
TheArchitect
27-Nov-2008, 11:58
I am not sure that modern way of doing things is necesarily 'brute force'. It is really making better use of available resources. When you have a big chunk of memory, why not use it to store the decoded image rather than re-decode it every time?
CC
Computation and running from local cached storage is lower power cost than accessing a large area of memory from external DRAM.
Accessing a compressed image and partial decoding would theoretically require less trips to DRAM and therefore cost less power (depends on your decoder I guess).
But hey, you might be right, it might not be worth the effort.
Computation and running from local cached storage is lower power cost than accessing a large area of memory from external DRAM.
Accessing a compressed image and partial decoding would theoretically require less trips to DRAM and therefore cost less power (depends on your decoder I guess).
But hey, you might be right, it might not be worth the effort.
The partial decode solution results in multiple decodes and multipel data copies, a large number of these transactions are unlikely to be to local storage. Basically I think you get into swings and roundabouts territory when it comes to BW and the best approach would depend on the specific application.
Its also worth pointing out that the image captured from the local camera is highly likely to be in the form of a high res texture.
Wrt to screen size or more specifically the resolution of panels in hand held devices I recomend a bit of research into where things are heading.
And, of course these cores are also being used in STB's, where 1920x1080p is the target resolution, so you won't be viewing a heavily zoomed out image.
Whichever way you look at it FP24 is borderline going forward imo.
John.
...
But you could answer a quick OT question to a layman: given the nature of the hardware did they have an alternative choice?
I think the simple answer is that there are plenty of ways to skin a cat! You're probably right that LRB isn't really relevent as you would end up with a completly new architecture if you come at it from the low end, in my opinion you probably end up with something that looks like SGX.
When I first heard of the original Mali I said back then that if a small IP company can integrate single cycle 4xMSAA then it's high time we see it in standalone GPUs too and we did in 2006, albeit many before that considered that it doesn't make "sense".
Two things for the above apart from the long list of obvious misconceptions: when it takes 4 cycles for 16xMSAA someone might also enable 4x Supersampling instead. Before anyone says that 16xMSAA delivers far better polygon edge/intersections AA than 4xSSAA (and be of course right), I'll say that with the latter you usually get also a -1.0 LOD offset which is near 2xAF quality.
The 2nd thing is something that might be neglected by most if all not all IHVs in this market: albeit I understand the importance of having 2x,4x or more Multisampling on devices with small screens, it does sound to me that none of them has the power to add at least some portion of anisotropic filtering.
In my head Multisampling compared to Supersampling is a sort of "performance optimization", yet the first should also come with at least 2xAF to be comparable. And yes I understand that an advanced analytical AF algorithm along with the required TMU strength is a tough cookie to break for now when die area is so limited.
To me though it takes a bit more than good polygon edge/intersection antialiasing (which is what nowadays 5-10% of the total screen space?) and blurry bilinear filtering for the majority of the screen.
The way I look at is that both MSAA and AF are optimisations of FSAA, MSAA localises the cost of FSAA to edges of geomtery only and AF localises the cost of super sampling textures only to those surfaces that need it. Combining the two gives very good results with lower cost than FSAA.
John.
darkblu
27-Nov-2008, 14:06
I'm not sure how you'd do it on a PNG file, but probably taking a texture sub image which again would limit the section your accessing when zoomed to a "manageable precision" (I know Apple desktop solution providers have to supply a number of required extensions, perhaps an extension attaching a sub region of a texture to a texture is one of them).
i'm not sure i'm following with the extension - glTexSubImage has been a part of the gl core spec for generations now.
other than that, if you play with an iphone long enough you'll notice two things - the average size of kinetically pannable surfaces is impressive, particularly for such a class of device, and the ulta-responsive scroll-ability, stretchability (pinch-zoom) and rotatability aspects put an extra stress on the dimensions and response times for working with surfaces. for me, presonally, iphone's GUI is the most impressive mobile GPU showcase anybody has been able to come up with till now. i'm nothing short of amazed at the things the little mbx (lite) is able to pull, and had apple exposed the VGP's shading extension it'd have been my dream handheld.
why i'm bringing up all that - because a whole generation of mobile graphics devs will expect nothing less from the next generation of iphone (and handhelds, in general) - so those better have all the niceties that lowly mbx has: 2k x 2k textures, 2k x 1k viewports, 24bit depth buffers.
Ailuros
27-Nov-2008, 15:18
I think the simple answer is that there are plenty of ways to skin a cat! You're probably right that LRB isn't really relevent as you would end up with a completly new architecture if you come at it from the low end, in my opinion you probably end up with something that looks like SGX.
I meant that if Intel left LRB as it is on the hardware side of things and the driver would rather go into the IMR direction that the overall logic budget would had been a lot higher than with sw tbdr. In my mind (and I could be wrong of course) the way they've designed their hardware their current drivers were the only sensible option.
The way I look at is that both MSAA and AF are optimisations of FSAA, MSAA localises the cost of FSAA to edges of geomtery only and AF localises the cost of super sampling textures only to those surfaces that need it. Combining the two gives very good results with lower cost than FSAA.
Agreed. I'd still love to see really usable AF even on small form factor devices. If you'd give me the weird dilemma between having MSAA or AF I'd immediately go for the latter.
Simon F
27-Nov-2008, 15:52
Yes your are right you can't do random access into the fully compressed stream. You'd have to do the initial stages (huffman decode, etc.) to get the block information and then only do the full IDCT, etc. on the block you're interested in though, but that should be pretty quick I would have thought.
The comments Capt. CP made make it rather unpleasant. The closest you'll get in the literature to JPEG-like textures, AFAIK, was Microsoft's TREC that was in the Talisman system. The problem is that needs indirection which can result in GBH if you mention it to a hardware engineer. Please see the first two sections of this fool's paper (http://web.onetel.net.uk/%7Esimonnihal/assorted3d/fenney03texcomp.pdf).
TheArchitect
27-Nov-2008, 16:11
i'm not sure i'm following with the extension - glTexSubImage has been a part of the gl core spec for generations now.
You are quite right it has, the form addressing 3D texture is an extension in ES 2.0. I'm getting old and my brains going. :sad:
other than that, if you play with an iphone long enough you'll notice two things - the average size of kinetically pannable surfaces is impressive, particularly for such a class of device, and the ulta-responsive scroll-ability, stretchability (pinch-zoom) and rotatability aspects put an extra stress on the dimensions and response times for working with surfaces. for me, presonally, iphone's GUI is the most impressive mobile GPU showcase anybody has been able to come up with till now.
No arguments from me about how nice the IPhone UI is, glorious piece of work. To anyone selling Gfx IP into this space Iphone must have been a god send, it was all looking a bit ropey before that as no-one had really pushed gaming or even the UI experience in commercial devices.
i'm nothing short of amazed at the things the little mbx (lite) is able to pull, and had apple exposed the VGP's shading extension it'd have been my dream handheld.
Two possible reasons for that VGP extension being ommited are Apple not wanting reliance on single vendor extensions (paranoia about lock in) or I think I'm right in saying they did their own driver and that type of extension is quite a big chunk of work (implementation and verification, etc.), so may have been left out to reduce the workload.
However the discussion point was, "is FP24 enough representation to work with HD/2Kx2K textures or not?".
For the use cases described its sufficient (I don't think there was any disagreement was there?). For some cases it might be a bit of a stretch, but for screen aligned content (that is anything planar to the screen) its perfectly workable (probably a bit more of an issue if you start projecting the image into the Z).
John's point about mobile and HD is valid as well, more and more handset vendors are talking HD out from handsets (that is via HDMI, wireless USB/BT connections and built in micro projectors in the future), MID's increase the possibility of larger built in resolution screens, although there is some limits to the usefullness of going too mad with res within the physical constraints of the devices, (the physical ability of your eyeball to differentiate beyond a certain DPI at close range would render the investment in more pixels a bit pointless - Nokia have an R&D white paper on that some place).
why i'm bringing up all that - because a whole generation of mobile graphics devs will expect nothing less from the next generation of iphone (and handhelds, in general) - so those better have all the niceties that lowly mbx has: 2k x 2k textures, 2k x 1k viewports, 24bit depth buffers.
I think I'm right in saying most of the current crop of IP already meets or exceeds this specification (the dark horses would be AMD/ATI and Nvidia). The problem going further with precsision etc. all comes at a cost in gate area, storage, power etc.
Whilst Iphone is a good example of the use of a GPU in a device like that and probably the first time graphics has played a significant role in the purchasing decision of the consumer it's still quite a way from being the primary factor for most people. Packing an ever more powerful GPU with more cost (in power, area and $$$ terms) has to meet some ROI criteria for the handset implementors before they'll deploy it. Although a lot of handsets are shipping with acceleration its a long way from being ubiquitous at present and many maybe happy to stick with what they have for the mainstay of the market.
...
However the discussion point was, "is FP24 enough representation to work with HD/2Kx2K textures or not?".
For the use cases described its sufficient (I don't think there was any disagreement was there?). For some cases it might be a bit of a stretch, but for screen aligned content (that is anything planar to the screen) its perfectly workable (probably a bit more of an issue if you start projecting the image into the Z).
It becomes marginal as soon as you start doing more interresting stuff, and it simply isn't future proof, where "future" probably lies within the life span of the current generation, particlarly when you throw OpenCL into the mix.
...
I think I'm right in saying most of the current crop of IP already meets or exceeds this specification (the dark horses would be AMD/ATI and Nvidia). The problem going further with precsision etc. all comes at a cost in gate area, storage, power etc.
I beleive some members of the current crop are still limited to 16 bit Z.
When you amortise the cost of FP32 across vertex and pixel processing it isn't as significant a cost as you would think, but then thats another benifit of a unified shading engine.
Whilst Iphone is a good example of the use of a GPU in a device like that and probably the first time graphics has played a significant role in the purchasing decision of the consumer it's still quite a way from being the primary factor for most people. Packing an ever more powerful GPU with more cost (in power, area and $$$ terms) has to meet some ROI criteria for the handset implementors before they'll deploy it. Although a lot of handsets are shipping with acceleration its a long way from being ubiquitous at present and many maybe happy to stick with what they have for the mainstay of the market.
I wouldn't underestimate how big a seller eye candy is. Apple have already demonstrated this and have set the benchmark for user experience and I suspect that they're far finished, those that don't follow may struggle to survive in my opinion.
John.
TheArchitect
28-Nov-2008, 00:12
It becomes marginal as soon as you start doing more interresting stuff, and it simply isn't future proof, where "future" probably lies within the life span of the current generation, particlarly when you throw OpenCL into the mix.
I guess we'll have to wait and see on the CL front, not enough detail around to make that call at present, but thats probably true.
I beleive some members of the current crop are still limited to 16 bit Z.
Really now that is interesting, probably not ES 2.0 though right?
When you amortise the cost of FP32 across vertex and pixel processing it isn't as significant a cost as you would think, but then thats another benifit of a unified shading engine.
It's still a cost on something that is already pretty big in terms of area versus the percieved value from a business point of view. Thats what I was getting at.
We haven't got to the point where Phil Taylor can say - "you don't need a big CPU in mobile, just a big GPU" :wink:
I wouldn't underestimate how big a seller eye candy is. Apple have already demonstrated this and have set the benchmark for user experience and I suspect that they're far finished, those that don't follow may struggle to survive in my opinion.
The eye candy has been made a factor by Apple thats for sure, until IPhone everything was a little bland. However its still not the primary focus for most purchasers. Remember IPhone, N95 and freinds (the high end HTC's, LG Prada etc.) are the tip of the iceberg in terms of total handset sales and revenue. Most people are still more wrapped up in "can I get it free on my contract?" or "yeah I use crackberry(tm) 'cos its standard issue at my firm".
Outside the US, IPhone (and similar devices) are also hampered by not having an "all you can eat" data plan (don't get me started on how much of a rip EU data provision is!)
I think the IPhone is still sold more on the asperational qualities of the device than technical merit (although what it does it does very very well, but then Apple have always been very adept at making the complex very simple).
To get back to the point I was originally going to make in the last post (I forgot what I was going to write, then remembered after I sent it!). What is likely to drive the acceptance of more GPU grunt (more share of the PPA budget) is the heterogeneous compute model. This will grow the ROI for the GPU by enabling other "Visual Computing" applications to run successfully across the system (CPU,GPU, DSP, VLIW engines, etc.). To work well though, this will have to be a close partnership between GPU, CPU and DSP which will be hard to manage as a provider of a single peice of the puzzle (there is a whole ton of system crap you need to make that run well as well).
Ailuros
28-Nov-2008, 05:54
I think the IPhone is still sold more on the asperational qualities of the device than technical merit (although what it does it does very very well, but then Apple have always been very adept at making the complex very simple).Excuse the OT but I was actually almost sold in getting an iPhone. The reason I pulled back is that it doesn't support my native language (greek) up to now and you can't send any MMS (yet?) either. Relevant updates are supposed to come soon, but I'll get the device only if some of the essential functionalities that I need will be included in the package.
I hope Apple hasn't made the same mistake in huge markets like the chinese market; granted Greece is only a miniscule market compared to that, but when you advertise the iPhone as much as it has been here (in fact I've never seen up to now such an agressive marketing campaign) you can't come along with such serious drawbacks. As it stands right now if I'd get it, it would be half way useless as a mobile phone.
TheArchitect
28-Nov-2008, 10:16
Excuse the OT but I was actually almost sold in getting an iPhone. The reason I pulled back is that it doesn't support my native language (Greek) up to now and you can't send any MMS (yet?) either. Relevant updates are supposed to come soon, but I'll get the device only if some of the essential functionalities that I need will be included in the package.
Bit of an odd one that (the Greek language omission), you'd have thought they'd have had most of the essential dialogues translated already.
No MMS support is a very poor show I have to agree, it's a bankable feature on pretty much every phone these days.
I hope Apple hasn't made the same mistake in huge markets like the Chinese market
Ah but then they get all those fun clones over there to fill the gaps - I love all of that stuff on Engadget :lol:
granted Greece is only a minuscule market compared to that, but when you advertise the iPhone as much as it has been here (in fact I've never seen up to now such an aggressive marketing campaign) you can't come along with such serious drawbacks. As it stands right now if I'd get it, it would be half way useless as a mobile phone.
The lack of aggressive marketing from others I think shows that the mainstay of the market is still focused on the "free with contract" handset.
Musing on this Apple might be playing a clever game and using early adopters and the closed shop of service providers as kind of a large scale beta test to hone the product before they make the push for the mainstream (IPhone nano anyone?)
Still though, in the grand cell phone scheme of things these devices represent a limited portion of the market at present and penetration into the wider market has a price sensitivity. Originally the hope was that operator average revenue per user (known as ARPU) would be driven by selling additional services and this would help sub the costs of the handset. Increasingly though we've seen the service providers marginalised partly through there own failure to execute and partly through "unrealistic revenue expectations" (a polite way of saying greed) and the big players (Nokia, Google, Apple, etc.) supplying value added service direct to customer. If operator services continue to be commoditised (voice/SMS is already there, with the aggregation of broadband and mobile data, thats probably not far behind either) and they don't get a value add play then they can't finance the subs and growth will be limited outside of those who can afford it. Thats not an attractive thought in the current climate...
Okay now I'm depressed!
I guess we'll have to wait and see on the CL front, not enough detail around to make that call at present, but thats probably true.
The OpenCL spec is pretty much there so there's plenty of detail available now..
Really now that is interesting, probably not ES 2.0 though right?
I believe so.
It's still a cost on something that is already pretty big in terms of area versus the percieved value from a business point of view. Thats what I was getting at.
We haven't got to the point where Phil Taylor can say - "you don't need a big CPU in mobile, just a big GPU" :wink:
A (very) small delta that future proofs your core sounds like a good deal to me.
The eye candy has been made a factor by Apple thats for sure, until IPhone everything was a little bland. However its still not the primary focus for most purchasers. Remember IPhone, N95 and freinds (the high end HTC's, LG Prada etc.) are the tip of the iceberg in terms of total handset sales and revenue. Most people are still more wrapped up in "can I get it free on my contract?" or "yeah I use crackberry(tm) 'cos its standard issue at my firm".
It is true that the bulk of units are what comes for free but we will see ripple down, and that asside its innovative products like the iPhone that will push people to want and expect more.
Outside the US, IPhone (and similar devices) are also hampered by not having an "all you can eat" data plan (don't get me started on how much of a rip EU data provision is!)
I think the IPhone is still sold more on the asperational qualities of the device than technical merit (although what it does it does very very well, but then Apple have always been very adept at making the complex very simple).
Hmm, I know a lot of poeple who own iPhones becuase of its fluidy of execution as apposed to the asperational factor, but hey I haven't seen any marketing data on who's been buying the thing.
To get back to the point I was originally going to make in the last post (I forgot what I was going to write, then remembered after I sent it!). What is likely to drive the acceptance of more GPU grunt (more share of the PPA budget) is the heterogeneous compute model. This will grow the ROI for the GPU by enabling other "Visual Computing" applications to run successfully across the system (CPU,GPU, DSP, VLIW engines, etc.). To work well though, this will have to be a close partnership between GPU, CPU and DSP which will be hard to manage as a provider of a single peice of the puzzle (there is a whole ton of system crap you need to make that run well as well).
Lol, glad to see you're still towing the ARM company line even though you don't work for them now. Not withstanding the fact that IMG already works very closely with its key partners, initiatives like OpenCL will address the base issue of portability of key algorithms. Further, there has already been some interresting work by 3rd parties with NV's Cuda to do load balancing across CPU and GPU. If anything I think this is a time where the likes of ARM probably need to be looking over their shoulder.
John.
TheArchitect
28-Nov-2008, 19:52
A (very) small delta that future proofs your core sounds like a good deal to me.
On the FP32 thing? Maybe, but I wasn't just talking about that, the statement I was making is more about a general market issues not that specific feature. The mobile graphics community seems to be on a "go large" kick (to a certain extent survival depends on it), I'm just expressing a view that it may not be a sustainable business model to keep going bigger and bigger. At some point OEM's may call a halt in an attempt to consolidate on wringing out the power from the tech they have now.
It is true that the bulk of units are what comes for free but we will see ripple down, and that asside its innovative products like the iPhone that will push people to want and expect more.
I used to buy that, but I'm not so sure now, it's taking much longer for HW graphics to filter down and I think part of that is associated cost. 2420 has been in the market for an age for a mobile chip, you would have expected the price to start rolling off and the trickle down to have begun already, but no signs yet.
Hmm, I know a lot of poeple who own iPhones becuase of its fluidy of execution as apposed to the asperational factor, but hey I haven't seen any marketing data on who's been buying the thing.
We probably move in different circle to the mass market, most of the people we know are more "informed" consumers or have been influenced by one, so probably not entirely representative sample. As I pointed out in the last post though, no argument that the IPhone is a well executed user experience and others handset manufacturers should take note.
I was actually looking for some market data myself last night and came across an article claiming that distributors had been gagged by Apple with regards to sales figures etc. Not sure how reliable that is, but I bet that information would be gold.
Lol, glad to see you're still towing the ARM company line even though you don't work for them now.
Is this ARM's line? From what I've seen I don't think they've worked out that it might be an idea to have a more integrated story on this stuff! Yes they own all the right bits, but thats only half the story. Still very much a CPU centric mind set there. You don't see the levels of activity you see from AMD, IBM and more recently Intel in bluring the line and moving to the heterogeneous model.
I'm of the mind that if I'm presented with a system full of programmable units and they are not fully occupied then I should be allowed to use them to speed up something else. I don't give a crap if its CPU, GPU, DSP or whatever as long as I get to use it to innovate and differentiate my product (as long as I don't have to learn yet another instruction set, programming model or tool kit).
Not withstanding the fact that IMG already works very closely with its key partners, initiatives like OpenCL will address the base issue of portability of key algorithms.
Couldn't agree more, OpenCL addressing the portability issue is a huge leap forward. I've heard some concern about the ability of non CPU compute units (GPU's desktop and embedded, DSP's etc.) to handle larger and more complex code sections, rather than performance hot spots, which to be honest is fine this is V1.0 and its designed to fit todays hardware, but that needs to be looked at going forward. Then there is also the issue of system level data management/marshalling and movement.
Poor implementation of that side of things will kill performance and hamper the up take (or at least reduce the scope of usefulness) this is where owning more of the system level should be a strength. The infrastructure management is always the critical bit in stitching together a system with IP from multiple vendors (particularly when they may be competing for more of the solution) and is often the most painful bit to get right. There is low compultion from an IP providers point of view to sink lots of man hours into this stuff as it has a low overall return for the business (versus say putting those guys on the embedded compiler for the devices) unless they own more of the overall IP or key system level enabling components.
Further, there has already been some interresting work by 3rd parties with NV's Cuda to do load balancing across CPU and GPU.
I've seen a lot of demos where they record big numbers, but nothing in a real running system deployment as yet. A lot of the claim about running physics and AI code on the GPU through CUDA turns out to be marketing fluff when you dig into it with the ISV's. I'm not saying it's a bad idea, just that your application hotspot needs to fit the current system level limitations to see a net return. Has Khronos made any statements about OpenCL and auto or directed load balancing yet?
(I did like the Myth Busters paintball demo at Nvision - wonder how much that cost them? - yes I know that wasn't CUDA, but was just a nice bit of marketing, which they do very well).
If anything I think this is a time where the likes of ARM probably need to be looking over their shoulder.
They are on a defensive on several fronts, not least of which is against Intel (its like watch the Croc Hunter when hes poking a stick up an Anacondas bum you sit there thinking "any minute now and wham!"). They still haven't completely killed off MIPS, PowerPC, Tensilica or ARC either. Maybe one of those will turn out to be a Stingray with an attitude?
Having said that though the snake did bite itself in the arse at their last IDF (I'd ask for a refund on that guys PR training - struth!).
That got me thinking actually - (not the stick up a big snakes bum bit) If you've got your largest partners off doing "rolling your own" versions of ARM cores, doesn't the value of ARM diminish to them? Does it get reduced to the point where the only thing people are buying is the ISA and the tools?
Anyway it always pays to keep looking over your shoulder no mater who you are! Only the paranoid survive and all that... (or is it just because you're paranoid doesn't mean they aren't out to get you).
Ailuros
29-Nov-2008, 06:02
I'm of the mind that if I'm presented with a system full of programmable units and they are not fully occupied then I should be allowed to use them to speed up something else. I don't give a crap if its CPU, GPU, DSP or whatever as long as I get to use it to innovate and differentiate my product (as long as I don't have to learn yet another instruction set, programming model or tool kit).
Can't disagree here; the only other thing I'd like to add is that for now any change in that direction sounds to me quite easier on a SystemOnChip than on a PC as we know it today. At least in the first case the CPU isn't sitting on the wrong side of the bus.
Of course the immediate answer to the latter will be "ideas" like Fusion and the likes and albeit it goes a couple of years down the road I severely doubt that they'll be able to capture anything above the budget to mainstream PC market after all.
Besides in the less foreseeable future I'd rather suspect - as seen several times in the past - that when programmability advances to a certain point the circle closes and there's usually some drift back to fixed function hardware under different terms each time. In my mind it's a perpetuum mobile as demands constantly rise especially for graphics. There's no such thing as we've reached good enough IQ and we can now easily concentrate on everything else. In 1998 most played in 800*600-1024*768 with 16bpp and bilinear. Now only 10 years later you'd expect from a high end system to give you equivalent performance with at least 4xMSAA, 16xAF, fp HDR in 1680*1050-1900*1200 at 32bpp.
silent_guy
29-Nov-2008, 06:24
I'm of the mind that if I'm presented with a system full of programmable units and they are not fully occupied then I should be allowed to use them to speed up something else. I don't give a crap if its CPU, GPU, DSP or whatever as long as I get to use it to innovate and differentiate my product (as long as I don't have to learn yet another instruction set, programming model or tool kit).
That's why every company I've worked for tries extremely hard to keep the innards of different blocks in their chip obfuscated to the customer. Not because exposing them would be too educational for competitors, by the time they'd learn about it, it'd be too late anyway, but to protect the customer from their own dumb ideas and because the support issues would be a nightmare, if only because all complex chips on the market have a ton of bugs, with often very peculiar SW work-arounds.
Pretty much all complex chips, no matter which application, have hidden ARM's, MIPS'en, Sparc's, Tensilica's or other C programmable CPU's that are carefully locked down before being shipped to customers. In theory, they could be used to actually calculate something useful, but it's just not worth exposing them.
TheArchitect
29-Nov-2008, 09:24
Can't disagree here; the only other thing I'd like to add is that for now any change in that direction sounds to me quite easier on a SystemOnChip than on a PC as we know it today. At least in the first case the CPU isn't sitting on the wrong side of the bus.
I'd agree. Achieving this in a SoC is relatively easy when compared to a modern PC, but internal blocks in a SoC still have issues with competition for the miniscule bandwidth you get in emdedded systems which is often compounded by poorly implemented schemes of access yeilding low utilisation of the bus.
Besides in the less foreseeable future I'd rather suspect - as seen several times in the past - that when programmability advances to a certain point the circle closes and there's usually some drift back to fixed function hardware under different terms each time. In my mind it's a perpetuum mobile as demands constantly rise especially for graphics. There's no such thing as we've reached good enough IQ and we can now easily concentrate on everything else. In 1998 most played in 800*600-1024*768 with 16bpp and bilinear. Now only 10 years later you'd expect from a high end system to give you equivalent performance with at least 4xMSAA, 16xAF, fp HDR in 1680*1050-1900*1200 at 32bpp.
I'm with you for the desktop, but the physical constraints of a pocketable device (even when using the "docked" model - i.e. mobile device "plugged in" to a home dock with mains power, a hardline to ethernet and HDMI connections) make chasing these things considerably more problematic and cannot be solved by just 'roiding up each of the subsystems generation on generation.
The high end system of today has 1KW PSU massive amounts of active cooling and scant regard for its carbon footprint other than to prevent is power rails going molten under load. You can't pull the same trick in a mobile in the near future and this is why I'm suggesting its better to get more from whats there already.
Ailuros
30-Nov-2008, 05:54
I'd agree. Achieving this in a SoC is relatively easy when compared to a modern PC, but internal blocks in a SoC still have issues with competition for the miniscule bandwidth you get in emdedded systems which is often compounded by poorly implemented schemes of access yeilding low utilisation of the bus.
If it makes sense to fuse CP and GPU capabilities into single chips in the desktop space, I don't see why it wouldn't make sense for the smaller markets either. I'd think that such sollutions could capture up to the mainstream segment of the mobile market and while I don't think it would solve all possible problems, but at least for central processing and graphics you shouldn't have two units fighting for bandwidth since I'd expect the device itself to balance out/control the whole process.
I'm with you for the desktop, but the physical constraints of a pocketable device (even when using the "docked" model - i.e. mobile device "plugged in" to a home dock with mains power, a hardline to ethernet and HDMI connections) make chasing these things considerably more problematic and cannot be solved by just 'roiding up each of the subsystems generation on generation.
The high end system of today has 1KW PSU massive amounts of active cooling and scant regard for its carbon footprint other than to prevent is power rails going molten under load. You can't pull the same trick in a mobile in the near future and this is why I'm suggesting its better to get more from whats there already.
I'd say that the biggest problem mobile GPUs have with is advanced texture filtering at the moment. Other than that they're scaling in IQ improving features faster than the desktop GPUs ever did. There's 4xRGMS available both on Mali as on SGX and both can utilize up to 16x sample AA for OpenVG (where performance isn't as much an issue and that amount of samples is actually necessary) if needed.
Anisotropic algorithms have the huge advantage nowadays to be highly adaptive; they filter only the surfaces that actually need X amount of samples, and while antialiasing mostly needs bandwidth (a non issue on TBDRs and the reason why NVIDIA has implemented coverage sampling on its APX2500) anisotropic on the other hand mostly needs fillrate. Today's high end GPUs might have massive fill-rates yet all of them don't even bother to do full trilinear with AF on default; GeForces wouldn't have such a big problem with it due to their massive bilerp rates, but Radeons might have it slightly tougher there if they wouldn't use that amalgalm between bi- and trilinear often referred to as "brilinear". With that they virtually get an approximation of trilinear for free.
A current mobile GPU for mobile/PDA devices has today mostly 1 TMU and in some rare exceptions 2 TMUs. While I understand that adding filtering capabilities will add in die area I don't think it would be an as big problem for future generations. In the desktop space you have dozens of TMUs (up to 96 for the GT200 currently) and there the "bill" is proportionally a lot higher as of course that resolutions are still miniscule compared to the desktop. How large can you get a screen on a mobile phone anyway, without the device ending up at the size of a shoe.
Besides I have the feeling that fixed function hw will further minimize in the future for texture mapping/filtering and render outputs might get even more absorbed by other units like the ALUs and/or memory controllers.
Finally while I understand that competing IHVs struglgle amongst other things to have the highest possible feature set for each generation per mW, I personally feel that a line could be drawn and not follow the desktop parts in that regard as closely. SGX already mentions procedural geometry in its whitepapers; I have the feeling that none of it is actually necessary for up to 10.1 and I severely doubt that any mobile developer would deal with any of it for the lifetime of that generation. The question here is if transistor budget for feature X could have been a better investment elsewhere.
What are the differences between the single core Mali-400 MP and a Mali 200?
Ailuros
02-Dec-2008, 13:19
What are the differences between the single core Mali-400 MP and a Mali 200?
Is Mali400 a single or a dual core config? I had the impression its the latter.
Laurent06
02-Dec-2008, 15:00
Is Mali400 a single or a dual core config? I had the impression its the latter.
If we are to believe these documents
- ARM site (http://www.arm.com/miscPDFs/21863.pdf)
- cbinews (www.cbinews.com/uploadfile/whitepaper/2008-06-10/10151751.pdf)
Mali-400 can have one to four cores.
Also according to the document from ARM site above, it looks like Mali-400 with 1 core has the same pixel performance as Mali-200 but twice its geometry performance.
Here's a presentation that ARM did on Mali at the ARM DEVCON 2008
http://library.corporate-ir.net/library/19/197/197211/items/310754/Mike_Dimelow.pdf
A lot interesting info and assertions, some highlights are:-
Mali 400MP "beats 6 SGX cores 520/530/535/540/545/550"
Hmmm...comparing Mali 400MP to 520/530/535 ?? whats that about
550 never existed ?
Mali 400MP dual core fill rate 550Mpix 30M tri, SGX545 does 1000Bpix and 40M tri ?
http://www.imgtec.com/factsheets/powervr/POWERVR_SGX_Series5_IP_Core_Family_[2.3].pdf
13 GPU licenses and state the areas as being PMP,PND,wireless and STB. Lead partner for Mali-400 is ST micro. Other stated licencees (for various Mali cores) are Zoran, ST, Micronas, Ericsson (note not sony Ericsson) Cisco Systems, NXP, telechips,RMI, Broadcom. Remaining licencees are "private"
Another OpenGLES1.x core coming out in 2009, higher end multi-core Vithar coming in 2010, and multi-core top end THOR in 2012. (slide 13)
Slide 15 shows target silicon will be shipping to 8 customers in 2008/9
Launch dates for various Mali-ed products are on slide 13.
Here's a presentation that ARM did on Mali at the ARM DEVCON 2008Cheers.
Hmmm...comparing Mali 400MP to 520/530/535 ?? whats that about
550 never existed ?Surely they meant 555?
Mali 400MP dual core fill rate 550Mpix 30M tri, SGX545 does 1000Bpix and 40M tri ?Tsk tsk, marketing ftw? PowerVR's numbers include a 2.5x multiplier because of the claimed higher efficiency of TBDRs when there is overdraw. That number is a tad excessive, but it is fair to say that if you designed a game for both IMRs and TBDRs then you could save on a Z-Pass which does increase your effective bandwidth.
Mali 400 has 1 TMU per core, which is similar to SGX 530. Their dual-core variant would, TMU-wise, be equivalent to a SGX 540/543, while their quad-core variant would be equivalent to a SGX 555. All per clock, ignoring any potential effective fillrate gain from SGX being a TBDR. One of the key reasons why they claim they 'beat' SGX performance-wise is that Mali is supposedly clockable at 240-275MHz, while PowerVR quotes 200MHz and in practice OMAP3 only delivers 100-133MHz (probably about 200MHz on 45nm though! ;))
Given that the same presentation mentions a benchmark done at 220MHz, I'm a little bit skeptical that even 240MHz is realistic for the first SoCs but we'll see what happens. The history of meeting clock speed targets in the industry is pretty damn awful.
And of course, there's more to handheld GPUs than the number of TMUs and peak polygon throughput - although I guess many people in the industry often forget about that, or just don't care. The only two companies I'm aware of which have detailed (up to a certain extent) their shader pipeline is ATI and... Samsung (see my other post a minute ago). Oh well, maybe one day!
Launch dates for various Mali-ed products are on slide 13.Yes, it's interesting that ST-Micro seems to be their only Mali-400MP licensee at this point though! At least Mali 200 has a lot more momentum than I thought... :)
Here's a presentation that ARM did on Mali at the ARM DEVCON 2008
Slide 7 there doesn't match http://www.arm.com/miscPDFs/21863.pdf on performance & area though (at least for M200).
Yes, it's interesting that ST-Micro seems to be their only Mali-400MP licensee at this point though!
Lead licensee in ARM lingo usually means flagship or "driving customer" rather than "the only one", although I haven't seen any other announcements for M400 so far. Could it have anything to do with the "high-end 3-D graphics accelerator" mentioned here? (http://www.redorbit.com/news/technology/1640224/stericsson_and_nokia_announce_cooperation_to_provi de_nextgeneration_smartphone_platform/index.html?source=r_technology)
Ailuros
28-Feb-2009, 07:48
Slide 7 there doesn't match http://www.arm.com/miscPDFs/21863.pdf on performance & area though (at least for M200).
5mm2 vs. 4.1mm2 at 65nm and it doesn't have a ~ as with M400 estimated die sizes. Past estimates were a lot more optimistic:
http://www.hardocp.com/article.html?art=ODAyLDEsLGhlbnRodXNpYXN0
Mali 400MP "beats 6 SGX cores 520/530/535/540/545/550"
Hmmm...comparing Mali 400MP to 520/530/535 ?? whats that about
550 never existed ?If you want to compare MP vs. MP (with 2 cores at a time) and always on what each of the two has officially announced:
Mali MP dual core:
275MHz / 550MPixels/s / 30M Tris/s = ~9mm2@65nm (which should actually read at 240MHz since that's their rate max frequency for 65nm)
SGX543 dual core:
200MHz / 800MPixels/s / 70M Tris/s = 16mm2@65nm
SGX543 single core:
200MHz / 400MPixels/s / 35M Tris/s = 8mm2@65nm
Since diagrams show that MaliMP scales fragment processors but still has just one vertex processor (which has been confirmed by arjan in another thread) I'd rather give MaliMP 18M Tris/s than 30 after all. Besides the fact that you gain higher geometry efficiency on a USC in general, if you normalize both on the same frequency level (and since I haven't used any overdraw factor for SGX fillrates) I sure hope their die estimates for Mali MP are more accurate than in the past.
5mm2 vs. 4.1mm2 at 65nm and it doesn't have a ~ as with M400 estimated die sizes. Past estimates were a lot more optimistic:
http://www.hardocp.com/article.html?art=ODAyLDEsLGhlbnRodXNpYXN0Looking at that presentation, I can't help but notice the 300MP/s @ 200MHz number. Err, what? 1.5 TMUs or ROPs, really? Gives me a hunch it might be inflated in the same way TBDR numbers are (although in this case I can't figure out what the reasoning might be!) and so the 2-3mm2 was really for a half-pixel TMU configuration, while the only Mali200 config that still exists today is a full-pixel TMU. I could be horribly wrong, of course. Of course, in this context it is also interesting to look at ARM's claimed scaling numbers from 90GP to 65LP in the presentation tangey posted - they are pretty awful (1.5->1mm˛ for Mali55!) and can also help explain the numbers a bit.
BTW: I found this little gem in an ARM presentation. A die partitioning for the Mali55! ;) There's a fair bit of SRAM, but the biggest block is obviously the 'Texture Mapper', followed by the 'Tri setup master', and then a bunch of smaller ones that are much harder to read but which include a 'Framebuffer/Blenders' block and a 'MMU AMB' one.
http://www.jp.arm.com/event/pdf/forum2007/t3-1.pdf - Page 11
I actually also found ARM11, Cortex-A8 and Cortex-A9 partitionings in the past, although I can't remember where I saved them if I did at all... Might not be easy to find again sadly but it is out there.
Ailuros
28-Feb-2009, 13:24
Looking at that presentation, I can't help but notice the 300MP/s @ 200MHz number. Err, what? 1.5 TMUs or ROPs, really? Gives me a hunch it might be inflated in the same way TBDR numbers are (although in this case I can't figure out what the reasoning might be!) and so the 2-3mm2 was really for a half-pixel TMU configuration, while the only Mali200 config that still exists today is a full-pixel TMU.
That presentation was before ARM bought Falanx and it fairly sounds like an effective fillrate. It's the fillrate that struck you as weird? Try 20GFLOPs@200MHz from a mere 3mm2@90nm core. Damn creative math if you ask me.
Instead of trying to convince with their presentations that they consume X less bandwidth than the competition (which I have severe doubts it's even true), it would be nice for a change not to explain their final die sizes but each cores final capabilities in relation to the real final die size.
BTW: I found this little gem in an ARM presentation. A die partitioning for the Mali55! :wink: There's a fair bit of SRAM, but the biggest block is obviously the 'Texture Mapper', followed by the 'Tri setup master', and then a bunch of smaller ones that are much harder to read but which include a 'Framebuffer/Blenders' block and a 'MMU AMB' one.
http://www.jp.arm.com/event/pdf/forum2007/t3-1.pdf - Page 11
Jebus I never would had noticed without you pointing me at it.
That presentation was before ARM bought Falanx and it fairly sounds like an effective fillrate. It's the fillrate that struck you as weird? Try 20GFLOPs@200MHz from a mere 3mm2@90nm core. Damn creative math if you ask me.Well, it's the fillrate that struck me as not plausibly being a raw number. The GFlops number, I'd find believable at 3mm2 if you have a large batch size and it's FP16. However it was my understanding that their PS is FP24 and the VS is FP32... So I'll admit it certainly wasn't believable as a raw number either.
Instead of trying to convince with their presentations that they consume X less bandwidth than the competition (which I have severe doubts it's even true)WRT bandwidth, I think many of their comparisons make perfect sense relative to a basic kind of tiler that doesn't really exist in the industry, although it'd probably be nearest what ATI did in the OpenGL ES 1.x generation (but not quite that either).
Anyway, most of the bandwidth claims in the industry are even less credible than most of the spam e-mails I get in my mailbox. Probably at the same level as home fitness equipment marketing...
Jebus I never would had noticed without you pointing me at it.Well it helps that I saw similarly colored diagrams in much larger format s for other ARM cores in the past, so my brain instantly realized the similarity... :)
Ailuros
01-Mar-2009, 18:21
Well, it's the fillrate that struck me as not plausibly being a raw number.
Bottomline is they're theoretically claiming higher raw fillrates, which they're not on the same frequency basis. I won't exclude that their cores could end up more tolerant to higher frequencies, but those figures they're presenting don't even point that way.
Anyway, most of the bandwidth claims in the industry are even less credible than most of the spam e-mails I get in my mailbox. Probably at the same level as home fitness equipment marketing...
Or that the 1st generation GoForce had pixel shaders.... :P
Don't get me wrong I really like Falanx and I find Mali as an architecture very insteresting, in fact more interesting than Tegra. I just don't see the reason for that type of marketing; you lose more than you gain after all at least IMHLO.
Or that the 1st generation GoForce had pixel shaders.... :PHeh, the most terrifying thing is it did have pixel shaders; in a few ways they seemed in fact more advanced than the GF3's... Yet it didn't have true Early-Z; depth testing saved power through clock gating and preventing memory accesses, but it never improved shading performance one iota. Ugh... It's a pretty weird and not always very logical architecture.
Here's the relevant patent: http://v3.espacenet.com/publicationDetails/description?CC=EP&NR=1759380A2&KC=A2&FT=D&date=20070307&DB=EPODOC&locale=en_EP
What is much more laughably primitive in the original GoForce is the transform engine, which just reuses the setup engine to do very basic transforms in HW instead of on the CPU. Honestly, I'm not sure why they even bothered... heh. And of course, the whole 'let's keep the framebuffer/textures in on-chip SRAM!' idea was insane. The original GoForce probably was awful at hiding memory latency therefore; I wonder how/if that evolved in the 4800/5500 when they started being dependent on external memory for textures.
Don't get me wrong I really like Falanx and I find Mali as an architecture very insteresting, in fact more interesting than Tegra. I just don't see the reason for that type of marketing; you lose more than you gain after all at least IMHLO.Yeah, it is definitely interesting. The basic rendering strategy is certainly much more interesting than Tegra's. I don't know how exciting/smart the low-level details are for either since that's basically unknown for everybody in the handheld world, but I can honestly say I'd love to know in both cases... ;)
Ailuros
02-Mar-2009, 08:47
Heh, the most terrifying thing is it did have pixel shaders; in a few ways they seemed in fact more advanced than the GF3's... Yet it didn't have true Early-Z; depth testing saved power through clock gating and preventing memory accesses, but it never improved shading performance one iota. Ugh... It's a pretty weird and not always very logical architecture.
Here's the relevant patent: http://v3.espacenet.com/publicationDetails/description?CC=EP&NR=1759380A2&KC=A2&FT=D&date=20070307&DB=EPODOC&locale=en_E (http://v3.espacenet.com/publicationDetails/description?CC=EP&NR=1759380A2&KC=A2&FT=D&date=20070307&DB=EPODOC&locale=en_EP)
I just skimmed through it, but it rather sounds like a generic scalar ALU, which (unless I've missed something) I wouldn't necessarily conclude that its capable of pixel shading.
What is much more laughably primitive in the original GoForce is the transform engine, which just reuses the setup engine to do very basic transforms in HW instead of on the CPU. Honestly, I'm not sure why they even bothered... heh. And of course, the whole 'let's keep the framebuffer/textures in on-chip SRAM!' idea was insane. The original GoForce probably was awful at hiding memory latency therefore; I wonder how/if that evolved in the 4800/5500 when they started being dependent on external memory for textures.
Didn't they also claim a Geometry engine?
Anyway we're way OT with that kind of stuff.
Yeah, it is definitely interesting. The basic rendering strategy is certainly much more interesting than Tegra's. I don't know how exciting/smart the low-level details are for either since that's basically unknown for everybody in the handheld world, but I can honestly say I'd love to know in both cases... ;)
All in all I have the feeling that IMG announced 543MP just to take the wind out of their marketing sails; ok way too exaggerated but I think you get my point. Even if you'd get to linear scaling with multiple cores, there's always a portion of redundancy involved and in those particular markets die area and power consumption are way more critical than in any other market.
The fact that die estimates were way off in the past, aren't really annoying me with Mali. What annoys me is that estimated performance and featureset of that 2005 presentation are quite on a different level than the final result.
Mali can through 4 passes yield 16xMSAA; under normal gaming conditions the resources for that are way too high. For anything OpenVG though (always depending on the amount of sub-paths in each path) it's certainly a sample amount that might be needed there.
I've no idea if NV made any modifications to their CSAA algorithm for OpenVG; if not it might give some nasty side-effects with VG content.
I just skimmed through it, but it rather sounds like a generic scalar ALU, which (unless I've missed something) I wouldn't necessarily conclude that its capable of pixel shading.So what else do you want? Don't you remember how incredibly basic DX8 Pixel Shading was? :)
The fact that die estimates were way off in the past, aren't really annoying me with Mali. What annoys me is that estimated performance and featureset of that 2005 presentation are quite on a different level than the final result.Yup, it's pretty depressing seeing how every single handheld chip/IP I've ever looked at, I've *always* overestimated its 3D performance by at least 2x until I had the real info. At least in Mali's case, I can claim it's not my fault... ;)
I've no idea if NV made any modifications to their CSAA algorithm for OpenVG; if not it might give some nasty side-effects with VG content. Sigh, I really am a retard. See, Neil Trevett (president of Khronos/chair of OpenGL ES) was at the NV stand and I didn't realize it until it was too late, so I didn't ask him any questions except stuff obviously related to NV/Tegra. Bah! :( Heck, I even realized he was probably at the stand, but didn't realize that was him....
Ailuros
03-Mar-2009, 06:45
So what else do you want? Don't you remember how incredibly basic DX8 Pixel Shading was? :)
I as a layman have a hard time calling register combiners as pixel shaders but that's just me LOL :P
Sigh, I really am a retard. See, Neil Trevett (president of Khronos/chair of OpenGL ES) was at the NV stand and I didn't realize it until it was too late, so I didn't ask him any questions except stuff obviously related to NV/Tegra. Bah! :( Heck, I even realized he was probably at the stand, but didn't realize that was him....
It's never too late to find out ;)
RMI have recently launched a series of MIPS based app pros (Au1300) with some of them including Mali200 cores. Thats the first I've seen of Mali being mated to a non-ARM processor.
http://www.rmicorp.com/products/Au1300.htm
Performance stats for the graphics core are stated as:-
• Open GL ES 1.1 and 2.0 and OpenVG 1.1 standards support.
• Vertex and Fragment shaders.
• 10M polygons per second.
• 4x full-screen anti-aliasing with no impact on performance.
• Up to 25x FSAA supported.
• Alpha blending and texture caching.
So in this impmentation at least, they are getting 10M polys, which is quite different from the 16M stated here:-
http://www.arm.com/miscPDFs/21863.pdf
But again its hard to make comparisons as there is nothing in the RMI data that hints to either the 3D graphics clock, or the fab process used for the chip.
Ailuros
05-Mar-2009, 06:36
Polygon rates should be truly subject to core frequency used. I'm just a bit puzzled with the up to 25xFSAA odd sample amount.
A recent Mali-200 demo
http://www.youtube.com/watch?v=g0bwMCe6IaA
A quick bump to point out that there are now two Mali development boards available on the ARM website, also confirming that the U8500 uses the Mali 400:
- ST-Ericsson U8500: http://www.malideveloper.com/platforms/boards/st-e-mop500-development-platform.php (Mali 400)
- Telechips TCC8900: http://www.malideveloper.com/platforms/boards/telechips-tcc8900-development-platform.php (Mali 200)
There's no indication of the number of cores in the U8500, so I assume it's just one. No indications of MHz for either the A9 or the Mali400 in there, but two interesting tidbits there and on the new ST-Ericsson page about the U8500: it sports a 1080p H.264 *High Profile* camcorder but, unlike OMAP4, only 32-bit LPDDR2. Also has two camera ISPs: 18MP for the primary, 5MP for the secondary. Nice, I wonder how much a phone like that would cost in 1H11... Probably more than anyone sane would ever pay but heh ;)
Laurent06
22-Oct-2009, 08:51
There's no indication of the number of cores in the U8500, so I assume it's just one. No indications of MHz for either the A9 or the Mali400 in there, but two interesting tidbits there and on the new ST-Ericsson page about the U8500: it sports a 1080p H.264 *High Profile* camcorder but, unlike OMAP4, only 32-bit LPDDR2. Also has two camera ISPs: 18MP for the primary, 5MP for the secondary. Nice, I wonder how much a phone like that would cost in 1H11... Probably more than anyone sane would ever pay but heh ;)
That's a dual A9-based SoC.
Ref: http://www.malideveloper.com/platforms/boards/st-e-mop500-development-platform.php
EDIT: Hum, you link the same page as I do, and I clearly see mention of dual core, odd...
That's a dual A9-based SoC.Err, gosh, I realize now my sentence was very ambiguous: I didn't mean the number of cores for the A9. I meant the number of cores for the Mali 400! :runaway:
If it's a single core, then that's a 1 TMU design and, assuming it's clocked above 200MHz as ARM claims should be easy to do, would probably be most comparable to the SGX530 in the 45nm OMAP3 (although probably a bit faster in a good day, i.e. poly rate, and a bit slower on a bad one, i.e. overdraw). Frankly not very impressive 3D-wise for such an otherwise very powerful design, but we'll see.
Are the current SE Omap3 (Satio etc) phones using the U380 platform, i.e. have they integrated their HSPA along with an Omap3430 into a one-chip solution, or did they drop this platform and go with separate Omap3 + HSPA ?
http://www.ericsson.com/ericsson/press/releases/20080208-1189711.shtml
Also what ever happend to the U500 platform, which was ARM11 + Mali200....did that just get completely dropped ?
http://www.ericsson.com/ericsson/press/releases/20080206-1188885.shtml
Both platforms were announced Feb '08.
U380 was a lie; it was two chips in a package (I'd guess OMAP3430 and M340); I do believe it's used in the Satio though. I can't seem to find the presentation where they indicated that anymore, though. U500 probably got canned in favour of the U8500 - it was a very weird achitecture with 3 ARM11 cores (1 for apps, 1 for modem, 1 for multimedia iirc) and I honestly doubt it would have been very impressive.
roninja
22-Oct-2009, 17:55
Thought U8500 was due to be announced at MWC but did not make an appearance. Interesting if this part heads Nokia's way or OMAP4 beats em too it?
Interesting if this part heads Nokia's way or OMAP4 beats em too it?
This at least shows some Nokia interest:
http://zomgitscj.com/nokia-signs-with-st-ericsson-gets-chip-for-1080p-video-recording/
A fun looking "UI playground" concept is presented, controlled by multi-touch where icons can be 3D objects and can interact under a physics model, running on a Mali 200 platform.
http://www.youtube.com/watch?v=g0bwMCe6IaA
Got a chance to see this at GDC in ARM's booth. Very nice demo!
Behind the curtain the panel was connected to a notebook PC.
Was the Mali200 clocked around 300 MHz?
That's higher than a mobile phone implementation would use, I'd expect.
More footage of the Canvas demo is shown. Performance on the high resolution display is nice: the video-mapped-to-the-cube getting thrown around as a 3D object and colliding with the pins was a good touch.
http://www.youtube.com/watch?v=rAiK8jrI8CI&sns=em
Ailuros
10-Aug-2010, 12:37
It's the first time I see anything containing Mali200 listed:
http://www.glbenchmark.com/phonedetails.jsp?benchmark=glpro11&D=SmartQ%20V5&testgroup=lowlevel
uhhmm yikes it didn't pass even one quality test...crappy drivers?
rpg.314
10-Aug-2010, 14:15
Where are the ES 2.0 results? I can't find them on a quick look. Or are they MIA?
uhhmm yikes it didn't pass even one quality test...crappy drivers?
Looks like a bug in their glReadPixels().
Exophase
15-Sep-2010, 16:01
Now that Nufront's chip is announced as having Mali 400MP and Samsung Orion is rumored at having it I wanted to open this thread up again.
I had a bunch of stuff here, but I'm seeing more that it's basically superseded by this document: http://infocenter.arm.com/help/topic/com.arm.doc.dui0363d/DUI0363D_opengl_es_app_dev_guide.pdf (http://infocenter.arm.com/help/topic/com.arm.doc.dui0363d/DUI0363D_opengl_es_app_dev_guide.pdf)
Apparently the ALUs are VLIW, SIMD, and 32-bit float. It also has hardware support for 16-bit float: it appears to suggest using these to save bandwidth post-geometry, not to improve computational throughput. One Mali400 ALU looks way more flexible/powerful than a USSE ALU (dunno as much about USSE2), more comparable to that of z430. So a Mali400MP should have pretty comparable computational performance.
Apparently the ALUs are VLIW, SIMD, and 32-bit float.
That's the geometry processor not the fragment shader. Frag shader is apparently FP24.
John.
Exophase
15-Sep-2010, 17:58
That's the geometry processor not the fragment shader. Frag shader is apparently FP24.
John.
Yeah okay, I missed that it just said "geometry processor." Do you have a source on the fragment shader being FP24?
By the way, earlier in this thread you mentioned that 24-bit gives you 15-bits of fractional data. If this is the usual FP24 implementation which is like IEEE-754 float, ie, 1.8.15 for sign, exponent, and fractional portion then the effective absolute resolution is really minimally 16-bits. Normalized floats have an implicit higher order 1 bit; effectively, this is encoded by the exponent. So with 11-bit texture addressing you should have an effective 5-bits fractional at the full index magnitude.
That document actually says that the fragment shader uses FP16.
darkblu
15-Sep-2010, 19:30
Yeah okay, I missed that it just said "geometry processor." Do you have a source on the fragment shader being FP24?
I don't have a link to a source handly, but I do remember Mali as having an fp24 fragment shader ALUs from earlier presentations/workshop sessions.
Generally, both non-unified shader model, SoC-class GPUs I'm aware of follow the fp32-vertex / fp24-fragment scheme.
edit: I just noticed Xmas' remark. It appears I have a case of faulty memory cells.. Rats. *reports to factory for repair*
By the way, earlier in this thread you mentioned that 24-bit gives you 15-bits of fractional data. If this is the usual FP24 implementation which is like IEEE-754 float, ie, 1.8.15 for sign, exponent, and fractional portion then the effective absolute resolution is really minimally 16-bits. Normalized floats have an implicit higher order 1 bit; effectively, this is encoded by the exponent. So with 11-bit texture addressing you should have an effective 5-bits fractional at the full index magnitude.
Correct.
fp24 (15-bit mantissa) gives you 2^-16 relative precision (mandated by the GLSL ES specs for highp, btw);
fp16 (10-bit mantissa) gives you 2^-11 relative precision, etc.
Exophase
15-Sep-2010, 20:09
That document actually says that the fragment shader uses FP16.
Wow, you're right, I read the section and somehow completely misinterpreted it as meaning that FP16 is supported and should be used to save bandwidth between vertex shading and fragment shading.
That only gives effective 11-bits of guaranteed precision, not very good for HD texture coordinates...
I wonder if maybe the ALUs are FP16, but it uses FP24 internally and can access it for some purposes. Like if the TMUs can be addressed with it and varyings produce it. I guess even FP16 isn't that bad for texture coordinates (and on SGX you'd probably usually opt for it), since it still gives you 0.25 sub-texel precision at up to 512x512. 1024x1024 if somehow the texture coordinates can range from -1 to 1 (pretty sure it can't work that way but I don't really know for sure)
So ARM promotes a lot that Mali has very efficient bandwidth utilization, even compared to "traditional tile renderers." Are they claiming that they have better post-transform geometry data compression than IMG does?
Exception has been taken to that characterization of PowerVR as the "traditional tile renderer".
Ailuros
16-Sep-2010, 10:16
So ARM promotes a lot that Mali has very efficient bandwidth utilization, even compared to "traditional tile renderers." Are they claiming that they have better post-transform geometry data compression than IMG does?
Their webpage states as of recently:
Advanced tile-based deferred rendering and local buffering of intermediate pixel states.
http://www.arm.com/products/multimedia/mali-graphics-hardware/mali-400-mp.php
While in their dev_guide I read the following:
The Mali GPUs use tile-based immediate-mode rendering.
For this type of rendering, the framebuffer is divided into tiles of size 16 by 16 pixels. The
Polygon List Builder (PLB) organizes input data from the application into polygon lists. There
is a polygon list for each tile. When a primitive covers part of a tile, an entry, called a polygon
list command, is added to the polygon list for the tile.
The pixel processor takes the polygon list for one tile and computes values for all pixels in that
tile before starting work on the next tile. Because this tile-based approach uses a fast, on-chip
tile buffer, the GPU only writes the tile buffer contents to the framebuffer in main memory at
the end of each tile. Non-tiled-based, immediate-mode renderers generally require many more
framebuffer accesses. The tile-based method therefore consumes less memory bandwidth, and
supports operations such as depth testing, blending and anti-aliasing efficiently.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0363d/index.html
Since I'm having a hard time making out a difference can I call the Mali a TBIDR (tile based immediately deferred renderer)?
Imo the correct description of Mali is "Early Z Tile based rendering", this differs from the PowerVR method in that they don't do deferred shading/texturing.
Exophase - yes 15 bit mantissa is actually 16 bits of precision when you take into account the implied 1, assuming that's the representation used.
Note that FP16 is not generally sufficient precision to manipulate texture coordinates with the shader (think accumulated error). I still believe they're FP24, but I'm not sure why they wouldn't expose as that though if that really was the case.
John.
Exophase
16-Sep-2010, 14:43
Had this one out with aaronspink/darkblu/et al once.
It's deferred and not immediate in the sense that it's scene-grabbing instead of rendering primitives as they're issued.
It's not deferred and is immediate in the sense that it performs full rendering (minus early-Z elimination) within a tile as opposed to having a fast internal Z-path and then index based rendering.
An important distinction between it and z430 is that it appears to have more explicit tiling and a fixed small (16x16) tile size, and therefore probably has a hardware binning pass prior to geometry as opposed to binning with geometry with skips for already binned polygons. This probably saves a lot of bandwidth in comparison, but like IMG I imagine they employ some kind of compression on their post-transform data. PowerVR should be as "traditional" as you get, having been the only tile based renderers for years, but yeah, there are clearly holes in their statements.
What I've always wanted to see was a tile-based early-Z renderer with hardware binning that also performed some level of depth sorting. If it's going to bin per-tile anyway this can't be that much more expensive (or maybe it can, I don't really know the binning algorithms and haven't thought this through that thoroughly). That'd make early-Z much more effective, and if you also add in binning of opaque vs alpha primitives you'd get order-independent translucency too... seems like it'd bring the per-pixel savings much closer to TBDR (especially w/faster than fill early-Z) while not having highly costly alpha test and having order-independent translucency.
rpg.314
16-Sep-2010, 15:30
If you have done the spatial binning for all the triangles, is it really much harder to make a deferred renderer over a early-Z renderer.
darkblu
16-Sep-2010, 17:28
If you have done the spatial binning for all the triangles, is it really much harder to make a deferred renderer over a early-Z renderer.
Surely JohnH would give you a much better answer, but hypothetically, it should not be that much more difficult - you need an extra pixel ownership test/enumeration step during the shading stage. That also introduces certain interesting moments with partial derivatives & co, in situations where you don't have enough neighbour pixels of the same draw call, to trivially compute deltas with. But that's not unlike similar situations that arise from the use of fragment discard ops, or mere z-early-out, in any other rasterizer architecture.
Had this one out with aaronspink/darkblu/et al once.
It's deferred and not immediate in the sense that it's scene-grabbing instead of rendering primitives as they're issued.
It's not deferred and is immediate in the sense that it performs full rendering (minus early-Z elimination) within a tile as opposed to having a fast internal Z-path and then index based rendering.
Technically I'd agree, however within the industry TBDR i.e tile based deferred rendering has always referred to PowerVRs particularly brand of tile based rendering. Trying to call Mali a hybrid IMR/TBR is just marketing smoke and mirrors to try and claim that they have something clever, when in fact it's just a bog standard tile based renderer that is missing one of the useful optimisations that TBDR provides.
An important distinction between it and z430 is that it appears to have more explicit tiling and a fixed small (16x16) tile size, and therefore probably has a hardware binning pass prior to geometry as opposed to binning with geometry with skips for already binned polygons. This probably saves a lot of bandwidth in comparison, but like IMG I imagine they employ some kind of compression on their post-transform data. PowerVR should be as "traditional" as you get, having been the only tile based renderers for years, but yeah, there are clearly holes in their statements.
I'm pretty certain that ARM's binning process is very similar to PowerVR's, with the exception that they do bounding box based tiling instead of perfect tiling.
To the best of my knowledge the Z430's "tiling" processes the incoming geometry twice, once to tile it and once to rasterise it. This works fine within a closed system (the XBox360) that has a very large tiles but rapidly falls to pieces with smaller tile sizes and in an open API where applications can and do modify their VBO's mid scene.
What I've always wanted to see was a tile-based early-Z renderer with hardware binning that also performed some level of depth sorting. If it's going to bin per-tile anyway this can't be that much more expensive (or maybe it can, I don't really know the binning algorithms and haven't thought this through that thoroughly). That'd make early-Z much more effective, and if you also add in binning of opaque vs alpha primitives you'd get order-independent translucency too... seems like it'd bring the per-pixel savings much closer to TBDR (especially w/faster than fill early-Z) while not having highly costly alpha test and having order-independent translucency.
The problem is that you need to avoid nlogn time (or worse) within the binning process, although possible this is actually non trivial in HW and can also result issues with memory access patterns. Alternatively you could apply an up front tile resolution Z test but this doesn't work well as the absence of per pixel Z values makes it difficult to reject based on the Z extents of anything other than objects that fully cover the tile.
John.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.