View Full Version : An overview of Qualcomm's Snapddragon Roadmap
convergedw
01-Nov-2010, 12:55
This is a report from the Linley group that Qualcomm posted on their site. It gives a pretty good overview/analysis of Qualcomm's roadmap over the next year or so and is a nice source to keep track of all of their various Snapdragon iterations.
http://www.qualcomm.com/documents/linley-report-dual-core-snapdragon
It sounds like they might be aiming to get a revamp of the Scorpion core to market in 2012. If so, I would imagine that we'll be hearing about it soon....either their analyst day in November or at 3GSM.
Wishmaster
01-Nov-2010, 18:33
Didn't know scorpion is OoOE CPU :razz:
very nice technical document(not too complicated so good for those who are not into those kind of things but still are curious), shame they didn't include samsung with its hummingbird when they compared scorpion to what others have to offer.
After reading that and seeing results from scorpion 2nd generation devices(HTC desire HD running msm7230) I can't wait to see how will dual-core snapdragon do(especially that 512kb L2 'faster than in A9'). Shame about lack of h.264 HP :sad:
metafor
01-Nov-2010, 19:41
Didn't know scorpion is OoOE CPU :razz:
very nice technical document(not too complicated so good for those who are not into those kind of things but still are curious), shame they didn't include samsung with its hummingbird when they compared scorpion to what others have to offer.
After reading that and seeing results from scorpion 2nd generation devices(HTC desire HD running msm7230) I can't wait to see how will dual-core snapdragon do(especially that 512kb L2 'faster than in A9'). Shame about lack of h.264 HP :sad:
Since Hummingbird is functionally a Cortex A8, you can use everything (except the max clock and power consumption) of the A8 in that paper as a comparison.
Wishmaster
01-Nov-2010, 20:05
Since Hummingbird is functionally a Cortex A8, you can use everything (except the max clock and power consumption) of the A8 in that paper as a comparison.
I should have been more precise. I meant not the CPU only but the whole 'platform' as such. It means sgx540, combined with that hummingbird CPU and multimedia chip which is capable of playbacking up to 1080p video streams. Either Linley didn't see the necessity of including it in this comparison or didn't think samsung should be considered key player on the market.
metafor
01-Nov-2010, 20:26
I should have been more precise. I meant not the CPU only but the whole 'platform' as such. It means sgx540, combined with that hummingbird CPU and multimedia chip which is capable of playbacking up to 1080p video streams. Either Linley didn't see the necessity of including it in this comparison or didn't think samsung should be considered key player on the market.
Most of the article was focused on CPU performance and power as well as microarchitecture and some comments on RF integration. I agree a detailed comparison of various GPU architectures would've been nice but there's relatively less information available on Adreno/Yamato/z430 than Scorpion.
Exophase
01-Nov-2010, 21:18
The article makes it sound like Scorpion and Cortex-A9 are using more or less the same OoO technologies, but I doubt this. The word from Anand was that Scorpion could do "some" things OoO, but is "not A9 class."
The way I figure it works is that there are separate pipelines for integer and load/store (and maybe more, like multiply) that each take different stages. These pipelines can complete out of order from each other in terms of writeback, so integer instructions could keep being issued ahead of a stalled load/store pipe so long as there are no dependencies. This would be like ARM11/XScale (but dual issue of course). But it would mean there's no reorder queues in front of the pipelines, so you couldn't for instance run one ALU operation ahead of another that was dependency stalled. I also figure there's no register renaming.
Most of the article was focused on CPU performance and power as well as microarchitecture and some comments on RF integration. I agree a detailed comparison of various GPU architectures would've been nice but there's relatively less information available on Adreno/Yamato/z430 than Scorpion.
Quite the contrary, until this article I've seen very little on Scorpion. This one doesn't have an awful lot either, but at least it discloses pipeline lengths and cache sizes. I've seen a lot of information on z430 in the i.MX51 user guide and AMD's slides, I'd say we know a lot about it... what we don't really know is how Adreno 205 and 220 improved on it. I blame Qualcomm for just being secretive in general.
Wishmaster
01-Nov-2010, 21:51
Quite the contrary, until this article I've seen very little on Scorpion. This one doesn't have an awful lot either, but at least it discloses pipeline lengths and cache sizes. I've seen a lot of information on z430 in the i.MX51 user guide and AMD's slides, I'd say we know a lot about it... what we don't really know is how Adreno 205 and 220 improved on it. I blame Qualcomm for just being secretive in general.
Well said, the only other detailed article about scorpion can be found on insidedsp.com but this one shows what kind of upgrades were made to the first generation and what can we expect from dual-core scorpion. Still I wonder how will that 512kb L2 compare to the standard used in A9.
True that we don't know anything about adreno 205 or 220 apart from the raw numbers.
Exophase
01-Nov-2010, 22:07
Still I wonder how will that 512kb L2 compare to the standard used in A9.
This article actually highlighted a pretty significant concern, that L2 in Cortex-A9 is running over the external AXI bus instead of an internal one. It'll be good to know what bus speed AXI is running at in ie OMAP4 and Tegra 2. In OMAP4 at least it goes to a higher level interconnect (L3) so they might be able to ramp it higher if only the L2 cache is hanging off of it. Will have to look at the OMAP4430 TRM again sometime.. I'd check it now if it weren't so huge -_-
metafor
01-Nov-2010, 22:27
The article makes it sound like Scorpion and Cortex-A9 are using more or less the same OoO technologies, but I doubt this. The word from Anand was that Scorpion could do "some" things OoO, but is "not A9 class."
The article is written quite fondly and probably in a better light than it should've been. But given the difference in pipeline lengths and a plethora of other factors, I wouldn't discount how "OoO" Scorpion is based solely on what little performance numbers we have between the two.
Quite the contrary, until this article I've seen very little on Scorpion. This one doesn't have an awful lot either, but at least it discloses pipeline lengths and cache sizes. I've seen a lot of information on z430 in the i.MX51 user guide and AMD's slides, I'd say we know a lot about it... what we don't really know is how Adreno 205 and 220 improved on it. I blame Qualcomm for just being secretive in general.
Perhaps. But my point stands. This is a microprocessor article comparing various ARM CPU's.
Exophase
02-Nov-2010, 00:20
I'm not discounting OoO based on performance differences. It's more based on what Qualcomm hasn't been saying about it, and the comment on Anandtech.
I agree that the article doesn't come off as 100% NPOV, which is a little alarming for an analyst group. I also found the citation of Tegra 2 being the "graphics leader" a little suspect, almost as if it has to be just because it's made by nVidia. I full expect SGX 540 to be at least competitive, if not clearly leading itself. But there isn't really a lot to go on.
I agree that the article doesn't come off as 100% NPOV, which is a little alarming for an analyst group. I also found the citation of Tegra 2 being the "graphics leader" a little suspect, almost as if it has to be just because it's made by nVidia. I full expect SGX 540 to be at least competitive, if not clearly leading itself. But there isn't really a lot to go on.I think you'll find OMAP4 and Adreno 220 to be leading Tegra 2 quite clearly, yes (barring any miraculous driver improvements which seem unlikely given the Tegra 3 focus at this point).
Of course Linley has no independent capacity to verify NVIDIA's claims and others are unlikely to counter them directly (or be able to prove otherwise publicly). This is a general problem with analysts in this business; heck, they practically never even have any access to anything resembling a die size estimate! How you can evaluate Qualcomm's competitive position in the standalone baseband market (for example) without roughly knowing their die size and that of their competitors is completely beyond me. MDM8200 was over 100mm², but they obviously didn't go around letting everyone know about it.
So these analysts find themselves to be in a position where they need to be 'slightly too optimistic about everybody', which is not a bad compromise, but far from ideal.
metafor
02-Nov-2010, 02:45
I'm not discounting OoO based on performance differences. It's more based on what Qualcomm hasn't been saying about it, and the comment on Anandtech.
No offense to Anand, but his speculation is just that and hardly anything to do with actual info. As for what Qualcomm doesn't say, well, they don't say much about anything....
I agree that the article doesn't come off as 100% NPOV, which is a little alarming for an analyst group. I also found the citation of Tegra 2 being the "graphics leader" a little suspect, almost as if it has to be just because it's made by nVidia. I full expect SGX 540 to be at least competitive, if not clearly leading itself. But there isn't really a lot to go on.
I read the article as pretty much reiterating what separate companies say in press releases, but with more technical information involved. Based on at least GLBench, Tegra 2 is slightly ahead of the 540 and Adreno 205 is somewhere around 20% slower than the 540.
I expect 220 to be a significant leap as that is meant for true console-level graphics.
Exophase
02-Nov-2010, 03:47
I'm pretty sure we've had this discussion before, but it wasn't mentioned as speculation so much as a direct comment:
"Qualcomm claims the ability to do some things out of order, but by and large the pipeline is in order which ultimately keeps it out of the A9 classification."
Suggesting that this isn't him guessing but actually knowing. Why would this necessarily have nothing to do with actual info?
On the other hand, I've questioned some of the conclusions Linley has drawn in the past. For instance, a claim was once made that perf/MHz of Cortex-A9 and Atom were the same, and that both of them were only 25% the perf/MHz of Nehalem.
metafor
02-Nov-2010, 05:09
I'm pretty sure we've had this discussion before, but it wasn't mentioned as speculation so much as a direct comment:
"Qualcomm claims the ability to do some things out of order, but by and large the pipeline is in order which ultimately keeps it out of the A9 classification."
See, it's hard to tell where the comment from QCOM begins and where Anand's conclusions begin. I somehow doubt a QPerson actually said that it's "out of the A9 classification".
On the other hand, I've questioned some of the conclusions Linley has drawn in the past. For instance, a claim was once made that perf/MHz of Cortex-A9 and Atom were the same, and that both of them were only 25% the perf/MHz of Nehalem.
From a theoretical perspective, those aren't really unreasonable claims. Keep in mind this is solely from the point of view of CPU performance. The difference in memory subsystem and system bus performance dramatically changes the end result of course.
Exophase
02-Nov-2010, 05:21
See, it's hard to tell where the comment from QCOM begins and where Anand's conclusions begin. I somehow doubt a QPerson actually said that it's "out of the A9 classification".
I'm not claiming a comment from Qualcomm here, but I am claiming that Anand isn't pulling this out of thing air. What it appears to be is a statement made based on him knowing things about the architecture the rest of us don't, things he's not at liberty to divulge. He may have already said too much. Nonetheless, I imagine he has a good reason for saying what he is. Furthermore, Linley's comment (that the core has some manner of speculative execution) doesn't contradict this, I feel it's only not enough to draw much of a conclusion from.
From a theoretical perspective, those aren't really unreasonable claims. Keep in mind this is solely from the point of view of CPU performance. The difference in memory subsystem and system bus performance dramatically changes the end result of course.
Solely from a CPU point of view it's even less reasonable, IMO. Just the same, the claim was made in the context of real world performance, that you would need 4x more A9 cores per/MHz to keep up.
rpg.314
02-Nov-2010, 10:34
It could be doing out of order execution but without any speculation.
\Shrugs
EDIT: It doesn't do that. :oops:
metafor
02-Nov-2010, 14:28
I'm not claiming a comment from Qualcomm here, but I am claiming that Anand isn't pulling this out of thing air. What it appears to be is a statement made based on him knowing things about the architecture the rest of us don't, things he's not at liberty to divulge. He may have already said too much. Nonetheless, I imagine he has a good reason for saying what he is. Furthermore, Linley's comment (that the core has some manner of speculative execution) doesn't contradict this, I feel it's only not enough to draw much of a conclusion from.
It's not. And without a description from Qualcomm, I don't think any journalist out there can reliably claim information. I like Anand a lot but I'm not going to take his word for it.
Solely from a CPU point of view it's even less reasonable, IMO. Just the same, the claim was made in the context of real world performance, that you would need 4x more A9 cores per/MHz to keep up.
Compared to Nehalem? Dhrystone (I know, I know, not indicative of real world performance, but you'd be surprised how often it's used as a metric in CPU design) puts Nehalem at roughly 22 DMIPS/MHz from the benchmarks I've seen. The A9 pulls ~2.5 according to ARM.
Exophase
02-Nov-2010, 14:32
There are lots of things it could or couldn't be doing.. probably about all we have any confidence in is that it's doing at least something OoO under some circumstance. It could be doing everything Cortex-A9 is and more, but I doubt it.
I remember Intel actually referred to Atom as having OoO capabilities because it can execute integer instructions ahead of floating point ones. That's kinda like calling something OoO because stores go off asynchronously on a write buffer.
Anything with branch prediction performs speculative execution. Prefetching can be considered speculative, and technically so can predicated instructions (although it's explicitly instrumented by the program). I was never really sure what else speculative execution referred to that would be specific to OoOE.
My expectation is stall that it's in-order execution and out-of-order completion.
Compared to Nehalem? Dhrystone (I know, I know, not indicative of real world performance, but you'd be surprised how often it's used as a metric in CPU design) puts Nehalem at roughly 22 DMIPS/MHz from the benchmarks I've seen. The A9 pulls ~2.5 according to ARM.
I was talking perf/MHz per core. 22 DMIPS/MHz is for 4 cores, or 5.5 DMIPS/MHz per core. That's only 2.2x more than Cortex-A9, which is closer to what you'd realistically expect. Note that the Cortex-A9 number is also per core, and Cortex-A9 can also be implemented as quad core.
While different benchmarks can highlight architectural strengths in different GPUs, SGX540 is ahead in GLBenchmark.
http://glbenchmark.com/result.jsp?benchmark=glpro11&certified_only=1
Wishmaster
15-Nov-2010, 20:37
Something I've just found http://developer.qualcomm.com/sites/default/files/IQ-Tech-Track-AdrenoGPUandPerformanceTools.pdf
The only interesting part is about next generation of adreno graphics - Adreno 3xx. According to this paper it'll be GPGPU with OpenCL support running new openGL ES 'Halti' core(openGL ES 3.0 codename?) and if the presentation is to be believed it's going to be used on the 28nm snapdragon next year.
So if we add what linley said about improved scorpion architecture which should come around 2012 and this new GPU it gives us one hell of an interesting SoC! At least on paper :grin:
Love this never-ending performance race! Soon every smartphone will become outdated month after launch just like it is now with PC's.
convergedw
17-Nov-2010, 18:11
Qualcomm has informally introduced their next generation Snapdragon family. I had expected something more detailed today, but it looks like that will wait until next year (probably 3GSM).
Page 35 has the limited details.
http://files.shareholder.com/downloads/QCOM/1068233203x0x420036/945326df-e767-4ebe-8cc6-6df7aa88b848/2010NYAnalystdeckweb_SM%20final.pdf
<i>CPU UPGRADE
New micro-architecture
~5x performance,
~75% lower power
MULTI-MODE MODEM
Integrated LTE Multi-Mode
All 3G modes supported
GRAPHICS
UPGRADE
~4x performance</i>
During the talk, it was stated that the 5x performance measure was based on DMIPS. That would take them from 2,100 DMIPS to ~10,500 DMIPS. No mention of what speed the 75% lower power figure is referencing...but I think the specs for the original Snapdragon were for 500mW at 1Ghz.
Sampling is expected in 2011 with first products in 2012.
Slacker
Wishmaster
17-Nov-2010, 21:09
Qualcomm has informally introduced their next generation Snapdragon family. I had expected something more detailed today, but it looks like that will wait until next year (probably 3GSM).
Page 35 has the limited details.
http://files.shareholder.com/downloads/QCOM/1068233203x0x420036/945326df-e767-4ebe-8cc6-6df7aa88b848/2010NYAnalystdeckweb_SM%20final.pdf
<i>CPU UPGRADE
New micro-architecture
~5x performance,
~75% lower power
MULTI-MODE MODEM
Integrated LTE Multi-Mode
All 3G modes supported
GRAPHICS
UPGRADE
~4x performance</i>
During the talk, it was stated that the 5x performance measure was based on DMIPS. That would take them from 2,100 DMIPS to ~10,500 DMIPS. No mention of what speed the 75% lower power figure is referencing...but I think the specs for the original Snapdragon were for 500mW at 1Ghz.
Sampling is expected in 2011 with first products in 2012.
Slacker
Those are bold statements if you ask me. But it fits what linley wrote in their report and what I've found about the next GPU.
I'll remember to keep an eye on this one, cause I can't wait to learn more about this beast.
metafor
17-Nov-2010, 21:27
Those are bold statements if you ask me. But it fits what linley wrote in their report and what I've found about the next GPU.
I'll remember to keep an eye on this one, cause I can't wait to learn more about this beast.
The DMIPS portion is likely due to multi-core compared to single core. There will be per-core DMIPS improvement as well, of course, but nowhere near 5x.
I'm curious whether the 4x GPU is Adreno 220 or something beyond.
Exophase
18-Nov-2010, 02:08
The 5x claim only really sounds attainable if they're comparing at least triple core to single core. Particularly if we're talking a comparison to a 1.3GHz Scorpion. But I can see them saying it for double core, I just doubt it's a totally fair comparison. Seems like everyone is claiming some vague "5x improvement" for something these days.
Yet another flaw of DMIPS is that the benchmark scales unrealistically well with more cores, hence DMIPS numbers have gone up so dramatically for x86 CPUs.
metafor
18-Nov-2010, 04:25
The 5x claim only really sounds attainable if they're comparing at least triple core to single core. Particularly if we're talking a comparison to a 1.3GHz Scorpion. But I can see them saying it for double core, I just doubt it's a totally fair comparison. Seems like everyone is claiming some vague "5x improvement" for something these days.
Yet another flaw of DMIPS is that the benchmark scales unrealistically well with more cores, hence DMIPS numbers have gone up so dramatically for x86 CPUs.
The 8960 is a dual-core part. It's likely a combination of IPC increase, frequency increase as well as the 2 cores that contribute to the DMIPS increase. And it's likely a comparison against the current 1GHz Scorpion (likely the 65nm one).
Judging from the other graphs provided by the presentation, it wouldn't surprise me that they fiddled with the the candidates being compared to make that "5x" claim.
Yet another flaw of DMIPS is that the benchmark scales unrealistically well with more cores, hence DMIPS numbers have gone up so dramatically for x86 CPUs.
They're good for putting up impressive numbers but the reason they're used is because that's the benchmark one uses when designing a CPU to gauge throughput. So it's the first benchmark number that becomes available for flashy presentations such as this.
Exophase
18-Nov-2010, 05:16
So 2.5x increase per core given perfect scaling. Memory hierarchy probably isn't improving the 5x to match that, especially least not in latency. Good thing DMIPS don't care about things like memory performance.
I agree entirely that there's fiddling going on here, if you ask me the whole presentation kinda stunk. Really vague claims and calling the other platforms unnamed competitors, as if they were somehow not allowed to say who they were actually comparing against. Hard to take this sort of thing seriously.
metafor
18-Nov-2010, 05:42
Memory bandwidth certainly isn't improving 5x but given the present (rather pathetic) state of memory performance in mobile SoC's, a 2-3x improvement in load/store performance wouldn't be out of the question, especially compared to Scorpion.
Wishmaster
18-Nov-2010, 07:18
The 8960 is a dual-core part. It's likely a combination of IPC increase, frequency increase as well as the 2 cores that contribute to the DMIPS increase. And it's likely a comparison against the current 1GHz Scorpion (likely the 65nm one).
Judging from the other graphs provided by the presentation, it wouldn't surprise me that they fiddled with the the candidates being compared to make that "5x" claim.
I agree that it has to be a combination of improved IPC and dual core architecture. Who knows maybe they will clock it at 2Ghz and with dual core it would be possible to achieve 5x performance of snapdragon1.
But I don't think that they mean adreno 220 when talking about 4x performance. In msm8x60 they talk about 4x performance(which is possible thanks to adreno 220), besides it should use something new where adreno 220 is probably still heavily relaying on amd z480.
Ailuros
18-Nov-2010, 07:52
I can see a popular trend with those "4x times" or "5x times" the performance claims from different IHVs and/or manufacturers. It's not the first time we've seen those and most of us should know how realistic they can be in the end in real time and that irrelevant from which corner it comes.
Wishmaster
18-Nov-2010, 08:19
I can see a popular trend with those "4x times" or "5x times" the performance claims from different IHVs and/or manufacturers. It's not the first time we've seen those and most of us should know how realistic they can be in the end in real time and that irrelevant from which corner it comes.
At least it sounds impressive! :grin:
Wonder how competitive it will be when compared to tegra3 and omap5(?)
Hmm - I suppose a per-core/mhz DMIPS target roughly similar to A15 is likely given that they weren't going to go revamp the architecture significantly for OoOE only to remain effectively dual-issue, and neither are they increase issue width without full OoOE. That would get us to a DMIPS of about 3.5/MHz/core or more iirc, which is 1.67x Snapdragon's. So to achieve 5x, you're looking at a 1.5GHz dual-core.
As for the GPU's 4x, that's not very impressive assuming it's also compared to the original Snapdragon with its 133MHz Adreno 200. In fact, that's exactly the performance level of the Adreno 220! The only chance for this to be more interesting is if they're actually referring to the 45nm shrink, which I've seen some people indicate (perhaps mistakenly) that it uses an Adreno 205, in which case we'd be looking at roughly twice the MSM8x60's performance. Wasn't there a PDF somewhere that indicated they were working on an OpenGL ES 3.0 architecture? If this isn't double the performance, I suppose either it's coming in a 28nm refresh or they're not doubling the number of units, both of which would be slightly disappointing.
On how it will compare to the competition: I don't know for certain about OMAP5, but Tegra3's design target was a quad-core Cortex-A9 at 1.2GHz on 28LPT. That means Snapdragon would be 1.75x as fast per-core but for optimally scaling multi-core workloads (yeah right...) Tegra3 would be 1.14x faster. That's for integer; for floating-point, you need to consider Tegra3 doesn't include NEON (and even if it did, Cortex-A9's NEON is only 64-bit wide). I'd argue that from a marketing perspective, a quad-core with lower IPC is remains very attractive, although I don't know how OEMs would evaluate both overall.
rpg.314
18-Nov-2010, 14:43
I don't know for certain about OMAP5, but Tegra3's design target was a quad-core Cortex-A9 at 1.2GHz on 28LPT.
Forget smartphones or tablets, can anyone tell me what's the point of a quad core even in anything but a laptop/desktop? It sounds pointless even in the former if you ask me.
I'd prefer a higher clocked and a wider dual core.
Exophase
18-Nov-2010, 15:46
Hmm - I suppose a per-core/mhz DMIPS target roughly similar to A15 is likely given that they weren't going to go revamp the architecture significantly for OoOE only to remain effectively dual-issue, and neither are they increase issue width without full OoOE. That would get us to a DMIPS of about 3.5/MHz/core or more iirc, which is 1.67x Snapdragon's. So to achieve 5x, you're looking at a 1.5GHz dual-core.
How do you know they aren't going to revamp the architecture significantly for OoOE only to remain effectively dual-issue? Cortex-A9 did. Cortex-A15 level is very lofty for a chip that'll be ready in 2011, not to mention a chip that's "75% the power consumption." Maybe it means 75% consumption when clocked to the same performance levels, ie 1/5th the clock speed using the perfect DMIPS core scaling. Cortex-A15 is going to be positioned to take a bigger market share outside of mobile, ie netbooks/laptops and server space. That gives them some incentive to push an architecture that has a higher baseline power draw while still keeping A9 available. Hard to imagine Qualcomm pushing nearly as much into these markets.
I think the 5x will be more viable with a > 1.5GHz clock than with a Cortex-A15 level architecture. They've already slated 45nm products for 1.5GHz, so wouldn't you expect their 28nm chip to clock higher?
Forget smartphones or tablets, can anyone tell me what's the point of a quad core even in anything but a laptop/desktop? It sounds pointless even in the former if you ask me.In theory, the point is that for workloads that *do* scale with four cores, both perf/watt and perf/mm2 are better. For perf/watt, this is because of voltages: two undervolted cores at 750MHz will take a lot less power than one overvolted core at 1.5GHz. For perf/mm2, this can be seen with the Cortex-A15 which is nearly twice as big as the A9 but probably 'only' 60-70% faster overall (counting both IPC and frequency on the same process). That's still pretty good scaling, but obviously you'll get diminishing returns the more you try to scale up per-core performance.
I'd prefer a higher clocked and a wider dual core.I'd definitely prefer that too, but keep in mind that there will not be a single Cortex-A15 application processor taping out for about one year after Tegra3 taped-out. This is simply as high-end as you can get in this timeframe without designing your own CPU ala Qualcomm, except for the lack of NEON presumably. NV is probably right that it's worth the fairly negligible extra silicon even if it's more useful for marketing than real apps. I think there will be big incentives for AAA game developers to exploit those four cores sooner rather than later, though... :)
I'm more cautious about Tegra4 as they aren't a lead licensee for Cortex-A15 so it'll presumably still be quad-core A9 (perhaps clocked noticeably higher if they go for 28HPM instead of 28LPT though). We'll see...
How do you know they aren't going to revamp the architecture significantly for OoOE only to remain effectively dual-issue? Cortex-A9 did. Cortex-A15 level is very lofty for a chip that'll be ready in 2011, not to mention a chip that's "75% the power consumption." Maybe it means 75% consumption when clocked to the same performance levels, ie 1/5th the clock speed using the perfect DMIPS core scaling.That 75% figure nearly certainly means 4x performance/watt, which you obviously don't want to point out that way or people might realise that means up to 1.25x total power ;) I agree it's an ambitious goal, but presumably they've had a team working on it since before ARM even finished the A9 (not sure if the A15 started as a parallel project though, since the A9 was unusual in being created primarily by the Sophia Antipolis design center), so it's far from impossible.
Hard to imagine Qualcomm pushing nearly as much into these markets.Qualcomm is very ambitious wrt tablets, but obviously they don't care about servers or set-top boxes.
I think the 5x will be more viable with a > 1.5GHz clock than with a Cortex-A15 level architecture. They've already slated 45nm products for 1.5GHz, so wouldn't you expect their 28nm chip to clock higher?I don't buy it. They achieve 1.5GHz with a high-voltage part for tablets on 40LPT, and there's not a lot of extra performance on the table for 28LPT (20% maybe?). Finally and unlike the A9, they've already got a fairly long pipeline so there's not as much to gain on that front either. It also probably wouldn't be as power-efficient.
It's possible that it's really 4.6x as fast at 1.75GHz, which would get us to a DMIPS/MHz of 2.76 - that's perfectly plausible on a dual-issue OoOE design. Not very exciting though, and I suspect not as likely to be true, but we'll see.
Exophase
18-Nov-2010, 19:14
4x perf/Watt with much higher peak performance over a design that's already highly competitive in perf/Watt.. one process node better, but that still strikes me as a little hard to believe. Even for DMIPS.
Ailuros
18-Nov-2010, 20:07
At least it sounds impressive! :grin:
As long as anybody doesn't fall for it, no harm done.
Wonder how competitive it will be when compared to tegra3 and omap5(?)http://www.anandtech.com/show/4024/qualcomm-reveals-nextgen-snapdragon-msm8960-28nm-dualcore-5x-performance-improvement
http://images.anandtech.com/reviews/SoC/Qualcomm/ng-snapdragon/adreno3xx_sm.jpg
Since I recall another funky claim from IMG at xbitlabs mentioning something about PS3 performance (I just don't recall the exact claim and am too bored to dig it out), all falls into the same category. 2013 is a mighty long time from today. In 2012 NV should already be producing Tegra4. Different sides are just throwing around vague data about next generation devices. Only when they've all announced the specifics of each future architecture we'll get a tad wiser.
In any other case and until then we'll be reading about the Uber-T604, the Ultra-Adreno3xx, the Fantastic-Tegra3/4 and the Super-Series6 amongst others.
A far more important question would be what Uber-OGL_ES-Halti exactly stands for (what a stupid codename for an API anyway....).
Wishmaster
18-Nov-2010, 20:12
As long as anybody doesn't fall for it, no harm done.
I'm sure there will be some that will fall for it.
http://www.anandtech.com/show/4024/qualcomm-reveals-nextgen-snapdragon-msm8960-28nm-dualcore-5x-performance-improvement
http://images.anandtech.com/reviews/SoC/Qualcomm/ng-snapdragon/adreno3xx_sm.jpg
Since I recall another funky claim from IMG at xbitlabs mentioning something about PS3 performance (I just don't recall the exact claim and am too bored to dig it out), all falls into the same category. 2013 is a mighty long time from today. In 2012 NV should already be producing Tegra4. Different sides are just throwing around vague data about next generation devices. Only when they've all announced the specifics of each future architecture we'll get a tad wiser.
In any other case and until then we'll be reading about the Uber-T604, the Ultra-Adreno3xx, the Fantastic-Tegra3/4 and the Super-Series6 amongst others.
I know that all of them use roughly the same type of 'language' so on paper they all seem to be as good as the other cause they all have 'PS3 graphics performance', I wonder how good IRL they'll be.
A far more important question would be what Uber-OGL_ES-Halti exactly stands for (what a stupid codename for an API anyway....).
OpenGL 3.0 LITE?
I assume the '75% lower power' (i.e. 4x performance per watt) is relative to the original 65nm Snapdragon, not the 45nm shrink.
As for PS3/XBox360-level performance... What's the probability that any handheld chip has equivalent performance to a chip with 24 TMUs at 550MHz before 14nm in 2015? Zero.
(although in practice RSX's utilisation isn't mind-blowing, it's more optimised towards perf/mm2 than perf/unit, so I suppose equivalent performance on a ultra-high-end tablet chip on 20nm isn't strictly impossible).
metafor
18-Nov-2010, 21:55
Scorpion at 1.5GHz required the LPG process at 45nm. This is a significantly more power-hungry process than LP and far more so than 28LP. IIRC, 1.4W using a 1.5GHz LPG core and a 1.3GHz LP core. I'm going to make a rough, blind assumption that the LPG core consumes the majority of that; around 900mW-1W.
Seeing as this will likely run ~1.5GHz in 28LP and that you'd likely need ~1.0GHz or lower to match the performance of the previous 1.5GHz LPG Scorpion, having a new u-arch on 28LP that consumes ~300mW isn't out of the question.
But yes, this is with very fiddled numbers.
IIRC, 1.4W using a 1.5GHz LPG core and a 1.3GHz LP core. I'm going to make a rough, blind assumption that the LPG core consumes the majority of that; around 900mW-1W.Ohhhhh. Wait, that chip uses two different synthesis jobs for the two cores, one of which uses only LP transistors and the other mostly G? I didn't know that, very intriguing. I thought the Marvell Armada 628 was the first to do something like that. This would also explain why Qualcomm is the only company that has invested in a dedicated DC/DC for each core; it would be problematic to share one DC/DC if the cores were rated for very different frequencies at a given voltage.
My assumption (if this is true) is that at very low frequencies the LP core takes significantly less power than the G core (which is therefore always power gated off in that case) due to lower leakage, but at the maximum frequency the G core takes *less* power than the LP core due to lower dynamic power. Anything else would be rather absurd and defy the whole point (with your numbers, you'd be better off with an overvolted LP core, that's insane!)
It's at times like this I feel like I really should finish that article on Icera one of these days! See slide 36: http://www.lirmm.fr/arith18/slides/ARITH18_keynote-Knowles.pdf :)
metafor
19-Nov-2010, 03:28
Ohhhhh. Wait, that chip uses two different synthesis jobs for the two cores, one of which uses only LP transistors and the other mostly G? I didn't know that, very intriguing. I thought the Marvell Armada 628 was the first to do something like that. This would also explain why Qualcomm is the only company that has invested in a dedicated DC/DC for each core; it would be problematic to share one DC/DC if the cores were rated for very different frequencies at a given voltage.
Regardless of whether the two cores are asymmetrical, separate power rails is a good thing for power. I believe Nehalem took this approach as well. The 8x60, for instance, has two symmetrical LP cores, but has separate regulators for each. It just makes sense if you can afford the engineering effort.
My assumption (if this is true) is that at very low frequencies the LP core takes significantly less power than the G core (which is therefore always power gated off in that case) due to lower leakage, but at the maximum frequency the G core takes *less* power than the LP core due to lower dynamic power. Anything else would be rather absurd and defy the whole point (with your numbers, you'd be better off with an overvolted LP core, that's insane!)
Well, the G process takes similar dynamic power. The thing to remember is that the LP process can only scale so high without drastically over-volting. And this being (well, intended) to be a tablet/netbook part, chasing performance was important. And let's face it, 1.5GHz sounds great for marketing.
Exophase
19-Nov-2010, 03:53
I too think the dual process approach sounds intriguing, especially if you can shut off the higher power core entirely during low load.
Regardless of whether the two cores are asymmetrical, separate power rails is a good thing for power. I believe Nehalem took this approach as well. The 8x60, for instance, has two symmetrical LP cores, but has separate regulators for each. It just makes sense if you can afford the engineering effort.You're perfectly correct of course - I should have remembered our earlier discussion better (although I'd note an extra DC/DC is not free, and Qualcomm has the slight advantage of making their own PMICs). Here's probably a better way to put it: if your cores are asymmetric, then there is also a fair bit of engineering complexity in sharing a single DC/DC, so it makes even more sense to invest in the superior approach of dual DC/DC.
Well, the G process takes similar dynamic power. The thing to remember is that the LP process can only scale so high without drastically over-volting. And this being (well, intended) to be a tablet/netbook part, chasing performance was important. And let's face it, 1.5GHz sounds great for marketing.I don't buy this - at all. Practically speaking, the LP and G process are mostly LSTP (Low STandby Power) and LOP (Low Operating Power) processes respectively, as defined by the ITRS.
See slide 8 of the same presentation: http://www.lirmm.fr/arith18/slides/ARITH18_keynote-Knowles.pdf (I assume none of the LOP transistors are LP and none of the LSTP are G, but that would be an implementation detail anyway).
There's no free lunch - the LP transistors aren't magically lower power. You simply trade lower leakage for higher voltages at a given frequency, and therefore higher dynamic power at a given performance level. What the G transistors allow you to do in handhelds is to efficiently go higher up the leakage curve for those transistors that either truly require the speed (e.g. a frequency-optimised CPU core), or are nearly always either busy or power-gated off, or are a small but problematic bottleneck in your critical path. All three cases make perfect sense and the last two can genuinely reduce total power in real use cases.
This is not completely different from multi-Vt where you use multiple transistors at different points on the curve throughout your chip. LPG simply lets you go higher up on the curve where it makes sense without compromising power efficiency for the rest of your chip. One further point is that the LPG process works by giving you access to two different oxide thickness (Tox) for your transistors, whereas different transistors in either the LOP or LSTP category achieve different leakages by varying gate length. I assume (but could be horribly wrong) that the highest-leakage LSTP transistors would therefore have worse dynamic power than LOP transistors with similar target leakage because gate length scaling would probably result in diminishing returns eventually.
So overall, I would be extremely shocked if the G core did not take less power than the LP core at the same frequency (e.g. 1.3GHz).
metafor
19-Nov-2010, 15:34
I don't buy this - at all. Practically speaking, the LP and G process are mostly LSTP (Low STandby Power) and LOP (Low Operating Power) processes respectively, as defined by the ITRS.
See slide 8 of the same presentation: http://www.lirmm.fr/arith18/slides/ARITH18_keynote-Knowles.pdf (I assume none of the LOP transistors are LP and none of the LSTP are G, but that would be an implementation detail anyway).
There's no free lunch - the LP transistors aren't magically lower power. You simply trade lower leakage for higher voltages at a given frequency, and therefore higher dynamic power at a given performance level.
Well yes, hence the comment "LP won't scale that high without drastic over-volting" :)
What the G transistors allow you to do in handhelds is to efficiently go higher up the leakage curve for those transistors that either truly require the speed (e.g. a frequency-optimised CPU core), or are nearly always either busy or power-gated off, or are a small but problematic bottleneck in your critical path. All three cases make perfect sense and the last two can genuinely reduce total power in real use cases.
This is not completely different from multi-Vt where you use multiple transistors at different points on the curve throughout your chip. LPG simply lets you go higher up on the curve where it makes sense without compromising power efficiency for the rest of your chip. One further point is that the LPG process works by giving you access to two different oxide thickness (Tox) for your transistors, whereas different transistors in either the LOP or LSTP category achieve different leakages by varying gate length. I assume (but could be horribly wrong) that the highest-leakage LSTP transistors would therefore have worse dynamic power than LOP transistors with similar target leakage because gate length scaling would probably result in diminishing returns eventually.
I'm trying to understand why you think it would have higher dynamic power with a longer gate length (unless you mean having to crank up voltage to reach the same frequency, as I've addressed above). There are many differences between LPG and LP, but the primary of which are that LP uses thicker oxide than LPG and that LP is doped lighter than LPG.
This results in both a lower Ion as well as Ioff. LPG should actually consume more dynamic power as well as more leakage as a trade-off for frequency but as you pointed out, it requires lower voltage to reach the same frequencies. So if we can lower the voltage by ~20-40mV to achieve the same frequency, we offset any (and keep in mind, leakage is still a dominant factor even while the core is running) disadvantages of using high-current transistors.
Keep in mind that this is all a non-linear relationship and at some point, the LP process would not need a significantly higher voltage compared to the LPG to reach a certain frequency. At that point, the LP core will take significantly less power for the same frequency. At 45LP, that point isn't that low :)
I'm trying to understand why you think it would have higher dynamic power with a longer gate length (unless you mean having to crank up voltage to reach the same frequency, as I've addressed above).That's what I meant, yes.
There are many differences between LPG and LP, but the primary of which are that LP uses thicker oxide than LPG and that LP is doped lighter than LPG.Ah yes, I forgot about doping, thanks.
This results in both a lower Ion as well as Ioff. LPG should actually consume more dynamic power as well as more leakage as a trade-off for frequency but as you pointed out, it requires lower voltage to reach the same frequencies.Right, lower voltage at a given frequency. But then you say...
So if we can lower the voltage by ~20-40mV to achieve the same frequency, we offset any (and keep in mind, leakage is still a dominant factor even while the core is running) disadvantages of using high-current transistors.20-40mV?! That's it? I can see why you think the inherently slightly higher power at a given frequency+voltage would compensate the lower voltage for a given frequency if that's the most you could lower the voltage for the same frequency. Ah well, I suppose I'll never get much more real-world data than this Icera presentation, which is very nice but might not apply perfectly to other cases (and there's no clear comparison of LSTP Low Vt and LOP High Vt sadly, which surely is the question here).
Keep in mind that this is all a non-linear relationship and at some point, the LP process would not need a significantly higher voltage compared to the LPG to reach a certain frequency. At that point, the LP core will take significantly less power for the same frequency. At 45LP, that point isn't that low :)Hmmm. But surely that point is still much lower than 1.3GHz, no? So I'd still expect the G core to take noticeably less total power at 1.3GHz than the LP core.
metafor
19-Nov-2010, 16:45
20-40mV?! That's it? I can see why you think the inherently slightly higher power at a given frequency+voltage would compensate the lower voltage for a given frequency if that's the most you could lower the voltage for the same frequency. Ah well, I suppose I'll never get much more real-world data than this Icera presentation, which is very nice but might not apply perfectly to other cases (and there's no clear comparison of LSTP Low Vt and LOP High Vt sadly, which surely is the question here).
That depends on who designs your library. There's a lot more to HVT vs LVT than just gate length. More complex cells use different transistor configurations (favoring parallel vs serial FET arrangement for higher current with higher leakage). Typically, LVT is orders of magnitude (50ps vs 300ps) faster than HVT but both leakage and dynamic power is orders of magnitude.
And make no mistake, 40mV is a lot both in performance as well as power impact :) It's the difference between 1.4GHz and 1.7GHz.
Hmmm. But surely that point is still much lower than 1.3GHz, no? So I'd still expect the G core to take noticeably less total power at 1.3GHz than the LP core.
Possibly. But Scorpion scales really really well at 45LP (people have OC'ed it to 1.9GHz, though I'm not sure what the voltage was). So I don't know how conservative 1.3GHz was in the voltage/frequency curve.
A far more important question would be what Uber-OGL_ES-Halti exactly stands for (what a stupid codename for an API anyway....).
Halti (http://en.wikipedia.org/wiki/Halti)
Entropy
21-Nov-2010, 11:35
On how it will compare to the competition: I don't know for certain about OMAP5, but Tegra3's design target was a quad-core Cortex-A9 at 1.2GHz on 28LPT. That means Snapdragon would be 1.75x as fast per-core but for optimally scaling multi-core workloads (yeah right...) Tegra3 would be 1.14x faster. That's for integer; for floating-point, you need to consider Tegra3 doesn't include NEON (and even if it did, Cortex-A9's NEON is only 64-bit wide). I'd argue that from a marketing perspective, a quad-core with lower IPC is remains very attractive, although I don't know how OEMs would evaluate both overall.
I don't know how attractive a quad-core is marketing wise in this segment. I'd contend that as we migrate from desktops to laptops to nettops to tablets to cell phones, the less impressed the average consumer generally gets with having a number such as this thrown in their face.
From a more technical/engineering standpoint, I'd like to question just how useful a quad-core CPU would be. It seems to me that not only do you have the multi-core utilization problems I'm familiar with, but on top of that you would add main memory (and possibly even cache, hi JohnH and metafor ;)) contention with the GPU, and on top of that you have pitiful main memory throughput compared to the ALU resources. So for the life of me, I just can't see utilization being very good. But then, I don't know the mobile application space all that well, am I missing something?
Ailuros
21-Nov-2010, 17:35
I assume the '75% lower power' (i.e. 4x performance per watt) is relative to the original 65nm Snapdragon, not the 45nm shrink.
As for PS3/XBox360-level performance... What's the probability that any handheld chip has equivalent performance to a chip with 24 TMUs at 550MHz before 14nm in 2015? Zero.
(although in practice RSX's utilisation isn't mind-blowing, it's more optimised towards perf/mm2 than perf/unit, so I suppose equivalent performance on a ultra-high-end tablet chip on 20nm isn't strictly impossible).
Last sentence describes your second thoughts? Before I answer your question whether there's going to be a chip with 24 TMUs@550MHz in half a decade, you might want to re-think of how exactly TMUs are incorporated into G7x. I'm sure I can come up with far better ideas regarding texturing and/or TMUs since G80 and even more so since GF100. 5 years from now is a mighty long time and technical advancements in the embedded space are more than just rapid. Even worse embedded CPU development isn't static either.
Not sure it directly fits with this conversation, but i have heard the CEO (H Yassiae) of IMG say this year that performance of handheld graphics cores would hit around x100 of todays performance, on roughly the same power, within 5 years.
500 MHz is the rate at which RSX actually ended up being specced.
Exophase
22-Nov-2010, 06:54
Wow, 100 times. I take it that's got to be incorporating a move to big gobs of dedicated on-die memory.
For the sake of posterity and history can we quantify that 100x performance level?
First lets pick a part for todays performance levels.
I suggest PowerVR SGX540 which seems to be on par with the Adreno 205:
Current specs:
90 million *triangles/sec.
In 5 years time we can expect 9 billion triangles/sec.
Awesome.
*These are vapo-triangles not real triangles
metafor
22-Nov-2010, 18:52
I seriously doubt it'll be inflated geometry pipeline expansion. Likely it'll be more improvements in the memory system (using GMEM on-die, for instance) than anything else. And that could easily result (once we get as much on-die SRAM as some of today's desktop processors) in a 100x improvement.
Intriguing metafor, cheers (obviously not very specific so I can't make much out of it, but heh :))
Regarding PS3-level performance and 100x in 5 years: I'm willing to bet those are GFlops relative to the Apple A4. The SGX535 there has only two ALU pipelines with 4 flops each, so that's about 2GFlops. I think 200GFlops on a high-end tablet chip on 14nm is not out of the question given the increasing ALU ratio (doubles in SGX540, doubles further in SGX543MP, presumably increases further in next-gen).
A SGX543 4MP @ 400MHz would already have 25x as many flops, and that's perfectly realistic on 28HPM. Even if you didn't change the ALU ratio, you'd still get to 100x pretty easily on 14nm in 2H15. Of course, as metafor says, the memory system will need a pretty big boost to keep up. I think external memory is likely to improve better than some expect there - for tablet chips in that timeframe, we should be looking at 64-bit DDR4, which is nice. In fact... now that I look at these numbers, may I change my prediction? Probability that we reach PS3-level performance on 28nm: practically zero. Probability that we reach it on 20nm: reasonably high! (and yes, I know G7x efficiency per unit is pretty bad, although I suppose I was thinking of the case where the dev hand-optimised quite a bit for it. Also keep in mind G8x isn't magically better there; unit efficiency is much better, but perf/mm2 isn't as can be see via G71 vs G84 - it's probably a better idea to only bother comparing handheld chips to Xenos anyway).
First lets pick a part for todays performance levels.
I suggest PowerVR SGX540 which seems to be on par with the Adreno 205:
90 million *triangles/sec.
90M was a samsung marketing figure, I think IMG would be more comfortable with 20-30M
EDIT
Sorry, just saw your "vapo" reference.
Ailuros
23-Nov-2010, 17:26
For the sake of posterity and history can we quantify that 100x performance level?
First lets pick a part for todays performance levels.
I suggest PowerVR SGX540 which seems to be on par with the Adreno 205:
Well even Qualcolmm doesn't seem to place it there in it's own graphs, but that's besides the point.
Current specs:
90 million *triangles/sec.
In 5 years time we can expect 9 billion triangles/sec.
Awesome.
*These are vapo-triangles not real triangles
The Samsung S5PC110 manual specifically mentions for SGX540 20M Tris and I assume that's at 200MHz.
Besides triangle rates are as important as on paper FLOP rates to define performance for any GPU.
But if we really have to speculate on theoretical triangle rates MBX in its fastest incarnation (>230MHz) was capable of 7M Tris just as the lowest end SGX520 while the lowest end MBX Lite of the OGL_ES 1.1 generation was somewhere at 700k Tris if memory serves well. If that should help speculating where more or less the generation after Series5 could be regarding triangle rates than fine.
If you now want to play silly marketing games if I take in theory a 16MP@400MHz that would equal 1120M Tris/s, which is 160x times over highest end MBX and 1600x times over MBX Lite. Of course it is it possible on paper if Series6 goes multi-core in due time like Series5 did; and it's pretty irrelevant if any marketed figure is just the maximum latency of a multi-core X configuration and never gets used in any device in the end.
The whole point was that is was a silly marketing statement. 100x performance in 5 years is, highly unlikely, unless some weird non-realistic, and pointless metric was used as a comparison.
As to SGX 540 and Adreno 205 comparisons I was referring to:
http://androidandme.com/2010/10/news/3dmarkmobile-gpu-showdown-adreno-205-vs-powervr-sgx540/
Unfortunately in this test it is nearly impossible to completely rule out the CPU and memory subsystem which will impact scores but Adreno 205 does in all cases show itself as a competent performer in this particular benchmark.
This debate is far from settled. We can clearly see Qualcomm has come a long way from their first Snapdragons and made great progress with the Adreno 205 GPU. I was starting to worry that Qualcomm was in trouble (http://androidandme.com/2010/09/news/is-qualcomm-and-htc-in-trouble-with-their-dual-core-processors/) with their Adreno GPU family, but it holds its own against PowerVR and now I’m pretty excited about the Adreno 220 GPU coming in future dual-core Snapdragons.
The biggest problem as ever is power consumption and as alluded by another member here - the move to large quantities of ondie memory in the future will certainly not hurt performance.
How has the PC graphics market improve in performance over that time? May not be entirely relevant but gives some kind of indication of what is possible.
GeForce 7800 GTX or Radeon X850XT Platinum Edition
vs
Geforce GTX 580 or Radeon 5970.
PS I don't know where I got the 90 million triangles per second number from. Is there any place to easily check various specs apart from the press releases?
Ailuros
23-Nov-2010, 22:34
The whole point was that is was a silly marketing statement. 100x performance in 5 years is, highly unlikely, unless some weird non-realistic, and pointless metric was used as a comparison.
Look at it that way: chances are few to none that someone will build today or in the future a 16MP SGX543/4. It won't happen because it's too large for a handheld, tablet or smartphone and as manufacturing processes scale down after a specific point their next generation will make more sense than that one. That doesn't mean that the IP doesn't exist or isn't technically feasible.
And yes of course due to the time being so large between then and today irrelevant of any theoretical measurement real time performance difference is going to be huge; also of course because SGX has a multi-core variant and I don't expect neither ARM or IMG to abandon the multi-core idea for their GPU IP.
As to SGX 540 and Adreno 205 comparisons I was referring to:
http://androidandme.com/2010/10/news/3dmarkmobile-gpu-showdown-adreno-205-vs-powervr-sgx540/
Unfortunately in this test it is nearly impossible to completely rule out the CPU and memory subsystem which will impact scores but Adreno 205 does in all cases show itself as a competent performer in this particular benchmark.One benchmark out of many, but again besides the point. Let's move on.
The biggest problem as ever is power consumption and as alluded by another member here - the move to large quantities of ondie memory in the future will certainly not hurt performance.Again 5 years down the line is a mighty long time.
How has the PC graphics market improve in performance over that time? May not be entirely relevant but gives some kind of indication of what is possible.
GeForce 7800 GTX or Radeon X850XT Platinum Edition
vs
Geforce GTX 580 or Radeon 5970.And that's relevant how exactly? But since you're eager to work out a parallel example for that one, assume you would build today a super-GPU with 16 GF110 cores on it with a solution that would guarantee nearly linear scaling; how do you think would that one compare to a G70?
PS I don't know where I got the 90 million triangles per second number from. Is there any place to easily check various specs apart from the press releases?Trust me the manual from SAMSUNG states 20M Tris for SGX540. Depends on the manufacturer, if they list those kind of specs and if yes if their realistic enough to represent something as close as possible to reality.
Apart from upcoming SGX54x multi-core configs there's not a single embedded GPU out there that can achieve anything close to real 90M Tris/s. The Tegra2 GPU if memory serves well is capable of 70M vertices/s.
And that's relevant how exactly? But since you're eager to work out a parallel example for that one, assume you would build today a super-GPU with 16 GF110 cores on it with a solution that would guarantee nearly linear scaling; how do you think would that one compare to a G70?
Relevant as it shows scaling in GPU's over 5 years. And since you did bite :P
I think you misunderstood my example. You would literally compare the high end 5 years ago to the high end now.
So that would be a quad GPU GF110 system? I am not sure how to design a single sysem with 16 GF110's right now. By my rough calculations we still have not hit G70 x 100 speeds yet, (and be fair, by your metrics you should at least let me SLI my G70's but even without SLI I think maybe we have approached 40x with 8x more power consumption and approx 6x die size).
Everything is based on power consumption, die size and what the market is willing to bear (in regards to cost).
We have in many cases in raw theoretical power not gone to 100x the power in 5 years in the discrete PC GPU field. I find it unlikely the mobile platforms will either since......... they also face the same issues but with differing priorities for their end consumer (us).
I tried to bring some reality into these theoretical figures. That is all, and looking further into it I still don't think the 100x claim is possible. We all have the laws of physics to contend with after all.
Since that Android and Me link has shown up here, I should point out that the build of 3DMarkMobile ES2 v1.0 that they used was buggy and that performance of all the phones is misrepresented; particularly so in the Epic 4G's case.
I can't really speak for Qualcomm as to what their actual performance is, but for us in the 4G, using the 4G's shipping drivers, performance should be significantly higher in the Hoverjet test (over 2x), and higher in Taiji too by a smaller amount.
Ailuros
24-Nov-2010, 12:41
Relevant as it shows scaling in GPU's over 5 years. And since you did bite :P
I think you misunderstood my example. You would literally compare the high end 5 years ago to the high end now.
That's what I actually did. I compared a high end embedded GPU of 5 years ago with what the maximum possible today.
So that would be a quad GPU GF110 system? I am not sure how to design a single sysem with 16 GF110's right now.
There are lightyears of differences between desktop GPUs and an embedded GPU block in an SoC. That should be clear and that's the reason why I asked for the relevance in the first place. But since you can today scale a SGX543/4 up to 16 cores in an SoC, I used your rather weird example and asked how it would look like if you'd scale 16 GF110 cores on a GPU cluster, because exactly there are multi-core configs involved in the embedded space.
By my rough calculations we still have not hit G70 x 100 speeds yet, (and be fair, by your metrics you should at least let me SLI my G70's but even without SLI I think maybe we have approached 40x with 8x more power consumption and approx 6x die size).
Problem being that neither IMG's or any other IHV had for the first OGL-ES1.x generation any cores that were fit for multi-core configs. Can we stay in the embedded space for a change to keep track over things?
We have in many cases in raw theoretical power not gone to 100x the power in 5 years in the discrete PC GPU field. I find it unlikely the mobile platforms will either since......... they also face the same issues but with differing priorities for their end consumer (us).
See above. And mark once more that I clearly pointed out that 16MP for Series5 XT is the maximum latency of the design.
I tried to bring some reality into these theoretical figures. That is all, and looking further into it I still don't think the 100x claim is possible. We all have the laws of physics to contend with after all.
Arun already made a few points how someone could interpret that claim. He doesn't have to be on spot, he picked up the correct reasoning behind the marketing blurb. If someone would tell you that super-douper-ultra core config of the future will gain 6000 fps in Q3 in 1080p then of course it would be a complete joke.
But if you're actually following the embedded market you'll see that IMG, Qualcolmm and ARM are potentially targeting GPGPU amongst other things which means an even healthier boost in floating point performance with all of their next generations than today.
In fact Arun took a perfectly sensible example of a 4MP@400MHz. Take now a 16MP on the same frequency and the floating point difference compared to a SGX540 if my math isn't screwed up is even over 140x times. And I hope I won't have to repeat that chances are very few that we'll ever see a SGX543/4 16MP config.
Do you expect their next generation to sport the same floating point power per ALU as on SGX543/4? Obviously it will be quite a bit higher. Now try your speculative math again for 5 years down the road and for <20nm.
Since that Android and Me link has shown up here, I should point out that the build of 3DMarkMobile ES2 v1.0 that they used was buggy and that performance of all the phones is misrepresented; particularly so in the Epic 4G's case.
I can't really speak for Qualcomm as to what their actual performance is, but for us in the 4G, using the 4G's shipping drivers, performance should be significantly higher in the Hoverjet test (over 2x), and higher in Taiji too by a smaller amount.
That does explain the anamoly. Thanks.
Entropy
25-Nov-2010, 14:03
I think it is clear that great advances in graphics performance cannot come from the GPU core alone. Metafor pointed out advances in on-chip memory as a necessity, and Arun mentioned advances in the main memory performance.
There is an interesting underlying question here about SoC designs, and how much of a say the GPU IP suppliers have in the overall design of a TI OMAP. There are balancing issues that do not look trivial, and where GPU needs may be at odds with price/size/power draw/et cetera concerns. And of course the priorities of the volume customers when it comes to their devices is another powerful influence. Compared to, say, AMD providing a complete graphics card, the graphics IP designers have much less of a say in the physical implementation of their designs. Extreme uses of the IP, even if possible, may well never see the light of day.
I found some performance figures from an old ImgTech press release regarding the performance of the SGX 543 - 35 million polygons/sec at 200MHz assuming a 2.5x depth complexity.
http://www.imgtec.com/News/Release/index.asp?NewsID=428
The first generation Adreno is claimed to up to perform around 22 million triangles/sec with a 133 megapixel/sec fill rate.
Second generation up to 41 million triangles/sec and fill rate of 245 megapixels/sec.
Third generation (dual CPUs) up to 88 million triangles/sec and a fill rate up to 532 megapixels/sec.
http://www.qualcomm.com/products_services/chipsets/snapdragon.html
One thing that surprises me is how little performance information is available in the public domain for SoC's (that incorporate these GPU technologies). Alternatively it could be argued there is too much information for PC CPU's and Gfx chips.
Still there is a big gaping hole that could be filled... any takers?
A SGX543 4MP @ 400MHz would already have 25x as many flops, and that's perfectly realistic on 28HPM. Even if you didn't change the ALU ratio, you'd still get to 100x pretty easily on 14nm in 2H15. Of course, as metafor says, the memory system will need a pretty big boost to keep up. I think external memory is likely to improve better than some expect there - for tablet chips in that timeframe, we should be looking at 64-bit DDR4, which is nice. In fact... now that I look at these numbers, may I change my prediction? Probability that we reach PS3-level performance on 28nm: practically zero. Probability that we reach it on 20nm: reasonably high! (and yes, I know G7x efficiency per unit is pretty bad, although I suppose I was thinking of the case where the dev hand-optimised quite a bit for it. Also keep in mind G8x isn't magically better there; unit efficiency is much better, but perf/mm2 isn't as can be see via G71 vs G84 - it's probably a better idea to only bother comparing handheld chips to Xenos anyway). Wow totally missed this paragraph even after Ailuros pointed towards it.
metafor
25-Nov-2010, 16:06
I found some performance figures from an old ImgTech press release regarding the performance of the SGX 543 - 35 million polygons/sec at 200MHz assuming a 2.5x depth complexity.
http://www.imgtec.com/News/Release/index.asp?NewsID=428
The first generation Adreno is claimed to up to perform around 22 million triangles/sec with a 133 megapixel/sec fill rate.
Second generation up to 41 million triangles/sec and fill rate of 245 megapixels/sec.
Third generation (dual CPUs) up to 88 million triangles/sec and a fill rate up to 532 megapixels/sec.
http://www.qualcomm.com/products_services/chipsets/snapdragon.html
One thing that surprises me is how little performance information is available in the public domain for SoC's (that incorporate these GPU technologies). Alternatively it could be argued there is too much information for PC CPU's and Gfx chips.
Still there is a big gaping hole that could be filled... any takers?
I don't know how reliable those numbers can be viewed, even at their base. If you look at the numbers of the Adreno 205 vs the 200, it's almost a direct scaling of the higher clockspeed the 205 uses.
But there were some micro-architectural changes as well, which isn't reflected.
Ailuros
25-Nov-2010, 17:40
I found some performance figures from an old ImgTech press release regarding the performance of the SGX 543 - 35 million polygons/sec at 200MHz assuming a 2.5x depth complexity.
Claimed poly rates are IMO irrelevant to depth complexity but rather for fillrate. Each 543 clocked at 200MHz has a fill-rate of 400MPixels/s * 2.5x overdraw = 1000 MPixels/s effective fill-rate. 4 (USSE2) ALUs, 2TMUs, 16 z/stencil units.
SGX535 is a totally different chapter. It consists of 2 USSE1 ALUs, 2 TMUs, 8 z/stencil units. USSE2 ALUs as found only in Series5XT (SGX543/544) have over twice the floating point throughput per ALU.
USSE1/SGX520-545 per ALU:
1 FP32 scalar or
2 FP16 (Vec2) or
4 INT8 (Vec3 or 4)
USSE2 = > 2*USSE1 in throughput and that's still Series5XT.
The first generation Adreno is claimed to up to perform around 22 million triangles/sec with a 133 megapixel/sec fill rate.
Second generation up to 41 million triangles/sec and fill rate of 245 megapixels/sec.
Third generation (dual CPUs) up to 88 million triangles/sec and a fill rate up to 532 megapixels/sec.
http://www.qualcomm.com/products_services/chipsets/snapdragon.html
One thing that surprises me is how little performance information is available in the public domain for SoC's (that incorporate these GPU technologies). Alternatively it could be argued there is too much information for PC CPU's and Gfx chips.
Still there is a big gaping hole that could be filled... any takers?See metafor's reply for that. Qualcolmm itself sets the Adreno 2xx generation roughly on par with iPhone3GS which contains a SGX535@200MHz (not sure if the frequency is correct). 540 is a step higher since it might contain the same amount of ALUs as 535 but has twice the ALU amount (4 instead of 2 in 535).
I think it is clear that great advances in graphics performance cannot come from the GPU core alone. Metafor pointed out advances in on-chip memory as a necessity, and Arun mentioned advances in the main memory performance.
There is an interesting underlying question here about SoC designs, and how much of a say the GPU IP suppliers have in the overall design of a TI OMAP. There are balancing issues that do not look trivial, and where GPU needs may be at odds with price/size/power draw/et cetera concerns. And of course the priorities of the volume customers when it comes to their devices is another powerful influence. Compared to, say, AMD providing a complete graphics card, the graphics IP designers have much less of a say in the physical implementation of their designs. Extreme uses of the IP, even if possible, may well never see the light of day.
I fully agree.
Tahir2,
Read up the following up until the end of the "wait what we're working on" paragraph here: http://pvrinsider.imgtec.com/
snip:
But we have just gotten started. The next-next-next generation graphics technologies we are working on at any point in time will be around 5-6 years away from shipping consumer products. Knowing how powerful the next POWERVR graphics technologies will be, we can confidently say that you haven’t seen anything yet! Very soon, we’ll see devices with our multi-core SGX XT (http://bit.ly/bXAzcR), which can scale to almost any level of performance needed. All in the palm of your hand.Just to help the entire perspective.
Claimed poly rates are IMO irrelevant to depth complexity but rather for fillrate. Each 543 clocked at 200MHz has a fill-rate of 400MPixels/s * 2.5x overdraw = 1000 MPixels/s effective fill-rate. 4 (USSE2) ALUs, 2TMUs, 16 z/stencil units.
The wording is taken from ImgTech's press release, I realise that depth complexity figures are used to calculate best case scenario advantages for fillrate in PVR's architecture all the way back before the Kyro.
http://www.imgtec.com/News/Release/index.asp?NewsID=428
It is in there and the indication is depth complexity helps arrive at the polygons/sec figures.
Thanks for the heads up will read the rest of the post and links a little later.
It is in there and the indication is depth complexity helps arrive at the polygons/sec figures.
Depth complexity has no bearing of quoted polygon throughput, it only effects fill rate.
Small correction to Ailuros's ALU throughput quote,
USSE1/SGX520-545 per ALU:
1 FP32 scalar min, 2x F32 max, or
2 FP16 (Vec2) or
4 INT8 (Vec3 or 4)
John.
Exophase
26-Nov-2010, 03:29
Isn't it really fixed point 10-bit vec3/vec4? 1 bit sign, 1 bit whole, 8 bits fractional.
Isn't it really fixed point 10-bit vec3/vec4? 1 bit sign, 1 bit whole, 8 bits fractional.
Hmm, yes, couple of data types missing there!
USSE1/SGX520-545 per ALU:
1 FP32 scalar min, 2x F32 max, or
2 FP16 (Vec2) or
2 INT16 (Vec2)
4 ES2.0 Lowp (Vec3 or 4)
4 INT8 (Vec3 or 4)
John.
Exophase
27-Nov-2010, 02:17
Thanks JohnH. It's cool that you get 2x int16 and not just 1x via doctored fp32s.
Does 2x FP32 mean 1 fmadd, or something more? Or if fmadd counts as one op, can you do 2 on FP16 per clock? If you can say, of course.
Thanks JohnH. It's cool that you get 2x int16 and not just 1x via doctored fp32s.
Does 2x FP32 mean 1 fmadd, or something more? Or if fmadd counts as one op, can you do 2 on FP16 per clock? If you can say, of course.
It's 2xF32 fmadd, however because of the data path width constraints getting to two requires some commonality between the inputs to each, for example as when multiplying a vector by a matrix or a vector by a scalar etc. F16 and INT16 don't have the data path width constraint so can always do 2x madd. For Lowp and INT8 it's 4x full sum of products. All are per pipe per clock.
John.
Wishmaster
23-Jan-2011, 21:41
According to this article (http://www.itproportal.com/2011/01/11/pictures-qualcomm-demos-worlds-fastest-smartphone/) qualcomm msm8x60 is manufactured at 28nm and not 45 as we thought till now.
Do you think it's possible considering that we should see first devices running on this chip in few months? Did qualcomm outrun the competition?
There is absolutely no way Qualcomm started sampling a 28nm chip in June 2010. It is not at all credible. I could imagine them eventually releasing a 28nm shrink of this 40nm chip though (ala MSM7200A), who knows...
Wishmaster
23-Jan-2011, 23:09
There is absolutely no way Qualcomm started sampling a 28nm chip in June 2010. It is not at all credible. I could imagine them eventually releasing a 28nm shrink of this 40nm chip though (ala MSM7200A), who knows...
That's what I thought
Just wanted to get some confirmation from people that know more than I do :smile:
Question to you Arun, at CES qualcomm showed this msm8x60 streaming 1080p 3D video through HDMI cable to TV, it's not something they talked about earlier so do you think there could've been some modifications in the chipset?
convergedw
23-Jan-2011, 23:27
I think the first 28nm chip from Qualcomm is supposed to be the MSM8960. It is slated to sampled sometime in 2011. It will have their next gen Snapdragon CPU and likely the Adreno 300 GPU. My guess is we'll hear more specs at MWC.
Unfortunately, Qualcomm's roadmap has become so complicated that quite a few blogs confuse the MSM8260/MSM8660 and the MSM8960. Hell, their own CEO confused a couple of their dual-core chips in September.
http://www.intomobile.com/2010/09/09/qualcomm-our-dual-core-1-5ghz-snapdragon-processor-will-arrive-in-q3-4-2011-not-q1/
Question to you Arun, at CES qualcomm showed this msm8x60 streaming 1080p 3D video through HDMI cable to TV, it's not something they talked about earlier so do you think there could've been some modifications in the chipset?I know that at one point Qualcomm was going to use AMD IP for 1080p encode/decode, but I don't know if they did in the end. That IP was based on Tensilica Xtensa, so it should be quite flexible - it should probably be able to reuse the same resources to do 3D video at a lower bitrate without changes (or just a moderately higher clock). Of course maybe that's not the IP they use, in which case either they made some changes or it's also fairly flexible...
Unfortunately, Qualcomm's roadmap has become so complicatedThey are indeed very good at coming up with ridiculously complex roadmaps. I tried to figure out Qualcomm's RF roadmap based on a few presentations - it's arguably even more complicated than their chipset roadmap! :) I'd post it but I doubt anyone would care, heh.
convergedw
21-Apr-2011, 19:11
Here is a presentation by Qualcomm giving some numbers on Snapdragon including some comparisons between their next-gen architecture and the A15.
http://www.kandroid.org/board/data/board/conference/file_in_body/1/4.session.제7회_KANDROID_세미나_퀄컴.pdf
Lots of good info in there. The quad version of their next GPU could be interesting depending on when it reaches product.
Some of the claims in there (especially on the graphics side) are highy questionable, though.
Hah @ them using NEON-based stuff for most of their current generation CPU performance comparisons, and also things like V8 which while very very relevant they were actually responsible for porting to ARM (and told me point blank they did their best to make it more tolerant of Snapdragon's long pipeline - which is a very good thing and a great investment on their part, just makes it less representative of other workloads).
Another very surprising tidbit is this presentation seems to imply the 2.5GHz Snapdragon might be done on SiON (28 LP) and not High-K (although it's far from explicit on this point). I was quite shocked to discover that OMAP5 will be done on a 28nm SiON process at UMC and GF, not High-K - so that's possibly 2GHz versus 2.5GHz on the same process - although I still suspect Qualcomm is probably using High-K despite that slide, and there's also the question of whether they both use Triple Gate Oxide if they're SiON. Either way I'm skeptical about their performance claims; if they're both SiON, then 23% more headroom would therefore imply identical DMIPS/MHz as A15, which seems extremely unlikely given all of their claims so far. I think they must just be underestimating A15 clocks and/or DMIPS/MHz.
Either way, Qualcomm's roadmap is solid as always, there's no denying that.
metafor
21-Apr-2011, 21:46
Another very surprising tidbit is this presentation seems to imply the 2.5GHz Snapdragon might be done on SiON (28 LP) and not High-K (although it's far from explicit on this point).
2.5 won't be on LP. But the slide is fairly market-speak and doesn't distinguish that :/
I was quite shocked to discover that OMAP5 will be done on a 28nm SiON process at UMC and GF, not High-K - so that's possibly 2GHz versus 2.5GHz on the same process
2GHz on LP? I suppose it's plausible but difficult to believe even with the A15's pipeline.
- although I still suspect Qualcomm is probably using High-K despite that slide, and there's also the question of whether they both use Triple Gate Oxide if they're SiON.
2.5GHz is HK, 1.4-1.7GHz is LP. 8960 won't be 2.5.
2GHz on LP? I suppose it's plausible but difficult to believe even with the A15's pipeline.I was surprised as well, but TI apparently said so explicitly: http://www.eetimes.com/electronics-news/4214774/Upset-TI-slams-Samsung-s-foundry-efforts (EETimes reporting isn't always right these days, but this is a very good article overall and written by Mark LaPedus to boot, so I'd tend to trust it). Keep in mind it might use Triple Gate Oxide at least (TSMC certainly supports it at 28LP, presumably UMC/GF do too but I don't know for certain).
ARM's A15 presentation says "Feasibility work showed critical loops balancing at about 15-16 gates/clk" - if that means ~16 FO4 on A15, then it's an absolute speed demon and 2GHz on SiON might not be that surprising, however I read that as meaning 'relatively simple gates' rather than necessarily FO4. Also being 'feasible' doesn't mean that's necessarily what they did I suppose. I don't know the terminology enough to know what is the most likely meaning, any ideas?
2.5GHz is HK, 1.4-1.7GHz is LP. 8960 won't be 2.5.Oh, so 2.5GHz is only APQ8064. That makes a lot of sense, the PR certainly wasn't very clear though :???:
EDIT: BTW, it's nice that Snapdragon has a tightly coupled L2 unlike A9 (but like A15), however I think it's worth pointing out that an OoOE core like the A9 can hide L2 latency better than an in-order one like Snapdragon.
metafor
21-Apr-2011, 22:29
I was surprised as well, but TI apparently said so explicitly: http://www.eetimes.com/electronics-news/4214774/Upset-TI-slams-Samsung-s-foundry-efforts (EETimes reporting isn't always right these days, but this is a very good article overall and written by Mark LaPedus to boot, so I'd tend to trust it)
UMC 28nm. Probably still SiON but could potentially be faster than TSMC's 28LP. Plus TI's never been a slouch at pushing frequency from the back-end side.
ARM's A15 presentation says "Feasibility work showed critical loops balancing at about 15-16 gates/clk" - if that means ~16 FO4 on A15, then it's an absolute speed demon and 2GHz on SiON might not be that surprising, however I read that as meaning 'relatively simple gates' rather than necessarily FO4. I don't know the terminology enough to know what is the most likely meaning though, any ideas?
*Shrug*. Based on the pipeline, I'd say it's based on NAND-equivalent delay but who knows. I really really really doubt they're able to pull off 12-cycle NEON VMLA with 16 FO4.
Oh, so 2.5GHz is APQ8064. That makes a lot of sense, the PR certainly wasn't very clear though :???:
I forget whichever model name but only HK variants will go above 1.4-2.0GHz. I'm not sure what frequency it is of the A15 they're comparing to, but I'm reasonably sure it isn't 2.0GHz. PR is never clear, unfortunately :/
UMC 28nm. Probably still SiON but could potentially be faster than TSMC's 28LP. Plus TI's never been a slouch at pushing frequency from the back-end side.Yeah, although presumably they will dual-source with GF again, using the same 28LP SiON process there as Qualcomm. Either way ST-Ericsson's 2.5GHz peak on High-K is suddenly looking less impressive.
*Shrug*. Based on the pipeline, I'd say it's based on NAND-equivalent delay but who knows. I really really really doubt they're able to pull off 12-cycle NEON VMLA with 16 FO4.NAND-equivalent delay makes sense, thanks. On the A8/A9, MAC was done as separate MUL then ADD with a dedicated MAC FIFO afaik (so it had twice the latency), however A15 supports a fused FMAC which must indeed presumably must be done in 12 cycles (including Issue & Writeback). That might indeed be a frequency bottleneck.
metafor
21-Apr-2011, 22:58
Yeah, although presumably they will dual-source with GF again, using the same 28LP SiON process there as Qualcomm.
Krait is on TSMC.....
NAND-equivalent delay makes sense, thanks. On the A8/A9, MAC was done as separate MUL then ADD with a dedicated MAC FIFO afaik (so it had twice the latency), however A15 supports a fused FMAC which must indeed presumably must be done in 12 cycles (including Issue & Writeback). That might indeed be a frequency bottleneck.
12 cycles is a lot for VFMA. IIRC, A15's VMLA throughput is 1 quad per cycle so they aren't double-pumping. The 12 cycles is likely only for VMLA, but we'll have to wait until instruction latencies are released to be sure.
I would suspect ARM would use NAND-equivalents more than they would FO4 as they're more front-end oriented and their modeling is likely based on gate-delay rather than wire delay. But 2.0GHz on 28 SiON is indeed impressive.
Krait is on TSMC.....My understanding is that TSMC is the lead supplier with GF as a likely second source down the line, based on this article: http://semimd.com/blog/2011/02/07/qualcomm-shies-away-from-high-k-at-28nm/ - either way, we should probably leave that discussion at this, right or wrong :)
12 cycles is a lot for VFMA. IIRC, A15's VMLA throughput is 1 quad per cycle so they aren't double-pumping.I think that's right, yes.
The 12 cycles is likely only for VMLA, but we'll have to wait until instruction latencies are released to be sure. I would suspect ARM would use NAND-equivalents more than they would FO4 as they're more front-end oriented and their modeling is likely based on gate-delay rather than wire delay. But 2.0GHz on 28 SiON is indeed impressive.Indeed. BTW, I just remembered Bulldozer has a latency of 6 cycles for fused FMA and it's also quite a speed demon. David Kanter from RealWorldTech mentioned a (presumably NAND-equivalent) ~17 gate delay rumour on comp.arch in his article, not sure if it's true but either way it should have a lower gate delay than the vast majority of CPUs out there. If they can do 6 cycles on ~17 NAND-equivalent, then 10+ on 16 FO4 doesn't seem so impossible anymore. However even then I'd be skeptical ARM would be willing to trade-off area/power to achieve that latency, and your reasoning on why ARM would talk in NAND-equivalent makes sense to me.
metafor
21-Apr-2011, 23:13
Indeed. BTW, I just remembered Bulldozer has a latency of 6 cycles for fused FMA and it's also quite a speed demon. David Kanter from RealWorldTech mentioned a (presumably NAND-equivalent) ~17 gate delay rumour on comp.arch in his article, not sure if it's true but either way it should have a lower gate delay than the vast majority of CPUs out there. If they can do 6 cycles on ~17 NAND-equivalent, then 10+ on 16 FO4 doesn't seem so impossible anymore. However even then I'd be skeptical ARM would be willing to trade-off area/power to achieve that latency, and your reasoning on why ARM would talk in NAND-equivalent makes sense to me.
A 12-cycle VFMA would be possible with 16 FO4's but it'd be a colossal waste of gates and flops. Plus it'd also mean VMLA would be ~20 cycles, which I don't believe it is. The long pole really is VMLA, not VFMA if we're talking a throughput of 1 quad/cycle.
Laurent06
22-Apr-2011, 08:30
Here is a presentation by Qualcomm giving some numbers on Snapdragon including some comparisons between their next-gen architecture and the A15.
http://www.kandroid.org/board/data/board/conference/file_in_body/1/4.session.제7회_KANDROID_세미나_퀄컴.pdf
That link doesn't work for me, or the document was removed. Did someone save a copy?
convergedw
22-Apr-2011, 15:13
For some reason the file opened fine for me in Chrome but not in IE or Firefox. Here is the file at an upload site.
http://www.speedyshare.com/files/28087534/4.session._7_KANDROID_.pdf
Laurent06
22-Apr-2011, 16:09
I'm indeed using FF. Thanks a lot for sharing!
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.