An overview of Qualcomm's Snapddragon Roadmap

Ohhhhh. Wait, that chip uses two different synthesis jobs for the two cores, one of which uses only LP transistors and the other mostly G? I didn't know that, very intriguing. I thought the Marvell Armada 628 was the first to do something like that. This would also explain why Qualcomm is the only company that has invested in a dedicated DC/DC for each core; it would be problematic to share one DC/DC if the cores were rated for very different frequencies at a given voltage.

Regardless of whether the two cores are asymmetrical, separate power rails is a good thing for power. I believe Nehalem took this approach as well. The 8x60, for instance, has two symmetrical LP cores, but has separate regulators for each. It just makes sense if you can afford the engineering effort.

My assumption (if this is true) is that at very low frequencies the LP core takes significantly less power than the G core (which is therefore always power gated off in that case) due to lower leakage, but at the maximum frequency the G core takes *less* power than the LP core due to lower dynamic power. Anything else would be rather absurd and defy the whole point (with your numbers, you'd be better off with an overvolted LP core, that's insane!)

Well, the G process takes similar dynamic power. The thing to remember is that the LP process can only scale so high without drastically over-volting. And this being (well, intended) to be a tablet/netbook part, chasing performance was important. And let's face it, 1.5GHz sounds great for marketing.
 
I too think the dual process approach sounds intriguing, especially if you can shut off the higher power core entirely during low load.
 
Regardless of whether the two cores are asymmetrical, separate power rails is a good thing for power. I believe Nehalem took this approach as well. The 8x60, for instance, has two symmetrical LP cores, but has separate regulators for each. It just makes sense if you can afford the engineering effort.
You're perfectly correct of course - I should have remembered our earlier discussion better (although I'd note an extra DC/DC is not free, and Qualcomm has the slight advantage of making their own PMICs). Here's probably a better way to put it: if your cores are asymmetric, then there is also a fair bit of engineering complexity in sharing a single DC/DC, so it makes even more sense to invest in the superior approach of dual DC/DC.

Well, the G process takes similar dynamic power. The thing to remember is that the LP process can only scale so high without drastically over-volting. And this being (well, intended) to be a tablet/netbook part, chasing performance was important. And let's face it, 1.5GHz sounds great for marketing.
I don't buy this - at all. Practically speaking, the LP and G process are mostly LSTP (Low STandby Power) and LOP (Low Operating Power) processes respectively, as defined by the ITRS.

See slide 8 of the same presentation: http://www.lirmm.fr/arith18/slides/ARITH18_keynote-Knowles.pdf (I assume none of the LOP transistors are LP and none of the LSTP are G, but that would be an implementation detail anyway).

There's no free lunch - the LP transistors aren't magically lower power. You simply trade lower leakage for higher voltages at a given frequency, and therefore higher dynamic power at a given performance level. What the G transistors allow you to do in handhelds is to efficiently go higher up the leakage curve for those transistors that either truly require the speed (e.g. a frequency-optimised CPU core), or are nearly always either busy or power-gated off, or are a small but problematic bottleneck in your critical path. All three cases make perfect sense and the last two can genuinely reduce total power in real use cases.

This is not completely different from multi-Vt where you use multiple transistors at different points on the curve throughout your chip. LPG simply lets you go higher up on the curve where it makes sense without compromising power efficiency for the rest of your chip. One further point is that the LPG process works by giving you access to two different oxide thickness (Tox) for your transistors, whereas different transistors in either the LOP or LSTP category achieve different leakages by varying gate length. I assume (but could be horribly wrong) that the highest-leakage LSTP transistors would therefore have worse dynamic power than LOP transistors with similar target leakage because gate length scaling would probably result in diminishing returns eventually.

So overall, I would be extremely shocked if the G core did not take less power than the LP core at the same frequency (e.g. 1.3GHz).
 
I don't buy this - at all. Practically speaking, the LP and G process are mostly LSTP (Low STandby Power) and LOP (Low Operating Power) processes respectively, as defined by the ITRS.

See slide 8 of the same presentation: http://www.lirmm.fr/arith18/slides/ARITH18_keynote-Knowles.pdf (I assume none of the LOP transistors are LP and none of the LSTP are G, but that would be an implementation detail anyway).

There's no free lunch - the LP transistors aren't magically lower power. You simply trade lower leakage for higher voltages at a given frequency, and therefore higher dynamic power at a given performance level.

Well yes, hence the comment "LP won't scale that high without drastic over-volting" :)

What the G transistors allow you to do in handhelds is to efficiently go higher up the leakage curve for those transistors that either truly require the speed (e.g. a frequency-optimised CPU core), or are nearly always either busy or power-gated off, or are a small but problematic bottleneck in your critical path. All three cases make perfect sense and the last two can genuinely reduce total power in real use cases.

This is not completely different from multi-Vt where you use multiple transistors at different points on the curve throughout your chip. LPG simply lets you go higher up on the curve where it makes sense without compromising power efficiency for the rest of your chip. One further point is that the LPG process works by giving you access to two different oxide thickness (Tox) for your transistors, whereas different transistors in either the LOP or LSTP category achieve different leakages by varying gate length. I assume (but could be horribly wrong) that the highest-leakage LSTP transistors would therefore have worse dynamic power than LOP transistors with similar target leakage because gate length scaling would probably result in diminishing returns eventually.

I'm trying to understand why you think it would have higher dynamic power with a longer gate length (unless you mean having to crank up voltage to reach the same frequency, as I've addressed above). There are many differences between LPG and LP, but the primary of which are that LP uses thicker oxide than LPG and that LP is doped lighter than LPG.

This results in both a lower Ion as well as Ioff. LPG should actually consume more dynamic power as well as more leakage as a trade-off for frequency but as you pointed out, it requires lower voltage to reach the same frequencies. So if we can lower the voltage by ~20-40mV to achieve the same frequency, we offset any (and keep in mind, leakage is still a dominant factor even while the core is running) disadvantages of using high-current transistors.

Keep in mind that this is all a non-linear relationship and at some point, the LP process would not need a significantly higher voltage compared to the LPG to reach a certain frequency. At that point, the LP core will take significantly less power for the same frequency. At 45LP, that point isn't that low :)
 
I'm trying to understand why you think it would have higher dynamic power with a longer gate length (unless you mean having to crank up voltage to reach the same frequency, as I've addressed above).
That's what I meant, yes.
There are many differences between LPG and LP, but the primary of which are that LP uses thicker oxide than LPG and that LP is doped lighter than LPG.
Ah yes, I forgot about doping, thanks.
This results in both a lower Ion as well as Ioff. LPG should actually consume more dynamic power as well as more leakage as a trade-off for frequency but as you pointed out, it requires lower voltage to reach the same frequencies.
Right, lower voltage at a given frequency. But then you say...
So if we can lower the voltage by ~20-40mV to achieve the same frequency, we offset any (and keep in mind, leakage is still a dominant factor even while the core is running) disadvantages of using high-current transistors.
20-40mV?! That's it? I can see why you think the inherently slightly higher power at a given frequency+voltage would compensate the lower voltage for a given frequency if that's the most you could lower the voltage for the same frequency. Ah well, I suppose I'll never get much more real-world data than this Icera presentation, which is very nice but might not apply perfectly to other cases (and there's no clear comparison of LSTP Low Vt and LOP High Vt sadly, which surely is the question here).
Keep in mind that this is all a non-linear relationship and at some point, the LP process would not need a significantly higher voltage compared to the LPG to reach a certain frequency. At that point, the LP core will take significantly less power for the same frequency. At 45LP, that point isn't that low :)
Hmmm. But surely that point is still much lower than 1.3GHz, no? So I'd still expect the G core to take noticeably less total power at 1.3GHz than the LP core.
 
20-40mV?! That's it? I can see why you think the inherently slightly higher power at a given frequency+voltage would compensate the lower voltage for a given frequency if that's the most you could lower the voltage for the same frequency. Ah well, I suppose I'll never get much more real-world data than this Icera presentation, which is very nice but might not apply perfectly to other cases (and there's no clear comparison of LSTP Low Vt and LOP High Vt sadly, which surely is the question here).

That depends on who designs your library. There's a lot more to HVT vs LVT than just gate length. More complex cells use different transistor configurations (favoring parallel vs serial FET arrangement for higher current with higher leakage). Typically, LVT is orders of magnitude (50ps vs 300ps) faster than HVT but both leakage and dynamic power is orders of magnitude.

And make no mistake, 40mV is a lot both in performance as well as power impact :) It's the difference between 1.4GHz and 1.7GHz.

Hmmm. But surely that point is still much lower than 1.3GHz, no? So I'd still expect the G core to take noticeably less total power at 1.3GHz than the LP core.

Possibly. But Scorpion scales really really well at 45LP (people have OC'ed it to 1.9GHz, though I'm not sure what the voltage was). So I don't know how conservative 1.3GHz was in the voltage/frequency curve.
 
On how it will compare to the competition: I don't know for certain about OMAP5, but Tegra3's design target was a quad-core Cortex-A9 at 1.2GHz on 28LPT. That means Snapdragon would be 1.75x as fast per-core but for optimally scaling multi-core workloads (yeah right...) Tegra3 would be 1.14x faster. That's for integer; for floating-point, you need to consider Tegra3 doesn't include NEON (and even if it did, Cortex-A9's NEON is only 64-bit wide). I'd argue that from a marketing perspective, a quad-core with lower IPC is remains very attractive, although I don't know how OEMs would evaluate both overall.

I don't know how attractive a quad-core is marketing wise in this segment. I'd contend that as we migrate from desktops to laptops to nettops to tablets to cell phones, the less impressed the average consumer generally gets with having a number such as this thrown in their face.

From a more technical/engineering standpoint, I'd like to question just how useful a quad-core CPU would be. It seems to me that not only do you have the multi-core utilization problems I'm familiar with, but on top of that you would add main memory (and possibly even cache, hi JohnH and metafor ;)) contention with the GPU, and on top of that you have pitiful main memory throughput compared to the ALU resources. So for the life of me, I just can't see utilization being very good. But then, I don't know the mobile application space all that well, am I missing something?
 
I assume the '75% lower power' (i.e. 4x performance per watt) is relative to the original 65nm Snapdragon, not the 45nm shrink.

As for PS3/XBox360-level performance... What's the probability that any handheld chip has equivalent performance to a chip with 24 TMUs at 550MHz before 14nm in 2015? Zero.
(although in practice RSX's utilisation isn't mind-blowing, it's more optimised towards perf/mm2 than perf/unit, so I suppose equivalent performance on a ultra-high-end tablet chip on 20nm isn't strictly impossible).

Last sentence describes your second thoughts? Before I answer your question whether there's going to be a chip with 24 TMUs@550MHz in half a decade, you might want to re-think of how exactly TMUs are incorporated into G7x. I'm sure I can come up with far better ideas regarding texturing and/or TMUs since G80 and even more so since GF100. 5 years from now is a mighty long time and technical advancements in the embedded space are more than just rapid. Even worse embedded CPU development isn't static either.
 
Not sure it directly fits with this conversation, but i have heard the CEO (H Yassiae) of IMG say this year that performance of handheld graphics cores would hit around x100 of todays performance, on roughly the same power, within 5 years.
 
For the sake of posterity and history can we quantify that 100x performance level?

First lets pick a part for todays performance levels.
I suggest PowerVR SGX540 which seems to be on par with the Adreno 205:

Current specs:

90 million *triangles/sec.

In 5 years time we can expect 9 billion triangles/sec.

Awesome.

*These are vapo-triangles not real triangles
 
I seriously doubt it'll be inflated geometry pipeline expansion. Likely it'll be more improvements in the memory system (using GMEM on-die, for instance) than anything else. And that could easily result (once we get as much on-die SRAM as some of today's desktop processors) in a 100x improvement.
 
Intriguing metafor, cheers (obviously not very specific so I can't make much out of it, but heh :))

Regarding PS3-level performance and 100x in 5 years: I'm willing to bet those are GFlops relative to the Apple A4. The SGX535 there has only two ALU pipelines with 4 flops each, so that's about 2GFlops. I think 200GFlops on a high-end tablet chip on 14nm is not out of the question given the increasing ALU ratio (doubles in SGX540, doubles further in SGX543MP, presumably increases further in next-gen).

A SGX543 4MP @ 400MHz would already have 25x as many flops, and that's perfectly realistic on 28HPM. Even if you didn't change the ALU ratio, you'd still get to 100x pretty easily on 14nm in 2H15. Of course, as metafor says, the memory system will need a pretty big boost to keep up. I think external memory is likely to improve better than some expect there - for tablet chips in that timeframe, we should be looking at 64-bit DDR4, which is nice. In fact... now that I look at these numbers, may I change my prediction? Probability that we reach PS3-level performance on 28nm: practically zero. Probability that we reach it on 20nm: reasonably high! (and yes, I know G7x efficiency per unit is pretty bad, although I suppose I was thinking of the case where the dev hand-optimised quite a bit for it. Also keep in mind G8x isn't magically better there; unit efficiency is much better, but perf/mm2 isn't as can be see via G71 vs G84 - it's probably a better idea to only bother comparing handheld chips to Xenos anyway).
 
First lets pick a part for todays performance levels.
I suggest PowerVR SGX540 which seems to be on par with the Adreno 205:
90 million *triangles/sec.


90M was a samsung marketing figure, I think IMG would be more comfortable with 20-30M

EDIT
Sorry, just saw your "vapo" reference.
 
For the sake of posterity and history can we quantify that 100x performance level?

First lets pick a part for todays performance levels.
I suggest PowerVR SGX540 which seems to be on par with the Adreno 205:

Well even Qualcolmm doesn't seem to place it there in it's own graphs, but that's besides the point.

Current specs:

90 million *triangles/sec.

In 5 years time we can expect 9 billion triangles/sec.

Awesome.

*These are vapo-triangles not real triangles

The Samsung S5PC110 manual specifically mentions for SGX540 20M Tris and I assume that's at 200MHz.

Besides triangle rates are as important as on paper FLOP rates to define performance for any GPU.

But if we really have to speculate on theoretical triangle rates MBX in its fastest incarnation (>230MHz) was capable of 7M Tris just as the lowest end SGX520 while the lowest end MBX Lite of the OGL_ES 1.1 generation was somewhere at 700k Tris if memory serves well. If that should help speculating where more or less the generation after Series5 could be regarding triangle rates than fine.

If you now want to play silly marketing games if I take in theory a 16MP@400MHz that would equal 1120M Tris/s, which is 160x times over highest end MBX and 1600x times over MBX Lite. Of course it is it possible on paper if Series6 goes multi-core in due time like Series5 did; and it's pretty irrelevant if any marketed figure is just the maximum latency of a multi-core X configuration and never gets used in any device in the end.
 
The whole point was that is was a silly marketing statement. 100x performance in 5 years is, highly unlikely, unless some weird non-realistic, and pointless metric was used as a comparison.

As to SGX 540 and Adreno 205 comparisons I was referring to:

http://androidandme.com/2010/10/news/3dmarkmobile-gpu-showdown-adreno-205-vs-powervr-sgx540/

Unfortunately in this test it is nearly impossible to completely rule out the CPU and memory subsystem which will impact scores but Adreno 205 does in all cases show itself as a competent performer in this particular benchmark.

This debate is far from settled. We can clearly see Qualcomm has come a long way from their first Snapdragons and made great progress with the Adreno 205 GPU. I was starting to worry that Qualcomm was in trouble with their Adreno GPU family, but it holds its own against PowerVR and now I’m pretty excited about the Adreno 220 GPU coming in future dual-core Snapdragons.

The biggest problem as ever is power consumption and as alluded by another member here - the move to large quantities of ondie memory in the future will certainly not hurt performance.

How has the PC graphics market improve in performance over that time? May not be entirely relevant but gives some kind of indication of what is possible.

GeForce 7800 GTX or Radeon X850XT Platinum Edition
vs
Geforce GTX 580 or Radeon 5970.

PS I don't know where I got the 90 million triangles per second number from. Is there any place to easily check various specs apart from the press releases?
 
The whole point was that is was a silly marketing statement. 100x performance in 5 years is, highly unlikely, unless some weird non-realistic, and pointless metric was used as a comparison.

Look at it that way: chances are few to none that someone will build today or in the future a 16MP SGX543/4. It won't happen because it's too large for a handheld, tablet or smartphone and as manufacturing processes scale down after a specific point their next generation will make more sense than that one. That doesn't mean that the IP doesn't exist or isn't technically feasible.

And yes of course due to the time being so large between then and today irrelevant of any theoretical measurement real time performance difference is going to be huge; also of course because SGX has a multi-core variant and I don't expect neither ARM or IMG to abandon the multi-core idea for their GPU IP.

As to SGX 540 and Adreno 205 comparisons I was referring to:

http://androidandme.com/2010/10/news/3dmarkmobile-gpu-showdown-adreno-205-vs-powervr-sgx540/

Unfortunately in this test it is nearly impossible to completely rule out the CPU and memory subsystem which will impact scores but Adreno 205 does in all cases show itself as a competent performer in this particular benchmark.
One benchmark out of many, but again besides the point. Let's move on.

The biggest problem as ever is power consumption and as alluded by another member here - the move to large quantities of ondie memory in the future will certainly not hurt performance.
Again 5 years down the line is a mighty long time.

How has the PC graphics market improve in performance over that time? May not be entirely relevant but gives some kind of indication of what is possible.

GeForce 7800 GTX or Radeon X850XT Platinum Edition
vs
Geforce GTX 580 or Radeon 5970.
And that's relevant how exactly? But since you're eager to work out a parallel example for that one, assume you would build today a super-GPU with 16 GF110 cores on it with a solution that would guarantee nearly linear scaling; how do you think would that one compare to a G70?

PS I don't know where I got the 90 million triangles per second number from. Is there any place to easily check various specs apart from the press releases?
Trust me the manual from SAMSUNG states 20M Tris for SGX540. Depends on the manufacturer, if they list those kind of specs and if yes if their realistic enough to represent something as close as possible to reality.

Apart from upcoming SGX54x multi-core configs there's not a single embedded GPU out there that can achieve anything close to real 90M Tris/s. The Tegra2 GPU if memory serves well is capable of 70M vertices/s.
 
And that's relevant how exactly? But since you're eager to work out a parallel example for that one, assume you would build today a super-GPU with 16 GF110 cores on it with a solution that would guarantee nearly linear scaling; how do you think would that one compare to a G70?

Relevant as it shows scaling in GPU's over 5 years. And since you did bite :p

I think you misunderstood my example. You would literally compare the high end 5 years ago to the high end now.

So that would be a quad GPU GF110 system? I am not sure how to design a single sysem with 16 GF110's right now. By my rough calculations we still have not hit G70 x 100 speeds yet, (and be fair, by your metrics you should at least let me SLI my G70's but even without SLI I think maybe we have approached 40x with 8x more power consumption and approx 6x die size).

Everything is based on power consumption, die size and what the market is willing to bear (in regards to cost).

We have in many cases in raw theoretical power not gone to 100x the power in 5 years in the discrete PC GPU field. I find it unlikely the mobile platforms will either since......... they also face the same issues but with differing priorities for their end consumer (us).

I tried to bring some reality into these theoretical figures. That is all, and looking further into it I still don't think the 100x claim is possible. We all have the laws of physics to contend with after all.
 
Back
Top