An overview of Qualcomm's Snapddragon Roadmap

Since that Android and Me link has shown up here, I should point out that the build of 3DMarkMobile ES2 v1.0 that they used was buggy and that performance of all the phones is misrepresented; particularly so in the Epic 4G's case.

I can't really speak for Qualcomm as to what their actual performance is, but for us in the 4G, using the 4G's shipping drivers, performance should be significantly higher in the Hoverjet test (over 2x), and higher in Taiji too by a smaller amount.
 
Relevant as it shows scaling in GPU's over 5 years. And since you did bite :p

I think you misunderstood my example. You would literally compare the high end 5 years ago to the high end now.

That's what I actually did. I compared a high end embedded GPU of 5 years ago with what the maximum possible today.

So that would be a quad GPU GF110 system? I am not sure how to design a single sysem with 16 GF110's right now.

There are lightyears of differences between desktop GPUs and an embedded GPU block in an SoC. That should be clear and that's the reason why I asked for the relevance in the first place. But since you can today scale a SGX543/4 up to 16 cores in an SoC, I used your rather weird example and asked how it would look like if you'd scale 16 GF110 cores on a GPU cluster, because exactly there are multi-core configs involved in the embedded space.

By my rough calculations we still have not hit G70 x 100 speeds yet, (and be fair, by your metrics you should at least let me SLI my G70's but even without SLI I think maybe we have approached 40x with 8x more power consumption and approx 6x die size).

Problem being that neither IMG's or any other IHV had for the first OGL-ES1.x generation any cores that were fit for multi-core configs. Can we stay in the embedded space for a change to keep track over things?

We have in many cases in raw theoretical power not gone to 100x the power in 5 years in the discrete PC GPU field. I find it unlikely the mobile platforms will either since......... they also face the same issues but with differing priorities for their end consumer (us).

See above. And mark once more that I clearly pointed out that 16MP for Series5 XT is the maximum latency of the design.

I tried to bring some reality into these theoretical figures. That is all, and looking further into it I still don't think the 100x claim is possible. We all have the laws of physics to contend with after all.

Arun already made a few points how someone could interpret that claim. He doesn't have to be on spot, he picked up the correct reasoning behind the marketing blurb. If someone would tell you that super-douper-ultra core config of the future will gain 6000 fps in Q3 in 1080p then of course it would be a complete joke.

But if you're actually following the embedded market you'll see that IMG, Qualcolmm and ARM are potentially targeting GPGPU amongst other things which means an even healthier boost in floating point performance with all of their next generations than today.

In fact Arun took a perfectly sensible example of a 4MP@400MHz. Take now a 16MP on the same frequency and the floating point difference compared to a SGX540 if my math isn't screwed up is even over 140x times. And I hope I won't have to repeat that chances are very few that we'll ever see a SGX543/4 16MP config.

Do you expect their next generation to sport the same floating point power per ALU as on SGX543/4? Obviously it will be quite a bit higher. Now try your speculative math again for 5 years down the road and for <20nm.
 
Since that Android and Me link has shown up here, I should point out that the build of 3DMarkMobile ES2 v1.0 that they used was buggy and that performance of all the phones is misrepresented; particularly so in the Epic 4G's case.

I can't really speak for Qualcomm as to what their actual performance is, but for us in the 4G, using the 4G's shipping drivers, performance should be significantly higher in the Hoverjet test (over 2x), and higher in Taiji too by a smaller amount.

That does explain the anamoly. Thanks.
 
I think it is clear that great advances in graphics performance cannot come from the GPU core alone. Metafor pointed out advances in on-chip memory as a necessity, and Arun mentioned advances in the main memory performance.

There is an interesting underlying question here about SoC designs, and how much of a say the GPU IP suppliers have in the overall design of a TI OMAP. There are balancing issues that do not look trivial, and where GPU needs may be at odds with price/size/power draw/et cetera concerns. And of course the priorities of the volume customers when it comes to their devices is another powerful influence. Compared to, say, AMD providing a complete graphics card, the graphics IP designers have much less of a say in the physical implementation of their designs. Extreme uses of the IP, even if possible, may well never see the light of day.
 
I found some performance figures from an old ImgTech press release regarding the performance of the SGX 543 - 35 million polygons/sec at 200MHz assuming a 2.5x depth complexity.

http://www.imgtec.com/News/Release/index.asp?NewsID=428

The first generation Adreno is claimed to up to perform around 22 million triangles/sec with a 133 megapixel/sec fill rate.

Second generation up to 41 million triangles/sec and fill rate of 245 megapixels/sec.

Third generation (dual CPUs) up to 88 million triangles/sec and a fill rate up to 532 megapixels/sec.

http://www.qualcomm.com/products_services/chipsets/snapdragon.html

One thing that surprises me is how little performance information is available in the public domain for SoC's (that incorporate these GPU technologies). Alternatively it could be argued there is too much information for PC CPU's and Gfx chips.

Still there is a big gaping hole that could be filled... any takers?
 
Last edited by a moderator:
Arun said:
A SGX543 4MP @ 400MHz would already have 25x as many flops, and that's perfectly realistic on 28HPM. Even if you didn't change the ALU ratio, you'd still get to 100x pretty easily on 14nm in 2H15. Of course, as metafor says, the memory system will need a pretty big boost to keep up. I think external memory is likely to improve better than some expect there - for tablet chips in that timeframe, we should be looking at 64-bit DDR4, which is nice. In fact... now that I look at these numbers, may I change my prediction? Probability that we reach PS3-level performance on 28nm: practically zero. Probability that we reach it on 20nm: reasonably high! (and yes, I know G7x efficiency per unit is pretty bad, although I suppose I was thinking of the case where the dev hand-optimised quite a bit for it. Also keep in mind G8x isn't magically better there; unit efficiency is much better, but perf/mm2 isn't as can be see via G71 vs G84 - it's probably a better idea to only bother comparing handheld chips to Xenos anyway).
Wow totally missed this paragraph even after Ailuros pointed towards it.
 
I found some performance figures from an old ImgTech press release regarding the performance of the SGX 543 - 35 million polygons/sec at 200MHz assuming a 2.5x depth complexity.

http://www.imgtec.com/News/Release/index.asp?NewsID=428

The first generation Adreno is claimed to up to perform around 22 million triangles/sec with a 133 megapixel/sec fill rate.

Second generation up to 41 million triangles/sec and fill rate of 245 megapixels/sec.

Third generation (dual CPUs) up to 88 million triangles/sec and a fill rate up to 532 megapixels/sec.

http://www.qualcomm.com/products_services/chipsets/snapdragon.html

One thing that surprises me is how little performance information is available in the public domain for SoC's (that incorporate these GPU technologies). Alternatively it could be argued there is too much information for PC CPU's and Gfx chips.

Still there is a big gaping hole that could be filled... any takers?

I don't know how reliable those numbers can be viewed, even at their base. If you look at the numbers of the Adreno 205 vs the 200, it's almost a direct scaling of the higher clockspeed the 205 uses.

But there were some micro-architectural changes as well, which isn't reflected.
 
I found some performance figures from an old ImgTech press release regarding the performance of the SGX 543 - 35 million polygons/sec at 200MHz assuming a 2.5x depth complexity.

Claimed poly rates are IMO irrelevant to depth complexity but rather for fillrate. Each 543 clocked at 200MHz has a fill-rate of 400MPixels/s * 2.5x overdraw = 1000 MPixels/s effective fill-rate. 4 (USSE2) ALUs, 2TMUs, 16 z/stencil units.

SGX535 is a totally different chapter. It consists of 2 USSE1 ALUs, 2 TMUs, 8 z/stencil units. USSE2 ALUs as found only in Series5XT (SGX543/544) have over twice the floating point throughput per ALU.

USSE1/SGX520-545 per ALU:

1 FP32 scalar or
2 FP16 (Vec2) or
4 INT8 (Vec3 or 4)

USSE2 = > 2*USSE1 in throughput and that's still Series5XT.


The first generation Adreno is claimed to up to perform around 22 million triangles/sec with a 133 megapixel/sec fill rate.

Second generation up to 41 million triangles/sec and fill rate of 245 megapixels/sec.

Third generation (dual CPUs) up to 88 million triangles/sec and a fill rate up to 532 megapixels/sec.

http://www.qualcomm.com/products_services/chipsets/snapdragon.html

One thing that surprises me is how little performance information is available in the public domain for SoC's (that incorporate these GPU technologies). Alternatively it could be argued there is too much information for PC CPU's and Gfx chips.

Still there is a big gaping hole that could be filled... any takers?
See metafor's reply for that. Qualcolmm itself sets the Adreno 2xx generation roughly on par with iPhone3GS which contains a SGX535@200MHz (not sure if the frequency is correct). 540 is a step higher since it might contain the same amount of ALUs as 535 but has twice the ALU amount (4 instead of 2 in 535).

I think it is clear that great advances in graphics performance cannot come from the GPU core alone. Metafor pointed out advances in on-chip memory as a necessity, and Arun mentioned advances in the main memory performance.

There is an interesting underlying question here about SoC designs, and how much of a say the GPU IP suppliers have in the overall design of a TI OMAP. There are balancing issues that do not look trivial, and where GPU needs may be at odds with price/size/power draw/et cetera concerns. And of course the priorities of the volume customers when it comes to their devices is another powerful influence. Compared to, say, AMD providing a complete graphics card, the graphics IP designers have much less of a say in the physical implementation of their designs. Extreme uses of the IP, even if possible, may well never see the light of day.

I fully agree.

Tahir2,

Read up the following up until the end of the "wait what we're working on" paragraph here: http://pvrinsider.imgtec.com/

snip:

But we have just gotten started. The next-next-next generation graphics technologies we are working on at any point in time will be around 5-6 years away from shipping consumer products. Knowing how powerful the next POWERVR graphics technologies will be, we can confidently say that you haven’t seen anything yet! Very soon, we’ll see devices with our multi-core SGX XT, which can scale to almost any level of performance needed. All in the palm of your hand.
Just to help the entire perspective.
 
Claimed poly rates are IMO irrelevant to depth complexity but rather for fillrate. Each 543 clocked at 200MHz has a fill-rate of 400MPixels/s * 2.5x overdraw = 1000 MPixels/s effective fill-rate. 4 (USSE2) ALUs, 2TMUs, 16 z/stencil units.

The wording is taken from ImgTech's press release, I realise that depth complexity figures are used to calculate best case scenario advantages for fillrate in PVR's architecture all the way back before the Kyro.

http://www.imgtec.com/News/Release/index.asp?NewsID=428

It is in there and the indication is depth complexity helps arrive at the polygons/sec figures.

Thanks for the heads up will read the rest of the post and links a little later.
 
It is in there and the indication is depth complexity helps arrive at the polygons/sec figures.

Depth complexity has no bearing of quoted polygon throughput, it only effects fill rate.

Small correction to Ailuros's ALU throughput quote,

USSE1/SGX520-545 per ALU:

1 FP32 scalar min, 2x F32 max, or
2 FP16 (Vec2) or
4 INT8 (Vec3 or 4)

John.
 
Isn't it really fixed point 10-bit vec3/vec4? 1 bit sign, 1 bit whole, 8 bits fractional.

Hmm, yes, couple of data types missing there!

USSE1/SGX520-545 per ALU:

1 FP32 scalar min, 2x F32 max, or
2 FP16 (Vec2) or
2 INT16 (Vec2)
4 ES2.0 Lowp (Vec3 or 4)
4 INT8 (Vec3 or 4)


John.
 
Thanks JohnH. It's cool that you get 2x int16 and not just 1x via doctored fp32s.

Does 2x FP32 mean 1 fmadd, or something more? Or if fmadd counts as one op, can you do 2 on FP16 per clock? If you can say, of course.
 
Thanks JohnH. It's cool that you get 2x int16 and not just 1x via doctored fp32s.

Does 2x FP32 mean 1 fmadd, or something more? Or if fmadd counts as one op, can you do 2 on FP16 per clock? If you can say, of course.

It's 2xF32 fmadd, however because of the data path width constraints getting to two requires some commonality between the inputs to each, for example as when multiplying a vector by a matrix or a vector by a scalar etc. F16 and INT16 don't have the data path width constraint so can always do 2x madd. For Lowp and INT8 it's 4x full sum of products. All are per pipe per clock.

John.
 
According to this article qualcomm msm8x60 is manufactured at 28nm and not 45 as we thought till now.
Do you think it's possible considering that we should see first devices running on this chip in few months? Did qualcomm outrun the competition?
 
There is absolutely no way Qualcomm started sampling a 28nm chip in June 2010. It is not at all credible. I could imagine them eventually releasing a 28nm shrink of this 40nm chip though (ala MSM7200A), who knows...
 
There is absolutely no way Qualcomm started sampling a 28nm chip in June 2010. It is not at all credible. I could imagine them eventually releasing a 28nm shrink of this 40nm chip though (ala MSM7200A), who knows...

That's what I thought
Just wanted to get some confirmation from people that know more than I do :smile:

Question to you Arun, at CES qualcomm showed this msm8x60 streaming 1080p 3D video through HDMI cable to TV, it's not something they talked about earlier so do you think there could've been some modifications in the chipset?
 
I think the first 28nm chip from Qualcomm is supposed to be the MSM8960. It is slated to sampled sometime in 2011. It will have their next gen Snapdragon CPU and likely the Adreno 300 GPU. My guess is we'll hear more specs at MWC.

Unfortunately, Qualcomm's roadmap has become so complicated that quite a few blogs confuse the MSM8260/MSM8660 and the MSM8960. Hell, their own CEO confused a couple of their dual-core chips in September.

http://www.intomobile.com/2010/09/0...on-processor-will-arrive-in-q3-4-2011-not-q1/
 
Question to you Arun, at CES qualcomm showed this msm8x60 streaming 1080p 3D video through HDMI cable to TV, it's not something they talked about earlier so do you think there could've been some modifications in the chipset?
I know that at one point Qualcomm was going to use AMD IP for 1080p encode/decode, but I don't know if they did in the end. That IP was based on Tensilica Xtensa, so it should be quite flexible - it should probably be able to reuse the same resources to do 3D video at a lower bitrate without changes (or just a moderately higher clock). Of course maybe that's not the IP they use, in which case either they made some changes or it's also fairly flexible...

Unfortunately, Qualcomm's roadmap has become so complicated
They are indeed very good at coming up with ridiculously complex roadmaps. I tried to figure out Qualcomm's RF roadmap based on a few presentations - it's arguably even more complicated than their chipset roadmap! :) I'd post it but I doubt anyone would care, heh.
 
Back
Top