Qualcomm Krait & MSM8960 @ AnandTech

Hypothetically some sort of "bridge" core or any sort of additional scheduling logic would be an idea.



Exactly why I'm asking myself whether they're just using 2 Mali400MP4 blocks.

Do you know if this setup would still be shown as Mali-400 in the specification string for Taji?

Because when NHW dug up the string for this mystery GPU it was this:

Samsung Galaxy Note 10.1
"android": {
"hardware": "smdk4x12",
"manufacturer": "samsung",
"device": "p4noterf",
"brand": "samsung",
"display": "IMM76D.N8000XXALD6",
"version_sdk": "4.0.4",
"board": "smdk4x12",
"version_code": "1"
GPU Vendor: ARM

While SIII still showed mali-400

Samsung Galaxy S III
"android": {
"hardware": "smdk4x12",
"manufacturer": "samsung",
"device": "m0",
"brand": "samsung",
"display": "IMM76D.I9300XWALE1",
"version_sdk": "4.0.4",
"board": "smdk4x12",
"version_code": "1"
GPU Vendor: ARM MALI-400MP
 
Adding the scheduling to support cores five through eight and rebalancing for performance (including also the expected 2x vertex performance, though previous Mali cores showed being conservative there wasn't much of a bottleneck) would make Mali 450 essentially 2xMali-400MP4 at its highest configuration.

Mali cores are traditionally well optimized for die area, so I don't see that as any kind of obstacle at 32 nm. The challenge will of course be delivering compelling performance for the amount of power consumed (and maybe heat generated) versus competing architectures of the time.
 
Mali cores are traditionally well optimized for die area, so I don't see that as any kind of obstacle at 32 nm. The challenge will of course be delivering compelling performance for the amount of power consumed (and maybe heat generated) versus competing architectures of the time.

It could be also 28nm and not necessarily 32nm. Part of the issue is already gone under 28nm and it might be still smaller (yet less performant) than let's say a SGX54xMP6 or higher.
 
48696.png


Pretty impressive, but it still trails the MP4. While the performance is impressive it is kind of disappointing that a brand new GPU can't beat something that is almost 5 years old. I can't wait to see what the T604 does.
 
Keep in mind that amongst other possible differences the SGX543MP4@250MHz has quite a fillrate and z/stencil fillrate difference compared to the Adreno320. The first has 8 TMUs and 64 z/stencil units.

Egypt (offscreen/OGL_ES2.0) is obviously more shader heavy than PRO (offscreen), but if you look at the PRO (offscreen/OGL_ES1.1) results of Anands link, it might be the fillrate differences that influence to some extend the differences.

Adreno320 has 4 TMUs afaik, probably meaning 1 TMU/cluster, which I figure should be also the same for coming generation Mali T6xx GPU IP. IMG Rogue on the other hand sounds suspiciously like 2 TMUs/cluster.

***edit: by the way it seems there's something finally moving for GLBenchmark2.5 results at Kishonti. No results yet published but it shouldn't take too long, since they just added a tab for it.
 
Yea I think they are impressive numbers....I would think that it also consumes a bit less power?

Direct 3d 11 9.3 was also an interesting read and still has me confused too be honest?

But for me the best bit was the revalation we would be seeing a krait v3...from what I have read it's not just a speed bump...so what could it be?

Edit: version 2.5 is already used in the gl benchmark pro I think?
 
Yea I think they are impressive numbers....I would think that it also consumes a bit less power?

Direct 3d 11 9.3 was also an interesting read and still has me confused too be honest?

But for me the best bit was the revalation we would be seeing a krait v3...from what I have read it's not just a speed bump...so what could it be?

Edit: version 2.5 is already used in the gl benchmark pro I think?

Having an architectural license means you can continually iterate the uarch.
 
http://www.glbenchmark.com/phonedetails.jsp?D=LG+E971&benchmark=glpro25

Frankly I expected more considering how close the 320 came to the 543MP4 in GLBenchmark2.1. Those are early results, so let's see if other devices and/or driver optimisations will help.

Disappointing...I was expecting 50+ fps....but as arun mentioned...adreno uarch is likely very sensitive to drivers...with this being a totally new uarch probably still the case.

The iPad 3 does carry much more bandwidth...maybe some decent drivers and lpddr3 would push it above A5x.
 
It should be noted that compiler/driver improvements could appear from all GPU IHVs sooner or later; there's no such thing as any architecture ever being "maxed out" under each case scenario.

In any case if you look at the 2.1 results there's quite a difference between the crop of Adreno320 devices in results. Granted there's no LG E971 as it appears in the 2.5 results, but the LG E970 scores in 2.1 12922 frames while the SKY IM-850 is at 15194 frames. Lord knows how each manufacturers sets frequencies and what kind of bandwidth each device has.
 
It should be noted that compiler/driver improvements could appear from all GPU IHVs sooner or later; there's no such thing as any architecture ever being "maxed out" under each case scenario.

In any case if you look at the 2.1 results there's quite a difference between the crop of Adreno320 devices in results. Granted there's no LG E971 as it appears in the 2.5 results, but the LG E970 scores in 2.1 12922 frames while the SKY IM-850 is at 15194 frames. Lord knows how each manufacturers sets frequencies and what kind of bandwidth each device has.

Yea I agree.

Right this new version of an old basemark open gl es 2.0 benchmark has been up on the play store...

As the the more complex games get built..we are starting to see the real benefit of unified shaders...1 month ago the Mali 400 was the king of the hill in smartphone graphics..by quite a margin.

However if you look at the 3 most complex benchmarks.... taji? Gl benchmark 2.5, and this gui benchmark..you see the real power of the adreno 225...which smacks the exynos around in all three?.
http://www.anandtech.com/show/6150/rightware-launches-basemark-gui-free-on-android-market

Seriously adreno 320 with lpddr3 and decent drivers SHOULD be more than good enough for future mobile gaming.
 
Yea I agree.

Right this new version of an old basemark open gl es 2.0 benchmark has been up on the play store...

As the the more complex games get built..we are starting to see the real benefit of unified shaders...1 month ago the Mali 400 was the king of the hill in smartphone graphics..by quite a margin.

However if you look at the 3 most complex benchmarks.... taji? Gl benchmark 2.5, and this gui benchmark..you see the real power of the adreno 225...which smacks the exynos around in all three?.
http://www.anandtech.com/show/6150/rightware-launches-basemark-gui-free-on-android-market

Seriously adreno 320 with lpddr3 and decent drivers SHOULD be more than good enough for future mobile gaming.
Well it's noted for Basemark GUI that the benchmark is geometry focused so the Mali-400 is vertex bound.
 
Seriously adreno 320 with lpddr3 and decent drivers SHOULD be more than good enough for future mobile gaming.

"Good enough" is always relative compared to what any of the competition might come up with in the foreseeable future. Since you also mention above Mali400MP4 the 32nm SoC has the GPU clocked at 440MHz and that's from memory for 4 Vec4 PS ALUs, 1Vec2 VS ALU and 4 TMUs. Upcoming T604 will have also 4 TMUs clocked at 500MHz with probably a similar amount of SIMDs as Adreno320 (whereby on T604 some SPs are probably capable of more than just 1 MADD/2FLOPs) and that's merely one example.

Granted Qualcomm may over time increase units and/or frequency (or both), however they still probably need to work on the sw side of things from the looks of it. Competition has heated up quite a bit and it's not going to ease anytime soon, rather the contrary. The real big advantage Qualcomm right now has is their execution both for new generation CPUs as well as new generation GPUs in their SoCs. Give or take they're about a year ahead compared to the competition.
 
Also Qualcomm has integrated lte baseband AND node advantage..even if the latter carries with it poor production quantity as of now.

What frequency do you expect the adreno 320 was clocked at?..

The Mali t604 might well be a dark horse...but I think everyone expects rogue to wipe the floor with the competition.
 
I assume Adreno320 is clocked at 400MHz.

The Mali t604 might well be a dark horse...but I think everyone expects rogue to wipe the floor with the competition.

Since T6xx have native FP64 support, it most likely cost them quite a bit in terms of die area. I'd expect the Adreno320 to be quite a bit smaller than T604, since there's no sign of FP64 on the first. As for Rogue we probably still have some time ahead to get excited about those.
 
Since T6xx have native FP64 support, it most likely cost them quite a bit in terms of die area. I'd expect the Adreno320 to be quite a bit smaller than T604, since there's no sign of FP64 on the first. As for Rogue we probably still have some time ahead to get excited about those.

I wonder how much die area a really low performance FP64 implementation takes.. is it necessarily significant? Especially for a device that can otherwise do a few dozen SP operations per cycle.
 
Power is a bigger problem with higher precision computation for modern < 2W SoCs.
 
Power is a bigger problem with higher precision computation for modern < 2W SoCs.

But if the FP64 datapath is separate and sufficiently low performance, nominal power consumption shouldn't be that much higher. Particularly when we're talking about a ~500MHz operating frequency.
 
How much overhead is it to have (possibly just some) FP32 units which can sequence FP64 over four or more cycles? The MADD units would have to be a little wider (AFAIK 26 bits instead of 23) and the normalization would have to be extended a bit, along with staggering the additions and having wider normalization, but I don't think all the data paths need to be extended that much.
 
How much overhead is it to have (possibly just some) FP32 units which can sequence FP64 over four or more cycles? The MADD units would have to be a little wider (AFAIK 26 bits instead of 23) and the normalization would have to be extended a bit, along with staggering the additions and having wider normalization, but I don't think all the data paths need to be extended that much.

The biggest problem (and often critical path) is alignment for FP64 MADD. Since a FP64 has a much wider range of exponents compared to FP32, the initial exponent comparator would have to be widen, the shift would have to be done over multiple cycles with the intermediates stored some place. The partial product would be 4x wider, meaning the shift value would have to be 4x wider.

This is all assuming fused MADD. Of course, if they're only implementing MUL and ADD and perhaps some sort of chained MADD, it gets easier but that presents its own problems of both latency and area used for a dedicated adder. But even then, the shifter for alignment would have to cover 2x the range.
 
Back
Top