Can iPad Pro out-game an XB360? *spawn

Is this still up-to-date? Because it seems to suggest that the Shield Tablet is faster, for instance:
http://www.pcworld.com/article/3006...he-ipad-pro-really-isnt-as-fast-a-laptop.html

Uhmm you mean the Shield Android TV which isn't exactly a mobile battery powered device? In one GPU synthetic only (3dmark)? I expected to see a list of mobile games where the reviewer would had compared the devices against each other....

But for the more important part when you really have a device that's battery powered from the link above:

I tried to heat up the A9X to check for performance throttling by repeatedly running 3DMark and simply gave up. It just does not get hot. I can’t say the same for the admittedly smaller (and harder-to-cool) Google Nexus 9, which gets hot just browsing the web. Watching the same 4K movie file (that I couldn’t actually edit) on the iPad Pro was buttery-smooth even playing in a background window. It’s a very impressive tablet.
 
Only high contrast areas (such as edges) need native (300+ DPI) rendering. It's stupid to brute-force render all pixels at equal quality on retina resolutions. In the future we are going to see smarter techniques.

A good example of variable rate shading (using a custom ordered grid MSAA pattern): http://www.pmavridis.com/research/coarse_shading/
Are we still going to use RGB or are people trying YUV type buffers more? We were talking YUV at the beginning of PS3 - I'm sure we all rememeber nAo16 or whatever it was format discussed for Heavenly Sword. YUV/HSV would better support different resolutions, seems to me.
 
Are we still going to use RGB or are people trying YUV type buffers more? We were talking YUV at the beginning of PS3 - I'm sure we all rememeber nAo16 or whatever it was format discussed for Heavenly Sword. YUV/HSV would better support different resolutions, seems to me.
Crytek used this on last gen consoles. Definitely worth trying also on mobiles, as bandwidth is the biggest limitation.

http://graphics.cs.aueb.gr/graphics/docs/papers/YcoCgFrameBuffer.pdf
 
Who's not using HDR these days? LOGLUV colour space seems a smart change. IIR we even discussed GPUs supporting this in hardware. I'm guessing shaders now make that unnecessary, unless the ROPs are still hard coded RGB and can't work differently.
 
Is this still up-to-date? Because it seems to suggest that the Shield Tablet is faster, for instance:
http://www.pcworld.com/article/3006...he-ipad-pro-really-isnt-as-fast-a-laptop.html

The SHIELD tablet is slower, but it also uses a 2 year-old SoC made on 28nm.
Regardless, Anandtech's ipad pro review show a completely different scenario in GFXBench, where the A9X outmatches a 15W Core i5.
However as stated earlier, the GFXbench is using OpenGL and it's using lower precision shaders.
It's possible that Intel's OpenGL drivers just suck and the lower precision shaders actually make a substantial difference, so GFXBench's results aren't representative of how the A9X would behave compared to Intel's HD515/520 if it had to deal with actual "PC grade" games. For example, the fact that it's a non-threaded dual core could be quite the issue in modern games.
 
I'm still waiting for a plausible answer why it doesn't make an inch of a difference in performance in Gfxbench graphics tests on Rogue GPUs where FP16 SPs are completely absent. The article above aims to check if the PRO is as fast as a laptop and it obviously isn't. Not that it makes any particular sense either since a laptop and a tablet have a completely different power portofolio.
 
Is this still up-to-date? Because it seems to suggest that the Shield Tablet is faster, for instance:
http://www.pcworld.com/article/3006...he-ipad-pro-really-isnt-as-fast-a-laptop.html
Hmm I see the Shield Tablet (Tegra K1) behind the iPad. The Shield TV (Tegra X1) though is faster.

Ice Storm physics is a different beast, since it's mostly a CPU test, and one where the Apple cores have been doing somewhat badly for a few generations.

EDIT - Missed the third page, sorry :oops:
 
I'm still waiting for a plausible answer why it doesn't make an inch of a difference in performance in Gfxbench graphics tests on Rogue GPUs where FP16 SPs are completely absent.
What are you talking about? What SoCs are you comparing?

Rogue GPUs without FP16 SPs? Is there such a thing? AFAIK the only difference between 6 and 6XT is that 6 used 3-way FP16 units at the same amount of units as FP32, whereas 6XT used 2-way FP16 units at twice the amount of units as FP32. The theoretical FP16 output is just 33% more.

E7MXIzd.png

0otAXNF.png
 
Last edited by a moderator:
Thank you for the funky diagrams as if I wouldn't know what each consists of LOL. I have both devices here with me. One has a 6200 which DOESN'T have any FP16 SPs and a 6230 which has. With one exception I've verified the supplied results here in real time before I included them above.

Again for reference:

Allwinner A80, G6230@533MHz (64 FP32 SPs, 96 FP16 SPs)
TRex offscreen: 20,60 fps
Manhattan 3.0: 8,60 fps
Manhattan 3.1: 3,90 fps

Mediatek HelioX10T, G6200@700MHz (64 FP32 SPs)
TRex offscreen: 27,10 fps
Manhattan 3.0: 10,20 fps
Manhattan 3.1: 4,90 fps

The first Rogue batch (Series6) came either without FP16 SPs (6200, 6400 ) or with FP16 SPs (6230, 6430). Mediatek has been using the G6200 for two SoC generations now. The latest is in the HelioX10 clocked at 700MHz.

https://imgtec.com/powervr/graphics/series6/

Table on the bottom of the page clarifies what each variant exactly contains.

AFAIK the only difference between 6 and 6XT is that 6 used 3-way FP16 units at the same amount of units as FP32, whereas 6XT used 2-way FP16 units at twice the amount of units as FP32. The theoretical FP16 output is just 33% more.

Series6, 6230, 6430, 6630 => FP32: 1.5x times FP16 SPs
Series6XT & Series7XT => FP32: 2x times FP16 SPs

FP16 output is workload dependent; if you'd go for instance for something like deep learning you're most certainly not stuck at 33% more output. As I said SIMDs as you can see them in the former marketing diagrams can either be fed with FP32 or FP16 instructions yet not a mix of both at the same time. I would think that they share datapaths as otherwise at least some of them could be used in parallel.

Those Rogues that have dedicated FP16 units save IMHO mostly power compared to channelling everything through FP32 SPs in a 3D game. No one would use FP16 in a game or benchmark instead of wherever FP32 is recommended, since the difference would show.

In order to come back to the above results: a 6230@533MHz with 102 GFLOPs FP16 vs. 90 GFLOPs FP32 of the 6200@700MHz should be at least close in performance in benchmarks that supposedly use excessively FP16. Contrary to that their performance difference matches too much the respective FP32 FLOP difference between the two to suggest anything that would escape the norm you'd find in any real mobile game out there.
 
Last edited:
So we can summarise from the above that:

The mediatek chip X10T has rogue 6200
The rockchip Allwinner A80 rogue 6230

There are no FP16s available in the 6200

Comparing the relative TREX scores would seem to suggest that the presence of FP16s does not majorly influence the scores.

If that suggestion is factually correct, then the presence of FP16 in the iPad Pro is not given a false benefit when comparing TREX scores between the Pro and the HD7770
 
Last edited:
For hairsplitting's sake the A80 is from Allwinner and not Rockchip.

For the conclusion: the data at hand is too sparse for my taste to jump to any conclusions, since there could be other factors at play I'm not aware of. I'm just noting that all 3 benchmark scores (T-Rex, Manhattan 3.0 & Manhattan 3.1) are too close to the frequency difference of the two GPUs.

Other than that why would AMD or any other IHV really bother to optimize a desktop GPU for a ULP mobile benchmark exactly? (assuming there's no other culprit for it). It should go without saying that IHVs like Apple, QCOM, Samsung and others heavily optimize for Gfxbench amongst other synthetic benchmarks for their ULP SoC GPUs.
 
Last edited:
Considering how ALU bound Manhattan3.0 and even more so Manhattan3.1 is the X1 GPU should actually waltz all over the A9X GPU because:
X1 GPU:
256 SPs @ 1GHz = 512 GFLOPs FP32 or 1024 GFLOPs FP16
A9X GPU
768 SPs @ 0.47GHz = 360 GFLOPs FP32 or 720 GFLOPs FP16
A9X GPU has 384 SPs, its frequency is in 0.7 GHz - 1 GHz range, so both GPUs have almost equal number of flops
Considering how bandwidth bound deferred shading is the A9X GPU with 2x of bandwidth should run circles around X1 GPU, but instead the perf difference shrinks with more ALU bound tiled deferred rendering in Manhattan3.1 and it's possible to see a reverse situation with more modern Car Chase test, I wonder how A9X GPU would deal with tesselation
 
A9X GPU has 384 SPs, its frequency is in 0.7 GHz - 1 GHz range, so both GPUs have almost equal number of flops

I severely doubt that Apple would all of the sudden not only increase unit amount as usual, but at the same time increase frequency in their GPUs by as much. Is there at least a single indication for such an absurd frequency or is it again another gut feeling? Please tell me it's not extrapolated from the fillrate results in Gfxbench....

For correctness' sake yes it has 384SPs; I by mistake used FP32 FLOPs/clock.

https://forum.beyond3d.com/posts/1901055/

Same unit count obviously in the 9.7" tablet; over half a TFLOP is rather a FP16 quote than FP32. At estimated 400MHz it gives 307GFLOPs FP32 or 614GFLOPs FP16.

Considering how bandwidth bound deferred shading is the A9X GPU with 2x of bandwidth should run circles around X1 GPU, but instead the perf difference shrinks with more ALU bound tiled deferred rendering in Manhattan3.1 and it's possible to see a reverse situation with more modern Car Chase test, I wonder how A9X GPU would deal with tesselation

How it'll fair in Car Chase is subject to Apple delivering 3.2 drivers and it won't give any considerable results with tessellation since it's most likely all chanelled through the ALUs itself as ARM does. That however has nothing to do with the above.

For the record's sake and in case you haven't noticed the A8X fairs times worse in Manhattan 3.1 and it's not the fault of the architecture itself, but strangled resources in Series6XT vs. 7XT. The latter cores fair in 3.1 quite a bit better, however whatever they've increased in 7XT wasn't obviously as generous as it could have been.

Here's the Salvator X from Renesas (R-Car H3) with a GX6650 (6XT) that actually gives you a first tessellation result from a 6 cluster config; heck even the 12 cluster T880 in the S7 reaches over 43fps in that one all channelled through compute and no I don't expect the A9X to even reach as high:

https://gfxbench.com/device.jsp?benchmark=gfx40&os=Android&api=gl&cpu-arch=ARM&hwtype=GPU&hwname=Imagination Technologies PowerVR Rogue GX6650&did=30930332&D=Renesas Salvator-X

But since you're bound to entertain us with self invented frequencies that one gives an offscreen fillrate of 10739 MTexels/s. With 12 TMUs the frequency is obviously not over 900MHz. More like 600MHz; now tell me what the heck could be "wrong" with that fillrate test..... :D

With 24 TMUs of the A9X the fillrate should be at 21478 and that at 600MHz. However the A9X GPU gets "only" 15862 MTexels or else 26% less. Do the math....

http://documentation.renesas.com/doc/DocumentServer/R70PF0027ED1000.pdf
(page 6 for R-Car H3 frequencies)
 
Last edited:
With 24 TMUs of the A9X the fillrate should be at 21478 and that at 600MHz. However the A9X GPU gets "only" 15862 MTexels or else 26% less. Do the math....
Ok. I will do it for you, 24 filtered texels per clock * 0.6 GHz = 14.4 GTexels/sec, unsurprisingly, that is a close match to what iPad Pro actually does (14.085), but we don't know the efficiency (though it should be obviously close to peak theoretical values for TDBR), iPad Air 2 does 7.56 GTexels/sec, so 12 clusters iPad Pro is 1.86x times faster with 1.5 higher number of TMUs, 1.86/1.5 = 1.24, so A9Xs GPU should have 1.24x higher frequency to achieve its fillrate. 16nm FF+ allows approaching up to 1.35x higher frequencies due to vastly reduced dynamic power consumption at the same power in comparison with 20nm, while the density gains are minimal, it would be utterly stupid to not use the strong points of the tech process and rely instead on the weak density gains only, obviously it's perfectly known to engineers at Apple, hence the higher frequency of A9X GPU. I was wrong with my initial frequency estimation at a glance, frequency should be somewhere in the 650-750Mhz range depending on the efficiency, still, it doesn't change any of my conclusions, the number of FLOPs is the same for both chips, 500-538 Gflops for A9X@0.65 - 0.7GHz vs 512 Gflops for TX1@1 GHz and bandwidth is a lot higher for A9X
 
I don't care what you consider wise or unwise and you may very well believe what suits your imagery better. The fillrate test uses alpha blending for the record. Other than that I've provided a wee bit more documentation then your usual gut feeling. If in doubt ask around it shouldn't be too hard to find out. Frequency is not over 500MHz either way you want to twist it.

Apple ITSELF claimed in a marketing blurb that the A9X GPU has 360x times the GPU power compared to the original iPad. The SGX535 in that one does 2 TFLOPs so figure it out.
 
I don't care what you consider wise or unwise and you may very well believe what suits your imagery better
A9X as well as A8X have the same number of TMUs and ROPs, so all numbers are still perfectly valid

The fillrate test uses alpha blending for the record
So what?

Apple ITSELF claimed in a marketing blurb that the A9X GPU has 360x times the GPU power compared to the original iPad. The SGX535 in that one does 2 TFLOPs so figure it out.
That's great, you can do the math by yourself now, just pick up the 1.6 Gflops number http://www.anandtech.com/show/4225/the-ipad-2-review/5 and multiply it by 360, hopefully you will get something like 576 Gflops for A9X :smile:
 
A9X as well as A8X have the same number of TMUs and ROPs, so all numbers are still perfectly valid

There are always 2 TMUs per cluster, but I don't know how they scale the back end with increasing cluster amounts. Above reality works obviously just in cases you seem to want to select since I still haven 't received an inch of a viable answer why on God's green earth the 6 cluster GX6650 with 12 TMUs yields over 10GTexels while clocked at a mere 600MHz. Did you even bother to compare those results to the A8X results to see if they make sense?


It won't give you TMUs * frequency = fillrate just because you think it will. Again 10739 MTexels/s / 12 TMUs = ~895MHz. See above the official product link it clocks at 600MHz.
https://gfxbench.com/compare.jsp?be...logies+PowerVR+Rogue+GX6650&D2=Google+Nexus+9

The fillrate results still make sense yes? It has been noted here on the boards many times that the latest Gfxbench fillrate test is highly misleading in regards of results to extrapolate frequencies out of those. Has the GX6650 above 2.4x times the fillrate of the GK20A in K1 or rather a <15% difference in peak texel fillrate due to frequency differences?

What could make sense is compare same architecture GPUs preferably from the same generation.

That's great, you can do the math by yourself now, just pick up the 1.6 Gflops number http://www.anandtech.com/show/4225/the-ipad-2-review/5 and multiply it by 360, hopefully you will get something like 576 Gflops for A9X :smile:

The original iPad GPU clocks at 250MHz; 2Vec2 FP16 * 0.25GHz = 2.0 GFLOPs FP16. I actually remember helping Anand himself back when he was writing that article for that page, because there was some confusion with MADDs.

Apple marketing back then also claimed a 9x times increase for the GPU from iPad to the iPad2.

SGX543MP2
2 cores * [ (4 Vec4) + 1 SFU MUL ] * 0.25GHz = 18 GFLOPs / 2 GFLOPs = 9x times increase and yes that's just as much marketing as the 360x times claim which goes for FP16 FLOPs on the 535 of the iPad since it was capacble of 2 Vec2 FP32 only under conditionals. 2 GFLOPs * 360x = 720GFLOPs FP16. Counting that single 9th OP from the SFU is just another of those dubious stories; yes it can be used but under conditionals again.

As one can see above Apple is rather consistent with GPU frequencies through each of their respective generation. For Series5/XT it was always in the 250-325MHz range and for anything Rogue since the A7/iPad Air frequencies are in the 400-533MHz ballpark for Apple.

I know that Intel uses a frequency of somewhere 460-470MHz for the G6430 they had integrated for their smartphone SoCs and had a burst frequency of 533MHz, but I doubt Apple used something like that. The unfortunate thing is that the new Manhattan 3.1 long term performance isn't available yet. It would be interesting to see if and how much either the iPad Pro or iPad 9.7" Pro are throttling. A small tolerable persentage for GPU throttling would rather favour the low clock theory, and is actually the reason IMO why Apple prefers to go wide with relatively low frequencies for its GPUs.
 
Last edited:
I still haven 't received an inch of a viable answer why on God's green earth the 6 cluster GX6650 with 12 TMUs yields over 10GTexels while clocked at a mere 600MHz
Ask the one who did the test, it could be done with any possible frequencies, which are not guarantied to be limited by 600 Mhz, neither the results are guarantied to be correct for some development boards or whatever the thing is

Did you even bother to compare those results to the A8X results to see if they make sense?
I don't bother to compare some random results of some random board because the board could be overclocked, it could be cooled with an air solution, it could not be limited by the same thermal and power constrains as A8X, and there is simply not enough of data samples to make any worthwhile conclusions at all

The fillrate results still make sense yes?
They don't make any sense, but for totally different reasons. It's not the test issue if some results are random garbage

It has been noted here on the boards many times that the latest Gfxbench fillrate test is highly misleading in regards of results to extrapolate frequencies out of those
We can compare texture filtering results, but this won't change anything - https://gfxbench.com/compare.jsp?be...GPU&hwname1=Apple+A9X+GPU&D2=Apple+iPad+Air+2

The original iPad GPU clocks at 250MHz; 2Vec2 FP16 * 0.25GHz = 2.0 GFLOPs FP16
These are FP32 flops, since USSE is unified it have to support FP32 for vertex processing - USSE enables up to IEEE 754 single precision floating point data processing essential for the best possible image quality, + https://imagination-technologies-cl...m/documentation/PowerVR_graphics_brochure.pdf (page 13)

as the 360x times claim which goes for FP16 FLOPs on the 535 of the iPad since it was capacble of 2 Vec2 FP32 only under conditionals
I don't think so, 360x goes for FP32 flops, this is the only possible way to be on par with the Shield ATV in ALU test - https://gfxbench.com/compare.jsp?be...me1=Apple+A9X+GPU&D2=NVIDIA+Shield+Android+TV
 
Ask the one who did the test, it could be done with any possible frequencies, which are not guarantied to be limited by 600 Mhz, neither the results are guarantied to be correct for some development boards or whatever the thing is
I don't bother to compare some random results of some random board because the board could be overclocked, it could be cooled with an air solution, it could not be limited by the same thermal and power constrains as A8X, and there is simply not enough of data samples to make any worthwhile conclusions at all

Ironically it roughly yields the results you'd expect from a 600MHz 6 cluster 6XT compared to the A8X GPU which also clocks lower.

They don't make any sense, but for totally different reasons. It's not the test issue if some results are random garbage.
We can compare texture filtering results, but this won't change anything - https://gfxbench.com/compare.jsp?be...GPU&hwname1=Apple+A9X+GPU&D2=Apple+iPad+Air+2

Preferably same generation....I wouldn't suggest that the front and back ends for a GX6450 and a GT7600 (duplicated or not for both are identical):

https://gfxbench.com/compare.jsp?be...U&hwname1=Apple+A9+GPU&D2=Apple+iPhone+6+Plus

These are FP32 flops, since USSE is unified it have to support FP32 for vertex processing - USSE enables up to IEEE 754 single precision floating point data processing essential for the best possible image quality, + https://imagination-technologies-cl...m/documentation/PowerVR_graphics_brochure.pdf (page 13)

Without conditionals you don't get 2 Vec2 FP32 out of a 535, but rather 1 Vec2 FP32 or 2 Vec2 FP16; peak FP32 is obviously 2Vec2.

I don't think so, 360x goes for FP32 flops, this is the only possible way to be on par with the Shield ATV in ALU test - https://gfxbench.com/compare.jsp?be...me1=Apple+A9X+GPU&D2=NVIDIA+Shield+Android+TV

Who says it has to match the X1 GPU in one ALU test, while it surpasses the former in the ALU2 test? Different architectures, different strengths and weaknesses. As you already noted the difference shrinks for the A9X in Manhattan3.1 and I'd expect another shrink in Car chase until Apple delivers a DX11.x GPU which doesn't sound like all that soon.

I don't even recall what the ES2.0 ALU test does, but the ES3.0 "ALU2" test:

....approximates the fragment shader computing load of the Manhattan test better, by rendering 64 point lights in multiple passes over a captured Manhattan frame. Emmissive and ambient terms are also evaluated along with the diffuse lighting.

[sarcasm start]Other than that: sure of course they've clocked a ULP SoC GPU that exceeds the 1b transistor mark at 930MHz because it's the only way it can make you feel better. And while you run any mobile game on it it throttles after a couple of minutes to half its frequency because it is really "that" common for Apple to follow such a strategy..... [/end of sarcasm]
 
Last edited:
Back
Top