Mobile GPGPU

Lazy8s

Veteran
I was impressed to see some actual examples of GPU compute running on a mobile device.

http://www.youtube.com/watch?v=h_Xp_Heog18

It's not a showcase by any means, and the optimization is highly suspect if for no other reason than the Tegra 2/Xoom version runs smoother than the iPad 2's amusingly, but it's good to see it as a (virtual) reality and not just hear about it as a theoretical one.

http://www.youtube.com/watch?v=8xnDxXGy0SI

Accelereyes is not waiting for OpenCL.
 
Interesting, but how is the battery impact...? Seems to me, most "smart" type devices have decent or even good battery life as long as you don't really use them, but once the processor starts spinning power drains at a vastly higher rate. :p
 
Accepting that mobile devices will enable processes/applications in the future like image enhancement, photo editing, augmented reality, visual detection, speech recognition, etc., the GPU, as mentioned, is the far more efficient processor on which to realize the bulk of such a workload.
 
Since the Jacket NDK is using OpenGL ES 2.0 it means that they're probably using textures and render to texture to farm data to fragment shaders. It's not hard to see how the bottlenecks for this might be more pronounced than expected on some GPUs, especially if that data is going through the CPU at any regular interval.

I'm sure proper OpenCL will perform much better.

A presentation was given detailing some early models on OpenCL with measurements for power consumption and performance.

https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/mobile_gpgpu.pdf

The conclusion drawn at the end,

"High parallelism at low clock frequencies (110 MHz)
is better than
low parallelism at high clock frequencies (550 Mhz)"

Is all but completely invalidated due to them comparing code compiled for the GPU with VFPU code on the CPU. Their comment to use fixed point instead because the CPU's floating point is slow is not something I agree with either; what it should be using is hand optimized NEON code. Even at the base speed of 500MHz the CPU here offers over 4 times the peak FLOPs as the GPU. It's much worse at hiding what is a pretty substantial amount of latency for FMADD kernels, and I'm sure that their code/data structures would have to be rearranged to extract better parallelism for the CPU, and probably involve performing the scaling in a much different fashion, in a separate pass at that. But if pushed to optimize both to the limits you'd probably still get better performance out of the CPU here.

Now on an iPad 2, where you have 4x the peak performance for NEON and worse multi-issue/latency hiding but something like 20 times the GPU ALU performance the story could be very different.
 
Last edited by a moderator:
Back
Top