Qualcomm Krait & MSM8960 @ AnandTech

I wonder how much die area a really low performance FP64 implementation takes.. is it necessarily significant? Especially for a device that can otherwise do a few dozen SP operations per cycle.

Once the implementation is "really low performance", why have it at all, other than to satisfy a feature checkbox?

Put another way: would you rather the mobile GPU vendors produce a "really low performance FP64 implementation", or spend that time further improving the F32 path?
 
I wonder how much die area a really low performance FP64 implementation takes.. is it necessarily significant? Especially for a device that can otherwise do a few dozen SP operations per cycle.

Do we know the performance characteristics of the FP64 implementation in T6xx cores yet? According to ARM's relevant blogs I've read they claim that FP64 might not be necessary all that often, yet when it's necessary it's needed badly. It's a point I can understand, however as a layman I can't tell if and how to what degree it's valid for now for the small form factor space.

I don't know how the SP/DP looks like, but I'd be very surprised it it's lower than 4:1; it could range between 4:1 and 8:1. Purely speculative if it has 4*SIMD16 and it yields 72 GFLOPs@500MHz, then some part of the SIMDs are capable of more than 2 FLOPs/clock. Bears the question if and how they fuse to achieve FP64.
 
Good news for the Adreno320: the LG E970 just surpassed by a slight notch the iPad3 in GLBenchmark2.5

http://www.glbenchmark.com/phonedetails.jsp?D=LG+E970&benchmark=glpro25

Awesome..I predict quite a lot of performance in the locker for then 320 still to come.
Do they do a msm integrated lte version of the quad core krait?..if so I think that is going to be the best soc this generation in my humble opinion...better than this years exynos and even early omap 5..very nicely balanced in every area.
 
Last edited by a moderator:
The majority of small form factor GPUs scale over time in terms of performance, either via additional units or higher frequencies or both. Qualcomm has predicted for Adreno3xx up to Xbox360/PS3 console GPU performance over time, which isn't necessarily absurd if you consider from where Adreno2xx started and reached Adreno225 with 8 Vec4 ALUs@400+MHz.
 
Imean with current shipping frequencies...just with driver improvements and maybe lpddr3

That's common for any platform out there. It's not that other IHVs' driver departments sit around idle or that SoC manufacturers and OEMs don't try to constantly improve aspects of their products.
 
I didn't say that...how ever with it being a new uarch..and as some have stated the likely sensitivity of adreno gpu to poor drivers mean that decent drivers would likely have a bigger effect on 320.
 
Once the implementation is "really low performance", why have it at all, other than to satisfy a feature checkbox?

Put another way: would you rather the mobile GPU vendors produce a "really low performance FP64 implementation", or spend that time further improving the F32 path?

There's a big gap between the performance levels I'm describing and what you'd get with pure software emulation, probably several times faster. A relatively small amount of hardware can go a long way in tasks like this. So if you want FP64 at all then yes I'd rather have a slow hardware implementation than nothing, but I question the usefulness of FP64 in this space right now.

Starting with something low performance at least encourages more people to have code that uses it, even if it isn't fast. I doubt that GPUs w/o hardware support even have to support it in GLSL, so hardware support would encourage software development. It has to start somewhere if it's going to ever become something, and this is the path that desktop GPUs took.

I was only really thinking about synthesizing MUL and ADD separately, though.. doing full FMA would be more demanding as metafor points out (but even more so in software).. does anyone know if OGL ES 3.0 requires FMA?
 
Nope, there's no requirement for a single precision FMA (or DP of course) in ES3.0.
 
I didn't say that...how ever with it being a new uarch..and as some have stated the likely sensitivity of adreno gpu to poor drivers mean that decent drivers would likely have a bigger effect on 320.
Adreno 320 is a relatively simple GPU architecture to write drivers and a shader compiler for with only a couple of exceptions (as far as these things go of course, it's certainly not easy by any means), so you'd expect it to be fast out of the blocks.

In cases where it doesn't seem to be as fast as you'd expect, I'd first start thinking about and looking at system-level bottlenecks rather than the driver and compiler software, even at this early stage.
 
Adreno 320 is a relatively simple GPU architecture to write drivers and a shader compiler for with only a couple of exceptions (as far as these things go of course, it's certainly not easy by any means), so you'd expect it to be fast out of the blocks.

In cases where it doesn't seem to be as fast as you'd expect, I'd first start thinking about and looking at system-level bottlenecks rather than the driver and compiler software, even at this early stage.

Thanks.
 
What are the typical use cases for FP64 in GPGPU? Even Kepler takes a significant hit when performing MAC compared to standalone MUL and ADD, so I imagine MAC's aren't particularly desired?
 
I would put good money on almost everyone that needs double precision floats being happy enough with the precision of the unfused ops, when using GPUs as accelerators. I don't know the exact algorithms used by those I know that put value in GPU double precision, or their numerical precision and stability requirements, but I've never heard them say they need FMAC or they can't use the hardware.
 
I would put good money on almost everyone that needs double precision floats being happy enough with the precision of the unfused ops, when using GPUs as accelerators. I don't know the exact algorithms used by those I know that put value in GPU double precision, or their numerical precision and stability requirements, but I've never heard them say they need FMAC or they can't use the hardware.

Rys: could I ask you personal opinion on what the main advantages sgx 5 series has over an adreno 225 please?
Would it be fair to say (based on various conversations) that sgx is a good alrounder..where as the adreno is better suited to heavy shader modern workloads?
Exynos and tegra prefer the simple stuff.
 
What are the typical use cases for FP64 in GPGPU? Even Kepler takes a significant hit when performing MAC compared to standalone MUL and ADD, so I imagine MAC's aren't particularly desired?

Any variable that will accumulate a lot of stuff, or will have HDR, or will be used in conjunction with stuff having HDR.

FMACs/FMADDs etc. by themselves are no more or no less desirable. If good FP64 is there, then fused ops add speed, not precision and as such are no more or less desirable than unfused ops.
 
Any variable that will accumulate a lot of stuff, or will have HDR, or will be used in conjunction with stuff having HDR.

FMACs/FMADDs etc. by themselves are no more or no less desirable. If good FP64 is there, then fused ops add speed, not precision and as such are no more or less desirable than unfused ops.

I wasn't referring to fused vs unfused. I was referring to single MAC op vs standalone ops for MUL and ADD. Are there significant advantages to having a MAC op for most jighly parallel, loosely memory coupled algorithms?
 
Rys: could I ask you personal opinion on what the main advantages sgx 5 series has over an adreno 225 please?
Would it be fair to say (based on various conversations) that sgx is a good alrounder..where as the adreno is better suited to heavy shader modern workloads?
Exynos and tegra prefer the simple stuff.
Hard for me to have a personal opinion since this is what I get paid for :p

There are multiple obvious good bits to SGX when running modern GLES2 content, including content with complex shaders: unified shader, TBDR / bandwidth efficiency, good shader compiler, good basic fillrate.

Some SGX implementations have lot of ALU like Adreno 220 and 225 as well.

In general, and it's pretty obvious, you tend to want a very low bandwidth unified shader. Arguably we're the only ones that have built that so far.
 
I wasn't referring to fused vs unfused. I was referring to single MAC op vs standalone ops for MUL and ADD. Are there significant advantages to having a MAC op for most jighly parallel, loosely memory coupled algorithms?

Advantage: A MAC unit is smaller than standalone ADD and MUL units, so you can have more of them. You also save on instruction issue rate, so if your code fits MAC, it's a win.

Traditionally, the main disadvantage of fused mul-add is the lack of intermediate rounding of the product. This can produce different (but not wrong!) results than separate mul+add.

Cheers
 
Whoa, the new Xiaomi Mi-Two with 1.5 GHz Qualcomm Snapdragon S4 Pro APQ8064 SoC and 2 GB of LPDDR2 RAM looks really impressive! Quad-core Krait and Adreno 320 for $315.

Comes with a 720p IPS 4,3" screen, 2000mAh battery (optional 3000mAh battery available) and Jelly Bean based MIUI skin.

AnandTech has the skinny on it.
 
Last edited by a moderator:
Back
Top