Exophase said:
Although it'll vary a lot depending on load I don't think Cortex-A15 will deliver such a big increase in ICP over A7 on average. ARM said they expected 50% better IPC in integer code than Cortex-A9 but in practice I'm not seeing numbers anywhere close to this very often, at least not with Exynos 5. And I think the expectation that A7 will be roughly A8 level and often better is credible - although it can't pair as many instruction combinations it does have a shorter branch mispredict penalty, claims very low latency caches (we'll see) and AFAIK benefits from auto prefetch and what have you.
So I'd give a number like ~1.8x in the best case but maybe as low as 1.4-1.5x. Very hand-waved, of course. Maybe will change a little with compilers.
Hmm, interesting, I think you're right that 2x is too high for integer workloads (although you'd expect A15 to be much closer to 2x IPC with 128-bit FP for what little it's worth). I'm honestly not sure what number would be most realistic but I suppose it really depends on the workload and memory subsystem. Given how complex the A15 is though, I'd be really disappointed if it was less than 1.5xA7 on average but we'll see...
OlegSH said:
I heard from one developer that single SGX Vec4 SIMD unit issues 4 FP16 MADs per cycle and only 2 FP32 MADs, does anybody know if this true or not? And does it mean that 76 theoretical gflops of SGX554MP4 in iPad4 are equal to 38 FP32 Gflops?
IMG DevTech's team has historically claimed that to some developers as a simplifcation of what you can expect in terms of performance, but the shader core can definitely achieve full-rate Vec4 FP32 MADDs. The problem is there are other restrictions (not ALUs and not operand sharing per-se) on Vec4 FP32 instructions, and if you don't meet those then you only get a Vec2 FP32 or Vec4 FP16 operation at full speed. These restrictions are not a problem for typical vertex shader workloads.
Exophase said:
However, according to a post by John H (sorry, don't have a link) there are instructions with a shared 32-bit operand that can perform two FMADDs in one cycle. Although he wasn't any more specific about what this meant my guess is that it's some form of (a * b) + (a * c) + d. This instruction would be used in the kernel of matrix * vector multiplication ie linear transformations, which is the fundamental building block for geometry transformation & lighting.
Series 5XT could dual-issue the above instructions so you get 4xFP16 or limited 4xFP32
Correct for SGX as per JohnH's post, but the SGX-XT instruction set is new and so you shouldn't extrapolate too much from his post, it's not simply dual-issuing original SGX instructions.
Exophase said:
I have never read anything about that, and I read that pretty detailed optimization manual nVidia released.
Same here, but when I did some low-level testing, I noticed higher performance (but not 2x) with LowP for some extreme corner case shaders I wrote. It might just be register pressure although it wasn't behaving in the way I would have expected it to based on my understanding of NV40's register system, but I never had the time to investigate.
Exophase said:
Whole thing might be moot, since according to an earlier post by Arun in this thread (
http://forum.beyond3d.com/showpost.php?p=1659557&postcount=271) it looks like the 8/10-bit operations may not dual-issue in Series5XT..
Correct, a fully 10-bit shader can still be very slightly faster because you wouldn't need to do any FP->INT conversion at the end, but that's about it. In practice it could even be slower (before considering dual-issue) due to both hardware and compiler limitations although the compiler should detect those cases and revert to FP16.
AlexV said:
I would consider the odds for a wonky partially unified solution to be quite low. That being said, and given the Tegra GPU's heritage (G7x something), I'd expect the VS to be able to interact with the samplers in order to implement Vertex Texture Fetch (do we know if older Tegras didn't support this? it'd go against what G7x can do). IIRC though, there were some strict bounds on just how you could sample - possibly no filtering? It's been a while - sidenote, amusing how in the handheld / embedded world most of what is old is new again
Tegra has much less to do with G7x than most people realise. It's inspired by G7x but fundamentally a different architecture (e.g. VS and PS are definitely 4 MADDs per scheduler, while G7x was 5 MADDs for VS and 8 MADDs for PS). The feature set is also completely different (no MSAA, no framebuffer compression, lower precision - the main new features compared to G7x are programmable blending, a general depth/color cache, and a hacky CSAA implementation).
3dcgi said:
I suspect Nvidia won't go with a Kepler based core until they're ready to pay the silicon cost for dx11 functionality.
Yeah, although the bigger question is the power cost, obviously. It might compensate some of that by being more bandwidth efficient, but likely not enough.