A lot of game code is just whatever is fastest to get it working. Game quality is more about speed of iteration than it is about code quality.
Agreed. That seems to be almost universally true for game logic and UI code (and basically everything game specific code). Fast iteration time is critical to finetune the gameplay and make it fun.
Technology code tends to be developed differently, but that's usually less than 10% of the whole code base.
Hell I've worked on big teams where the majority of "game programmers" couldn't explain a cache hierarchy.
That's also true for smaller teams
. Most game programmers don't need to worry about things like cache hierarchies and store forwarding stalls. As long as technology programmers understand the low level hardware, things usually proceed fine. Technology code (graphics rendering, physics simulation, area/ray/etc queries) tends to use the majority of CPU cycles, so designing this part of code to be as efficient as possible is often a huge step forward to the goal. Of course low level programmers need to monitor the performance of game code, and once in a while fix performance bottlenecks caused by the game/UI code as well.
Most modern games use Flash based UI engines, so basically there's no hope to get UI code to run well, no matter what you do
Nobody is saying two or three-fold boost, I don't know where this is coming from. Something like a 20% improvement in perf/clock easily qualifies as significant.
I think the 2x-3x boost figure comes from the rumors of the WiiU clock rate. Compared to Xenon's 3.2 GHz clock rate, it needs much more than 20% IPC improvement to match the performance.
I'm not assuming that VMX isn't used, I'm assuming that integer heavy code is still reasonbly common. Often even alongside VMX code, for instance for address generation and flow control. But in between VMX code as well - for instance lacking gather/scatter instructions means sometimes you need to move stuff into integer registers for computed loads/stores, and you will often want to use ALUs for this since the load/store unit has weak address generation. AFAIK there's a huge penalty for transferring registers between VMX and the integer ALU so you probably generally don't want to try using VMX as a second integer port.
Back to my original point that a lot of code is C/C++/whatever and not ASM, which you agree with - compilers are even further from optimal when it comes to vectorization. Although game developers may be using ones that are better than what I'm used to. Still, it's one of those things that benefits a lot from tight control which is difficult to communicate in high level source code.
VMX code is mostly used in performance critical areas only (less than 5% of the code base). However these areas of the code can easily use more than 50% of your CPU cycles.
I don't know about other developers, but we do not use any compiler autovectorization tools. Autovectorization is just too fragile to work properly (especially on a architecture that has no scalar<->vector register moves and no store forwarding).
Basically all our VMX code is hand written intrinsics. We don't have lots of it, but we have enough to run many of our performance critical parts completely in VMX. On Xenon, you do not want to mix (tiny pieces of) vectorized code with standard code. You want to run large agressively unrolled number crunching loops that are pure VMX. This is because the VMX pipelines are very long, and you have big penalties in moving data between register sets (you need to do it though L1 cache, but you have no store forwarding, so you have to wait a long time before the data gets available). OoO CPUs have register renaming, so loop unrolling is not required, but fortunately Xenon has lots of vector registers, so loop unrolling gives the compiler lots of options to reorder instructions / cascade registers to keep the long VMX pipelines filled.
Another reason why this kind of brute force VMX batch processing is a good fit for Xenon is the memory prefetching. You have to manually cache prefetch, or you will suffer A LOT. Batch processing has predictable memory access pattern, so it's easy to prefech things in time (even if you are using some kind of (cache-line) bucketed structures). Running this kind of unrolled cache friendly vector processing code on a modern OoO core doesn't improve it's performance that much, since the code doesn't need branch predictors, automated cache prefetching, register renaming, etc, etc.
It's true that you can often do things branchless (like with predication or equivalent). But often it'll be slower on CPUs that do have good branch prediction, so it's a balancing act.. if you're not writing assembly then you probably want to try to keep the code reasonably well performing for all of your target platforms.
(+ lots of other branching discussion in this thread)
Talk about branching is always close to my heart
First of all, most branches inside tight loops are just bad code. Clean code should separate common cases from special cases. You should take the branch out of the loop, and execute the code (that was inside the branch) only for the elements that have that property (extract it to another separate loop, preferably to a separate file). This style of programming makes the code easier to read & understand, and allows you to easily extend it (add new special case functionality without the need to modify existing code). This kind of processing is also very cache friendly (if you also extract the data needed by each special property to a separate linear array).
Random branches are always hard to predict. If you don't analyze your data, and do nothing to control your branching behaviour (for example sort your structures to improve branching regularity), you have to design around the worst case. For structures that might at some point of the game contain around half of the elements with a certain property (requiring branching), it means you have to estimate a 50% branch mispredict rate (as elements are in "random" order).
How much a single branch mispredict costs? For Sandy Bridge that is 14-17 cycles. How much work Sandy Bridge can do in this time? It can do 14-17 eight wide AVX additions + multiplies (+ lots of loads, stores and address generation in other ports). AVX optimized 4x4 matrix multiply executes in 12 cycles (
http://software.intel.com/en-us/forums/topic/302778). Mispredicted branch costs more than a 4x4 matrix multiply! And Haswell makes things even worse (it has two FMA ports + more integer ports). It will be able to do two matrix multiplies for each mispredicted branch
Branches that are easy to predict are often fine (as long as they do not pollute the code base). We had a discussion about branching with a colleague of mine. If basically was about things dying in the game, and whether it would be better to have a listener-based solution to tell entities to remove references to a dead entity, or do a "null check" on use (we don't use pointers, so "null check" is not exactly what we do). In this case a single check (branch instruction) during usage costs a single cycle, because it always returns true until the entity dies (it's always predicted properly). When the entity dies, you pay a 20+ cycle misprediction penalty for each entity that had a reference to it (during the next time it tries to access it). In comparison the listener based solution causes a 600 cycle cache miss (for each entity that needs to be informed during dead, and likely another cache miss for each entity that registers itself as a listener). The single cycle (of a predicted branch) can often be masked out (executed in parallel to other instructions in separate pipelines). So it is a better performing option. And it also sidesteps all the problems in programmers occasionally forgetting to register listeners, or send proper messages around... That has never happened, heh heh
Branchless constructs (for isel, fsel, min, max, etc) are often very good, but you have to be careful to use correct bit shuffling tricks for each platform, as all modern CPUs are superscalar, but have different amount of pipelines/ports for different instructions. But that's nothing a good inlined function that is #ifdef'd differently for different platform doesn't handle (of course it's also important to name these functions properly, and teach your programmers to use them). If some platform performs better with branches, you can of course easily change those select functions to use branches on that platform.
Branchless constructs are more important for GPU programming, as 32/64 threads in the same warp need to have identical control flow. They are also important for VMX programming (do multiple simultaneous branchless operations in vector registers, and keep the results there).