Yay, very nice. Gonna go pick up your game today. I will probably suck.
Fixed.
Yay, very nice. Gonna go pick up your game today. I will probably suck.
Fixed.
Oh thanks, it's nice to know you think I'm gonna be bad at it too!! I hope you're playing so I can kick your arse.
So it seems the jump from Xenon to Jaguar isn't that big indeed. most gains are due to the wider FP execution units, no?Generally the new CPUs were running our old PowerPC-optimised code very well. We only had to rewrite a few VMX128 optimized loops using AVX instructions to allow higher number of simultaneous active animations and physics objects. In the end we decided to double the complexity limitations of our in-game editor compared to the Xbox 360 version, allowing the users to build larger and more dynamic tracks for the next-generation consoles.
Excellent read indeed, from the article :
So it seems the jump from Xenon to Jaguar isn't that big indeed. most gains are due to the wider FP execution units, no?
I'm assuming by code they are referring to code produced by a higher level language compiler, given PowerPC is nothing like 80x86.I dunno... using more or less completely unoptimized code on a CPU that has half the clock rate seems quite impressive, no? I don't doubt that they'll allow for quite a bit more in the future, when "Jaguar Code" will be used.
I'm assuming by code they are referring to code produced by a higher level language compiler, given PowerPC is nothing like 80x86.
Ha! High/low depends on your perspective. If you write a lot in assembly everything else feels like high levelWhat's wrong with a "lower level language compiler"?
I dunno... using more or less completely unoptimized code on a CPU that has half the clock rate seems quite impressive, no? I don't doubt that they'll allow for quite a bit more in the future, when "Jaguar Code" will be used.
I wouldn't say that. The jump is significant when you take into account the code optimization and maintenance cost.So it seems the jump from Xenon to Jaguar isn't that big indeed. most gains are due to the wider FP execution units, no?
There are a couple of gotchas to the Jaguar SIMD/FP pipeline. It doesn't have a FMADD, which can increase the instruction count for some loads, and while they have more than double the number of units, they have half the clock speed. The theoretical top FP performance of an 8 core Jaguar is about the same as the theoretical top FP performance of the XCPU. Generally for most workloads, the Jaguar is going to win handily, but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.... Amazingly detailed and useful info deleted, coz that's how I roll ...
Obviously in vector processing loops, you need to port the VMX128 intrinsics to AVX (they wouldn't even compile otherwise), but that's less than 1% of the code base. It's not that hard to port really, since AVX instruction set is more robust (mostly it's 1:1 mapping and sometimes a single AVX instruction replaces two VMX128 instructions).
You asked me about the FP execution units. All I can say that I am very happy that the Jag FP/SIMD execution units have super low latency. Most of the important instructions have just one or two cycle latency. That's awesome compared to those old CPUs that had 12+ cycles of latencies for most of the SIMD ALU operations. If you are interested in Jaguar, the AMD 16h Optimization Guide is freely available (download from AMD website). It includes an Excel sheet that list all instruction latencies/throughputs. It's a good read, if you are interested in comparing the Jaguar low level SIMD performance to other architectures.
....but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.
There are a couple of gotchas to the Jaguar SIMD/FP pipeline. It doesn't have a FMADD, which can increase the instruction count for some loads, and while they have more than double the number of units, they have half the clock speed. The theoretical top FP performance of an 8 core Jaguar is about the same as the theoretical top FP performance of the XCPU. Generally for most workloads, the Jaguar is going to win handily, but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.
Amazing, didn't know X86 can save programmer's time and effort by that much, coding for these PPC CPUs must have been a real choke, thanks for the enlightening answer.The PPC in-order CPU bottlenecks have been talked to death, but it's always good to look back to see how the modern CPUs (including Jaguar) make our life much easier.
....
x86 has relatively little to do with it. It's an ISA, so it imposes substantial influence on the available registers and operations, but it doesn't have any strict control over the pipeline structure.Amazing, didn't know X86 can save programmer's time and effort by that much, coding for these PPC CPUs must have been a real choke, thanks for the enlightening answer.
Amazing, didn't know X86 can save programmer's time and effort by that much, coding for these PPC CPUs must have been a real choke, thanks for the enlightening answer.
Agreed. If you have a super optimized FMA heavy vector crunching loop (heavily unrolled of course to utilize all 128 VMX registers) you will reach similar throughput on XCPU (the whole CPU). In general however it's very hard to even reach a FMA utilization rate of 50% (pure linear algebra does of course). XCPU had a vector unit that was way better than any x86 CPU released during the last decade (and VMX128 instruction set was awesome, except it lacked int mul). But SSE3 -> AVX is a huge jump. And Jaguar's unit in particular is nice, because the latencies are so low (and even the int mul is fast). On the old PPC cores you had to move data between vector<->scalar registers through memory (LHS stalls everywhere), on modern PC CPUs you have direct instructions for this (1cycle vs ~40 cycles). This combined with low latency vector pipeline allow you to use vector instructions pretty much everywhere. On XCPU you had to separate you vector code to long unrolled loops and be extra careful that all instructions touching the data were vector instructions inside the loop (or pay the heavy LHS stall costs). That pretty much limited vector instructions to special cases.Generally for most workloads, the Jaguar is going to win handily, but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.