Digital Foundry Article Technical Discussion Archive [2014]

Shortbread · Apr 30, 2014

ThePissartist said:
Yay, very nice. Gonna go pick up your game today. I will probably suck.

Fixed.

Deleted member 86764 · Apr 30, 2014

Shortbread said:
Fixed.

Oh thanks, it's nice to know you think I'm gonna be bad at it too!! I hope you're playing so I can kick your arse.

Shortbread · Apr 30, 2014

ThePissartist said:
Oh thanks, it's nice to know you think I'm gonna be bad at it too!! I hope you're playing so I can kick your arse.

LOL...

Sad part I purchased it, haven't gotten around to playing it. Somehow, I got stuck playing BloodRyne again (x1000) on the PC.

shredenvain · Apr 30, 2014

Great interview Sebbi. One of the best dev interviews i have read on digital foundry in quite some time.

pjbliverpool · Apr 30, 2014

Yeah great interview. Tons of detailed and honest tech evaluation in there. I particularly enjoyed the sections about DX12 and the cross platform optimization possibilities. Interesting stuff in there about the previous version of Trials on PC as well and why it had some issues. I guess it's worth us PC gamers remembering when we get what is considered to be a poor console port just how much these games are optimised for the old console architectures.

Even this generation it sounds like developers will have to tread carefully around the unified memory aspect even though other aspects of the console architectures are far more applicable to modern PC hardware.

DavidGraham · May 1, 2014

Excellent read indeed, from the article :

Generally the new CPUs were running our old PowerPC-optimised code very well. We only had to rewrite a few VMX128 optimized loops using AVX instructions to allow higher number of simultaneous active animations and physics objects. In the end we decided to double the complexity limitations of our in-game editor compared to the Xbox 360 version, allowing the users to build larger and more dynamic tracks for the next-generation consoles.

So it seems the jump from Xenon to Jaguar isn't that big indeed. most gains are due to the wider FP execution units, no?

TheWretched · May 1, 2014

DavidGraham said:
Excellent read indeed, from the article :

So it seems the jump from Xenon to Jaguar isn't that big indeed. most gains are due to the wider FP execution units, no?

I dunno... using more or less completely unoptimized code on a CPU that has half the clock rate seems quite impressive, no? I don't doubt that they'll allow for quite a bit more in the future, when "Jaguar Code" will be used.

Deleted member 11852 · May 1, 2014

TheWretched said:
I dunno... using more or less completely unoptimized code on a CPU that has half the clock rate seems quite impressive, no? I don't doubt that they'll allow for quite a bit more in the future, when "Jaguar Code" will be used.

I'm assuming by code they are referring to code produced by a higher level language compiler, given PowerPC is nothing like 80x86.

betan · May 1, 2014

DSoup said:
I'm assuming by code they are referring to code produced by a higher level language compiler, given PowerPC is nothing like 80x86.

What's wrong with a "lower level language compiler"?

Deleted member 11852 · May 1, 2014

betan said:
What's wrong with a "lower level language compiler"?

Ha! High/low depends on your perspective. If you write a lot in assembly everything else feels like high level

Perhaps sebbbi can clarify which language they used but I assume C or C++.

chris1515 · May 2, 2014

TheWretched said:
I dunno... using more or less completely unoptimized code on a CPU that has half the clock rate seems quite impressive, no? I don't doubt that they'll allow for quite a bit more in the future, when "Jaguar Code" will be used.

It is the first Trials on Xbox One and PS4 and like sebbi said the gap between first generation and second generation title on PS4, Xbox One and PC will be huge.

They find the best solution to ship the title now. It will be much better the next Trials...

And it is very good same ISA on the two consoles and PC for 64 bits version.

sebbbi · May 2, 2014

DavidGraham said:
So it seems the jump from Xenon to Jaguar isn't that big indeed. most gains are due to the wider FP execution units, no?

I wouldn't say that. The jump is significant when you take into account the code optimization and maintenance cost.

The PPC in-order CPU bottlenecks have been talked to death, but it's always good to look back to see how the modern CPUs (including Jaguar) make our life much easier.

The 3 biggest time sinks when programming older in-order PPC CPUs:

1. Lack of store forwarding hardware
Store buffer stalls (also known as load-hit-store stalls or LHS stalls) are the biggest pain in the ass when programming a CPU without store forwarding hardware. The stall is caused by the fact that memory writes must be delayed by ~40 cycles (because of buffering & possible branch misses). If the CPU reads a memory location that was just written to, it stalls for up to 40 cycles. These stalls are everywhere. C/C++ compilers are notorious in pushing things to stack and reading the data back from the stack right after that (register spilling, function calls push parameters to stack, read & modify on class variables in loops, etc). Normally you just want to hand optimize the most time critical part of your program (expert programmers are good at this), but LHS stalls effect every single piece of code, so you must teach every singe junior programmer techniques to avoid them... or you are in trouble. LHS stalls are a huge time sink.

Modern CPUs have robust store forwarding hardware. You no longer need to worry about this at all. Result: Lots of saved programmer time.

2. Cache misses and lack of automatic data pre-fetching
The second largest time sink in writing code for older CPUs. Caches have been the most important thing for CPU performance for long long time. If the data you access is not in L2 cache (cache miss), you have to wait for up to 600 cycles. On old in-order CPUs the CPU does nothing during this time (you lose up to 600 instruction slots). Modern out-of-order CPUs can reorder some instructions to hide the memory stalls partially. Modern CPUs also have automatic data cache pre-fetchers that find patterns in your load addresses, and automatically load the cache lines you would likely access before you need them. Unfortunately the old PPC cores didn't have automated data pre-fetching hardware. You had to manually pre-fetch data, even in linear accesses (going through an array for example). Again every programmer must know this, and add the manual cache pre-fetch instructions in their code to avoid the up to 600 cycle stalls.

Modern CPUs have robust fully automatic data prefetching hardware that does better job almost every time than a human, and with no extra coding & maintenance cost. Modern CPUs also have larger caches (Jaguar has 2 MB per 4 core cluster) that are faster (lower load to use latency).

3. Lack of out-of-order execution, register renaming + long pipeline
Long pipeline means that instructions have long latencies. Without register renaming the same register cannot be used if it is already used by some instruction in the pipeline. Without out-of-order execution this results in lots of stalls. The main way to avoid these stalls is to unroll all tight loops. Unfortunately unrolling needs to be often done manually, and this takes time and leads to code that is often hard to maintain and modify.

Modern Intel CPUs (and AMD Jaguar) have relatively short pipelines (and loop caches). All modern CPUs have out-of-order execution and register renaming. On these CPUs, loop unrolling often actually degrades performance instead of improving it (because of extra instruction footprint). So, the choice is clear: Keep the code clean and let the compiler write a proper loop. Save lot of time now, and even more time when you need to modify the existing code.

Answer

Jaguar is a fully modern out-of-order CPU. It has good caches, good pre-fetchers and fantastic branch predictor (that AMD actually adopted later to Steamroller, according to Real World Tech: http://www.realworldtech.com/jaguar/). With Jaguar, coders can focus on optimizing things that actually matter, instead of writing boilerplate "robot" optimizations around the wast code base.

Jaguar pushes though huge majority of the old C/C++ code hand optimized for PPC without a sweat. You can actually remove some of the old optimizations and make it even faster. Obviously in vector processing loops, you need to port the VMX128 intrinsics to AVX (they wouldn't even compile otherwise), but that's less than 1% of the code base. It's not that hard to port really, since AVX instruction set is more robust (mostly it's 1:1 mapping and sometimes a single AVX instruction replaces two VMX128 instructions).

You asked me about the FP execution units. All I can say that I am very happy that the Jag FP/SIMD execution units have super low latency. Most of the important instructions have just one or two cycle latency. That's awesome compared to those old CPUs that had 12+ cycles of latencies for most of the SIMD ALU operations. If you are interested in Jaguar, the AMD 16h Optimization Guide is freely available (download from AMD website). It includes an Excel sheet that list all instruction latencies/throughputs. It's a good read, if you are interested in comparing the Jaguar low level SIMD performance to other architectures.

Deleted member 11852 · May 2, 2014

Nice explanation. Please do a YouTube channel - with sock puppets.

bkilian · May 3, 2014

sebbbi said:
... Amazingly detailed and useful info deleted, coz that's how I roll ...

Obviously in vector processing loops, you need to port the VMX128 intrinsics to AVX (they wouldn't even compile otherwise), but that's less than 1% of the code base. It's not that hard to port really, since AVX instruction set is more robust (mostly it's 1:1 mapping and sometimes a single AVX instruction replaces two VMX128 instructions).

You asked me about the FP execution units. All I can say that I am very happy that the Jag FP/SIMD execution units have super low latency. Most of the important instructions have just one or two cycle latency. That's awesome compared to those old CPUs that had 12+ cycles of latencies for most of the SIMD ALU operations. If you are interested in Jaguar, the AMD 16h Optimization Guide is freely available (download from AMD website). It includes an Excel sheet that list all instruction latencies/throughputs. It's a good read, if you are interested in comparing the Jaguar low level SIMD performance to other architectures.

There are a couple of gotchas to the Jaguar SIMD/FP pipeline. It doesn't have a FMADD, which can increase the instruction count for some loads, and while they have more than double the number of units, they have half the clock speed. The theoretical top FP performance of an 8 core Jaguar is about the same as the theoretical top FP performance of the XCPU. Generally for most workloads, the Jaguar is going to win handily, but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.

Lalaland · May 3, 2014

bkilian said:
....but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.

Perhaps it's just me but I imagined you staring at said beautiful filter in Visual Studio and releasing a long, wistful sigh as I read this.

pjbliverpool · May 3, 2014

bkilian said:
There are a couple of gotchas to the Jaguar SIMD/FP pipeline. It doesn't have a FMADD, which can increase the instruction count for some loads, and while they have more than double the number of units, they have half the clock speed. The theoretical top FP performance of an 8 core Jaguar is about the same as the theoretical top FP performance of the XCPU. Generally for most workloads, the Jaguar is going to win handily, but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.

As I understand it, the peak SIMD performance of Xenon is 76.8 GFLOPS while the 6 Jaguar cores of the XB1 roll in at 88.8 GFLOPS.

Not that that means an awful lot without knowing what the real world utilization is but it should give a decent idea of the relative potential.

DavidGraham · May 3, 2014

sebbbi said:
The PPC in-order CPU bottlenecks have been talked to death, but it's always good to look back to see how the modern CPUs (including Jaguar) make our life much easier.

....

Amazing, didn't know X86 can save programmer's time and effort by that much, coding for these PPC CPUs must have been a real choke, thanks for the enlightening answer.

HTupolev · May 3, 2014

DavidGraham said:
Amazing, didn't know X86 can save programmer's time and effort by that much, coding for these PPC CPUs must have been a real choke, thanks for the enlightening answer.

x86 has relatively little to do with it. It's an ISA, so it imposes substantial influence on the available registers and operations, but it doesn't have any strict control over the pipeline structure.

You could design a Xenon-like x86 CPU that has a very deep pipeline and very little logic to compensate for it, if you wanted to. You could also design an x86 CPU that uses single-cycle execution a la small embedded microcontrollers, and which has very predictable and consistent (albeit also very low) performance as a result.

3dilettante · May 3, 2014

DavidGraham said:
Amazing, didn't know X86 can save programmer's time and effort by that much, coding for these PPC CPUs must have been a real choke, thanks for the enlightening answer.

I would characterize it as the benefits of having a non-sucky CPU. There are PPC cores more capable than Jaguar, and nothing inherent to x86 that Jaguar does much better than Xenon.

What's something of an outlier this generation is the fact that the design partner was willing to use so much of its crown-jewel IP, with a new CPU core and its best GPU tech.
It's not a Bulldozer-derived core, but that might have been on the table, and skipping it looks to have been a good call.

The original Xbox had an Intel Coppermine-derived CPU, which was a notch below its best but not bad.
It was still better than the vast gulf in quality between Xenon and POWER5.

sebbbi · May 3, 2014

bkilian said:
Generally for most workloads, the Jaguar is going to win handily, but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.

Agreed. If you have a super optimized FMA heavy vector crunching loop (heavily unrolled of course to utilize all 128 VMX registers) you will reach similar throughput on XCPU (the whole CPU). In general however it's very hard to even reach a FMA utilization rate of 50% (pure linear algebra does of course). XCPU had a vector unit that was way better than any x86 CPU released during the last decade (and VMX128 instruction set was awesome, except it lacked int mul). But SSE3 -> AVX is a huge jump. And Jaguar's unit in particular is nice, because the latencies are so low (and even the int mul is fast). On the old PPC cores you had to move data between vector<->scalar registers through memory (LHS stalls everywhere), on modern PC CPUs you have direct instructions for this (1cycle vs ~40 cycles). This combined with low latency vector pipeline allow you to use vector instructions pretty much everywhere. On XCPU you had to separate you vector code to long unrolled loops and be extra careful that all instructions touching the data were vector instructions inside the loop (or pay the heavy LHS stall costs). That pretty much limited vector instructions to special cases.

Microsoft had an good presentation about AVX2 / FMA integration to Visual Studio. At first try their FMA support reduced the average performance, because FMA has higher latency than mul. For example if you do two adds and mul based on the two add results, the total latency is add + mul (as the two adds will execute simultaneously). If you replace this with add + FMA, the latency will be add + FMA (since the FMA requires the add result before it can start, they can't execute simultaneously). This is a general problem for instructions that require more inputs. The more inputs you need, the harder it is to execute other instructions concurrently.

Digital Foundry Article Technical Discussion Archive [2014]

Shortbread

Island Hopper

Deleted member 86764

Guest

Shortbread

Island Hopper

shredenvain

pjbliverpool

B3D Scallywag

DavidGraham

TheWretched

Deleted member 11852

Guest

betan

Deleted member 11852

Guest

chris1515

sebbbi

Deleted member 11852

Guest

bkilian

Lalaland

pjbliverpool

B3D Scallywag

DavidGraham

HTupolev

3dilettante

sebbbi

Similar threads