Digital Foundry Article Technical Discussion Archive [2014]

Discussion in 'Console Technology' started by DieH@rd, Jan 11, 2014.

Thread Status:
Not open for further replies.
  1. Shortbread

    Shortbread Island Hopper
    Veteran

    Joined:
    Jul 1, 2013
    Messages:
    4,093
    Likes Received:
    2,316
    Fixed. :wink:
     
  2. ThePissartist

    Veteran Regular

    Joined:
    Jul 15, 2013
    Messages:
    1,559
    Likes Received:
    507
    Oh thanks, it's nice to know you think I'm gonna be bad at it too!! I hope you're playing so I can kick your arse. ;)
     
  3. Shortbread

    Shortbread Island Hopper
    Veteran

    Joined:
    Jul 1, 2013
    Messages:
    4,093
    Likes Received:
    2,316
    LOL...

    Sad part I purchased it, haven't gotten around to playing it. Somehow, I got stuck playing BloodRyne again (x1000) on the PC. :oops:
     
  4. shredenvain

    Regular

    Joined:
    Sep 12, 2013
    Messages:
    921
    Likes Received:
    189
    Location:
    Somewhere in southern U.S.
    Great interview Sebbi. One of the best dev interviews i have read on digital foundry in quite some time.
     
  5. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,805
    Likes Received:
    1,093
    Location:
    Guess...
    Yeah great interview. Tons of detailed and honest tech evaluation in there. I particularly enjoyed the sections about DX12 and the cross platform optimization possibilities. Interesting stuff in there about the previous version of Trials on PC as well and why it had some issues. I guess it's worth us PC gamers remembering when we get what is considered to be a poor console port just how much these games are optimised for the old console architectures.

    Even this generation it sounds like developers will have to tread carefully around the unified memory aspect even though other aspects of the console architectures are far more applicable to modern PC hardware.
     
  6. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,279
    Likes Received:
    3,527
    Excellent read indeed, from the article :

    So it seems the jump from Xenon to Jaguar isn't that big indeed. most gains are due to the wider FP execution units, no?
     
  7. TheWretched

    Regular

    Joined:
    Oct 7, 2008
    Messages:
    830
    Likes Received:
    23
    I dunno... using more or less completely unoptimized code on a CPU that has half the clock rate seems quite impressive, no? I don't doubt that they'll allow for quite a bit more in the future, when "Jaguar Code" will be used.
     
  8. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    12,494
    Likes Received:
    7,746
    Location:
    London, UK
    I'm assuming by code they are referring to code produced by a higher level language compiler, given PowerPC is nothing like 80x86.
     
  9. betan

    Veteran

    Joined:
    Jan 26, 2007
    Messages:
    2,315
    Likes Received:
    0
    What's wrong with a "lower level language compiler"? :)
     
  10. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    12,494
    Likes Received:
    7,746
    Location:
    London, UK
    Ha! High/low depends on your perspective. If you write a lot in assembly everything else feels like high level ;)

    Perhaps sebbbi can clarify which language they used but I assume C or C++.
     
  11. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    4,679
    Likes Received:
    3,578
    Location:
    Barcelona Spain
    It is the first Trials on Xbox One and PS4 and like sebbi said the gap between first generation and second generation title on PS4, Xbox One and PC will be huge.

    They find the best solution to ship the title now. It will be much better the next Trials...:grin: And it is very good same ISA on the two consoles and PC for 64 bits version.
     
    #1191 chris1515, May 2, 2014
    Last edited by a moderator: May 2, 2014
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    I wouldn't say that. The jump is significant when you take into account the code optimization and maintenance cost.

    The PPC in-order CPU bottlenecks have been talked to death, but it's always good to look back to see how the modern CPUs (including Jaguar) make our life much easier.

    The 3 biggest time sinks when programming older in-order PPC CPUs:

    1. Lack of store forwarding hardware
    Store buffer stalls (also known as load-hit-store stalls or LHS stalls) are the biggest pain in the ass when programming a CPU without store forwarding hardware. The stall is caused by the fact that memory writes must be delayed by ~40 cycles (because of buffering & possible branch misses). If the CPU reads a memory location that was just written to, it stalls for up to 40 cycles. These stalls are everywhere. C/C++ compilers are notorious in pushing things to stack and reading the data back from the stack right after that (register spilling, function calls push parameters to stack, read & modify on class variables in loops, etc). Normally you just want to hand optimize the most time critical part of your program (expert programmers are good at this), but LHS stalls effect every single piece of code, so you must teach every singe junior programmer techniques to avoid them... or you are in trouble. LHS stalls are a huge time sink.

    Modern CPUs have robust store forwarding hardware. You no longer need to worry about this at all. Result: Lots of saved programmer time.

    2. Cache misses and lack of automatic data pre-fetching
    The second largest time sink in writing code for older CPUs. Caches have been the most important thing for CPU performance for long long time. If the data you access is not in L2 cache (cache miss), you have to wait for up to 600 cycles. On old in-order CPUs the CPU does nothing during this time (you lose up to 600 instruction slots). Modern out-of-order CPUs can reorder some instructions to hide the memory stalls partially. Modern CPUs also have automatic data cache pre-fetchers that find patterns in your load addresses, and automatically load the cache lines you would likely access before you need them. Unfortunately the old PPC cores didn't have automated data pre-fetching hardware. You had to manually pre-fetch data, even in linear accesses (going through an array for example). Again every programmer must know this, and add the manual cache pre-fetch instructions in their code to avoid the up to 600 cycle stalls.

    Modern CPUs have robust fully automatic data prefetching hardware that does better job almost every time than a human, and with no extra coding & maintenance cost. Modern CPUs also have larger caches (Jaguar has 2 MB per 4 core cluster) that are faster (lower load to use latency).

    3. Lack of out-of-order execution, register renaming + long pipeline
    Long pipeline means that instructions have long latencies. Without register renaming the same register cannot be used if it is already used by some instruction in the pipeline. Without out-of-order execution this results in lots of stalls. The main way to avoid these stalls is to unroll all tight loops. Unfortunately unrolling needs to be often done manually, and this takes time and leads to code that is often hard to maintain and modify.

    Modern Intel CPUs (and AMD Jaguar) have relatively short pipelines (and loop caches). All modern CPUs have out-of-order execution and register renaming. On these CPUs, loop unrolling often actually degrades performance instead of improving it (because of extra instruction footprint). So, the choice is clear: Keep the code clean and let the compiler write a proper loop. Save lot of time now, and even more time when you need to modify the existing code.

    Answer

    Jaguar is a fully modern out-of-order CPU. It has good caches, good pre-fetchers and fantastic branch predictor (that AMD actually adopted later to Steamroller, according to Real World Tech: http://www.realworldtech.com/jaguar/). With Jaguar, coders can focus on optimizing things that actually matter, instead of writing boilerplate "robot" optimizations around the wast code base.

    Jaguar pushes though huge majority of the old C/C++ code hand optimized for PPC without a sweat. You can actually remove some of the old optimizations and make it even faster. Obviously in vector processing loops, you need to port the VMX128 intrinsics to AVX (they wouldn't even compile otherwise), but that's less than 1% of the code base. It's not that hard to port really, since AVX instruction set is more robust (mostly it's 1:1 mapping and sometimes a single AVX instruction replaces two VMX128 instructions).

    You asked me about the FP execution units. All I can say that I am very happy that the Jag FP/SIMD execution units have super low latency. Most of the important instructions have just one or two cycle latency. That's awesome compared to those old CPUs that had 12+ cycles of latencies for most of the SIMD ALU operations. If you are interested in Jaguar, the AMD 16h Optimization Guide is freely available (download from AMD website). It includes an Excel sheet that list all instruction latencies/throughputs. It's a good read, if you are interested in comparing the Jaguar low level SIMD performance to other architectures.
     
  13. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    12,494
    Likes Received:
    7,746
    Location:
    London, UK
    Nice explanation. Please do a YouTube channel - with sock puppets.
     
  14. bkilian

    Veteran

    Joined:
    Apr 22, 2006
    Messages:
    1,539
    Likes Received:
    3
    There are a couple of gotchas to the Jaguar SIMD/FP pipeline. It doesn't have a FMADD, which can increase the instruction count for some loads, and while they have more than double the number of units, they have half the clock speed. The theoretical top FP performance of an 8 core Jaguar is about the same as the theoretical top FP performance of the XCPU. Generally for most workloads, the Jaguar is going to win handily, but if you had a beautiful, hand optimized, audio filter on the XCPU, you're going to have a tough time replicating it on the Jaguar with the same performance.
     
  15. Lalaland

    Regular

    Joined:
    Feb 24, 2013
    Messages:
    596
    Likes Received:
    266
    Perhaps it's just me but I imagined you staring at said beautiful filter in Visual Studio and releasing a long, wistful sigh as I read this.
     
  16. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,805
    Likes Received:
    1,093
    Location:
    Guess...
    As I understand it, the peak SIMD performance of Xenon is 76.8 GFLOPS while the 6 Jaguar cores of the XB1 roll in at 88.8 GFLOPS.

    Not that that means an awful lot without knowing what the real world utilization is but it should give a decent idea of the relative potential.
     
  17. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,279
    Likes Received:
    3,527
    Amazing, didn't know X86 can save programmer's time and effort by that much, coding for these PPC CPUs must have been a real choke, thanks for the enlightening answer.
     
  18. HTupolev

    Regular

    Joined:
    Dec 8, 2012
    Messages:
    936
    Likes Received:
    564
    x86 has relatively little to do with it. It's an ISA, so it imposes substantial influence on the available registers and operations, but it doesn't have any strict control over the pipeline structure.

    You could design a Xenon-like x86 CPU that has a very deep pipeline and very little logic to compensate for it, if you wanted to. You could also design an x86 CPU that uses single-cycle execution a la small embedded microcontrollers, and which has very predictable and consistent (albeit also very low) performance as a result.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    I would characterize it as the benefits of having a non-sucky CPU. There are PPC cores more capable than Jaguar, and nothing inherent to x86 that Jaguar does much better than Xenon.

    What's something of an outlier this generation is the fact that the design partner was willing to use so much of its crown-jewel IP, with a new CPU core and its best GPU tech.
    It's not a Bulldozer-derived core, but that might have been on the table, and skipping it looks to have been a good call.

    The original Xbox had an Intel Coppermine-derived CPU, which was a notch below its best but not bad.
    It was still better than the vast gulf in quality between Xenon and POWER5.
     
  20. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Agreed. If you have a super optimized FMA heavy vector crunching loop (heavily unrolled of course to utilize all 128 VMX registers) you will reach similar throughput on XCPU (the whole CPU). In general however it's very hard to even reach a FMA utilization rate of 50% (pure linear algebra does of course). XCPU had a vector unit that was way better than any x86 CPU released during the last decade (and VMX128 instruction set was awesome, except it lacked int mul). But SSE3 -> AVX is a huge jump. And Jaguar's unit in particular is nice, because the latencies are so low (and even the int mul is fast). On the old PPC cores you had to move data between vector<->scalar registers through memory (LHS stalls everywhere), on modern PC CPUs you have direct instructions for this (1cycle vs ~40 cycles). This combined with low latency vector pipeline allow you to use vector instructions pretty much everywhere. On XCPU you had to separate you vector code to long unrolled loops and be extra careful that all instructions touching the data were vector instructions inside the loop (or pay the heavy LHS stall costs). That pretty much limited vector instructions to special cases.

    Microsoft had an good presentation about AVX2 / FMA integration to Visual Studio. At first try their FMA support reduced the average performance, because FMA has higher latency than mul. For example if you do two adds and mul based on the two add results, the total latency is add + mul (as the two adds will execute simultaneously). If you replace this with add + FMA, the latency will be add + FMA (since the FMA requires the add result before it can start, they can't execute simultaneously). This is a general problem for instructions that require more inputs. The more inputs you need, the harder it is to execute other instructions concurrently.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...