Gamefest 2008 console edition

Discussion in 'Console Technology' started by liolio, Jul 23, 2008.

  1. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    I thank that it would be interesting to discuss parts of this event that are related to the 360

    Gamefest 2008 program

    Here some parts that sounds interesting, I hope some of the devs here will be able to comment after the show (or before ;) )

    Xbox 360 Compiler and PgoLite Update:
    The Xbox 360 compiler has changed dramatically in the last year, which changes the rules for how to write efficient Xbox 360 code. Many of the improvements automatically make your code faster, but others require you to change your code in order to reap the benefits. PgoLite has also improved and should be used differently, to get even better results. This talk summarizes the past year's developments, and gives simple rules for how to get maximum benefit from the changes.

    New CPU Features in the XDK:
    The tools in the XDK continue to improve. New profiling features make it easier to find out where time is being wasted in your code, and lock-free primitives make efficient multithreaded code easier to write. The talk covers enhancements to trace analysis, CPU performance counters, system monitor, and timing captures, and discusses how to use LockFreePipe and XMCore.

    Microsoft Directions in Parallel Computing and Some Short Term Help:
    This talk focuses on the native task scheduler being announced by the Parallel Computing Platform group in Microsoft this spring and offerings that are available in the XDK. The scheduling of tasks within games can improve resource utilization, load balancing, and performance. For games targeting the current generation of PCs and the Xbox 360 console, we discuss an interim solution. Previous talks given on this topic laid the foundation for using tasks to move work required by the engine from an over-utilized hardware core to an underutilized core. A progression of task and scheduler designs is presented that start with simple implementations and move to more complex designs that have low-overhead. The simple implementations are satisfactory for a small number of tasks but impose a prohibitive amount of overhead when the number of tasks grows. Finally, we present the work-stealing algorithm that pulls work from one core to another in the low-overhead scheduler.

    Multi-Threaded Rendering for Games:
    One core is just not enough for graphics anymore—rendering tasks often have to run in parallel to hit the target framerate and hide latent operations. This talk includes best practices, pitfalls to avoid, and a range of design patterns for implementing multithreaded rendering on today’s platforms, including Direct3D 10 and Xbox 360. We cover everything from batch submission to resource management and discuss future plans for greater flexibility and higher performance when rendering on multiple threads.

    Direct3D Samples Update:
    Come see the latest developments from the DirectX and Xbox 360 samples team. This presentation is a deep-dive into the inner workings of recent and upcoming graphics samples in the DirectX SDK and Xbox 360 XDK. Techniques discussed include deferred particle rendering, Xbox 360 geometry instancing, edge-based antialiasing, and more.


     
  2. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    I'll add a link to the discussion in the section 3D technolgy and algorythms so at least people can look for it ;)

    http://forum.beyond3d.com/showthread.php?t=49164

    By the way if some devs want to discuss some 360 specific stuffs they are welcome.

    For example if on can comment about that:
    Finally, we present the work-stealing algorithm that pulls work from one core to another in the low-overhead scheduler.

    It would be a great departure on the way some engines work (the capcom frameworl engine comes to mind for example).
     
  3. NRP

    NRP
    Veteran

    Joined:
    Aug 26, 2004
    Messages:
    2,712
    Likes Received:
    293
    I wonder if this is similar to the "job queue" model that PS3 devs (DeanoC, et al) have talked about?
     
  4. silhouette

    Regular

    Joined:
    Mar 13, 2003
    Messages:
    524
    Likes Received:
    3
    This is pretty interesting, but given the presentation of similar sessions was never showed up on DirectX website, I doubt we will see this one too. However, bolded part is pretty interesting. It means we haven't seen what 360 can do in gfx department yet :)

     
  5. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    EDIT I'm answering NRP

    I'm not sure, it's a step forward from task base scheduling that we find in the framework engine for sure (and I guess in other too, somebody knows how EU III schedule the differents task?).
    From my understanding "job queue model" is kind of task base. Job are assigned to spe , can a spe steal some job from the queue list of another spe?
    Anyway it could be something in that vein, whether we use the words job, task, or blocks.
    It hints that different workload may no longer handle thought threads but through something more refined.

    I wonder if Ms could be moving to something close to Intel researches.
    Rapso spoke and gave links in the "next -geneartion..." thread about intel threading building blocks.
    here is the link:
    http://softwarecommunity.intel.com/isn/downloads/intel_tbb_ch01_for_promo.pdf
    it's just the intro but worse a read ;)
    especially the last pages (5 &6 where they do peak about work stealing algorithms, ttask splitting, etc.).

    I also think that this give us another hint:
    Xbox 360 Compiler and PgoLite Update:
    The Xbox 360 compiler has changed dramatically in the last year, which changes the rules for how to write efficient Xbox 360 code. Many of the improvements automatically make your code faster, but others require you to change your code in order to reap the benefits. PgoLite has also improved and should be used differently, to get even better results. This talk summarizes the past year's developments, and gives simple rules for how to get maximum benefit from the changes.

    Ok MS improved the compiler, but they hint at new "best pratices" for coders.

    I really wonder if all of this is related.

    But I'm an impostor, my knowledge is thin ;) , devs give us some help :=)
     
    #5 liolio, Jul 24, 2008
    Last edited by a moderator: Jul 24, 2008
  6. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    I really start to wonder if my last post is stupid or too far of from reality? :oops:
    Come on somebody back me up... I feel really lone right now... :lol:
     
  7. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Yes this is pretty interesting too, supposedly MS wanted a CPU just good enough to saturate the graphic subsystem.
    So far it seems it doesn't completely deliver on its promises.

    And that also why I would like to know about the presentations I quoted in my other posts.
     
  8. chachi

    Newcomer

    Joined:
    Sep 15, 2004
    Messages:
    120
    Likes Received:
    3
    I wouldn't say it like that, you don't know what CPU limited encompasses until you know why any particular task is CPU limited. What if they had to spend a lot of time dealing with compression, is that a CPU problem or a RAM / DVD one? Not to mention the well publicized issues both "next gen" CPUs have running non-optimized (for them) code. That said there's no such thing as a processor that's too fast. :)

    It would be interesting to hear about the presentations but they're probably all under NDA.
     
  9. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    #9 liolio, Aug 29, 2008
    Last edited by a moderator: Nov 18, 2008
  10. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    I find this part too funny (about Block on thread sync from the "at least we aren't etc" presentation)
    If (ThisProject.NearEndOfProject (cEpsilon))
    {
    thisProject.ToOptimizeElsewhere ();
    thisteam.CrossFingers ();
    engineTean.QueueTask (RedesignMulticore System);
    }

    :lol:
    Other funny bit:
    All threading models are equal
    -----Well... yes and no
    ----some are much more equal than others
    :lol:
     
    #10 liolio, Nov 17, 2008
    Last edited by a moderator: Nov 17, 2008
  11. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    From this presentation:
    "at least we arent doing that"

    how do they get that games usually use 0.2 instruction per cycle?

    I'm not questioning the validity but I'm just curious.
    Accordingly to this a game running @30FPS have 20 millions instructions per frame to play with.
    EDIT this is even more interesting if you consider statment made about % of cpu used, basically even with 100% occupation you still aim at 100% of 20% ?
    (I mean making the most of the average 0.2 IPC the cpu will push)

    Other than that presentations have been public for a while but we still haven't read any dev reaction, are NDA strict on the subject?

    I (and others I guess) would be interested in reading reactions/prospect in regard to last MS efforts.
     
    #11 liolio, Nov 17, 2008
    Last edited by a moderator: Nov 18, 2008
  12. assen

    Veteran

    Joined:
    May 21, 2003
    Messages:
    1,377
    Likes Received:
    19
    Location:
    Skirts of Vitosha
    Statements about CPU usage are pure bullshit, plain and simple.

    There's no clear line of demarcation between a game utilizing 100% of the CPU for a good reason, and a game calculating digits of pi 99% of the frame, and running at 1 fps.
     
  13. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    In fact I think the guy explained from where the value is coming in the audio presentation.Sadly I wasn't able to understand this part for some reasons (english issue).
    I found that the statment is way less PR oriented than the usual %.
    Percentage means nothing I agree, like a cell used @100% could mean 14 instriction issued par cycle... which is optimistic :lol:
    But I don't think the guy speak about what we could call cpu utilisation as in the windows task manager but more about the average IPC.
    In fact, I think that his value could be in fact pretty relevant. May be the usual combination of workloads and their nature makes that it's not possible to extract from the usual game code more than 0,2 Instruction per cycle (on average)?

    On top of that what you do of this number of "useful" instructions per frame is another story, as you pointed you can push useless or usefull code whether instruction are pushed at the average rate of 0.2 per cycle.


    I will try to listen again and catch what the guy says in the end (while answering a question in this regard).
    ----------------------------------

    I find an interesting link about performance issue on the xenon (in fact mentioned in the MS guy presentation):
    http://www.gamasutra.com/view/feature/3687/sponsored_feature_common_.php
     
  14. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    21,757
    Likes Received:
    7,441
    Location:
    ಠ_ಠ
    Thanks for the link lio :)
     
  15. cullenskink

    Newcomer

    Joined:
    Jun 21, 2007
    Messages:
    3
    Likes Received:
    0
    > how do they get that games usually use 0.2 instruction per cycle?

    Its not that games use 0.2 IPC, it's that *typically* in game titles running on the in-order processor on Xbox360 achieve that rate. Forget the notion that people are 'wasting' 1.8 IPC or something.

    The absolute max is 2.0 IPC - 2 hardware threads going full pelt. But for this to happen, you have to have two instruction streams that:
    - never contend with each other for execution hardware
    - never have any data or structural hazards internal to the instruction stream
    - never hit any other kind of pipeline penalty

    For various reasons, the in-order CPU will cause you problems. Use the fsqrt function? No problem, there's a pipeline stall. Load a value you very recently stored? Er... ok 30 cycle stall while I wait on L1 being consistent.

    And memory.... load some data not in L2 cache? 610 cycles wait.
    So this instruction:
    lwz r11, 0(r31)

    Can have an IPC of 1/610 or 0.00163 (obviously measuring IPC on one instruction is kind of silly). 0.2 seems like it's quite likely all of sudden.

    > Statements about CPU usage are pure bullshit, plain and simple.

    Here comes the science: In this case, the 0.2 IPC figure is gathered from a set of empirical data gathered from real games on 360 that the XDC team have. So this is not bullshit, it's an actual measure of the typical figure games get. For some types of tasks you'll go higher, for some way lower. Ditto if you're paying attention vs not paying attention. Who cares about IPC in the menus? But in your frustum cull... oh yeah.

    However, your point that any given CPU usage figure says nothing about what the CPU is actually doing in terms of useful work is a good one. If someone says 'oh yeah I max out the CPU' - that doesn't mean its necessarily doing anything useful or doing anything remotely efficiently. But thats not where the 0.2 IPC thing is aiming.

    Allan
     
  16. NRP

    NRP
    Veteran

    Joined:
    Aug 26, 2004
    Messages:
    2,712
    Likes Received:
    293
    That X360 CPU programming article was interesting. It seems really easy to write code that triggers the load-hit-store penalty. I guess that's why Barbarian was so pissed off about the x360 CPU in that old thread.

    I wonder if the load-hit-store latency problem can be "fixed" in the nextbox (assuming they go PowerPC again), or if it just a characteristic of the PowerPC architecture?
     
  17. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    You bet they can fix it in hardware. Intel does it all the time - it's called store queue snooping if I'm not mistaken. It costs some transistors obviously, but without it the CPU is severely crippled, especially a 20yr old design like PPC, with all the different register files and constant need to move things via memory.
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    You can easily saturate the graphics subsystem with just one CPU core (2 SMT threads + vector unit). The game CPU bottleneck is rarely in the graphics rendering department. Game logic, physics, sounds, data streaming/decompression etc take a lot of CPU time. When looking at larger scale studies like this, it's important to notice that only a small minority of the releases are super high budget AAA releases. Small teams do not have time to properly optimize all their game code, as the few game logic / menu programmers are very much occupied with the game feature set alone. Often only graphics engine is fully optimized, as graphics engine developers tend to have good hardware knowledge and optimization skills. This is often why the bottleneck is not in the GPU side.

    All idle GPU time should be considered wasted resources on Xbox 360. With a little bit of extra work you can port many of the massively parallel CPU algorithms to the GPU. With memexport/UMA the GPU can act as a ultra wide vector processor. Things like particle animation, particle sorting, particle geometry creation, object frustum culling (viewport/shadowmaps), object depth sorting, etc can all be ported to GPU and actually run much faster compared to the CPU as GPUs are much more suited for massively parallel vector processing. It's just a matter of balancing the engine properly on all CPU cores + the GPU. With PS3 you have to load balance all the processing units properly to get any performance out of it. With XBox 360 you will get acceptable performance easily, but getting the best performance requires lots of work, just like with other consoles. It's just not a PC with DirectX 9. It will take some time to get best out of the both current generation consoles. We are definetely not there yet.
     
  19. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Thanks for your insights :)

    About the huge LHS penalty on ppu and xenon, Barbarian hint that it would be avoidable in hardware but at some don't you think it would be better to at some point move to a clean design and an ISA compiler friendly or years of work on powerPC altivec in regard to compiler profiling tools etc would make the effort to move to something new moot/not that worse?

    Sebbi I get you're point but I start to wonder if it will happen at some point for the 360... I hope that the launch of 7 and thus of directx11 would have push developers to use xenos extra functionalities (in comparison to rsx) but as some imminent members hinted it's not likely to happen before as devs will only focus on directx11 when a significant part of the market consists in directx11 compliant GPU. It will take time... and could be worse depending on how 7 takes off.
     
    #19 liolio, May 16, 2009
    Last edited by a moderator: May 20, 2009
  20. cullenskink

    Newcomer

    Joined:
    Jun 21, 2007
    Messages:
    3
    Likes Received:
    0
    Load hit store is of course avoidable in hardware. Remember the PowerPC core on 360 and PS3 is a cut-down version of the full architecture.

    So you get no instruction re-ordering - which is one method an out-of-order processor uses to dodge or reduce LHS, by moving the instructions causing LHS apart, if possible; and you get no store forwarding, which is the other way of dodging some LHS (specifically, LHS on moving data from one register set to another, eg GPR to FPR, or GPR to vector register, etc - which in C parlance is effectively a cast).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...