Asynchronous Compute : what are the benefits?

Discussion in 'Console Technology' started by onQ, Sep 19, 2013.

  1. pMax

    Regular

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    not on an APU - the memory access should be the same for CPU and GPU (well, more or less).
    You are not supposed to go through PCIE from discrete card to memory, after all.

    Right, GPU is not suited for many small, non-uniform jobs.
    But if they are not many, and there are other jobs that hide that bad latency, you could cover it, I guess.
    Of course GPU is not the best to run those kind of threads, but at least you do not need to DMA its small memory buffer in and out all times before it finishes to work with it...
     
  2. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,762
    Likes Received:
    2,639
    Location:
    Maastricht, The Netherlands
  3. Bumpyride

    Newcomer

    Joined:
    May 19, 2005
    Messages:
    55
    Likes Received:
    2
    Location:
    MS, USA
    I think the interesting aspect of this isn't that some problems are better for CPU and some better on GPU. The shared memory space makes sharing work between the two more efficient but it also makes approaches to programming possible that weren't before.

    I've only done two big GPGPU projects in my life and one of them would have been implemented entirely differently on an APU. It was an adaptive beamforming project that involved a large multiplication to calculate a covariance matrix followed by a good bit of linear algebra on the smaller resulting matrix. The covariance matrix was much smaller so the GPU was used to brute force the multiplication and then it was passed back to the CPU for everything else. It would have been better to do the more irregular sorts and searches on the CPU while doing all of the coarse multiplication work on the GPU - especially if it meant not having to copy anything between device and host memory.

    My point is, within one algorithm, there were jobs that were better for the GPU and jobs that were much better for the CPU. Having both work efficiently at the same time would have structured things a lot differently than just splitting the algorithm in two and doing more work on the CPU than you would want because it would take too long to copy everything back.

    I have no idea how much faster this would have been in the end, but I don't doubt per unit of GPU and CPU resources (obviously the APUs are weaker than other available pairs of CPU and GPU) it definitely would have been faster.

    I can't wait to see if more and more of this kind of thing crop up as people optimize their software for the next gen systems.
     
  4. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    Ok. Then what?

    That's a totally different discussion. On why declarative job systems (D3D/OGL) suck so much for modern GPUs. Not because GPUs are "slow".

    That's side-tracking anyways.
    Typical game workload that needs performance is graphics and everything around it: collisions, simulation, animation, particles etc.
    All other things can run anywhere, their impact is less than 5% (if you coded the game right).

    Mostly, bad code. People in non-gamedev world usually do not optimize anything.

    And they perform equally as good if your reads are hand picked to be in cache at the exact time.
    Very similar to what the SPU guys tried to teach the masses.

    I can understand when you've tried and did not succeed.
    The problem is that people usually hear this argument and then don't even try.
    But if you look into the future the number of hw threads/ports/jobs per core only increases.
    No way to stay with "old CPU" paradigm any longer anyway.

    Draw call overhead exists only because of "peculiar" D3D design.
    You can draw things on Orbis without any overhead. Just by assembling contexts by yourself.
    Theoretically you can even draw primitive-by-primitive without any performance penalty whatsoever.

    If you have a small task, that does not need bandwidth, do it on CPU. What's the problem?

    You have 6 SPU "tasks" on PS3 and 256k of storage for each.
    And then you have 64 independent "thread pipelines" in Orbis with ****k cache per task.
    What exactly is "better" here?
     
  5. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    Last time I've checked people in ND and Sony SM shared this worldview.
    So, maybe it's insulting, but not to the whole industry anyway. :)
     
  6. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    If writing better code is the solution, then we'd just have in-order processors, why bother with the micro-ops optimization even? Oh wait, that's because most devs write bad code.

    It's all so obvious now, nevermind the hundreds of thousands of pioneers in the CS field, because all you need is just to write better code.
     
  7. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    You've never spoken to them about their "worldview".
     
  8. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    Sturgeon's Law.
    Does it mean you should not strive for perfection?
    And besides that, modern CPUs have a lot of legacy code to support, and it's not so easy when memory access is 150x slower per clock than it was in 80s.
    And then they still try: Netburst or Cell...
     
  9. tuna

    Veteran

    Joined:
    Mar 10, 2002
    Messages:
    3,550
    Likes Received:
    590
    We could have better compilers instead.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Probably recognize the strengths or weaknesses of the hardware and design the software accordingly.

    Can you clarify what the consoles are using, at least for the first half of the upcoming gen?

    What is the speedup for rigid body physics running on the GPU versus CPU?
    How about data management for the streaming system, or latency-sensitive input processing, or the high-speed management whole virtual memory subsystem the GPU relies on?

    So, if for whatever reason things don't hit this 5% figure, it must be bad code.
    Is there an example of a well-coded game you can cite?

    That's only the case if bad code is defined as any code that doesn't saturate the memory bus.
    There are good algorithms that don't hit main memory for the majority of their accesses, and bad ones that do.
    The cost for off-die access is so high that for many reasonable or practical data sets it is preferable to go for an algorithm that may be asymptotically inferior to a more parallel cache-thrasher, because it is not necessary or reasonable to bloat the working set enough to scale past the inflection point.

    I'm not sure why it's a good idea for an interactive system with millisecond time budgets to saturate anything to that extent, since that either leaves no room for demand spikes or has a baseline that is way too high.


    Previously, I went into how the cach hierarchy gives maybe a dozen bytes of cache storage per work item.
    I want to see the optimizations that can reduce everything down to that.

    As nice as that may be, the designers of the hardware in question do not agree, so the platform in question does not do what you want.


    Where are the contexts assembled?
    Are you really sure there's never overhead iterating through every single primitive instead of utilizing an instruction or command sequence that leverages a whole hardware pipeline optimized for it?

    You see, that's the old way of thinking. The future is thousands of threads.

    It's 256kB per SPE, which is an independent front end and execution pipeline. Within those bounds, it has a straightline speed quadruple what a CU can perform physically, before noting that the CU cannot perform sequential issue faster than once every four slow cycles.
    For the GPU, it's 64 front-end command pipelines that do not possess resources of the own and have not been disclosed as having the necessary autonomy beyond taking what commands the CPU runtime gives them, and using those to arbitrate with the scheduler and CU status hardware in the GPU. The CUs then perform the work.
    I've already noted that it's **Bytes per task with Orbis.
     
  11. Esrever

    Regular

    Joined:
    Feb 6, 2013
    Messages:
    846
    Likes Received:
    647
    or better hardware instead.
     
  12. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    or just write better code.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    ...

    Oh noes, it keeps going deeper!
    Which, by the way, GPUs are not good at.
     
  14. gurgi

    Regular

    Joined:
    Jul 7, 2003
    Messages:
    605
    Likes Received:
    1
    write better posts? :p
     
  15. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ
    But can you express your ideas?
     
  16. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    I'm cool with that.

    Dunno about Xone, but PS4 let's you go quite low level when writing the code. No "constants" or "buffers", for example.

    Rigid body physics can be indeed too simple for GPU, but for example spring-based simulations of soft bodies looks like a good contender.

    Most of the code revolves around drawing things anyway. It's the most performance hungry part of any game I know.

    I agree. But the games are developing in another direction: tighter lighting simulations -> bigger datasets, tighter collision/forces simulations -> bigger datasets, and so on.

    You have a luxury of saturating things and getting away with it on consoles (at least for now). It's not a "general computing" machine.

    That threads number was inflated too much.
    Each CU has 4 SIMD units 16 "work items" wide, 18 CUs. Each memory controller has 128k of L2.
    So basically we can do a granularity of 128 * 4 / (18 * 4) = 7k per "thread".
    On modern Haswell CPU it's 4 SIMD/FMA ports for each 256k of L2 = 64k per "thread".
    Yes CPUs are much better, but only approx. one order of magnitude better.

    Why do you need to "iterate"? CUs work just like regular CPU cores: you pass number of SIMD instructions, they get executed, you write it to memory.
    No need for any "context" or "buffer" anyway. AMD driver does all this "context" mumbo-jumbo on PC just to keep compatibility with the D3D architecture. Inside it's just a compiler that spits out imperative code.

    For tasks that need performance.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The SIMD units are 16-wide, but operate on a 4-cycle vector issue. The per-instruction width of the wavefronts is 64-wide.
    The CUs themselves can host 40 wavefronts each.

    For comparison, Haswell with HT has two threads and AVX-256 gives a per-instruction width of 8.
    The 256KB of L2 is backed up by an L3 of 8MB. Even if just dividing it up per-thread on die, it's 1MB.
    Since HT is optional, it can be 2MB per-thread.
     
  18. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    948
    Likes Received:
    417
    Some problems are excessively performance hungry, and totally unparallelizable. What you're conjecturing is: "you can break every performance problematic algorithm into trivial to calculate pieces".
     
  19. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    That's akin to Haswell 60-entry scheduler, per core.

    Caches farther than L2 are not effective in games. Can be tested in real world. :)
    Haswell has 3 AVX-256 ports per core that can run in parallel.
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    If you want to pull that, then the per-wavefront instruction queues in each CU can readily expand the GPU's number of buffered instructions.
    The wavefronts possess wholly private context and their own instruction pointer. They're as close a match to the threads in Haswell as there can be.

    Haswell can drop down to one thread without a problem.
    The CUs absolutely cannot drop below 4 wavefronts without idling fractions of their units.

    The Haswell L3 is maybe 30-40 cycles away at over 3 GHz.
    The global L2 of GCN may get under that in cycle terms, but by much, and that is not in wall clock time.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...