Asynchronous Compute : what are the benefits?

Discussion in 'Console Technology' started by onQ, Sep 19, 2013.

  1. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    605
    Likes Received:
    56
    If the algorithm is "bad" it will run equally bad on any modern CPU also.
    But if your algorithm runs faster on CPU than on GPU - it's your problem, you did not optimize it and then CPU executor did it for you.
     
  2. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    605
    Likes Received:
    56
    Modern games have 8-10 digit budgets. They can accommodate that cost.
    And you can always write games on Andrioid/iOS or in HTLM5/Javascript, much cheaper.
    There even no need to optimize it so hard... :)
     
  3. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    605
    Likes Received:
    56
    It's totally fair to compare it.
    Why nobody uses non-linear radiosity algorithms?
    Why nobody uses shadow volumes?
    Etc.
    In interactive graphics everybody is cool with "no single-threaded or dependent algos".
    It's time to accept it as a new worldview.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Current GPUs are optimized for very coarse workloads that have a very significant amount of math being done per memory access, and a very limited amount of divergence in the execution of each data element.
    There are workloads for which the more limited SIMD width of CPUs, or SPEs, is sufficient to capture the bulk of the data parallelism that can be extracted per-cycle before divergence or just plain lack of extra work to do kicks in.

    If, on a GPU, you can find extra arithmetic work, leverage specialized hardware, or the cache miss rates for the CPU are equivalently high, it's a GPU strong point.
    If the cache hit rate is very high, or the granularity of the workload is very fine due to limited arithmetic density or complex control flow, it's not a GPU strong point with the architectures in question.

    The CPU has an execution pipeline that can handle a far more arbitrary code flow, and in terms of maintaining coherence and memory ordering, it is far superior. The programming model for HSA has a bunch of restrictions in how programs should be structured in order to create execution points where synchronization and write-back can be initiated by specialized writes or reads in order to permit global visibility and ordering. There are various gotchas that either the code writer or compiler will need to watch out for.
    CPU code, to a point, just does it.
    The execution model for the GPU encourages the smaller simple program approach, and it requires extra hoops CPUs do not. It isn't required that a CPU know ahead of time how many registers the code it's running will need. It also doesn't bloat the register allocation if one of the branch paths blows up the register requirements.
    CPUs just run the code.

    On a CPU, there is much more cache per thread, which in the face of external memory accesses still incurring a bandwidth+latency+power penalty means there are workloads where opting for fewer threads that stay on-chip will get the job done.

    Thousands of threads that branch in thousands of different directions?
    Try that with the Orbis GPU.

    Unless your tomorrow's world has dispensed with physical reality, it doesn't negate the implementation details of the specific chip in question. They aren't pretending that Orbis can do what you claim, and whatever your vision is for the future, it can't transplant itself into the PS4's hardware platform.
     
  5. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    605
    Likes Received:
    56
    That's cool, data locality and stuff.

    Workloads do not descend from heavens. Application developer creates them.
    If the data locality is bad, or code branches like crazy - you wrote a bad code.
    Yes it will suck less on CPU than on GPU, but it will still suck.

    If cache hit rate is very high it will run equally good on GPU. The only case here is if the cache hit rate is high only on specific cache sizes > GPU cache size. Looks too artificial to me.
    If you have low arithmetic density: do not use CPU at all. Use calculator. :)
    And complex control flow = bad code. Or it's infrastructure code = no need for speed.

    Cost. I've heard that already. We don't want to optimize, we want hardware to do the job, and hire less (or less expensive) developers.

    CPU is good for bad/old/legacy code. I know that.

    Bottom line:
    CPU and GPU try to leverage the same problem: how to keep caches full in such a fashion that memory bandwidth is saturated all the time.
    And IMHO GPUs do it much better.
     
  6. onQ

    onQ
    Veteran

    Joined:
    Mar 4, 2010
    Messages:
    1,540
    Likes Received:
    55
    PS4 GPU has been modified for fine grained computing.





    start around 41:40
     
  7. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    605
    Likes Received:
    56
    P.S. any problem that needs high performance - will run better on GPU.
    P.P.S. "high performance" = high arithmetic demand + high memory bandwidth requirements, ask HPC guys.
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    A workload is the code and data set the hardware is tasked to run on.
    Those are informed by the constraints of the problem the programmer is trying to solve.

    It's not artificial when cache sizes are so much larger per thread on the CPU. There are 4 MB of L2 cache for up to 8 CPU threads in Orbis.
    Assuming equivalent usage, it's 512 KB of cache to play with.

    There is 3/4 MB for up to 720 wavefronts, which is 46,080 threads--if you buy into the marketing.
    That's about 1K per wavefront, and 17 bytes per "thread".
    How many orders of magnitude are necessary before the example becomes not artificial?

    edit:
    My apologies, I was mentally using a larger GPU.
    Orbis has 512KB, so cut the per-thread cache allocation as necessary.


    I don't normally point out garbage argumentation.

    Graphics drivers/compilers.
    GPU compute run-time managers.
    Or are you asserting Orbis might not need those?

    They are also good at code that requires fine-grained synchronization, and there are simply problems that include that.
    There are data sets that fall below the minimum the GPU needs for utilization. This is still the case for GCN.
    Reduction operations are common, and it follows that if the GPU does that enough times, eventually the data it works on falls below the minimum.
    See how AMD is trying to sell HSA for image recognition. The GPU is faster for the initial broad sweeps, but it falls on its face as the number of tiles drops.

    That is not what the CPUs try to do, they can't fully schedule around a miss that goes off chip.
    Under much of their operating range, CPUs do their best to prevent off-die access.
    GPUs start from a pessimistic case where they assume off-die access is extremely routine.


    Fine-grained relative to what?
    For previous-gen GPUs, sure.
     
    #48 3dilettante, Sep 22, 2013
    Last edited by a moderator: Sep 22, 2013
  9. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    605
    Likes Received:
    56
    This one somehow assumes that workload was not developed for the hardware in mind. Or am I seeing things?

    I will address this below.

    I assume these tasks do not need the performance.
    They can run on a slow CPU without a problem.

    What prevents you from feeding it more tiles as the data you work on is reduced in size?
    Or you are speaking about the thread granularity issue?
    Still I think we are side-tracking here, I'd like game development examples. I cannot think of a good reduce task here.

    Working set of modern game is several gigabytes in size = "off-die access is extremely routine".
    To address the point above: cache miss is routine for games. You can even see it by yourself in any modern PC game (just analyze it with profiler or any specialized tools).
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Some HPC people think turnaround times for computations measured in days or weeks are great.


    This is backwards.
    The hardware was developed with a workload in mind.

    I think the system that allows jobs to be farmed out to the GPU has a performance impact.
    GPU drivers on desktop systems can themselves become a performance limiter, and that's with cores much more powerful than Jaguar.

    The GPU was tasked with analyzing an image. The performance criterion is how fast the analysis can complete.

    It's a multimedia example, and for user-facing functionality the latency factor weighs heavily.
    Sony also expects the GPU to perform the work for image recognition for its camera.


    What's the rate for CPUs?
    Why does their utilization rate of mere tens of GB/s of memory bandwidth rarely peak outside of benchmarks?
    GPUs almost assume that every wavefront memory operation could take a full trip to memory, and they can do so without affecting arithmetic throughput if enough math operations are available.
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Let me add my two cents.

    Let's assume a programmer has a certain problem to solve. Usually there is more than one algorithm possible to get the job done. Depending on the problem, it may be very easy to roll out an efficient massively parallel algorithm, or it is not. For quite some problems the "natural" algorithm presents itself with only modest parallelism, complicated control flow and all sort of things GPUs don't like very much. But they may run with a very high performance on CPUs.

    As I said in the beginning, there is often more than just one algorithm to solve a problem. That could be a point to explore these. One may find an alternative, which exposes more parallelism and therefore scales much better to a high number of threads, and also avoids the common performance pitfalls of GPUs. Thus, this algorithm is better suited to GPUs as it is to CPUs and running the same algorithm on CPU and GPU gives the GPU an performance advantage. But that doesn't say at all, that it is faster than the original algorithm running on a CPU. It could be an inherently worse algorithm in the sense, that it has a huge upfront computational or memory costs, that it scales much worse with the problem size or whatever. It simply means that the problem in question may be not the best one to run on a GPU. And not everyone has the time to do years of research into finding yet another parallel implementation which may or may not be faster than the original one. Maybe, in a few years, when GPUs have further evolved, someone may find a better fit. But it doesn't help you now.

    To sum it up, sometimes it isn't as easy as saying that devs just have to write better code. There may be fundamental restriction to what one can do.
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    CPU beats GPU in all problems that cannot be sliced to tens of thousands of independent threads. If you try to run large amount of smaller problems on the GPU, you will stall the GPU (and the CPU because of draw/dispatch call setup overhead). It's much better just to crunch these small problems using AVX (preferably inside the CPU L1 cache).

    Games have lots of different types of entities that do not come in the required 10k+ numbers (needed for efficient GPU execution). Examples: currently active AI enemies (<50), ray casts needed for a single game logic frame (<100), path finding requests per frame (<50), active physics objects (game logic rigid bodies, not particle debris physics, <500), script objects running (<100), triggers running (<100), etc. Of course there are specific game types that have huge armies of enemies or are purely based on physics simulation, and those types of games might benefit more from GPU processing (assuming the branching is coherent and the dependencies are simple).

    Usually algorithm complexities rise when you move from a sequential algorithm to a parallel algorithm. O(log(n)) sequential algorithm can for example become O(n) when it is parallelized. And it is quite common that O(n) sequential algorithms became O(n log n) parallel algorithms. So you pay some overhead in algorithm complexity. Sometimes it's just better to run the algorithm on CPU, because you don't want to waste 2x-5x GPU flops to run it on GPU, even if it would be finish faster on the GPU. GPU is a resource, and you rather spend it doing something very efficient, such as graphics rendering (or simple batch processing).
     
    #52 sebbbi, Sep 23, 2013
    Last edited by a moderator: Sep 23, 2013
  13. pMax

    Regular Newcomer

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    The question is interesting. However, supposed you have a beefy front-end, what prevents you to actually run these small jobs with a relative performance penalty, but issue many of them in the GPU schedule?
    In the end, you MAY have other jobs running in your GPU (like 3d ones) that aids you to hide the higher latencies for those little extra jobs.
     
  14. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,682
    Likes Received:
    1,200
    Location:
    Maastricht, The Netherlands
    My (admittedly layman) impression is that this is precisely what Sony has worked with AMD on - to prevent smaller jobs stalling the GPU, to provide a smaller overhead from the CPU side, and to be more bandwidth efficient.

    It seems a given that there are algorithms that the CPU is better at - otherwise, why even have one (or more than one or two cores of one)? But if you have GPU cycles to spare (which Sony has suggested is typically the case), and you don't have CPU cycles to spare, then it still makes sense (assuming it's not too much work to refactor the algorithm).

    I fully agree that psorcerer's blanket statements are not doing his argument much service, but in general I would say there is some truth in it, in that data-focussed setups can be more efficient. Hasn't this been discussed endlessly in the light of using the SPE's to their advantage, and doesn't this hold to CUs to a large extent as well?

    I don't feel we have a lot of stuff currently to compare the way CUs can be used in the next-gen consoles with for modern game engines, but we do see Sony suggesting that the way SPEs have been used to farm off jobs is made possible by the current configuration of CUs, scheduling and prioritisation, datapaths and memory access.
     
  15. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,298
    Likes Received:
    396
    Location:
    Australia
    This is my layman understanding.


    its not just "front end", it's resources per alu, a gpu hides latency by doing something else. but it doesn't have anywhere near the local storage or hardware based performance improvers ( prefetch,predict,instruction cache etc) per ALU a CPU does. To do something else you have to have that data and instruction at the execution unit. That takes a very long time on a gpu. So if you have lots of small complex jobs waiting on fetching complex memory (noncontiguous) access etc you can have all your ALU's waiting for that data and you stall the GPU. A GPU wants lots of simple tasks with simple data structures.

    When you look at the evolution of GPU's your really looking at the evolution of data structures and tasks the GPU can effectively support.
     
  16. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,734
    Likes Received:
    11,207
    Location:
    Under my bridge
    SPE code can effectively operate as standard multicore, running multiple independent jobs simultaneously even using the same basic algorithms as CPU code (though with data optimisations), or running the same task in parallel across cores when it's an easy fit. GPGPU requires a completely different approach, running the same task in parallel across many multiple processors. It's not always easy to a take a fundamentally linear task or task with data dependencies and break it into tiny units each processed in parallel. Sometimes the algorithm that can deal with a task broken down like that is far less efficient that a linear algorithm run on a CPU core, such that it makes no sense to move everything to GPGPU.
     
  17. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    17,682
    Likes Received:
    1,200
    Location:
    Maastricht, The Netherlands
    But even then, what processes are there on the CPU that require so much CPU resources? Surely typically you will have a few really hungry ones that process a lot of data, that then make sense to move to CU? Obviously this is not something commonly done on PC because the communication between CU and CPU isn't optimal, they don't have access to the same memory pool, there's caching issues, etc.

    More importantly, I'm getting the impression that a lot of GPGPU talk today is informed by the state of technology of 2006 rather than modern architecture, interfaces and SDKs.

    To take a quote from NVidia's PR material:

    That future is here this November? ;)
     
  18. Jay

    Jay
    Veteran Regular

    Joined:
    Aug 3, 2013
    Messages:
    1,939
    Likes Received:
    1,087
    I'm confused, how is that any different than people who are saying that some work fits CPU's better?

    I've not seen anyone say that some tasks can not run better on GPU's, and that there is possibility to move currently CPU done tasks to GPGPU.
     
  19. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    40,734
    Likes Received:
    11,207
    Location:
    Under my bridge
    It might well be. I just take considerable umbrage at psorcerer effectively saying all devs are stupid/lazy for not writing their code as efficiently parallelised GPU compute. In areas where we are seeing GPGPU acceleration, it's been after a lot of time and research. The notion that devs should just write their games different, easy as, is insulting to the industry.
     
  20. almighty

    Banned

    Joined:
    Dec 17, 2006
    Messages:
    2,469
    Likes Received:
    5
    Would GPGPU really be that much faster on PS4 compared to PC?

    Running TressFX on Tomb Raider causes a decent enough performance drop... Would it be less of a hit on console?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...