GPGPU. ACE's etc?

Discussion in 'GPGPU Technology & Programming' started by Ron Burgundy, Aug 20, 2014.

  1. Ron Burgundy

    Newcomer

    Joined:
    Aug 18, 2014
    Messages:
    25
    Likes Received:
    1
    Hello. I want to know how GPGPU would be utilized. And what benefit it can bring. I also want to know what ACE's are for in GCN GPU's.

    Thanks.
     
  2. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    ACEs are the hardware job managers of the GPU, interfacing with the application. In a short way to say, the application will write a dispatch command to the ACE, which describes the problem size and contain a code pointer of the job. Recall that GPU uses SPMD/SIMT programming model. So you assign stuff in bulk to the GPU and the GPU will break it down into hardware threads in its own way, instead of manually spawning threads as if you are programming on a multi-core CPU. Check the HSA queuing model if you are interested.

    The ACE does process the command. However, the wavefronts (hardware contexts of the CUs) are not directly controlled by the ACEs, but the dispatch controllers which arbitrate and schedule kernel dispatches from different ACEs and also the graphics pipeline.
     
  3. Ron Burgundy

    Newcomer

    Joined:
    Aug 18, 2014
    Messages:
    25
    Likes Received:
    1
    Thank you very much for explaining this. So can those ACE's handle GPGPU tasks like physics etc? Or what other real world tasks are ACE's useful for? Such as for real time graphics.
     
  4. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    ACEs are not responsible of the compute but the CUs. An ACE manages the compute pipeline in accordance with the command packets written by the application, and generate works in batches for the CUs (via a dispatch controller arbitrating among all ACEs & GFX). So you can always tell the ACEs (via abstractions like HSA's AQL mechanics) to generate kernels of anything above and push them to the CUs if you think it fits the GPU.
     
    #4 pTmdfx, Aug 21, 2014
    Last edited by a moderator: Aug 21, 2014
  5. Ron Burgundy

    Newcomer

    Joined:
    Aug 18, 2014
    Messages:
    25
    Likes Received:
    1
    I see. So GPGPU only works with the power of the GPU itself? And the more ACE's doesn't mean more physics etc if the GPU is too weak?

    The PS4 is said to have 8 ACE's. More than the 7970. But since the 7970 is a more powerful GPU, won't it's GPGPU performance be much better despite having less ACE's?
     
  6. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,153
    Likes Received:
    928
    Location:
    still camping with a mauler
    More ACEs allow the GPU to better utilize its available resources. The 7970 is far more capable than the ~7850 in the PS4, and more ACEs won't make it perform like a 7970.
     
  7. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    ACE is on the control flow. They oversee things but they don't contribute to the computational work. If the GPU lacks compute power, it always lack it no matter how many ACEs are there. ACEs just improve the efficiency of running different kernels from the same or even different application processes, especially those with a problem size that is not enough to fill the GPU up individually. In a more abstract way to say, the factory's (GPU) throughput is still determined by the scale and performance of the manufacturing pipeline (Comptue Units), unless the job requires aggressive administration and coordination from the managers (ACEs), or the factory is underutilized.
     
  8. Ron Burgundy

    Newcomer

    Joined:
    Aug 18, 2014
    Messages:
    25
    Likes Received:
    1
    Yeah I know that. The ACE's still have to work with 1.84TF.
     
  9. Ron Burgundy

    Newcomer

    Joined:
    Aug 18, 2014
    Messages:
    25
    Likes Received:
    1
    Right. I see now. So more ACE's = using the power thats already there more efficiently. Thank you very much.
     
  10. liquidboy

    Regular Newcomer

    Joined:
    Jan 16, 2013
    Messages:
    416
    Likes Received:
    77
    As the Xbox one architects like to describe the ACE's -

    Andrew Goossen: The number of asynchronous compute queues provided by the ACEs doesn't affect the amount of bandwidth or number of effective FLOPs or any other performance metrics of the GPU. Rather, it dictates the number of simultaneous hardware "contexts" that the GPU's hardware scheduler can operate on any one time. You can think of these as analogous to CPU software threads - they are logical threads of execution that share the GPU hardware. Having more of them doesn't necessarily improve the actual throughput of the system - indeed, just like a program running on the CPU, too many concurrent threads can make aggregate effective performance worse due to thrashing. We believe that the 16 queues afforded by our two ACEs are quite sufficient.

    And as the ps4 architects describe theres -

    Cerny : The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands — the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that’s in the system. PS4 has 8 ACE's.


    Both xb1 and ps4 seem to be using ACE's with 8 queues each..
     
    #10 liquidboy, Aug 23, 2014
    Last edited by a moderator: Aug 23, 2014
  11. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    More ACEs give more software design flexibility. For example each middleware using asynchronous compute could use a single dedicated ACE. This is similar to middlewares having their own worker threads on CPUs.

    I wouldn't say that more simultaneous work threads (command lists / ACEs) bring better utilization. Running too many separate tasks simultaneously would reduce the cache efficiency of any architecture (CPU or GPU). Also the utilization gains from running more tasks simultaneously (on a shared execution core) drop heavily after two or three simultaneous (independent) instruction streams (independent threads of work). If you want to compare this to CPUs, you'll notice that 2-way hyperthreading gives quite nice gains, but 4-way (Xeon Phi) doesn't bring as much additional gains, unless of course there are lots of stalls on each of the threads.

    On GPU, the rendering pipeline has stalls (holes in execution), because there's limited amount of fixed function hardware inside the GPU (vertex setup, ROPs / fill rate, texture filtering units, etc), and different graphics shaders hit different bottlenecks. This leaves bandwidth and execution units (ALU) free for asynchronous compute tasks. Compute tasks on the other hand are not as much bottlenecked by fixed function hardware. This means that even a single (long) compute task running simultaneously with the graphics rendering pipeline can bring the GPU utilization near 100%. The improvements in GPU utilization can be quite big.

    As we all know, most current games do not utilize the GPU anywhere near 100%. If they would, they would heat the GPU as much as Furmark does. It would be fun to see games like that soon :)
     
  12. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,905
    Likes Received:
    6,188
    lol what exactly is Furmark doing that is causing the GPU to heat up so rapidly? Like I see this donut shape moving about, and temps start rising everywhere lol
     
  13. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    Furmark hit upon a mix of operations that happen to keep the GPU well utilized so it draws a lot of power. It's probably very ALU heavy.
     
  14. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,905
    Likes Received:
    6,188
    Interesting from what I see online the benchmark can cause overheating to cards not sufficiently cooled (85+ range) Now I am curious as to how hot cards normally run during standard gameplay.
     
  15. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    611
    Likes Received:
    1,052
    Location:
    PCIe x16_1
    ALU and ROP. Don't underestimate the ROPs.

    You'd be shocked by the power usage on modern GPUs just rendering a simple backdrop when they're not ALU-bound (EVE Online's login screen used to be a great example of this).
     
  16. pMax

    Regular Newcomer

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    hmm... you might need to pull constants for the GPU as well, and that should be done in parallel with your (preferably userland) command buffer.

    So, its likely you need 2 parallel buffer reads just for one job.

    Also, I'd say that more compute buffer exec units with more specialization might 'speed up' the "per-buffer" front-end processing time as well, which supposedly never hurts...
     
    #16 pMax, Aug 27, 2014
    Last edited by a moderator: Aug 27, 2014
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...