More GPU cores or more GPU clock speeds?

Discussion in 'Architecture and Products' started by gongo, Jul 9, 2016.

  1. gongo

    Regular

    Joined:
    Jan 26, 2008
    Messages:
    582
    Likes Received:
    12
    As an armchair observer, i ask, are games today taking better advantage of higher GPU clock speeds rather than moar cores?

    I dont understand how AMD GCN arch can fall off after the 290 Hawaii. Then i noticed Nvidia GPU have been increasingly relying on high clocks. With high clocks, you can push things like fillrate and tessellation higher with lesser units, no? And with lesser compute units, you can push for higher clocks...and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?
     
  2. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Yes.

    No, they're not really linked. Being able to run higher clocks is primarily determined by the microarchitecture.

    No, that doesn't make a lot of sense. Rendering is inherently a massively parallel operation, but that parallelism is, for the most part, hidden behind an API.
     
    Razor1 and Alexko like this.
  3. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    You probably would want a better understanding in the 3D rendering pipeline before digging into hardware design details like this.

    he parallelism on the CPU end and the parallelism among compute units end are two completely different contexts. One is about preparing and pushing an embarrassingly large block of work to the GPU, while the other is about picking up those works, preparing them, and breaking it down to hardware threads for execution. Game engines are within the former context, while the later context depends on what combination of work is being submitted by the game engine.

    Like silent_guy said, rendering are an embarrassingly parallel problem, say for pixel shaders you have at least hundreds of millions of pixels per second for 60fps 1080p, not even considering multi-sampling. Then before pixel shaders you have vertices and geometries. The problem size is orders of magnitude larger than the best case hardware maximum thread count.
     
    #3 pTmdfx, Jul 9, 2016
    Last edited: Jul 9, 2016
    Razor1, Grall and Alexko like this.
  4. PlanarChaos

    Newcomer

    Joined:
    May 30, 2016
    Messages:
    30
    Likes Received:
    1
    from the review Ryan Smith wrote on Fury X diving into Fiji:
    source: http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4


    In short, parts of GCN couldn't be scaled further than they were with Fiji compared to Hawaii.
     
  5. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Because of Amdahl. Higher clocks mean, the less than optimal parallelizable parts of any given task won't keep the parallel parts of the GPU from doing useful stuff as long as with lower clocks.

    Yes and no. It's always the product of clocks times number of units.

    If you have for example a 2 GHz clock but only one rasterizer or tessellator, you're gonna be slower most of the times than with four raster/tessellation engines running at 1 GHz, because those tasks are highliy parallel.

    It's a delicate balance between power, size/transistors invested, # of units, IPC and clocks.

    I suspect, that in order to reach these high clocks on pascal in the given power budget, Nvidia actually invested a fair amount of die space. Maybe they have clock islands for each GPC or even SM, having smaller areas over which to distribute clocking signals and in the same time being able to intercept load/power spikes in single units much better. Just a guess.
     
    pharma and Razor1 like this.
  6. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    582
    Likes Received:
    285
    I want bandwidth.
     
  7. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    More of everything :) also good!
     
  8. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    … and then some! :D
     
  9. milk

    Veteran Regular

    Joined:
    Jun 6, 2012
    Messages:
    2,995
    Likes Received:
    2,563
    The question is: would it be better to have a bit more of everything and then some, or just a lot more of evertything without any more some on top of that?
     
  10. AlBran

    AlBran Ferro-Fibrous
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,731
    Likes Received:
    5,822
    Location:
    ಠ_ಠ
    Are we talking about a larger bacon-wrapped cake with more bacon or just a proportionately larger cake wrapped in bacon?
     
  11. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    12,518
    Likes Received:
    8,726
    Location:
    Cleveland
    The only way its good is if it's American Bacon and not Beaver Bacon.
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    In most cases yes, but there are steps in the rendering pipeline that are at least partially serial. In modern rendering pipelines you need to calculate some reductions (for example generate a mip chain, calculate scene average luminance, generate a histogram, generate a depth pyramid, sort data, etc). Generally anything that requires a global prefix sum. Also resource transitions from write to read need a sync point (wait until idle). A narrower GPU with higher clocks goes to idle sooner (less threads to wait) and restores from idle to full occupation sooner. The GPU pipeline is surprisingly deep. Warm up hurts. Caches are cold at first. First vertex shader invocations will cache miss, meaning that the GPU needs to wait until there's any vertices ready, meaning that there is no pixel shader work. There is only very limited amount of on-chip vertex attribute storage available, so it just cannot just fill the GPU with vertex work either. If those first triangles happen to be small (= not much pixel shader work) or backfacing or outside of the screen, a wide GPU is just screwed (= mostly idling). You have similar problems when rendering background geometry (lots of draw calls of different meshes + tiny triangles). LOD and combining meshes together helps, but has its own implications regarding to memory consumption and content development effort.

    Shadow map rendering doesn't even have pixel shaders. It's very easy to underutilize a wide GPU when rendering shadow maps. Similarly when you are rendering low-resolution buffers, such as occlusion buffers or anything with conservative rasterization, you likely underutilize a wide GPU. On AMDs GPUs you can avoid underutilization with asynchronous compute. But it's not a magic bullet. You still need to find parallel work. It's surprising how linear modern rendering pipelines are (you really want to process steps in a certain order. next step often requires previous step output). You can of course compromise with latency (mixing previous and current frame tasks) to find more parallel work, or do some kind of split scene rendering (but that might again hurt wider GPUs in some cases). Not trivial to write a code that is best for both wide and narrow GPUs.
     
    dogen, trinibwoy, Alexko and 6 others like this.
  13. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
    Do you think that as GPUs get wider, the ratio of, say, ALU throughput to geometry throughput might become more homogeneous across GPUs and make things easier? Right now there are threshold effects due to the very small number of geometry engines, and the correspondingly coarse granularity, but do you see that improving in the future?

    I mean, for AMD in particular, GPUs can have anywhere between 6 and 64 CUs, but only 1 to 4 primitives per clock. I don't know if my question is clear or whether it even makes sense, but I hope so.
     
  14. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    337
    Likes Received:
    294
    At least these 4 don't necessarily need to be done on the data of the frame currently being rendered, do they? There's actually a lot of work to be done in parallel, if you are willing to re-use past frames results. Sure, that comes at a slightly decreased visual quality (or a slight performance impact) immediately following huge scene transitions, but mostly shouldn't even be noticeable at all.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...