More GPU cores or more GPU clock speeds?

gongo

Regular
As an armchair observer, i ask, are games today taking better advantage of higher GPU clock speeds rather than moar cores?

I dont understand how AMD GCN arch can fall off after the 290 Hawaii. Then i noticed Nvidia GPU have been increasingly relying on high clocks. With high clocks, you can push things like fillrate and tessellation higher with lesser units, no? And with lesser compute units, you can push for higher clocks...and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?
 
With high clocks, you can push things like fillrate and tessellation higher with lesser units, no?
Yes.

And with lesser compute units, you can push for higher clocks...
No, they're not really linked. Being able to run higher clocks is primarily determined by the microarchitecture.

and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?
No, that doesn't make a lot of sense. Rendering is inherently a massively parallel operation, but that parallelism is, for the most part, hidden behind an API.
 
with lesser compute units, you can push for higher clocks...and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?
You probably would want a better understanding in the 3D rendering pipeline before digging into hardware design details like this.

he parallelism on the CPU end and the parallelism among compute units end are two completely different contexts. One is about preparing and pushing an embarrassingly large block of work to the GPU, while the other is about picking up those works, preparing them, and breaking it down to hardware threads for execution. Game engines are within the former context, while the later context depends on what combination of work is being submitted by the game engine.

Like silent_guy said, rendering are an embarrassingly parallel problem, say for pixel shaders you have at least hundreds of millions of pixels per second for 60fps 1080p, not even considering multi-sampling. Then before pixel shaders you have vertices and geometries. The problem size is orders of magnitude larger than the best case hardware maximum thread count.
 
Last edited:
I dont understand how AMD GCN arch can fall off after the 290 Hawaii. Then i noticed Nvidia GPU have been increasingly relying on high clocks. With high clocks, you can push things like fillrate and tessellation higher with lesser units, no? And with lesser compute units, you can push for higher clocks...and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?
from the review Ryan Smith wrote on Fury X diving into Fiji:
Looking at the broader picture, what AMD has done relative to Hawaii is to increase the number of CUs per shader engine, but not changing the number of shader engines themselves or the number of other resources available for each shader engine. At the time of the Hawaii launch AMD told us that the GCN 1.1 architecture had a maximum scalability of 4 shader engines, and Fiji’s implementation is consistent with that. While I don’t expect AMD will never go beyond 4 shader engines – there are always changes that can be made to increase scalability – given what we know of GCN 1.1’s limitations, it looks like AMD has not attempted to increase their limits with GCN 1.2. What this means is that Fiji is likely the largest possible implementation of GCN 1.2, with as many resources as the architecture can scale out to without more radical changes under the hood to support more scalability.
<picture snipped>
Along those lines, while shading performance is greatly increased over Hawaii, the rest of the front-end is very similar from a raw, theoretical point of view. The geometry processors, which as we mentioned before are organized to 1 per shader engine, just as was the case with Hawaii. With a 1 poly/clock limit here, Fiji has the same theoretical triangle throughput at Hawaii did, with real-world clockspeeds driving things up just a bit over the R9 290X. However as we discussed in our look at the GCN 1.2 architecture, AMD has made some significant under-the-hood changes to the geometry processor design for GCN 1.2/Fiji in order to boost their geometry efficiency, making Fiji’s geometry fornt-end faster and more efficient than Hawaii. As a result the theoretical performance may be unchanged, but in the real world Fiji is going to offer better geometry performance than Hawaii does.
source: http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4


In short, parts of GCN couldn't be scaled further than they were with Fiji compared to Hawaii.
 
As an armchair observer, i ask, are games today taking better advantage of higher GPU clock speeds rather than moar cores?
Because of Amdahl. Higher clocks mean, the less than optimal parallelizable parts of any given task won't keep the parallel parts of the GPU from doing useful stuff as long as with lower clocks.

With high clocks, you can push things like fillrate and tessellation higher with lesser units, no?
Yes and no. It's always the product of clocks times number of units.

If you have for example a 2 GHz clock but only one rasterizer or tessellator, you're gonna be slower most of the times than with four raster/tessellation engines running at 1 GHz, because those tasks are highliy parallel.

It's a delicate balance between power, size/transistors invested, # of units, IPC and clocks.

I suspect, that in order to reach these high clocks on pascal in the given power budget, Nvidia actually invested a fair amount of die space. Maybe they have clock islands for each GPC or even SM, having smaller areas over which to distribute clocking signals and in the same time being able to intercept load/power spikes in single units much better. Just a guess.
 
The question is: would it be better to have a bit more of everything and then some, or just a lot more of evertything without any more some on top of that?
 
Are we talking about a larger bacon-wrapped cake with more bacon or just a proportionately larger cake wrapped in bacon?
 
Are we talking about a larger bacon-wrapped cake with more bacon or just a proportionately larger cake wrapped in bacon?
The only way its good is if it's American Bacon and not Beaver Bacon.
 
Like silent_guy said, rendering are an embarrassingly parallel problem, say for pixel shaders you have at least hundreds of millions of pixels per second for 60fps 1080p, not even considering multi-sampling. Then before pixel shaders you have vertices and geometries. The problem size is orders of magnitude larger than the best case hardware maximum thread count.
In most cases yes, but there are steps in the rendering pipeline that are at least partially serial. In modern rendering pipelines you need to calculate some reductions (for example generate a mip chain, calculate scene average luminance, generate a histogram, generate a depth pyramid, sort data, etc). Generally anything that requires a global prefix sum. Also resource transitions from write to read need a sync point (wait until idle). A narrower GPU with higher clocks goes to idle sooner (less threads to wait) and restores from idle to full occupation sooner. The GPU pipeline is surprisingly deep. Warm up hurts. Caches are cold at first. First vertex shader invocations will cache miss, meaning that the GPU needs to wait until there's any vertices ready, meaning that there is no pixel shader work. There is only very limited amount of on-chip vertex attribute storage available, so it just cannot just fill the GPU with vertex work either. If those first triangles happen to be small (= not much pixel shader work) or backfacing or outside of the screen, a wide GPU is just screwed (= mostly idling). You have similar problems when rendering background geometry (lots of draw calls of different meshes + tiny triangles). LOD and combining meshes together helps, but has its own implications regarding to memory consumption and content development effort.

Shadow map rendering doesn't even have pixel shaders. It's very easy to underutilize a wide GPU when rendering shadow maps. Similarly when you are rendering low-resolution buffers, such as occlusion buffers or anything with conservative rasterization, you likely underutilize a wide GPU. On AMDs GPUs you can avoid underutilization with asynchronous compute. But it's not a magic bullet. You still need to find parallel work. It's surprising how linear modern rendering pipelines are (you really want to process steps in a certain order. next step often requires previous step output). You can of course compromise with latency (mixing previous and current frame tasks) to find more parallel work, or do some kind of split scene rendering (but that might again hurt wider GPUs in some cases). Not trivial to write a code that is best for both wide and narrow GPUs.
 
Do you think that as GPUs get wider, the ratio of, say, ALU throughput to geometry throughput might become more homogeneous across GPUs and make things easier? Right now there are threshold effects due to the very small number of geometry engines, and the correspondingly coarse granularity, but do you see that improving in the future?

I mean, for AMD in particular, GPUs can have anywhere between 6 and 64 CUs, but only 1 to 4 primitives per clock. I don't know if my question is clear or whether it even makes sense, but I hope so.
 
calculate scene average luminance, generate a histogram, generate a depth pyramid, sort data
At least these 4 don't necessarily need to be done on the data of the frame currently being rendered, do they? There's actually a lot of work to be done in parallel, if you are willing to re-use past frames results. Sure, that comes at a slightly decreased visual quality (or a slight performance impact) immediately following huge scene transitions, but mostly shouldn't even be noticeable at all.
 
Back
Top