More GPU cores or more GPU clock speeds?

gongo · Jul 9, 2016

As an armchair observer, i ask, are games today taking better advantage of higher GPU clock speeds rather than moar cores?

I dont understand how AMD GCN arch can fall off after the 290 Hawaii. Then i noticed Nvidia GPU have been increasingly relying on high clocks. With high clocks, you can push things like fillrate and tessellation higher with lesser units, no? And with lesser compute units, you can push for higher clocks...and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?

silent_guy · Jul 9, 2016

gongo said:
With high clocks, you can push things like fillrate and tessellation higher with lesser units, no?

Yes.

And with lesser compute units, you can push for higher clocks...

No, they're not really linked. Being able to run higher clocks is primarily determined by the microarchitecture.

and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?

No, that doesn't make a lot of sense. Rendering is inherently a massively parallel operation, but that parallelism is, for the most part, hidden behind an API.

pTmdfx · Jul 9, 2016

gongo said:
with lesser compute units, you can push for higher clocks...and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?

You probably would want a better understanding in the 3D rendering pipeline before digging into hardware design details like this.

he parallelism on the CPU end and the parallelism among compute units end are two completely different contexts. One is about preparing and pushing an embarrassingly large block of work to the GPU, while the other is about picking up those works, preparing them, and breaking it down to hardware threads for execution. Game engines are within the former context, while the later context depends on what combination of work is being submitted by the game engine.

Like silent_guy said, rendering are an embarrassingly parallel problem, say for pixel shaders you have at least hundreds of millions of pixels per second for 60fps 1080p, not even considering multi-sampling. Then before pixel shaders you have vertices and geometries. The problem size is orders of magnitude larger than the best case hardware maximum thread count.

PlanarChaos · Jul 9, 2016

gongo said:
I dont understand how AMD GCN arch can fall off after the 290 Hawaii. Then i noticed Nvidia GPU have been increasingly relying on high clocks. With high clocks, you can push things like fillrate and tessellation higher with lesser units, no? And with lesser compute units, you can push for higher clocks...and you need lesser units because game engines are still not as massively parallels/limited by consoles...make sense?

from the review Ryan Smith wrote on Fury X diving into Fiji:

Looking at the broader picture, what AMD has done relative to Hawaii is to increase the number of CUs per shader engine, but not changing the number of shader engines themselves or the number of other resources available for each shader engine. At the time of the Hawaii launch AMD told us that the GCN 1.1 architecture had a maximum scalability of 4 shader engines, and Fiji’s implementation is consistent with that. While I don’t expect AMD will never go beyond 4 shader engines – there are always changes that can be made to increase scalability – given what we know of GCN 1.1’s limitations, it looks like AMD has not attempted to increase their limits with GCN 1.2. What this means is that Fiji is likely the largest possible implementation of GCN 1.2, with as many resources as the architecture can scale out to without more radical changes under the hood to support more scalability.
<picture snipped>
Along those lines, while shading performance is greatly increased over Hawaii, the rest of the front-end is very similar from a raw, theoretical point of view. The geometry processors, which as we mentioned before are organized to 1 per shader engine, just as was the case with Hawaii. With a 1 poly/clock limit here, Fiji has the same theoretical triangle throughput at Hawaii did, with real-world clockspeeds driving things up just a bit over the R9 290X. However as we discussed in our look at the GCN 1.2 architecture, AMD has made some significant under-the-hood changes to the geometry processor design for GCN 1.2/Fiji in order to boost their geometry efficiency, making Fiji’s geometry fornt-end faster and more efficient than Hawaii. As a result the theoretical performance may be unchanged, but in the real world Fiji is going to offer better geometry performance than Hawaii does.

source: http://www.anandtech.com/show/9390/the-amd-radeon-r9-fury-x-review/4

In short, parts of GCN couldn't be scaled further than they were with Fiji compared to Hawaii.

CarstenS · Jul 10, 2016

gongo said:
As an armchair observer, i ask, are games today taking better advantage of higher GPU clock speeds rather than moar cores?

Because of Amdahl. Higher clocks mean, the less than optimal parallelizable parts of any given task won't keep the parallel parts of the GPU from doing useful stuff as long as with lower clocks.

gongo said:
With high clocks, you can push things like fillrate and tessellation higher with lesser units, no?

Yes and no. It's always the product of clocks times number of units.

If you have for example a 2 GHz clock but only one rasterizer or tessellator, you're gonna be slower most of the times than with four raster/tessellation engines running at 1 GHz, because those tasks are highliy parallel.

It's a delicate balance between power, size/transistors invested, # of units, IPC and clocks.

I suspect, that in order to reach these high clocks on pascal in the given power budget, Nvidia actually invested a fair amount of die space. Maybe they have clock islands for each GPC or even SM, having smaller areas over which to distribute clocking signals and in the same time being able to intercept load/power spikes in single units much better. Just a guess.

Alessio1989 · Jul 10, 2016

I want bandwidth.

Razor1 · Jul 10, 2016

More of everything

also good!

CarstenS · Jul 10, 2016

… and then some!

milk · Jul 10, 2016

The question is: would it be better to have a bit more of everything and then some, or just a lot more of evertything without any more some on top of that?

AlNom · Jul 10, 2016

Are we talking about a larger bacon-wrapped cake with more bacon or just a proportionately larger cake wrapped in bacon?

BRiT · Jul 10, 2016

AlNets said:
Are we talking about a larger bacon-wrapped cake with more bacon or just a proportionately larger cake wrapped in bacon?

The only way its good is if it's American Bacon and not Beaver Bacon.

sebbbi · Jul 11, 2016

pTmdfx said:
Like silent_guy said, rendering are an embarrassingly parallel problem, say for pixel shaders you have at least hundreds of millions of pixels per second for 60fps 1080p, not even considering multi-sampling. Then before pixel shaders you have vertices and geometries. The problem size is orders of magnitude larger than the best case hardware maximum thread count.

In most cases yes, but there are steps in the rendering pipeline that are at least partially serial. In modern rendering pipelines you need to calculate some reductions (for example generate a mip chain, calculate scene average luminance, generate a histogram, generate a depth pyramid, sort data, etc). Generally anything that requires a global prefix sum. Also resource transitions from write to read need a sync point (wait until idle). A narrower GPU with higher clocks goes to idle sooner (less threads to wait) and restores from idle to full occupation sooner. The GPU pipeline is surprisingly deep. Warm up hurts. Caches are cold at first. First vertex shader invocations will cache miss, meaning that the GPU needs to wait until there's any vertices ready, meaning that there is no pixel shader work. There is only very limited amount of on-chip vertex attribute storage available, so it just cannot just fill the GPU with vertex work either. If those first triangles happen to be small (= not much pixel shader work) or backfacing or outside of the screen, a wide GPU is just screwed (= mostly idling). You have similar problems when rendering background geometry (lots of draw calls of different meshes + tiny triangles). LOD and combining meshes together helps, but has its own implications regarding to memory consumption and content development effort.

Shadow map rendering doesn't even have pixel shaders. It's very easy to underutilize a wide GPU when rendering shadow maps. Similarly when you are rendering low-resolution buffers, such as occlusion buffers or anything with conservative rasterization, you likely underutilize a wide GPU. On AMDs GPUs you can avoid underutilization with asynchronous compute. But it's not a magic bullet. You still need to find parallel work. It's surprising how linear modern rendering pipelines are (you really want to process steps in a certain order. next step often requires previous step output). You can of course compromise with latency (mixing previous and current frame tasks) to find more parallel work, or do some kind of split scene rendering (but that might again hurt wider GPUs in some cases). Not trivial to write a code that is best for both wide and narrow GPUs.

Alexko · Jul 11, 2016

Do you think that as GPUs get wider, the ratio of, say, ALU throughput to geometry throughput might become more homogeneous across GPUs and make things easier? Right now there are threshold effects due to the very small number of geometry engines, and the correspondingly coarse granularity, but do you see that improving in the future?

I mean, for AMD in particular, GPUs can have anywhere between 6 and 64 CUs, but only 1 to 4 primitives per clock. I don't know if my question is clear or whether it even makes sense, but I hope so.

Ext3h · Jul 14, 2016

sebbbi said:
calculate scene average luminance, generate a histogram, generate a depth pyramid, sort data

At least these 4 don't necessarily need to be done on the data of the frame currently being rendered, do they? There's actually a lot of work to be done in parallel, if you are willing to re-use past frames results. Sure, that comes at a slightly decreased visual quality (or a slight performance impact) immediately following huge scene transitions, but mostly shouldn't even be noticeable at all.

More GPU cores or more GPU clock speeds?

gongo

silent_guy

pTmdfx

PlanarChaos

CarstenS

Moderator

Alessio1989

Razor1

CarstenS

Moderator

milk

Like Verified

AlNom

Moderator

BRiT

(>• •)>⌐■-■ (⌐■-■)

sebbbi

Alexko

Ext3h

Similar threads