Peak of (CPU) Bottleneck

agent_x007 · May 23, 2018

Hello

This is NOT another "Is GPU X bottleneck by CPU Y ?" type of topic.
I know my CPU does (trust me on this), and please don't even try to say to me "CPU bottlenecks don't exist",

My question is simple :
Can CPU performance decrease GFLOPs generated by the GPU ?
I mean, GPU driver must do something on CPU for GPU to do anything (right ?).
So, is there a point, at which CPU isn't fast enough to get from GPU it's full processing speed, because driver execution is so slow ?

Short history lesson (for those that don't remember how old stuff works) :
Before BCLK, there was this thing FSB bus.
It was good bus at the begining, but after a while it's flaw was clear :
By design, it was responsible for ALL transfers from and to CPU.
Be it from RAM, HDD or PCI-e/AGP devices - all data that went to CPU had to go through it.
At one point, FSB wasn't getting fast enough to cope with data from new CPUs and other hardware.
To check if this is possible, I did something no human has ever done :
I put GTX 1080 FE in PGA 478 MB with Celeron 2.0A (Northwood-128 core) in it, and downclocked that to 1,9GHz (for good measure)

Result in AIDA64 :

^I added PCI-e bandwidth table to show how slow both RAM and PCI-e sub systems are at this point.
CPU-z valid : https://valid.x86.fr/2sve04

To better showcase what I done :
Because I decresed FSB frequency to 380MHz (effective) and 95MHz (real), bandwidth available to everything was reduced to 3080MB/s (theoretical).
That's NOT enough to saturate a PCI-e 1.x x16 slot, let alone DDR2 memory or anything hard disk related.

Now, on GPGPU benchmark from there clearly is a slowdown.
However it's in GPU Copy (which should have 250 000 MB/s+), and not so much on 32-bit Single Precision test.
Can someone explain to me why single precision FLOPs didn't went down so much (20% from theoretical), when driver clearly is limited by CPU processing speed (over 20% drop on internal memory copy speed) ?

Thank you.

MDolenc · May 24, 2018

Why would there be a slow down? Compute kernels have a very simple state. There's not much for CPU/driver to do. Compile kernel to more hardware friendly form for example but that should be done before any timing for benchmark comes into play. There is no reason for GPU and CPU to talk after kernel is launched. Even the GPU to GPU memory copy shouldn't go down significantly. Also note that all the timing in this cases is probably done CPU side. Which means GPU will also have to communicate to the CPU that it's done.
If you submit 20ms of work to the GPU and CPU takes 10ms to hand off that work then you can keep GPU busy for as long as you like (say in a loop). If you submit 20ms of work to the GPU and CPU takes 30ms to hand it off then yeah, GPU will idle.

agent_x007 · May 24, 2018

Well I thought pairing a 10TFLOP GPU with CPU that is about as fast as Pentium III, would make some difference...

Here's screenshot with data from my main PC (and proper CPU) :

In summary (biggest difference) :
Memory Copy : 225 722 MB/s vs. 206 146 MB/s (- 9,5%)
Single Precision Julia : 1945 FPS vs. 1783 FPS (- 9%)

Is ~10% performance loss significant ?

MDolenc · May 24, 2018

I fail to see the reason for the amazement or the question? CPU and GPU are 2 independent systems. It's kinda like asking/thinking if your local PC will bottleneck Amazon cloud? Depends... How much do the two need to talk?
In synthetic test like this... Not much.

agent_x007 · May 24, 2018

Independent, yet working together

GPU requires driver to work and driver requires x86 CPU to operate.

I wanted to check how far can you go on limiting CPU performance and what it does to GPU performance.
Limiting VRAM bandwidth by any ammount by simply using too slow CPU isn't something I expected.
It's quite interesting...

To have an example of how much this CPU is limiting performance, here's 3DMark Fire Strike score :
113 points for Celeron : https://www.3dmark.com/fs/15587676
and
18 178 points Core i7 4960X @ 4,5GHz : https://www.3dmark.com/3dm/26765556

Basicly : Card thinks it's idling while doing 3D test...
Q : Can Low power 3D (or 2D) clocks be the reason for lower performance in those two tests ?

Infinisearch · May 25, 2018

agent_x007 said:
Independent, yet working together
GPU requires driver to work and driver requires x86 CPU to operate.

I wanted to check how far can you go on limiting CPU performance and what it does to GPU performance.
Limiting VRAM bandwidth by any ammount by simply using too slow CPU isn't something I expected.
It's quite interesting...

To have an example of how much this CPU is limiting performance, here's 3DMark Fire Strike score :
113 points for Celeron : https://www.3dmark.com/fs/15587676
and
18 178 points Core i7 4960X @ 4,5GHz : https://www.3dmark.com/3dm/26765556

Basicly : Card thinks it's idling while doing 3D test...
Q : Can Low power 3D (or 2D) clocks be the reason for lower performance in those two tests ?

Basically it depends on what exactly you're doing. If you have a compute shader that does alot of work per invocation then your GPU (FLOPS) won't really be that affected by a slow CPU. On the other hand there is something like your firestrike scores where the physics part of the score is performed on the CPU. And since you can't submit something to be drawn to the GPU until you know its position and orientation which is the output of physics you're gonna be CPU limited. IIRC firestrike is dx11, if firestrike is batching drawcalls (which can make sense for dx11) then nothing can be submitted until all physics is complete. As to whether or not lower clocks could be the reason, I would think in this case it is possible (I'm not sure) the GPU/driver detects that it can lower it's clocks because its not being fed fast enough to justify higher clocks.

agent_x007 · May 25, 2018

I guess, we don't know how exactly 2D/3D LP/3D clocks are utilised (circumstances).
Because, if GPU/VRAM clock depends solely on GPU usage, using very slow CPU will make card not run as fast in full CPU load - regardless of task.
Also, I think it also can work in opposite case.
Assuming GPU does anything that can push it's utilisation beyond certain point, when CPU is strong enough, and card is slow enough - GPU may not use 2D clocks at all (even on net browsing).
This can drive idle power and temperatures on such a machine to very high levels.

Infinisearch · May 25, 2018

agent_x007 said:
Because, if GPU/VRAM clock depends solely on GPU usage, using very slow CPU will make card not run as fast in full CPU load - regardless of task.

No not regardless of task... if the CPU/driver has to process one command (lets say that takes 300ns) and sends that off to the GPU which then takes 1sec to complete, the CPU has enough time to process about 3 million commands. This is of course over simplified and I don't take into account what feeds the driver but the point is if the GPU has work to complete that requires NO CPU intervention then it becomes pretty much impossible to become CPU limited in this case.

edit - I shouldn't say impossible just unlikely.

agent_x007 · May 25, 2018

Based on my results, there are tests that are not affected by extremely low speed CPU, so there definitly is truth to what you all are saying. I get what you are saying, and if CPU had GPU driver only to worry about - sure, impossible to be CPU limited in some circumstances.

Still :
I do PCI-e render test from GPU-z, 100% CPU load.
I look at video on YT, 100% CPU load.
...
Running a Task manager, shows as 10-15% CPU load.

Now I will try to check max. OC of my card

I'm wondering if artefacts will show up later, because GPU can't be utilised properly...

Peak of (CPU) Bottleneck

Similar threads