Amdahl Law

I made some other calculations with a 99.99% parallelizable code:

Tahiti @ 1 GHz = 672x
GK110 2688 cores @ 0.8 GHz = 612x (Tahiti should be faster)
4096 cores @ 1 GHz = 804x (20% more than Tahiti)

Either I'm missing other variables or 3D graphics is more than 99.99% parallelizable, anyway improving core counts alone won't cut it for long.
 
Thread Necro Lord reply

Take a simple but crazy example. Lets have one CU (or whatever it called this week) per pixel. Now run a simple blur, where each pixel grabs its surrounding pixels and runs a simple matrix kernel on them. Its perfectly parallel, as there is no serial communication. So 2 Million CU GPUs would be awesome ;)

Of course most graphics operations are no where near that parallel and want to get data from other parts of an image/mesh etc. that other CUs are potentially processing and that where the serial part comes in.

So Amdahl's law kicks in and increasing CUs gets you less and less as you add more. However there is another 'get out' or complexity if you wish. You can simple add more completely separate jobs, i.e. whilst running a filter that is serially limited, have additional CUs work on a mesh that has nothing to do with it except in 1 seconds time (or any significant length of time).

Now you still can't escape Armdahl's law if they ever had to interact at all (if they load off the same disk or end up being used in the same image later on or share a memory manager.)

Which is why the schedulars of modern GPU have been advancing fast. If they can kick off multiple jobs only loosely connected, then added CU's is still a win.

Buts its hard, very hard. hence why CPUs haven't just shipped with 128 cores and GPUs add a lot of other things, rather than just more CUs.

And why nobody knows if/where the failure point is.
 
In the analogue computer running the RL simulation the serial part of the computation is 0 ... it's often computationally interesting to do some things serially, but the real world proves that for it's simulation it's by no means necessary. That's what most computationally relevant parts of games are. simulations (slightly simplified) of the real world.
 
The big flaw with Amdahl's law is that it assumes that the O times for the serial portion and parallel portion are the same. This is often not true.

Many algorithms have something like O(k) for serial and O(n) for the parallel, which means that the serial percentage approaches 0% as the problem size becomes larger.
 
Is this why Tahiti is so inefficient compared to GK104, but also to Pitcairn and Cape Verde?
2048 cores are already the virtual limit for a 95% parallel code.
But let's say graphics is 99.9% parallelizable; we would have:

Tahiti @ 1 GHz giving a 95.4 times speed-up.

GK110 @ 1 GHz giving a 96.4 times speed-up.

In real-world performance, Tahiti is less efficient because of Direct3D and driver latency, but even if they put out a 4096 cores @ 1 GHz the most they would get is a 97.6 times speed-up.

Is it time for AMD to just make die-shrinks and increase clock frequency?
If so, eventually a 2048 cores GPU will get embedded in the CPU and discrete cards will be history.
Then vector extensions will reach the same width and replace GPUs completely.

I think the answers to your questions should be mainly NOs.
Probably it would be great if someone explains in details how the potential shrink of a given chip or architecture would result and what benefits would come because as we already know the frequency increase doesn't scale performance linearly...
 
So Amdahl's law kicks in and increasing CUs gets you less and less as you add more. However there is another 'get out' or complexity if you wish. You can simple add more completely separate jobs, i.e. whilst running a filter that is serially limited, have additional CUs work on a mesh that has nothing to do with it except in 1 seconds time (or any significant length of time).
GCN's reliance on having many independent jobs has been shown to be high.
Tahiti has been profiled to have utilization problems in compute if the problem size drops too low. The floor is higher than it is for Nvidia, where a weaker GPU can keep pace with GCN until can bring its resources to bear.

Now you still can't escape Armdahl's law if they ever had to interact at all (if they load off the same disk or end up being used in the same image later on or share a memory manager.)
Or to worry about the more general observation about diminishing returns, the same memory subsystem, issue logic, or DRAM controller.
There could be no logical dependence between various wavefronts, but it doesn't magic away the physical hazards. We're not seeing the number of memory channels or the capacity of storage increasing arbitrarily.


The big flaw with Amdahl's law is that it assumes that the O times for the serial portion and parallel portion are the same. This is often not true.

Many algorithms have something like O(k) for serial and O(n) for the parallel, which means that the serial percentage approaches 0% as the problem size becomes larger.
It's not a flaw when the Amdahl's Law makes an observation concerning speedup for a fixed problem size.
 
GCN's reliance on having many independent jobs has been shown to be high.
Tahiti has been profiled to have utilization problems in compute if the problem size drops too low. The floor is higher than it is for Nvidia, where a weaker GPU can keep pace with GCN until can bring its resources to bear.
On GCN, you can overlap independent kernel dispatches even within the same command queue. This allows for some pretty large performance benefits as you can achieve scaling even with smaller workloads as long as you have other work to schedule.
 
This allows for some pretty large performance benefits as you can achieve scaling even with smaller workloads as long as you have other work to schedule.

This last sentence is the other side of the point I was making. There are benefits as long as there is other work to schedule, or more likely with the software/drivers in question, tools capable of reliably extracting said information.
If the the aggregate size of the smaller workloads is still small, GCN doesn't maintain its lead.

There's a wider range of values N where the overheads or underutilization are too large to hide, relative to other architectures.

The general assumption that there is always more or bigger work to do is the underpinning of Gustafson's law, until other constraints cap the amount of parallel work that can be supported on physical machine or desired workload are brought to bear.
We see this in benchmarks with GPUs set to resolutions so high they give somewhat palatable numbers, but they don't look as imposing against the competition once the resolutions are lowered and the frame rates are closer to what gamers are comfortable with.
 
Systems tend to become CPU limited the lower the resolutions go, but there isn't a solid line within a title or between titles.
Some manufacturers tend hit a ceiling more quickly than others.
 
I thought at lower resolutions faster GPUs were limited by CPUs.

It simply depends on your shaders, and what they do.

Cerny was i.e. suggesting to use both GPGPU and renderer shader to ensure that the GCN core has always enough kernel to process at any given time.
 
Back
Top