Amdahl Law

Deano Calver

Newcomer
As multi-processor is the hot topic, I'd thought I'd mention Amdahl's Law...

A clever man once said
"Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor"

Sounds very relevant to Cell, no? That man was Gene Amdahl and in wrote that in 1967, nearly 30 years ago!

Amdahl's Law
Code:
S=           N
          ----------
        (B*N)+(1-B)
Where N is the number of processors, B is the strictly serial percentage of an algorithm and S is the speed up factor.

B determines how well parellel architecture will benefit an algorithm. To see why Armdahl's Law is so important try B=0.5 (50%)
Code:
S = 2
     ----
    (0.5*2)+(1-0.5)
S=1.33333
Not try it with N=8
Code:
S = 8
     ----
    (0.5*8)+(1-0.5)
S=1.777777
Which means 2 processors are 1.34 times faster but 8 processors will only make the algorithm 1.78 times faster.

Fundementally the problem of multi-processing programming is to find algorithms with as low B as possible. This is extremely hard for most problems, luckily for us though some problems are naturally parellel. The obvious one is 3D rendering, here the strictly serial percentage is tiny and you have an near linear growth speed-up to processors. Let say B=10% for rendering with 8 processors
Code:
S=8
    ----
    (0.1*8)+(1-0.1)
S=4.7
Better than S=1.77 but still lots of reason to reduce B even further. GPUs shaders are essentially hardware restrictions keeping B as low as possible.

Check out for more details.
http://home.wlu.edu/~whaleyt/classes/parallel/topics/amdahl.html
 
More interesting stuff about Amdahl from valoh this time. These papers claim that Amdhal Law is too pessimistic.

http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html
http://joda.cis.temple.edu/~shi/docs/amdahl/amdahl.html

My personal view of these is that there a bit optimistic rather than Amdhal Law being too pessimistic, but then half glass full or empty syndrome :)

Some problems are inheritly parallel, others can be made fairly parellel with some brain power applied. To me the essense of Amdhal Law is just that, you can get good parallel performance but not nessecarily easily.

One of the reasons I like GPUs is because of the honesty they hit this problem with. They just don't allow stuff thats bad from an Amdahl Law point of view, so they parallelise well.

Nelg point out that Linus (of Linux fame) mentioned (with regard Cell) Amdhal's Law and games
here


How much games (and which bits of games) are inheritly parallel is a valid question. Some parts (Graphics) are easy, others (AI and physics) seem possible but much work is needed to figure it all out.
 
Just saw this today.This is the first time I browse the forums and although I dont know much about tech this really pulled my interest and brought in my mind a question which I think is also interesting

Cell has one PPU and 7 SPE's for proecessing right?

While Xenos has 3 PPU's right?

So does that mean the Cell has N=8 and Xenos N=3?And how does B differ for each CPU?What does that mean for each console?
 
Nesh said:
So does that mean the Cell has N=8 and Xenos N=3?And how does B differ for each CPU?What does that mean for each console?
Not being a developer or anything, I'll just like to point out that cell isn't your average run of the mill multicore processor, as it's possible (and actually intended) for one SPU to do a partial calculation and then pass on the semi-finished work to another SPU for more processing via the internal bus (which has what, 300GB/s peak bandwidth?), which could then pass on to another and another and another.

Amdahl's law might not even apply at all really, depending on how the code is set up to work.
 
But isn't it a bit different if it's not strictly _one_ algorithm, but something multi-threaded?
 
Guden Oden said:
Not being a developer or anything, I'll just like to point out that cell isn't your average run of the mill multicore processor, as it's possible (and actually intended) for one SPU to do a partial calculation and then pass on the semi-finished work to another SPU for more processing via the internal bus (which has what, 300GB/s peak bandwidth?), which could then pass on to another and another and another.
Passing results to a different SPE, one after the other, sounds very sequential...

The internal bus still has a certain amount of latency associated with it. So long as there are enough elements to work on in the meantime, this can be hidden. However, the less uniform and predictable the workload, the more often this will not be the case.

Work would always be faster if the SPE could avoid passing data over the bus. The fact that it might be recommended to do otherwise indicates a limitation (probably justifiable, but still a limitation) of the design.

Amdahl's law might not even apply at all really, depending on how the code is set up to work.

Amdahl's law always applies to a parallel workload. It just might be hidden if the software is so badly designed that it intentionally creates a penalty worse than a task's inherently serial portion.
 
Deano Calver said:
"Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor"

The beauty of CELL (and similar approaches) is that a considerable amount of the housekeeping is done *at a separate point in time* i.e. beforehand, some of it during the design of a program and much of it by Sony's / IBM's compilers. So at runtime, programs on CELL *can* be very fast.
It'll be interesting to see how apt programs are and how well the compilers will do a) at release time, and b) over the course of the next years.
 
_xxx_ said:
But isn't it a bit different if it's not strictly _one_ algorithm, but something multi-threaded?
I wonder about that myself; the "serial" part of a program in Amdahl's model would seem to imply a choke-point, where all the threads are forced to wait on a section of code in a single thread. If you have a great multitude of non-serially-dependent tasks, you only really need a choke-point when you are starting up the system. And AFAICS, CPU scheduling is NOT one of the tasks that fundamentally require choke-points; running local, per-CPU schedulers isn't that much harder than running a global gang scheduler.

So the question that I am left with becomes: are there any obvious necessary choke-points at all in a (well-written) game other than at game/level load time?
 
I don't think Amdahl's Law is that specific about defining serial execution as being a choke point in the code.

In addition to choke points, there is some fraction of execution that simply cannot be sped up, including overhead involved in spawning parallel execution, combining results, maintaining correctness, or even the bare minimum time the slowest task takes to execute.

Multiple threads must be spawned, which takes a fixed and unavoidable amount of time. If they ever happen to talk to one another or to a common target, they will have to serialize for a little while.

Post-rendering effects, for example can be run largely in parallel to each other, but that doesn't change the inherent serialization of needing a previously rendered result. You can't parallelize a hard data dependency.

Correct memory behavior is also serializing. Cell doesn't maintain coherency within the SPE local stores, but there is a software mechanism for doing so. Eventually, an SPE's results are going to affect the behavior of the program at large, and unless timing or correctness really aren't all that important, the SPE or its target are going to have to synchronize or wait to make sure wierd stuff doesn't happen.

Regardless, the main point of the law is that if you speed up parallel execution, inescapable set-up time and communication overhead will increase or remain the same while the parallel section's time shrinks. Once the parallel execution time approaches zero, additional effort in parallel execution is wasted.
 
3dilettante said:
I don't think Amdahl's Law is that specific about defining serial execution as being a choke point in the code.

In addition to choke points, there is some fraction of execution that simply cannot be sped up, including overhead involved in spawning parallel execution, combining results, maintaining correctness, or even the bare minimum time the slowest task takes to execute.
Spawning parallel execution in the first place is something that you should only need to to at program startup time ..? For a game this would seem to be a few microseconds out of a potential running time of several hours. Result combining within a game also only really forces serialization when the result is fed into a loopback (AFAICS, this would primarily be the case for game decision logic and partially AI/physics).
Multiple threads must be spawned, which takes a fixed and unavoidable amount of time. If they ever happen to talk to one another or to a common target, they will have to serialize for a little while.

Post-rendering effects, for example can be run largely in parallel to each other, but that doesn't change the inherent serialization of needing a previously rendered result. You can't parallelize a hard data dependency.
Given unrestricted parallellism, you can easily apply post-rendering effect for frame N while rendering frame N+1.
Correct memory behavior is also serializing. Cell doesn't maintain coherency within the SPE local stores, but there is a software mechanism for doing so. Eventually, an SPE's results are going to affect the behavior of the program at large, and unless timing or correctness really aren't all that important, the SPE or its target are going to have to synchronize or wait to make sure wierd stuff doesn't happen.
Its target will have to wait, but there is no real reason why the SPE cannot begin on a second task before the target has received the result of the 1st task. If you have a bunch of producer/consumer threads with queues between them, then coherency only needs to be maintained within each queue and each thread, not the system as a whole.
Regardless, the main point of the law is that if you speed up parallel execution, inescapable set-up time and communication overhead will increase or remain the same while the parallel section's time shrinks. Once the parallel execution time approaches zero, additional effort in parallel execution is wasted.
The 'B' in Amdahl's law is tremendously algorithm-dependent; the algorithms that e.g. John Gustafson have been using apparently have B in the range of ~0.0003%, whereas Amdahl himself appears to have been seeing B in the ~15-20% range.
 
der said:
The beauty of CELL (and similar approaches) is that a considerable amount of the housekeeping is done *at a separate point in time* i.e. beforehand, some of it during the design of a program and much of it by Sony's / IBM's compilers. So at runtime, programs on CELL *can* be very fast.
I think you're misinterpreting Deano's comments. He's talking about runtime "house keeping", e.g. the master threads that try to keep the SPU's all fed. The point of this topic is that there is some code that can't be parallelized. If you have a huge array of processors, then this code will be your limiting step. The hope is that this rate-limiting step will be minimal for games.

Now regarding your interpretation of "housekeeping", there have been some intelligent posts by forum members here at B3D. They have the experience to know IBM and Sony not only haven't done this housekeeping, but will also have a very hard time doing it. Here's a good post. It's very tough for compilers to figure out how to optimize for this type of architecture, whether CELL or Xenon. Multithreading is a whole other issue.

I know it's easy to get caught up in the sheer horsepower of CELL, but consider for example the IBM benchmarks for solving a system of linear equations. Even with 10 times the horsepower, your matrix can only be 2.2 times larger to finish in the same time. Much smarter is to use a sparse matrix for things like physics, and so your code is becomes branch heavy and traverses linked lists, doing small groups of calculations at a time. This is far harder to optimize than the streaming repetitiveness of a brute force LU factorization. Dependencies make it hard to parallelize completely. That's when Amdahl's law kicks in.
 
arjan de lumens said:
Spawning parallel execution in the first place is something that you should only need to to at program startup time ..? For a game this would seem to be a few microseconds out of a potential running time of several hours. Result combining within a game also only really forces serialization when the result is fed into a loopback (AFAICS, this would primarily be the case for game decision logic and partially AI/physics).

That would work fine as long as the simulated environment or problem doesn't change.
Threads don't necessarily live forever, and having a million threads sitting around when you only need three is also not helpful. Spawning a new enemy AI or modeling the behavior of something when it breaks into a million pieces (or a million pieces merge) are good examples of how dynamic behavior will not fit a statically defined software environment.

If a high-poly model is occluded after the camera moves into another room, there's no point in having the threads involved in setting it up keep on running. Even if they are suspended, there will still be unavoidable cycles used to wake them if the item comes back into play.

In collision dection, it is possible that threads dealing with the motion of several separate objects will compute that within a given timestep an object will collide with two or more entities on its initial travel vector, though we know that once the collision effect is calculated, it would probably only hit one. There is no way to knowing what to do unless the various threads get together and hash this out.

Decision making is not a very parallel task. AI programmers discovered this the hard way. Any amount of resources could be devoted to measuring and calculating utility functions, but it still required the machine to stop and consider each and every value at least once relative to one or more values. Even at its fastest, there is a chain of decisions that cannot be reduced.

Given unrestricted parallellism, you can easily apply post-rendering effect for frame N while rendering frame N+1.

But you can't render post-rendering for frame N+1. No matter how much the rendering is executed in parallel, the fact that there is a frame N and a frame N+1 indicates that some of the inherently serial portion discussed by Amdahl exists.

Granted, good pipelining of the effects can hide most of this latency, and oftentimes it isn't noticeable by the user. However, just because the user doesn't see it, it doesn't mean the serial element doesn't exist.

Its target will have to wait, but there is no real reason why the SPE cannot begin on a second task before the target has received the result of the 1st task. If you have a bunch of producer/consumer threads with queues between them, then coherency only needs to be maintained within each queue and each thread, not the system as a whole.

There can be many reasons why the SPE cannot begin the second task prior to sending its results off, probably having to do with the limits to the size of its local store and the bandwidth of available from a heavily loaded internal bus. It's a wide bus, but it isn't infinite. The DMA engine used for memory access is sizeable, but it is not infinite.

What happens if the target is currently unable to accept the result, perhaps as a result of poor synchronization?

Graphics work is already embarassingly parallel, so issues of synchronization don't come up as much. In anything not embarassingly parallel, queues may not be an improvement. The queue has to be stored and monitored by something, and it has methods for adding or subtracting entries. Whenver a queue has to change data at a memory location, something must be done make sure an unexpected event like mulitple writes can't lead to a nonsensical result.

The 'B' in Amdahl's law is tremendously algorithm-dependent; the algorithms that e.g. John Gustafson have been using apparently have B in the range of ~0.0003%, whereas Amdahl himself appears to have been seeing B in the ~15-20% range.

Amdahl may have been pessimistic about certain elements of performance, but that does not change the fact that the law has yet to be broken.

In fact, the page references would reinforce some initial skepticism about Cell's abilities. Part of the underpinning of that argument is that the number of processors can be scaled to fit the size of the parallel workload, thus making an end-run around Amdahl, which is discussed in terms of fixed workloads run on different numbers of processors.

However, commercial examples of Cell are the examplar of Amdahl's argument. The PS3 is not going to spawn new SPEs when a problem set is increased. It will either trudge through it, or the programmer will reduce the problem set.

Users can't use the scale the problem all you want argument. Even if cost were not a factor, there are also constraints with interactive software concerning response time. A massively parallel machine might be capable of processing ten times the amount of work units per second, but it will not be useful for user tasks if its reponse time is even half a bit longer.
 
Last edited by a moderator:
3dilettante said:
Amdahl may have been pessimistic about certain elements of performance, but that does not change the fact that the law has yet to be broken.
I'm not sure that Amdahls law can be "broken" per se. It simply states a mathematical relation, and the question mark is basically confined to what really constitutes "sequential" code.
From a practical standpoint, his now almost 40 year old statement still applies, it has just taken personal computing this long to discover its relevance. :)

In terms of games programming, it would seem that some mental adjustment is going to be needed in different places. While it is possible to parallellize a given algorithm, I'll submit that it is oftem fruitful to go another step back and look at the problem you want to solve and see if you can't formulate it in a different way, modifying the algorithm or coming up with an alternative formulation altogether. Parallellizing existing code may of course be the quickest way to get a speed-up or a product to market. OTOH I've seen cases you'd think would parallellize nicely that refused to yield more than a factor of three speed-up no matter how you massaged the original code base. Rewriting was the only option to access the inherent parallellism of the problem.

I'll also submit that the what I've seen some developers speak of, spawning a separate thread for sound, another for rendering, another for... seems very coarse grained to me, and a simplistic way to get some rudimentary parallell execution out of an existing code base.

It will be very interesting to hear what developers have to say about how the new tools have affected their own progress as algorithm designers and coders in the coming years.
 
I thought it could be interesting to come back to this toipic a little, now that we have seen some people at work with the cell some more. From the looks of it, Cell's SPEs are generally not used to tackle one algorithm together, but instead each SPE talkes one algorithm all by itself, like the cloth algorhithm used in Heavenly Sword (the respective devs are welcome to correct me ;) ). Let's call this the specialist approach.

On the other hand, something like the Crowd simulator (and the related school of fish demonstration at GDC 06) that we have seen the powerpoint presentation of seems to be a textbook Amdahl case, being one algorithm that is tackled as forcefully as possible by the whole power of the Cell. Let's call this the collective approach.

Looking at both examples, and the amount of effort put into the Crowd simulation, I suspect that the specialist approach is first of all easier to implement, and second of all easier to implement effectively, so long as you have a multitude of tasks that need to be done in your application.

I'm just a high-level programmer at the moment, and the only time I use threads (=closest thing to paralel that I do) is because I want a user to be able to do something else while a longer process is running, so I'm definitely not an expert, and I am just trying to see if I understand this correctly.
 
The problem of "specialist" approach is that not all works require equal computation power. For example, suppose that you use one SPE for AI, another one for sound processing, and another one for physics simulation, etc. Some SPEs will end up finishing their works earlier than other SPEs, and they'll have to sit idle waiting other SPEs. In interactive environments (such as games) this is even more serious because you can't use a large queue to "smooth" the differences.

One way to handle this problem is to make all works as small units and just use all available SPEs to compute them. This way you'll need a scheduler and may have some overheads, but it's possible to bring better efficiency. However, depending on the amount of data you need to move around, it may or may not be a good approach for CELL.
 
Deano Calver in Post #2 said:
...

Armdahl paper in comp.parallel faq

At the end is Amdahl's seminal paper itself.

Thanks Gubbi :)
And thanks Deano! :)
I have taken these excerpts, as I believe them to be very germain.

If a woman can have a baby in 9 months, then
clearly 9 women can have a baby in one month.

Generalize this to parallel compution.
Umm, yes.

%A J. Voelcker
%T Sandia Trio Wins Awards at Compcon
%J The Institute
%I IEEE
%V 12
%N 5
%D May 1988
%P 8
%K Karp/Bell parallel award for achieving >= 200x computing speedup
%X Winners got near linear speedup: 1020 speedup on 1024 processors in the
best case. They did it on Ncube 10-D hypercube with 1024 processors. They
developed mathematical methods and algorithms to minimize communications
between processors, and to overlap communication and computation. Also
used a logarithmic system-level algorithm to load a program into the
hypercube in 11 steps, rather than inputing the same program serially into
each of the 2^10 processors. The problems were wave motion through
deflectors, vortex motion in a fluid, and strain analysis of a cantilevered
beam.
%X Above refer entry does not even mention names of the people who
did the work. Search for Gary Montry or John Gustafson
The above is just an article. Not a paper. See Compcon proc.
Now I am off to read the original paper. ;)
 
The only purpose this post serves is to make the Last Post work for this forum. . . pre-VB update Last Post links are busted. . .
 
Is this why Tahiti is so inefficient compared to GK104, but also to Pitcairn and Cape Verde?
2048 cores are already the virtual limit for a 95% parallel code.
But let's say graphics is 99.9% parallelizable; we would have:

Tahiti @ 1 GHz giving a 95.4 times speed-up.

GK110 @ 1 GHz giving a 96.4 times speed-up.

In real-world performance, Tahiti is less efficient because of Direct3D and driver latency, but even if they put out a 4096 cores @ 1 GHz the most they would get is a 97.6 times speed-up.

Is it time for AMD to just make die-shrinks and increase clock frequency?
If so, eventually a 2048 cores GPU will get embedded in the CPU and discrete cards will be history.
Then vector extensions will reach the same width and replace GPUs completely.
 
Last edited by a moderator:
Back
Top