Is multiple core technology really needed for next-gen

SMarth · Feb 18, 2004

That's because in SMP systems the processors are sharing the same resources (memory,bus,io) which are already slow for a single cpu. Global synchronization is also extremely costly. But there's never been any doubt in my mind that the futur is asynchronous parallelism in both hardware and software. There will always be a need to serialize and synchronize of course, without that, reality may not make sense, but the idea is to restrict this to a minimum.

Is this possible ? Yes, absolutly. We just need to think a bit differently. After all the art of parallel programming is in it's infancy while the hardware is barely existant.

What fascinate me the most about 3D is how more then anything else it is contributing to the devellopment of mainstream low-cost parallel architectures. Though currently most of that parallelism is hidden, it is still a step in the right direction. After all, without 3D, it could have taken decades more before we'd seen the likes of what GPU are offering us today. It's not that parallelism and a-synchronization or de-serialization is something new, but it's just like gasoline, why change what work, even if it's primitive and limited? This is why I hope "cell" will work well enough to inspire the rest of the industry to follow in it's footstep and push parallelism forward. But no matter what, one day, everybody will go that way.

The bigest limitation toward that goal is the hardware, not the software. Not easy to overcome, and we'll need much more then a few billions transistors to achieve "thinking" and "living" computers which are our ultimate goal... no?

But by then we'll probably need 3D photonic to replace all that crappy electronic....

ERP · Feb 18, 2004

Fafalada said:
ERP said:

Gflops aren't everything, good integer performance is still very important to your average game.

Click to expand...

Very true - but in this case it's worh noting that APUs themselves are supposed to be rated for equal integer and float throughput. The other question is how well can that throughput be fed, but then both ratings would be equally affected by that too.

Strictly speaking the current VU architecture has the same integer and float instruction throughput that doesn't mean it's a good processor to run general code on.

A lot of game datastructures still require random access to memory and while game programmers in general try to model structures as streams (at least the good ones do), sometimes it's either impossible or just impractical.

MfA · Feb 18, 2004

Fully async logic has quite a bit of overhead, with the extremely short pipeline stages of high performance processors Im not sure if it would actually gain you anything.

I think keeping clocks synchronous wont get much more difficult ... if we start using switched networks instead of long wires going everywhere the domain in which the clock needs to be synchronous will get smaller with shrinks.

As for how much parallelism we can use ... consider our brain, and it's clock speed. The really interesting applications have more than enough parallelism to go around.

Vince · Feb 18, 2004

MfA said:
As for how much parallelism we can use ... consider our brain, and it's clock speed. The really interesting applications have more than enough parallelism to go around.

True, but there is just a little bit of an architectual difference between a connectionist system and your traditional von Neumann architecture.

Although that doesn't stop some concepts from transfering over such as the theifs who stole AntiAliasing.

Fafalada · Feb 18, 2004

ERP said:
Strictly speaking the current VU architecture has the same integer and float instruction throughput that doesn't mean it's a good processor to run general code on.

Strictly speaking, no, it's still 1.16 : 1

But you know I was talking about actual arithmetic throughput, data width, and completeness of the instruction sets, none of which current VU complies with. And I can tell you that I've run into situations where every single one of these would have been good to have, even with the "simple" programs we usually write for VUs.

A lot of game datastructures still require random access to memory and while game programmers in general try to model structures as streams (at least the good ones do), sometimes it's either impossible or just impractical.

Oh I agree, but then we all figure that's what we have that other processor for next to APUs right?
You can also do a whole lot more with not-streaming friendly problems if APUs can indeed start in/out DMAs from their side.

Panajev2001a · Feb 18, 2004

Hehe... the current VUs have one single 16 bits ALSU with 16x16 bits Registers: I would not call Integer processing power to be equal to Floating Point procxessing power.

APUs, even from IBM's own patents, promises to be different:

1.) It can DMA from and to Shared DRAM.

2.) It can do some I/O work with External I/O devices ( through clever use of DRAM's busy bits and other flags if they are implemented ): you could, for example, have the I/O device send the data to a location of the Shared DRAM and that data would be automatically forwarded to a selected portion of the interested APU's Local Storage. The same can be applied to send data to the I/O devices from the APU's Local Storage.

3.) 128x128 bits Register File can be used for Floats and Integer Vectors and scalar values.

4.) The following throughput describes the APU's operation:

a.) 1 FP Vector Instruction.

or

b.) 1 FX Vector Instruction ( Integer )

or

c.) 1 FP Scalar Instruction.

or

d.) 1 FX Scalar Instruction.

5.) When Executing scalar instructions the APU is limited to a peak of 2 FP/FX ops/cycle through the use of instructions like FP/FX MADD.

Panajev2001a · Feb 18, 2004

SMarth said:
That's because in SMP systems the processors are sharing the same resources (memory,bus,io) which are already slow for a single cpu. Global synchronization is also extremely costly. But there's never been any doubt in my mind that the futur is asynchronous parallelism in both hardware and software. There will always be a need to serialize and synchronize of course, without that, reality may not make sense, but the idea is to restrict this to a minimum.

Is this possible ? Yes, absolutly. We just need to think a bit differently. After all the art of parallel programming is in it's infancy while the hardware is barely existant.

What fascinate me the most about 3D is how more then anything else it is contributing to the devellopment of mainstream low-cost parallel architectures. Though currently most of that parallelism is hidden, it is still a step in the right direction. After all, without 3D, it could have taken decades more before we'd seen the likes of what GPU are offering us today. It's not that parallelism and a-synchronization or de-serialization is something new, but it's just like gasoline, why change what work, even if it's primitive and limited? This is why I hope "cell" will work well enough to inspire the rest of the industry to follow in it's footstep and push parallelism forward. But no matter what, one day, everybody will go that way.

The bigest limitation toward that goal is the hardware, not the software. Not easy to overcome, and we'll need much more then a few billions transistors to achieve "thinking" and "living" computers which are our ultimate goal... no? But by then we'll probably need 3D photonic to replace all that crappy electronic....

Great post.

V3 · Feb 19, 2004

Processor designers debate faster clocks

By Stephan Ohr

EE Times
February 18, 2004 (5:16 p.m. ET)

SAN FRANCISCOA â€” An evening panel session held at ISSCC here Tuesday (Feb. 17) debated some of the consequences of clocking processors at multi-GHz rates. Engineers from Sun, Intel, AMD and Fujitsu complained about power consumption, programming issues, and costs.
Philip Emma, manager of systems technology at IBM (Yortktown Heights, NY) proclaimed himself the "whipping boy" for high clock speeds. If we examine interconnect flight time, careful to avoid resonance in the wire lengths, we could obtain clock speeds as high as 60 GHz, he theorized. But would we need to terminate on chip wires, just to avoid those resonances? Such a device would be a consummate power eater, he concluded. The cost of computing machinery would be a function of the power it consumed, he said, not its clock rate.

Consumer demand for "big iron" would not likely track the rise in clock speeds, advised Alisa Sherer, a technology fellow with Advanced Micro Devices (AMD, Sunnyvale). This meant that marketing dollars would need to track or exceed engineering dollars to ensure that consumers would be enticed by fast clocking PCs, she said.

Multithreading could multiply the amount of work performed by the microprocessor with each clock cycle, said Marc Tremblay, a technology fellow with Sun Microsystems. In principle, a 256-thread machine could achieve terahertz clock rates, with each thread running a GHz race through the machine.

"If threading takes over, the GHz required is much lower," agreed Doug Carmean, a principal architect with Intel Corp. in Hillsboro (Oregon).

"But no one knows how to program a machine with more than two threads," protested an audience questioner from MIT. And the compiler technology would likely not keep up with the requirements of multi-threading, Sun's Tremblay conceded.

But none of the panelists doubted that there would be fabrication processes in place that would support multi-GHz processor designs. (An aggressive roadmap shown at the Intel Developers' Forum, IDF, across the street from ISSCC here, suggested putting technology shifts on a two-year cycle, culminating with a 25-nm manufacturing process in 2009-2010). Clock estimates given in response to an audience challenge ranged from a 7 GHz to a 10 GHz.

Power consumption, a function of gate fan-out loading within the individual processor's design, would be the limiting factor, reminded Alisa Sherer. Fan out loading on the order of 20 to 22 gates would consume much more power at high clock rates than designs loaded by 16 or 18 gates, she said. Complexity would inevitably favor higher fan out loading.

"We could do 10 GHz," Sherer postulated. "But a 5 GHz clock was a much better target." Even then, the processor is likely to consume a couple of 100 watts â€” and there will be few PC applications likely to support that, she concluded.

ISSCC panel debates processor speed advantages

By Ron Wilson

EE Times
February 19, 2004 (1:37 a.m. ET)

SAN FRANCISCO â€” A perennial topic at the International Solid State Circuits Conference is the debate between what have been termed the architects and the speed demons. These two camps, one emphasizing simple logic circuitry to increase clock frequency and the other pushing complex circuitry to increase the amount of work done in a clock cycle, have been fighting since the days of the first RISC CPUs over what was the best approach to increasing processor performance.
This year's incarnation of the debate held here at an evening session on Tuesday (Feb. 17) took the title "Processors and Performance: When do GHz Hurt?"

With so few viable processor design teams remaining in the industry, there is always the risk that such a panel will devolve into a round of Intel-bashing, and that's pretty much what happened this time.

Panel chair Shannon Morton, staff engineer at Icera Semiconductor, observed that the increase in clock frequency that has led us to 3 GHz CPUs had not come just from process scaling. There has also been a consistent and aggressive reduction in the number of fan-out-of-four gate delays per logic stage. This has contributed a significant portion of the reduction in cycle times.

Morton then framed the discussion with two broad questions. First, what is the end customer's perception of performance? Second, should we continue to focus on clock frequency in our pursuit of the customer's dollars?

Marc Tremblay, VP and fellow at Sun Microsystems, pointed out that clock frequency is only a primary lever for setting performance if the application is entirely-or nearly-cache-resident. Failing that, memory latency and bandwidth become the limiting factors. Tremblay said that the applications that primarily concern Sun in the server world are never cache resident.

Instead, there is a growing move to make them behave as if they were by intelligent, software-controlled prefetching. Tremblay hinted that the most promising development he had seen in some time was the blending of software-based prefetching with multithreading technology.

Philip Emma, manager of systems technology and microarchitecture at IBM, agreed that clock frequency was losing its traction as a performance driver, and that memory was getting far more critical. He added that the cost, in terms of complexity and power consumption, of pursuing higher frequencies was getting too great any way.

Emma predicted that the next big performance boosts would come from the likes of 3D packaging and optical interconnect, not from faster clocks.

The Intel senior principal architect Douglas Carmean introduced himself as the panel's token speed enthusiast. He dismissed skepticism by offering performance data that simply concluded that on 3D rendering tasks, higher clock frequency translated directly into faster completion times. He then generalized these results by arguing that many of the tasks real users cared about were still single-threaded, unparallelized and big, and that a CPU designer could not turn his back on them, even though it really did hurt.

Showing a chart of steadily decreasing fan-out-four gate delays per stage over time, Carmean said "at about 10 gate delays per stage, let's say it gets tingly. Below 10, no doubt about it. It hurts."

AMD fellow Alisa Scherer replied in-metaphor: "Pain is not necessary for performance." She said not only was higher clock frequency too costly to the architect, but that end users were no longer willing to pay the price in increased power consumption either. "No one wants something that sounds like a hydrofoil under their desk," she stated.

Scherer claimed that consumers could, like enterprise buyers, be educated about the difference between clock frequency and throughput, and that increasingly the PC industry would focus on I/O bandwidth, multiprocessing/multithreading and even non-performance features such as portability and security.

Hisashige Ando, CTO of the server systems group at Fujitsu, took a different tack. Ando claimed that while GHz may matter to the marketing department, the real issue was the three "ps": performance, power and price. He then demonstrated the theoretical advantage of an array of small, slower processors over a single large, faster processor, assuming the application was sufficiently suitable for parallel execution.

Finally Mark Horowitz, Yahoo professor of electrical engineering and computer science at Stanford, observed that the last speaker on a panel finds all the good points already taken. He suggested the pursuit of performance was a global optimization problem, in which on seeks the area in which the incremental increases in performance divided by the incremental costs in other quantities were approximately equal.

This had led him to predict some time ago that the decrease in gate-delays per stage would soon stop. He then stated his objective for the evening was to convince the "crazies at Intel to give up before they proved me wrong."

Hmm they seems to be having a debate over multi GHz Vs Multi cores.

MfA · Feb 19, 2004

"The Intel senior principal architect Douglas Carmean introduced himself as the panel's token speed enthusiast. He dismissed skepticism by offering performance data that simply concluded that on 3D rendering tasks, higher clock frequency translated directly into faster completion times. He then generalized these results by arguing that many of the tasks real users cared about were still single-threaded, unparallelized and big"

You would think if that was the case he could have picked an example which most users arent already executing on a parallel low clocked processor ... hell, if you extrapolate from the past then in one or two more generations real users wont even be running 3D rendering tasks on his processors anymore.

nobie · Feb 19, 2004

Amdahl's Law

Amdahl's Law is a law governing the speedup of using parallel processors on a problem, versus using only one serial processor. Before we examine Amdahl's Law, we should gain a better understanding of what is meant by speedup.

Speedup:
The speed of a program is the time it takes the program to excecute. This could be measured in any increment of time. Speedup is defined as the time it takes a program to execute in serial (with one processor) divided by the time it takes to execute in parallel (with many processors). The formula for speedup is:

S= T(1)/T(j)

Where T(j) is the time it takes to execute the program when using j processors. Efficiency is the speedup, divided by the number of processors used. This is an important factor to consider. Due to the cost of multiprocessor super computers, a company wants to get the most bang for their dollar.

To explore speedup more, we shall do a bit of analysis. If there are N workers working on a project, we may assume that they would be able to do a job in 1/N time of one worker working alone. Now, if we assume the strictly serial part of the program is performed in B*T(1) time, then the strictly parallel part is performed in ((1-B)*T(1)) / N time. With some substitution and number manipulation, we get the formula for speedup as:

S = N/ (B*N)+(1-B)

N = processors B = % of algorithm that is serial

This formula is known as Amdahl's Law. The following is a quote from Gene Amdahl in 1967:

For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit co-operative solution...The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor...At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.

Let us investigate speedup curves:
Now that we have determined speedup and efficiency, let us turn to using this information to make sense of Amdahl's Law. We will refer to a Speedup Curve to do this. A Speedup Curve is simply a graph with an X-axis of the number of processors, compared against a Y-axis of the speedup. The best speed we could hope for, S = N, would yield a 45 degree curve. That is, if there were ten processors, we would realize a ten fold speedup. Anything better would mean that the program ran faster on a single processor than in parallel, which would not make it a good candidate for parallel computing. When B is constant (recall B = the percentage of the strictly parallel portion of the program), Amdahl's Law yields a speedup curve which is logarithmic and remains below the line S=N. This law shows that it is indeed the algorithm and not the number of processors which limits the speedup. Also note that as the curve begins to flatten out, efficiency is drastically being reduced.

Entropy · Feb 19, 2004

THANKS!
For digging out the quote. I've been referring to it for ages, but I haven't kept it around in a quotable form.

To what degree it will raise its head in the context of small scale parallell processing in a games console is debateable. (Also, as was alluded to above, it gets increasingly complex/costly to wring out relatively small performance benefits for non-amenable-to-parallell-coding problems.)

Dio · Feb 19, 2004

One should note that that diagram is for a particular value of B. It is not always true that '16 processors' means '6x rather than 16x performance' (i.e. Your Mileage May Vary).

It should also be noted that the same problem affects systems with non-identical but parallel processing units (i.e. VPU's) as well. Because VPU systems are pipelined, we do our internal analysis mostly on 'where are the bottlenecks', which is easier conceptually.

MfA · Feb 19, 2004

Dont use Amdahl's law unless you are willing to discuss the input parameters ... it is a little like all the drones using Drake's equation to "support" their pre-existing opinion on extra-terrestrial live.

MrWibble · Feb 19, 2004

I dunno. He kinda/sorta has a point about most people not currently running (massively) parallel code. But then most people don't really have massively parallel architectures to use, and so why would developers target such things in the first place? On a sequential architecture, parallel code will run slower...

If architectures head down the parallel route to get performance, new algorithms will come to the fore to take advantage. If that happens the game flips around and the platforms without good parallelism become the ones that run like dogs.

Speaking as a programmer, I think if I had an infinitely fast machine, I'd rather it was serial rather than parallel because on the whole that's easier to deal with. But with the tangible limitations of the real world my primary motivation is speed, and so I'll take whichever is ultimately faster even though I might have to jump through some hoops to take advantage. It's looking like many architects think parallel is the way to go right now to get more bang per buck, so I guess we'll deal with that. Plenty of the code I write ought to be quite happy running in parallel, it's just not necessarily structured like that yet.

Also, as a footnote to this, I suspect that if it was as easy to clock a chip up by a factor of 16 (or even 6) as it ought to be to connect that many slower cores together, then chip makers would do exactly that. And shoving things together in parallel doesn't prevent you from taking advantage of speed-ups possible in less parallel architectures either, so it's not like the parallel chips are going to be running at drastically lower clock rates.

If a clocks on serial devices could be sped up 16x, but 16x the processors only get you a 6x improvement, then we'd only need to crank the clocks on those parallel processors by 3x or more to get a better improvement.

The graph is only flat at it's extrimity, based on a particular task. It also doesn't seem to take into account the speedup for doing *several different tasks at once*. If I have 3 independant tasks, and 3 processors to run them on, it ought to be pretty clear that I can do those tasks 3x faster than with one processor, except for an outside condition such as memory bandwidth. The graph also shows good improvements for lower numbers of parallel processors, so why not choose to do a moderate amount of work in parallel and *also* boost the clock?

I'm rambling, so I'll shut up now.

nobie · Feb 19, 2004

It's true, the extremity at which seperating threads becomes disadvantageous will not be reached any time soon. Even the "BE" form of Cell is basically only a 4 processor system, although presumably it could run a seperate thread on each APU. It would probably be best to isolate the parallel tasks into their own threads, and for example have each APU work on a seperate vertex or a seperate pixel.

I'm considering things here from a long-term perspective, even beyond PS3 and into the next decade. The truth is their most likely will not be a reliance on multi-threading, clock speed, or thread level performance alone. All three will be milked for everything they can give, and all three will eventually run out of steam.

nobie · Feb 19, 2004

Dio said:
One should note that that diagram is for a particular value of B. It is not always true that '16 processors' means '6x rather than 16x performance' (i.e. Your Mileage May Vary).

MfA said:
Dont use Amdahl's law unless you are willing to discuss the input parameters ... it is a little like all the drones using Drake's equation to "support" their pre-existing opinion on extra-terrestrial live.

The figures themselves are not as significant as the curve of the chart. Adjusting for the granularity of the grid, the curve follows the same trend whether B is 10% or 0.1%

3dcgi · Feb 20, 2004

Finally Mark Horowitz, Yahoo professor of electrical engineering and computer science at Stanford...

What kind of a title is that?

MfA · Feb 20, 2004

A straight line is part of the equation too, it is irrelevant to the question at hand if you dont pick parameters ... pick a computationally relevant part of a game engine which you think wont scale, then we will see if we can make something meaningfull out of Amdahls law..

Vince · Feb 20, 2004

Nobie, my previous objection to your invoking of Amdahl's still makes me wonder. Now, I'm not sure, but explain this to me.

It was my understanding that this law holds true when trying to accelerate one entity, be whatever it is, with N processors and the diminishing return you'll experience.

What happens when you don't accelerate one entity with N processors, but rather N entities with N processors concurrently. How is this influenced?

Why work on one entity with a plurality as if you're trying to emulate a serial pipeline when you can just compute in parallel en masse? I have to be missing something, I'd like to know what.

nobie · Feb 20, 2004

Vince said:
Nobie, my previous objection to your invoking of Amdahl's still makes me wonder. Now, I'm not sure, but explain this to me.

It was my understanding that this law holds true when trying to accelerate one entity, be whatever it is, with N processors and the diminishing return you'll experience.

What happens when you don't accelerate one entity with N processors, but rather N entities with N processors concurrently. How is this influenced?

Why work on one entity with a plurality as if you're trying to emulate a serial pipeline when you can just compute in parallel en masse? I have to be missing something, I'd like to know what.

I'm not sure I understand what you mean by one entity. If a program is divided up into threads, and each processor is running a different thread, wouldn't this be N entities on N processors? This is what I'm referring to. Obviously, running a single-threaded application on multiple processors won't do you any good.

MfA said:
A straight line is part of the equation too, it is irrelevant to the question at hand if you dont pick parameters ... pick a computationally relevant part of a game engine which you think wont scale, then we will see if we can make something meaningfull out of Amdahls law..

Allright, I made an excel spreadsheet out of the formula. The figures I plugged in are from the graph I posted, but you can play around with it and see for yourself.

Is multiple core technology really needed for next-gen

Similar threads