Why go multithreading/multiple cores?

Maybe you guys can answer this. My (limited) understanding is that we've reached a threshold in single core performance. So to reach the next level of performance we need to move to multi-core designs.


But now it seems that the CPU on the latest consoles may actually be less powerful than a single core high end CPU. Wouldn't this totally defeat the purpose?

So I'm confused, what exactly is the reason for going multi?
 
Maybe there is a better performance:cost ratio via multi-core than single core processors are able to scale these days.
 
Well even if each core of the x360 is slower than a single p4 . The 3 cores together will be much faster and will be able to handle more tasks at once than a single core chip could .
 
seismologist said:
Maybe you guys can answer this. My (limited) understanding is that we've reached a threshold in single core performance. So to reach the next level of performance we need to move to multi-core designs.


But now it seems that the CPU on the latest consoles may actually be less powerful than a single core high end CPU. Wouldn't this totally defeat the purpose?

So I'm confused, what exactly is the reason for going multi?
We've reached a place where single core, single CPU machines are not getting faster quickly. It's far easier for CPU designers to get more power by including more cores or more CPUs. This is actually where GPU designers are at, too.

As for the numbers given, I wouldn't put a whole lot of weight to them.
 
seismologist said:
Maybe you guys can answer this. My (limited) understanding is that we've reached a threshold in single core performance. So to reach the next level of performance we need to move to multi-core designs.


But now it seems that the CPU on the latest consoles may actually be less powerful than a single core high end CPU. Wouldn't this totally defeat the purpose?

So I'm confused, what exactly is the reason for going multi?

I found this to be an extremely enlightening read: http://www.gotw.ca/publications/concurrency-ddj.htm

It basically says that we've reached a speed limit (4 Ghz). To get more computing done in the same amount of time, instead of making our processors faster, we're switching to having more processors. The reason a performance improvement is possible with more processors is because a lot of code is parallelizable (meaning the code can be processed at the same time.)

Example of Parallelizable code:
1) R1*R2 -> R7
2) R3*R4 -> R8
3) R5*R6 -> R9

Because R1 through R6 have no dependencies between each other, we can execute lines 1,2,3 all at the same time. If we zoom out and look at longer orchastrations, sections of code which are not dependent on eachother can be put in separate threads and executed at the same time.

Example of non-parallelizable code:
1) R1*R2 -> R3
2) R3*R4 -> R5
3) R5*R6 -> R7

Here we see that every computed result gets used in the next line of computation. This prevents parallel execution of the lines because line i+1 is waiting on line i to finish executing. This code sequence MUST be executed sequentially and in order. This code is essentially not parallizable.

I don't know too much of the theory of concurrent programming and optimal threading. But it is what I'm going to dive into here pretty quick. I'm of the mind that if it can be done in a separate thread, it should be done in a separate thread. (Minus the cost of spawning and synching parallel threads).

Most programs are single threaded when they could be implemented in a multithreaded way. The reason we haven't implemented in a multithreaded way is because we've only had 1 processor which could only process 1 thread at a time.

To continue getting performance increases, we have to start writing multithreaded code. "What can be parallelized, should be parallelized." This is how a single app can take advantage of a multiple processor machine. If we don't switch to multithreaded applications, the app will be single threaded and only work on 1 processor at a time.

To sum up: The reason we are switching to multicore is because we are hitting a speed ceiling on processors. To take advantage of multicore, however, we must exploit parallelisms in execution of code wherever possible and efficient. The reason we can gain an advantage in a multicore environment is because most code contains parallelism which are not currently being exploited and cannot be exploited on single processor architectures.
 
Everything is well covered by others, I just would like to add that it has been obvious for sometime that amount of transistors available is still increasing with each new generation, but the clockrate is not.

There were few great whitepapers from IBM website couple of years ago, which were written by people working with "Blue Gene" project, kinda like the precessor for the whole "Cell-chip architecture". The tency was very clear allready by then, more calculation units, more local cache, sophisticated internal bussies for connecting calculation units, less clockrate.
 
The simple answer is "because we have to." Otherwise, why are both Intel and AMD moving in that direction as well?

...which is why many current complaints are funny in their blindness, as we're at the transition point now. Software-wise, we will be in the weakest state right now--but if both chips and software don't make the move now, we'll only be in the same state later on. Honestly, people are going to have to compare architectural design decisions not on current code or current thinking, but really looking back on things a number of years down the line. Even "64-bit programming" hasn't really caught on or done much yes, despite years on the market, but people think we can make direct comparisons of all the processor archetypes NOW?
 
AlgebraicRing said:
Example of non-parallelizable code:
1) R1*R2 -> R3
2) R3*R4 -> R5
3) R5*R6 -> R7

Here we see that every computed result gets used in the next line of computation. This prevents parallel execution of the lines because line i+1 is waiting on line i to finish executing. This code sequence MUST be executed sequentially and in order. This code is essentially not parallizable.

Huh?

It seems to me that barring potential rounding differences I can simply reorder to -

R7 = (R1 * R2) * (R4 * R6) ?

So potentially it would seem that I can parallelise (R1*R2) and (R4*R6), or did I miss something?

- Andy.
 
andypski said:
AlgebraicRing said:
Example of non-parallelizable code:
1) R1*R2 -> R3
2) R3*R4 -> R5
3) R5*R6 -> R7

Here we see that every computed result gets used in the next line of computation. This prevents parallel execution of the lines because line i+1 is waiting on line i to finish executing. This code sequence MUST be executed sequentially and in order. This code is essentially not parallizable.

Huh?

It seems to me that barring potential rounding differences I can simply reorder to -

R7 = (R1 * R2) * (R4 * R6) ?

So potentially it would seem that I can parallelise (R1*R2) and (R4*R6), or did I miss something?

- Andy.

You sort of summed up the different thinking that is required when trying to spin everything into different threads.

That is exactly what is difficult about parallelizing things -- need to rethink things sometimes. Sometimes you can parallelize things that don't originally seem parallizable (as you just showed).

Bleh, that was a lot of 'parallelizes'.
 
Huh?

It seems to me that barring potential rounding differences I can simply reorder to -

R7 = (R1 * R2) * (R4 * R6) ?

So potentially it would seem that I can parallelise (R1*R2) and (R4*R6), or did I miss something?

- Andy.
Maybe that wasn't a good example, using multiply every time (I'm wondering if the * was just used more to represent an operation or function?). Change the example to:

1) R3 = fa(R1, R2)
2) R4 = fb(R3, R1)
3) R5 = fc(R4, R1)

...and the dependencies are unavoidable. That's where fast, single threaded performance is necessary. Of course, you can describe R4 and R5 in terms of R1 and R2 (R5 = Fc(fb(fa(R1,R2),R1),R1)) and calculate R3, R4 and R5 in parallel, but you'd be doing a lot of extra work and that's not always possible.
 
Its just pref/transistor scaling really. A single modern sandy bridge core is probably close to as powerful as the x360s xenon, but a single SB core is probably around 200m transistors and xenon is 165m.

You can have more perf/core, but it costs more to get perf/core than to add more cores. Look at IBMs latest, PowerA2 with 16 simple cores at 428mm^2 vs Power7 with 8 big cores at 567mm^2, most of the time the PowerA2 matches or beats the P7, with close to half the power.
 
The problem with al the simple this is how you make code parallel examples is they ignore all of the real problems.

Lots of things add to the cost of running something in parallel Cache coherency with multiple threads, moving data or code in order to run it in parallel. Addition data structures or operations required to construct results from parallel operations. Moving data is by far the slowest operation on any modern processor.

It has to happen, for the obvious physical constraints, but it's naive to think it's easy on the software end or even an obvious win in all cases.

Cray uses very large numbers of hardware threads on a relatively slow processor to hide latency in their recent super computing clusters, but it's applicable to a small number of highly parallel large data problems.

It is really easy to write parallel code that demonstrates negative scaling, I've even seen cases where seemingly parallel code demonstrates worse than linear performance.

I'm not complaining, it's clear where the future is, and it has been since I did my degree in the 80's.
 
Big Irons like Cray supercomputers are optimized for ginormous dataset. Their architectures may not be "efficient" for running a small program.

Number of cores does not determine the scaling performance alone. Aside from the inherent parallelism, the efficiency of the algorithm is often more important. Then you achieve better performance (close to linear speedup) by adding more cores. Sometimes, rare memory conditions (e.g., unexpectedly high cache hit for certain problem size) help to achieve superlinear speedup.

More independent cores may be better for real-time computing and rich interactivity if utilization is good. While one core is tied up, you have other cores working at the same time. So more tasks can potentially meet their tight schedules.

At the end of the day, the programmer's talent and dedication still count a lot.
 
I would have really liked to see the concept of CELL take off more than it did with the PS3. The main core could be maybe a cut down Power7 and have 8 SPE's with it. Make it a dual core and we have 16 SPE's, and newer updated SPE's at that with maybe 512k LS.

Oh and btw I love the 6 year old thread resurrection.
 
Yeah, seeing Shifty respond to a 6.5 year old post was a real "WTF?" moment. :LOL:
Hmmmm. Didn't check the date. Don't go looking in my fridge... :mrgreen:

To be fair, it's quite possible that Flux wanted to ask a question, was suggested some existing threads, and asked it there. The recommendation system doesn't care about thread age.
 
I wonder if Intel's current hyperthreading CPUs can slot in instructions from different threads to unused execution units in the same clock cycle, or if it's just one thread at a time.

If it's just one thread/clock, how complicated would it be to allow simultaneous execution? After all, the CPU already has duplicate registers and whatnot for the two hardware threads... It would buff performance in multithreaded scenarios, so it might be a worthwile goal to strive towards.
 
I wonder if Intel's current hyperthreading CPUs can slot in instructions from different threads to unused execution units in the same clock cycle, or if it's just one thread at a time.

Intel processors can execute uops both threads within a single clock, but they can only issue and rename from a single thread at a time.



After all, the CPU already has duplicate registers and whatnot for the two hardware threads...

It actually doesn't really have duplicate registers for threads. Specifically, SNB has 160 physical (rename) registers, which aren't in any way earmarked for separate threads. When instructions are issued, they get passed through a rename phase where the logic looks for the correct names for their operand registers (as in, at this point in execution, which physical register means this architectural register?) and issues a new physical register for the result. After this, the instruction is passed to the 54-entry unified scheduler, and the cpu no longer makes any distinction between instructions issued from the two threads. (Except for flags and barging instructions after missed branches).
 
Back
Top