R520 and G70 32 pipes ?, G70 dual core with turbo cache ?

psurge · Apr 24, 2005

Chalnoth -

You could just execute nops for all pixels that didn't take a branch (isn't this basically what happens now?). More ideally I think you'd have to dispatch instructions from very small, quad-local caches. I imagine misses to be serviced with a global cache that broadcasts results to all its clients (and throws away duplicate requests).

KimB · Apr 24, 2005

I'm not sure what you're saying, but, then again, I'm quite tired right now. So maybe I'll re-read it in the morning.

Hyp-X · Apr 24, 2005

CPU's are not going multi-threaded (executing multiple command lists paralelly), because it's a good thing.
It's a very inefficient solution.
They do it because they have no other idea on how to significantly improve performance using the exess of transistors available on new processes.

As long as GPU commands can be broken into a lot of paralelly executable parts (vertices, pixels, etc.), executing multiple commands at the same time will always be worse than the time sliced solution.
That said, support for efficient context switching will be a very important feature in the future...

nutball · Apr 24, 2005

Yep, Hyp-X has hit the nail on the head. If adding extra cores was the most efficient and effective way to scale CPU performance, it would have happened way before now(*). It didn't because it isn't. Multi-core in CPUs isn't the best thing since sliced bread, it's the *worst* thing since sliced bread but it's the only thing left!

(*) How many '486 cores could you fit in 250M transistors, for example?

Pete · Apr 24, 2005

Perhaps typical PC usage has also changed with time to make multi-core CPUs more desirable than single-cores? Agreed that CPU makes seem forced into dual-core, but it may turn out to be quite useful to have anti-virus and other background apps run on one core, and your main task on another.

As for adding extra cores being efficient, I thought the math chip on the early x86s was popular enough to be integrated into the main chip. I don't see why it's that big of a stretch for multi-cores to become integral to the PC.

bloodbob · Apr 24, 2005

Typical PC usage is will only see the advantage of dual core and frankly the most econmic way would be for the cpus to act asyncronously and have the second core as a cheap very slow chip ( possible say 10% of the speed of the main processor ) I definately don't think current pc usage would benefit from more then two cores. However since the dual cores have come in there will be a bigger focus on load balancing software over multiple threads which will shift the average pc usage towards something that would benefit from multiple processors.

KimB · Apr 24, 2005

Hyp-X said:
CPU's are not going multi-threaded (executing multiple command lists paralelly), because it's a good thing.
It's a very inefficient solution.
They do it because they have no other idea on how to significantly improve performance using the exess of transistors available on new processes.

No, that's not true. They're going multi-threaded because you can get far more performance out of the same number of transistors this way.

Going multi-core is the most effective and efficient way to increase CPU performance. But it's also the hardest on software developers. It hasn't happened before now because CPU manufacturers were able to increase performance relatively easily by increasing IPC and/or clock rate, and a move to multicore wouldn't translate to a performance improvement in most applications.

But, now that it's begun, software developers are going to find ways to make use of it, and multicore is going to accelerate CPU performance at a rate not seen in recent years.

nutball · Apr 24, 2005

Chalnoth said:
No, that's not true. They're going multi-threaded because you can get far more performance out of the same number of transistors this way.

Going multi-core is the most effective and efficient way to increase CPU performance.

That is true at this point in time. It's not a global truth. Regardless of the software development cost, multi-threaded execution in all but the most trivially parallel applications incurs overheads in synchronisation and other inter-thread communication that isn't present in single-threaded applications.

Measured in CPU seconds, parallel applications in almost all cases consume more compute power than their serial equivalents. What gets reduced is walltime (which is what most people care about), but in terms of MIPS/FLOPS expended to get some given piece of work done, multi-threaded and parallel applications are less efficient.

I guess it depends how you want to define efficiency

But, now that it's begun, software developers are going to find ways to make use of it, and multicore is going to accelerate CPU performance at a rate not seen in recent years.

...for some algorithms and some applications.

KimB · Apr 24, 2005

You know what? I'd really like to see an example of a computationally-intensive application that can't make good use of parallelization.

Edit:
Anyway, even though it make often take more total computational resources to parallelize code, there is much more computational power available in a parallel architecture.

V3 · Apr 24, 2005

nutball said:
I guess it depends how you want to define efficiency

Well in term of high end performance requirement, speed up comes first. As long as it can finish the job faster, it doesn't really matter if you need to break the efficiency of algorithms to make it run faster on multiple cores.

Efficiency for speed up, not just for the sake of it.

nutball · Apr 24, 2005

V3 said:
Well in term of high end performance requirement, speed up comes first. As long as it can finish the job faster, it doesn't really matter if you need to break the efficiency of algorithms to make it run faster on multiple cores.

Efficiency for speed up, not just for the sake of it.

Absolutely, but my reply was to Chalnoth's assertion that multiple cores were the most efficient use of transistors (at this point). I agree with that assertion, however that is a different definition of "efficient" than the example I came up with. It's efficiency judged on the basis of extra performance per extra transistor, not extra performance per extra thread.

nutball · Apr 24, 2005

Chalnoth said:
You know what? I'd really like to see an example of a computationally-intensive application that can't make good use of parallelization.

Well it depends how far you want to scale, but sorts and FFTs are two which spring immediately to mind. Recursive bisection algorithms basically, need to be re-thought.

Two threads will be fine, but if you want to talk quad-core + SMT (which is what ... 18-24 months away?) then it becomes much tougher to get those to scale for sensibly sized computational payloads.

Any problem that requires a large number of iterations on a small set of data, that will also have problems scaling (where the synchronisation cost outweighs the compute payload per thread per iteration).

You get into interesting territory too... say some given algorithm is 10% quicker on 8 threads compared to 4, do you as a developer choose to use the 8 and get that 10%? On an dedicated machine that might make sense, on a desktop it's a lot less clear. Do you allow for dynamic adjustment of the number of threads? Who (or what) arbitrates all of this?

Anyway, this is getting kinda off the topic of GPUs

Kombatant · Apr 24, 2005

Simply put: hardware and software have to work side by side on this. Multi-core CPU/GPUs are of no use if the software is unable to take advantage of them. It's a chicken and egg situation really... I mean, if you look back into CPU history, dual CPU setups have been around for quite some time, but never took off due to lack of support/software availability that could really help take off the said platform. Now that dual core CPUs will become the norm within the next year, that will change for upcoming apps. Same will happen when dual core GPUs become mainstream.

KimB · Apr 24, 2005

nutball said:
Well it depends how far you want to scale, but sorts and FFTs are two which spring immediately to mind. Recursive bisection algorithms basically, need to be re-thought.

Well, sorts are pretty easy to parallelize, for one. You'd just split your list into a number of different lists, and sort each separately. The recombination would then be order N*M operations (N = total objects sorted, M = number of lists separated into), which should typically be shorter than the sorting of long lists, so while the recombination won't be parallelizable, it the bulk of the processing should be.

And the total number of operations shouldn't by a large amount. For example, if you have an efficient sorting algorithm that finishes typically in order N * ln(N) operations, the sorting of each block should now be finished in (N/M) * ln(N/M) operations. The total number of operations needed to sort all of the blocks, then, will be N * ln(N/M). Then you add in the recombination: N*ln(N/M) + N*M = N*ln(N) + N*(M - ln(M)). If we take some sample numbers, say, N = 1e6, M = 4, the total number of executed operations will increase by about 10%. Completion time with four processors, then, would be 25% + 10% = 35% that of single-processor execution. Of course, if you go to too many, this will actually end up being slower, but then you'd have to parallelize at a higher level.

Now, as for FFT's, well, I don't know enough about the algorithm to be sure, but from what I understand the algorithm itself is based upon the idea of separating the fourier transform into two parts, so I don't see why you couldn't get at least a level of parallelism of two to four.

Two threads will be fine, but if you want to talk quad-core + SMT (which is what ... 18-24 months away?) then it becomes much tougher to get those to scale for sensibly sized computational payloads.

Yes, but in any sensible situation you'd probably be wanting to apply the above algorithms many times. For most situations, FFT's and sorts are so fast on modern computers that they really don't become performance intensive until you have to do a whole lot of 'em. So once you get a parallelism of 2-4 in the base algorithm, expanding this to really improve efficiency is relatively easy, as long as you're doing more than one thing at a time.

Any problem that requires a large number of iterations on a small set of data, that will also have problems scaling (where the synchronisation cost outweighs the compute payload per thread per iteration).

Sure, but in most cases I've run into where this happens, you can again just parallelize the code by running multiple instances. For example, one piece of code I'm currently making use of is a Monte Carlo Markov Chain that only takes up about 1kb of memory, and I have to run the program for a couple of hours to get decent results. But this would be trivial to parallelize because if I wanted to, I could just run two (or a hundred) parallel chains, and combine the results in the end. I'd have to be a bit careful about statistics, but provided I could be sure each chain was completely uncorrelated with the others, this wouldn't be a problem at all.

Pete · Apr 24, 2005

nutball said:
Absolutely, but my reply was to Chalnoth's assertion that multiple cores were the most efficient use of transistors (at this point). I agree with that assertion, however that is a different definition of "efficient" than the example I came up with. It's efficiency judged on the basis of extra performance per extra transistor, not extra performance per extra thread.

Allow me to add efficiency judged by price, if not dictated by physics. Apparently it simply isn't feasible to clock the current single cores higher, so dual-/multi-core is the only resort. And combining two smaller cores may allow for greater yields than making one big one.

The same argument would apply to GPUs, too, just to keep things remotely on-topic. I'm guessing die size is increasing along with transistor count (despite smaller manufacturing processes), so splitting up the core may be inevitable, too, no?

KimB · Apr 24, 2005

Only if the extra packaging cost involved in combining two cores plus an I/O interface into the packaging is less than the silicon wasted in failed chips. But then it'd simply be a travesty to call it multi-core, since its inherently very different from the CPU multi-core situation.

Blazkowicz · Apr 25, 2005

Chalnoth said:
Well, sorts are pretty easy to parallelize, for one. You'd just split your list into a number of different lists, and sort each separately. The recombination would then be order N*M operations (N = total objects sorted, M = number of lists separated into), which should typically be shorter than the sorting of long lists, so while the recombination won't be parallelizable, it the bulk of the processing should be.

What about quicksort? It seems trivial to parallelize sinces it's all about splitting the list in two in the first place (and recursively sorting the two parts). If you have for instance two threads, you'd maybe spawn a new one when a recursion terminates? (or reuse the one in which there was the termination, maybe you can't afford killing and creating threads all the time)

I don't know jack shit about parallel programming, just asking some basic questions. Is recursion in general suited to parallelism? (at least in those cases where you split the data, everytime or sometimes, and work on the bits)

Blazkowicz · Apr 25, 2005

I think this dual core GPU thing was just made up, but maybe it has a basis on reality. Two distincts GPU on the same package? (this seems insane, but why not).
Integrated FlexIO/whatever for better multi-GPU? (on same board, or why not an insane ribbon cable made of silver or fiber optics between the cards, that would be a great way to make money

)
I don't like current SLI, it's better than rage fury maxx or volari duo but still a bit bad.

nutball · Apr 25, 2005

Chalnoth said:
And the total number of operations shouldn't by a large amount. For example, if you have an efficient sorting algorithm that finishes typically in order N * ln(N) operations, the sorting of each block should now be finished in (N/M) * ln(N/M) operations. The total number of operations needed to sort all of the blocks, then, will be N * ln(N/M). Then you add in the recombination: N*ln(N/M) + N*M = N*ln(N) + N*(M - ln(M)). If we take some sample numbers, say, N = 1e6, M = 4, the total number of executed operations will increase by about 10%. Completion time with four processors, then, would be 25% + 10% = 35% that of single-processor execution. Of course, if you go to too many, this will actually end up being slower, but then you'd have to parallelize at a higher level

Your numbers ignore the costs of synchronisation, assuming that conditions prevail on the machine which imply that each thread takes exactly the same elapsed time to complete.

Blazkowicz_ said:
What about quicksort? It seems trivial to parallelize sinces it's all about splitting the list in two in the first place (and recursively sorting the two parts). If you have for instance two threads, you'd maybe spawn a new one when a recursion terminates? (or reuse the one in which there was the termination, maybe you can't afford killing and creating threads all the time)

You can parallelise the sorting of the sub-lists, and to some extent you can parallelise the merging of the sub-lists into larger sub-lists, the problem is that final step -- you end up with two lists of N/2 elements that you can't avoid merging serially. So you have one thread working hard and M-1 threads sitting around doing nothing. That is a fundamental limit on your scalability (ask Mr Amdahl). To make good use of the idle compute resources you have to start to schedule other work ... if there is other work that *can* be done before the sort is complete.

All this debate hinges on what you are looking for. Hypothetical arguments don't butter no parsnips.

Chalnoth said:
Sure, but in most cases I've run into where this happens, you can again just parallelize the code by running multiple instances. For example, one piece of code I'm currently making use of is a Monte Carlo Markov Chain that only takes up about 1kb of memory, and I have to run the program for a couple of hours to get decent results.

Monte Carlo techniques are notorious for being embarrasingly parallel. What do you do if you want to run your single chain for 10 or 100 times more iterations?

There's the famous example ... it takes a woman nine months to produce a baby, does that mean that nine women will take one month? No, but they can produce nine babies in nine months. So whether nine women are useful depends on how many babies you want.

KimB · Apr 25, 2005

nutball said:
Your numbers ignore the costs of synchronisation, assuming that conditions prevail on the machine which imply that each thread takes exactly the same elapsed time to complete.

Sure, but for very long sorts the statistics will typically imply that each thread will take a pretty similar amount of time to complete. If they don't, then it means that you should do some more thinking about how you're doing your sorting (i.e. if the threads frequently do not finish at the same time, then that means there's a pattern to the disorder, and you should modify your algorithm to take advantage of that pattern), or you might consider splitting up the data differently.

Monte Carlo techniques are notorious for being embarrasingly parallel. What do you do if you want to run your single chain for 10 or 100 times more iterations?

As long as you can be certain that your multiple chains are completely independent, and each chain is much longer than the coherency length for the chain, there is no reason to make a single chain very long. Multiple chains will be just fine.

R520 and G70 32 pipes ?, G70 dual core with turbo cache ?

psurge

KimB

Hyp-X

Irregular

nutball

Pete

Moderate Nuisance

bloodbob

Trollipop

KimB

nutball

KimB

V3

nutball

nutball

Kombatant

KimB

Pete

Moderate Nuisance

KimB

Blazkowicz

Blazkowicz

nutball

KimB

Similar threads