Followup: GPUs as general processors

3dcgi said:
I read the entire article at the time and I got the feeling that Jen-Hsun's ultimate goal is as the title states.

My feeling is that the author probably interview him and severly misunderstood what he said and interpreted it as that GPU will eventually make CPU's unneeded. But that simply wont happend.


Anyway, GPU's may make rendering of Maya scenes on the CPU obsolete, but it wont take over tasks like running Word ever. There are good reasons why CPU's are designed in one way and GPU's in another. If the GPU model would be transferrable to the CPU's then CPU's would already be designed that way.
 
Humus said:
3dcgi said:
Anyway, GPU's may make rendering of Maya scenes on the CPU obsolete, but it wont take over tasks like running Word ever. There are good reasons why CPU's are designed in one way and GPU's in another. If the GPU model would be transferrable to the CPU's then CPU's would already be designed that way.

That one is designed for SPEC benches and the other 3DMark? Sorry couldn't resist. ;)
 
JF_Aidan_Pryde said:
Yes the CPU is much faster for casual computing, but how much faster do you want your 'casual' computing to go? The key to building any faster system is by acclerating the worst bottleneck, not making the faster bits even faster. I don't see any advantage in running my Winword or IE on a Prescott 5GHz CPU. But I do see how GPUs can accelerate all those interesting taskes like MP3, DivX, DVD, SETI etc.

Put in another way. Let's suppose there's a NV50 with full branching and unlimited loops. It also has a basic branch predictor and virtualised memory is implemented by the OS and GPU. Suppose it's clocked around 1GHz. Couple this with a low end CPU that handles it all the taskes to do. Would this not make a much more compelling machine for the consumer than a 6Ghz Pentium6 with a integated 'EXTREME' graphics chip?

Does the consumer care more about if Winword starts at 0.01 second instead of 1 second or if sight and sound is many times more engrossing?

Why would winword stay this way forever? Eventually you'll want more functions. How about a good translator? How about a better grammar checker? There are still many ways to use computing power. If you go back to the era of 386 33Mhz, will you say "386DX/33 is enough for Lotus 1-2-3 so why buy a better CPU?"
 
I agree that GPU architectures will likely never be good for running Word. I think the point (possibly in jest) was that GPU's are more likely swallow the CPU than the other way around. This is because the transistor count in GPU's is growing faster than that of CPU's. Heck, Intel already tried to swallow the GPU with Timna and changed their minds. Regardless, I think Nvidia would consider this to be an idealistic dream and they are not focusing on this as an actual goal.
 
Hmmm .... GPUs evolving to massively parallell processors .. and word processing? For processing of text, it should be possible to split the text up into paragraps, and at lower levels sentences and words. Many of the things you want to do with text can be parallellized to a great extent: spell check every word in parallel, grammar check/translate every sentence in parallel, perform layout on every paragraph in parallel, draw every letter on-screen in parallel, etc etc. Even in such a traditional CPU task as word processing, there are massive opportunities for parallellism that traditional CPUs just can't exploit. (Just intended as an example of how massive, fine-grained parallellism can help performance for tasks that initially may look as if they can't benefit from it)

Of course, a lot of programmability has to be added to GPUs for them to be able to do this kind of parallel operation on text or other kinds of non-graphics-oriented data; but when (rather than 'if') GPUs reach that level, the classical, single-instruction-stream CPU will be on its way to obsolescence. At least, that's what I (and apparently Nvidia) think will happen. (And as for Intel: they too are already moving in the direction of running multiple instruction streams per CPU - we'll see that more and more over the next years.)
 
No, I'd say it is not that simple. Parallelizing a certain task is not always easy or efficient. For example, to breaking down a text into paragraphs and make them paralleizable, you introduce the overhead of breaking them down. How to do that effectively? Not so easy.

Of course, we won't be able to make CPUs run faster and faster. So we'll have to make parallel processors. However, they are still different. Currently GPUs works are mostly independent. That is, one pixel does not interfere with another pixel. Maybe for adjacent pixels, but not remote pixels. However, CPUs are designed for better handling those "dependent" jobs. So CPU runs faster (higher clock rate) but with fewer functional units, and GPU runs slower but with more functional units.
 
pcchen said:
No, I'd say it is not that simple. Parallelizing a certain task is not always easy or efficient. For example, to breaking down a text into paragraphs and make them paralleizable, you introduce the overhead of breaking them down. How to do that effectively? Not so easy.
Not that hard: If you begin with a solid block of text that hasn't been preprocessed or anything, you can start out by dividing it into equal-size chunks, feed each chunk to a processing unit, let it scan for paragraph breaks, making a list of them, and then join its list with those of the previous and next processing units. This can give you a linked list of paragraphs very quickly. While a linked list doesn't sound nice for parallel operation, there are actually very efficient parallel algorithms for converting the linked list into an array. At this point, you have obtained an array of pointers to each paragraph, which is all you need to parallellize per-paragraph operations. For N processing units on a text of length K, you will have spent about O(K/N) time total so far.
Of course, we won't be able to make CPUs run faster and faster. So we'll have to make parallel processors. However, they are still different. Currently GPUs works are mostly independent. That is, one pixel does not interfere with another pixel. Maybe for adjacent pixels, but not remote pixels. However, CPUs are designed for better handling those "dependent" jobs. So CPU runs faster (higher clock rate) but with fewer functional units, and GPU runs slower but with more functional units.
But I want to run faster with more functional units :) . For truly dependent jobs, you can always run barriers or other thread-synchronization primitives (barriers should be easy to implement). The way I see it, the problem with present-day CPUs is that the serial instruction stream model forces artificial execution dependencies onto tasks that are not naturally dependent on each other - it's a bit amazing to see what contortions CPU designers are going through to extract even small amounts of parallellism from the serial instruction stream. Multithreading and parallel processing units are a bit of an obvious answer, and the result looks to me like it will naturally converge with ever-more-programmable GPUs at some point.
 
arjan de lumens said:
Not that hard: If you begin with a solid block of text that hasn't been preprocessed or anything, you can start out by dividing it into equal-size chunks, feed each chunk to a processing unit, let it scan for paragraph breaks, making a list of them, and then join its list with those of the previous and next processing units. This can give you a linked list of paragraphs very quickly. While a linked list doesn't sound nice for parallel operation, there are actually very efficient parallel algorithms for converting the linked list into an array. At this point, you have obtained an array of pointers to each paragraph, which is all you need to parallellize per-paragraph operations. For N processing units on a text of length K, you will have spent about O(K/N) time total so far.

No way you get O(K/N). Remember there are long and short paragraphs. Therefore, some blocks may have more paragraphs and some less. This makes the breaking task unbalance. Furthermore, I was not talking about 2 or 4 CPUs. If you want GPU class parallelism, you need hundreds or more of parallel tasks to be efficient. We all know that changing pixel shaders for small triangles are not a good idea.

But I want to run faster with more functional units :) . For truly dependent jobs, you can always run barriers or other thread-synchronization primitives (barriers should be easy to implement). The way I see it, the problem with present-day CPUs is that the serial instruction stream model forces artificial execution dependencies onto tasks that are not naturally dependent on each other - it's a bit amazing to see what contortions CPU designers are going through to extract even small amounts of parallellism from the serial instruction stream. Multithreading and parallel processing units are a bit of an obvious answer, and the result looks to me like it will naturally converge with ever-more-programmable GPUs at some point.

Maybe at very distant point, but still not near future. I already mentioned that current GPUs are like a massive parallel with multithreading architecture, and CPUs are also going this way. However, there is still an assumption: each pixel are independent. This is very important since you don't implement locks on them. On the other hand, when doing CPUs you'll want nice branch predictors for tasks less parallelizable, and good locking mechanism. The recently disclosed Prescott instructions MONITOR and MWAIT are examples of this. You don't want to spend many transistors on branch predicators and locking mechanisms for a GPU (yet).
 
There is merit to what you both, pcchen and arjans, say.

It might be instructive to look at the directions that high performance computing have taken. From the Cray 1 and forward, we have seen HPC move from fast scalar+vector processors to primarily massively parallell architectures, either loosely (as in Beowulf clusters) or more tightly integrated. These sysems typically run a few very suitable problems.
But there is another trend, and that is the migration of scientific computing to "PC" level hardware. And in spite of my field being one which actually does crunch supercomputer cycles, I would have to say that this is the dominant trend. There are two reasons for this, one of which is obviously cost. But the other reason is that for non-parallellizable code, you just can't get scalar processors that are much faster than current PC hardware. (Memory sizes are limiting though.)

Scientific codes have the advantage of not being very cost pressured - if a code is heavily used, you can count on professors having their graduate students check out possibilities for making it run faster. :)
The conclusion I would draw is that a lot of these codes simply do not parallellize well, not even to the point of benefitting from small scale SMP. I have dealt with such code myself.
So people parallellize physically, and buy multiple boxes/racks so that they at least can get away from queueing.

Algorithms and tools are certainly part of the problem - perhaps the single most hopeful development for parallell processing is the upcoming PS3. I'm totally uninterested in consoles per se, but I'll buy myself one of those if I can get development tools reasonably easy, because the architecture is so promising. In PC space the Hammer is interesting, but since we are dealing with physically distinct chips (and AMDs need to find a lucrative niche for their Opterons) I have difficulties seeing their topological flexibility ever having an impact in the mainstream. I'd predict that x86 will follow their current path for the foreseeable future.

So far, for a task to be productively migrated to a GPU, it has to be both parallellizeable and vectorizeable, can't require high precision, and must fit within the local memory of a gfx-card. (If it could be productively chopped up into smaller blocks that could be transferred over the AGP bus and processed locally, odds are that it could be partitioned into even smaller parts that would fit inside CPU caches, which would be faster still.)

I just can't see GPUs taking over the tasks of the CPU in any general sense. Particularly not on a platform controlled by Intel.

Entropy
 
pcchen said:
No way you get O(K/N). Remember there are long and short paragraphs. Therefore, some blocks may have more paragraphs and some less. This makes the breaking task unbalance.
You will obviously need O(K/N) time to scan for paragraph breaks. You will also need O(K/N) time to copy data around if you wish to make physical copies of each paragraph - for each block, you copy data from the block to the appropriate portion of the resulting paragraph - this takes O(1) time per character regardless of paragraph length. In addition, there is the per-paragraph overhead with arrays/pointers/lists, which to me at least looks as if it can be done in O(1) time per paragraph. Given that each block has size K/N, there can't be more than O(K/N) paragraphs per block. So it still looks to me as if O(K/N) is perfectly attainable for the text-to-paragraph splitting. You will be hit with the paragraph length imbalance when you do per-paragraph processing later, though.
Furthermore, I was not talking about 2 or 4 CPUs. If you want GPU class parallelism, you need hundreds or more of parallel tasks to be efficient. We all know that changing pixel shaders for small triangles are not a good idea.
Given, say, 8 pipelines and, say, 25 clocks latency for standard operations (like a filtered texture lookup, which may well take this much time), you need 200 pixels or other parallel tasks to reach peak performance. The number can be reduced somewhat with memory prefetching and branch prediction (when pixel shaders get branching), but still - OK, I see that it's more than just a bit much for general processing (for the word processing example of mine: 200 paragraphs is probably excessive, 200 sentences is doable, and 200 words is common)
Maybe at very distant point, but still not near future. I already mentioned that current GPUs are like a massive parallel with multithreading architecture, and CPUs are also going this way. However, there is still an assumption: each pixel are independent. This is very important since you don't implement locks on them. On the other hand, when doing CPUs you'll want nice branch predictors for tasks less parallelizable, and good locking mechanism. The recently disclosed Prescott instructions MONITOR and MWAIT are examples of this. You don't want to spend many transistors on branch predicators and locking mechanisms for a GPU (yet).
Each pixel is independent of every other pixel in the framebuffer; but two fragments (in the OpenGL sense of the word) that affect the same pixel are dependent on each other, and GPUs already now need to implement some sort of per-pixel locking scheme to handle this case correctly.
 
Each pixel is independent of every other pixel in the framebuffer; but two fragments (in the OpenGL sense of the word) that affect the same pixel are dependent on each other, and GPUs already now need to implement some sort of per-pixel locking scheme to handle this case correctly.

I think very few GPUs do per-pixel locking, that would be too expensive. Normally locking are primitive based. The dsy/dsx instruction on DX9 pixel shader 2_x needs block based locking, though.
 
pcchen said:
I think very few GPUs do per-pixel locking, that would be too expensive. Normally locking are primitive based. The dsy/dsx instruction on DX9 pixel shader 2_x needs block based locking, though.

The locking I talk about is needed in the case where two primitives overlap in the framebuffer, in which case you need to guarantee that the primitives update the overlapping pixels in the correct order. Locking on a per-primitive basis to handle this problem will require one of the following:
  • A triangle setup engine capable, for each primitive, to check whether it overlaps any primitives currently present in the pixel shader pipelines. If overlap is detected, it holds back the primitive until the ones that it overlapped with are safely written back to the framebuffer.
  • A full pixel shader pipeline flush per primitive. This will cause unnecessary performance loss much of the time.
Locking on a per-pixel basis requires, for the scanline converter, to check every pixel it wishes to output with every pixel currently in flight in the pipelines - if there is a hit, the scanline converter must stall. Sounds rather complex or slow in either case.

dsx/dsy, as implemented in R300/NV30, requires that pixels are issued and processed in groups of 4 (2x2 pixel blocks), rather than individually.
 
Yes, yes, I can see it now.

Jen-Hsun at Microsoft headquarters:
-"Well, the NV30 didn't get quite the reception we hoped but me and the lads have this brilliant idea! Couldn't you do Office for GeForce? Man, that spellchecking would fly! We have this new language "C for Geforces" or CG for short, which is perfectly suited for the task. Well, almost. But with the combined talent at our two companies, we are going to Stun the World, eh Bill?"

The vision is fading....
Damn.
I must have been smoking something hallucinogenic.

Entropy
 
Back
Top