View Full Version : The Problem(s) with GPGPU..
Anwar Ghuloum (Principal Engineer with Intel’s Microprocessor Technology Lab) is not particular fond (http://blogs.intel.com/research/2007/10/the_problem_with_gpgpu.html) of current GPGPU's programming model (and I kind of agree with him..)
An excerpt:
Let me be frank here…I like what I’m hearing from the GPGPU programming model side of things, but I don’t love it yet because it’s not dealing with reality. Reality means dealing with I/O, messy boundary conditions, irregular control flow and data access, funny shapes of thing
Well, he does have a point: the hardware architecture of current GPUs is fundamentally flawed for many workloads. Larrabee doesn't have the same problems, although I don't think it's perfect either.
Larrabee is basically a bunch of slow CPU cores with huge vector units. What that means is that it is not just pure SIMD like a modern GPU; it's also MIMD at the same time. The control logic should be able to run on the *same* core with basically near-zero cycle latency between the scalar and the vector registers.
That's pretty much huge. Having MIMD and SIMD close to each other is also the reason why I'm not the biggest fan of unified architectures for GPGPU, because having some MIMD processing before the SIMD part can come in handy (although what you really want is to be able to send data back to the MIMD part without going through memory or the CPU!)
Sadly, I don't know the details of Larrabee's architecture. It has 4 threads per core, but I don't know if you can be running two threads at once; i.e. can the vector unit be running one thread while the scalar unit(s) is being used by another thread? That could come in handy to balance the two, although with only 4 threads, I'm skeptical it's enough to do both that and hide latency (let alone the latter alone) efficiently.
The key is to make switching between SIMD and MIMD incredibly cheap and as transparent to the programmer as possible. Right now, with CUDA or CTM, that means going back to the CPU. Ugh! Ct might be better, but it's still far from perfect from the little I've seen, and I don't have enough experience with it to really judge their approach.
mhouston
20-Oct-2007, 04:09
I'll add my 2 cents here. I think the issue is getting conflated. Data parallel techniques are the only proven successful way to manage massive parallelism. Traditional threads has HUGE issues when scaling, hence one of the reasons that even on large shared memory machines like Altix's, people program using distributed system techniques. I think the programming models are evolving in the right direction actually. Just look at the success of Brook/RapidMind/CUDA in getting real apps and not just toys working at good speed on GPUs. RapidMind and Brook are interesting examples of things working on GPUs *and* multicore. RapidMind even goes a step further and you can run successfully on Cell with pretty good performance as well. Ct and TBB in and of themselves do not solve some pretty fundamental problems in parallel computation and scaling. Ct has some neat stuff, but it's not perfect either. (You can crap on any language, especially parallel languages!. Google "Why I hate your programming language" for examples in sequential programming)
GPU hardware does have tremendous limits, and if the CPU vendors could really match the *achievable* performance of GPUs without all the restrictions, than everyone would switch. The problem is, you can't build traditional cores and scale like that, hence the LARGE gap in performance. So maybe you strap some honking vector units on a chip, good luck getting most programmers to use that. Something somewhere has to get cut. That's not to say you can have a more flexible architecture than we have now, but where do you think the GPU vendors are heading along with new graphics APIs? Continued gains in flexibility...
Now, I'm also probably one of the most jaded GPGPU folks since I've been doing this for a LONG time. But, I've seen where we've come from and some of the upcoming stuff, and I think the area is evolving well. I also think that the line between GPU and CPU will start to get blurry pretty quickly. GPUs are getting more flexible and programmable, and CPUs are backing off from huge OoO designs and large monolithic cores and going to lots of small cores. GPUs are going from tiny cores to progressively larger cores, but still tiny compared to a Netburt era P4. The main difference at the moment is that GPUs are massively threaded (100s-1000s of threads) compared to CPU designs (1-10s of threads). Since GPUs and CPUs are converging in designs, eventually there won't be GPGPU and it's a just a question of what you call the programming techniques. Streaming and data parallel languages have lots of nice traits I hope to see continue. Often writing your code in one of these languages and back porting to a CPU using the same style results in faster code overall. ;-) Now, not everything fits into data parallel formulations. And then we can get into transactional memory (TM), hierarchical streaming, etc.
But, I think the more fundamental challenge in parallel program is our understanding of datastructures. All the datastructures most people are taught do not behave well in parallel, at least during updates. I have yet to see many solutions to this problem. TM helps here, but has scaling issues and problems in the larger world dealing with large object oriented code bases when nesting of transactions occur.
In the end, I think it's overly defensive posture to attack GPGPU. It really is a threat to traditional high-margin HPC platforms, especially with GPUs moving to 64-bit support soon. I wouldn't be surprised to see some bids or final designs for GPU based supercomputers that make the Top100 list in the next year or two.
Another random comment. Raytracing is really painful on a GPU, but guess what, GPUs dominate for primary rays since you can rasterize and can still do secondary effects competitive with current multi-core CPUs. But, as CPUs scale up and if GPUs don't get more flexible, handling secondary effects is likely going to be better on CPUs. But, conversely, I think it's going to be really hard for a more traditional CPU design to compete at raytracing.
I could go on and on, so if there are specific questions for me let me know and I'll try to be more coherent. ;-) (note that I can't talk about unreleased hardware from anyone so I can't say much directly about upcoming stuff)
Tim Murray
20-Oct-2007, 12:09
Three separate things:
A. Mike, at this point, do you see GPGPU as more or less a domain-specific processor?
B. I'm still wondering why we haven't seen more special-purpose libraries that use Brook or CUDA (although in both cases it might be an issue of maturity). For example, if there were an image processing or computer vision library that could use a GPU and shows a >10x speedup for some tasks, people would take notice--everyone uses OpenCV, and there's no parallelism in that. It could certainly help companies sell chips, and I think there would be a serious secondary effect of increasing people's awareness of GPGPU languages.
C. I've not done any supercomputer programming, so forgive me for being dumb... but doesn't cluster computing have several of the same problems as the basic GPGPU programming model once you're dealing with a few hundred/thousand CPUs? E.g., cost of synchronization, infeasibility of shared data structures, etc. Sure, the cluster will be more flexible and you can use whatever data structures you want within a machine, but it seems like you're still going to have the same types of problems. Basically, I'm wondering if the blog post really is GPGPU specific, or if many of his complaints are more about the difficulty of programming for massively parallel machines.
mhouston
20-Oct-2007, 12:52
A: Not domain specific. GPGPU can successfully handle lots of different algorithms, but limited to those that can be made data parallel. But there are solutions to physics, simulation, rendering (other than traditional raster graphics), AI, etc. GPGPU is basically accelerator offloading. It's not all that different than when FPUs were external chips sitting to the side. It's just that the interconnect speed compared to processing speed is low. But, *any* accelerator will have this issue. Moving to direct connect like Hypertransport or "QuickPath" can help to address that. But, some people are already hinting at combining CPU and GPU architectures on the same package or even the same die.
B: Stability and time. GPGPU was still lunatic fringe until ~1 year ago. We still have stability issues with large apps. The other issue is that there are few options that are vendor neutral or that have good confidence in forward compatibility. Also, GPU's increase in flexibility so quickly that it's hard to keep up. It's easy if you are shipping a box with a GPU in it as an appliance. Hence the success of GPGPU in medical imaging, but having users wanting to constantly update drivers that may break your code is a problem. The other issue is time. All of OpenCV is huge, but there are parts of it that *do* run on GPUs now. In fact, there are some DARPA Grand Challenge vehicles using GPUs for computer vision in the next race.
Folding@Home uses Brook, and NAMD/VMD use CUDA. There are people with specific science apps as well. Part of the issue is money. What is the killer consumer GPGPU app? Video encode/decode/transcode? (See ATI's AVIVO) Physics? (Havok-FX). This is likely the same question as "what's the killer app for > 4 cores".
In the appliance realm, there are already virus scanners, spam filtering, some limited database exploration, lots of image processing, and a few hard core simulation applications.
Standardization is also a killer here. Everyone is doing there own thing. One vendor needs to propose something neutral and back it, two of the three need to agree and force the other to do it, or Microsoft will/needs to come over the top and dictate it. I'm not sure which of those are for the best.
C: You thinking in the right direction. There are differences in the scales of communication cost, but the ideas are roughly the same. Basically with GPGPU, programmers have had to come up with ways to programs massively parallel machines that are more approachable than MPI and better match the massively multithreading hardware and memory systems. The relation to cluster computing is one of the reasons that many GPGPU apps go after the HPC space, not to mention this is where research funding is coming from.
In the end, I think it's overly defensive posture to attack GPGPU. It really is a threat to traditional high-margin HPC platforms, especially with GPUs moving to 64-bit support soon. I wouldn't be surprised to see some bids or final designs for GPU based supercomputers that make the Top100 list in the next year or two.Oh yeah, definitely: GPGPU is awesome for a substantial part of the HPC market. From this POV, I have no doubt that it will be a financial success before Intel even gets to the party with Larrabee.
What I was posting was more about the kind of paradigm shift we'd need to get GPGPU to be moe omnipresent in any kind of performance-sensitive programming. Right now, a game using CUDA (even if it also worked on AMD GPUs) would be stupidly hard to handle. You might be able to easily create a particle effect system based on it, but beyond that...
There are many places where large SIMD vectors can have an advantage, but where you also need MIMD logic to orchestrate the whole. GPGPU is, nearly by definition, unable to cater to these markets today. However, very large parts of HPC (thus why they can use huge clusters in the first place, as implicited Baron) doesn't need that too much AFAICT, and NVIDIA/AMD might be perfectly happy to only cater to that market for a few years.
AFAICT, NVIDIA is basically trying to have a revenue contribution from Tesla in Q4. So we should know more about their financial expectations there in February 2008, when they announce their Q4 results
Demirug
20-Oct-2007, 16:12
Standardization is also a killer here. Everyone is doing there own thing. One vendor needs to propose something neutral and back it, two of the three need to agree and force the other to do it, or Microsoft will/needs to come over the top and dictate it. I'm not sure which of those are for the best.
From past experience I believe that only Microsoft have the power to force anyone to agree to a single specification. Maybe the Khronos Group can get something on the table too. But if I see the current OpenGL 3.0 “disaster” I am not that sure anymore.
From the game developer point of view we need Microsoft anyway. As we need the GPU for graphics we need some API support to share it with other processing.
ShaidarHaran
20-Oct-2007, 17:59
C: You thinking in the right direction. There are differences in the scales of communication cost, but the ideas are roughly the same. Basically with GPGPU, programmers have had to come up with ways to programs massively parallel machines that are more approachable than MPI and better match the massively multithreading hardware and memory systems. The relation to cluster computing is one of the reasons that many GPGPU apps go after the HPC space, not to mention this is where research funding is coming from.
In response to this, at least in the case of Folding@Home and specifically the new(ish) SMP folding clients I can add that MPI is working quite well for them in the *nix arena (where it is native) but porting it to Windows is providing huge headaches. Massive performance drops compared to Linux, and forget about stability. Where it is stable on Linux, crashes are a regular happening in the Windows world.
My only direct experience with GPGPU is running the GPU folding client on my X1650 XT. 10 minute frame completion times (100 frame WUs) add up to a single WU taking 65% of a day to complete. By comparison the average 100 frame WU I get for CPU folding takes about 2 days to complete, and generates only 2/3 the points. Unfortunately, these are not the same WUs so a direct comparison cannot be made.
I'll add my 2 cents here. I think the issue is getting conflated. Data parallel techniques are the only proven successful way to manage massive parallelism.
On the other hand, non adaptive algorithms have run out of steam for simulation purposes. Data parallelism isn't the same as regularity and both are needed for something to run well on a GPU.
For the moment of course, GPUs are going to get more general (and most research into GPGPU is going to get trashcanned :)).
ShootMyMonkey
20-Oct-2007, 20:45
A: Not domain specific. GPGPU can successfully handle lots of different algorithms, but limited to those that can be made data parallel. But there are solutions to physics, simulation, rendering (other than traditional raster graphics), AI, etc. GPGPU is basically accelerator offloading.
My big problem with this is that people often couch so-called GPGPU as if making algorithms data-parallel is a universally solvable problem, and that computational power and thread availability is the one and only barrier. That's the one point in the blog entry that I tend to agree with most. Beyond just having a confined dataflow architecture, it's also designed not to be able to do some fairly important things which forces you to do things in ways that totally defeat the purpose of moving to the GPU in the first place. This is partly why I don't subscribe to the idea of designating it "GP"GPU... more like "SMT"GPU ("Somewhat More Than"...)
It's not all that different than when FPUs were external chips sitting to the side. It's just that the interconnect speed compared to processing speed is low. But, *any* accelerator will have this issue. Moving to direct connect like Hypertransport or "QuickPath" can help to address that. But, some people are already hinting at combining CPU and GPU architectures on the same package or even the same die.
It will certainly wouldn't hurt. There are too many tasks where because the GPU can't do the job in one shot, you end up having that acceleration in sequential pieces. That tends to make the end result overall poorer performing than doing it all on the CPU. And I get all the more peeved by the fact that the shpiel from GPU manufacturers leaves out that little detail. It's always "Oh, we can do all these computations SOO FAAST! What happens outside of that isn't important..."
So much tends to be the domain of academic exercises. I don't necessarily mean this in the pejorative sense, but that if extracting TLP is a concern, it's a nice (and most importantly widely-available) testbed to pick up some lessons, even if it is one that demands jumping through a lot of hoops.
In the appliance realm, there are already virus scanners, spam filtering, some limited database exploration, lots of image processing, and a few hard core simulation applications.
While I'm not entirely surprised, some of those just seem immeasurably ludicrous to me to be putting on a GPU. Virus scanners?!? What do you get out of it when you're dealing with something that is utterly disk/data transfer-limited?
mhouston
20-Oct-2007, 20:46
Just like most academic research. But, the research is helping us understand the limits. You'll notice that almost every GPGPU paper has a section that basically says "If we only had X functionality, we could do Y". In most cases, it's a small step over what is already there. But, issues with GPGPU limitations are forcing people to rediscover or invent new algorithmic techniques. Parallel prefix scan is a good example of rediscovering previous techniques and exploring the limits of applying them.
People are doing adaptive simulation on GPUs. Things like AMR work fine on the new GPUs with scatter support. It's tricky to handle, but doable. The main issue you run into with adaptive algorithms is load imbalance, but you have that on almost all parallel systems.
On of the other issues that I think people also forget is that shared memory techniques just don't scale because you can't build hardware that scales. This is one of the reasons that people use MPI and other explicit data management techniques on large SMP machins (Origin 2K/3K, Altix, Sun E15Ks). GPGPU style programming allows you to deal with large scale parallelism without some of the nastiness of MPI. And, you can see via CUDA that we are now starting to get synchronization primitives to handle limited forms of communication. Allowing general communication is difficult in the hardware and has huge implications for scalability. We are going to see this in the future, but if you use it, your system is going to have scaling problems.
MIMD, or more precisely MPMD vs. SPMD, is an issue. Efficient MPMD cores will look different from efficient SPMD cores. This is one of the arguments for heterogeneous designs. We shall see what the evolution of things like Fusion are and how they help here. It's possible to tack a good wide SIMD unit into a MPMD core, but how much time is the SIMD unit sitting idle? But, missing MPMD performance is going to tend to be an issue unless you have huge datasets and processing requirements so that the MPMD portion accounts for a small fraction.
mhouston
20-Oct-2007, 21:01
My big problem with this is that people often couch so-called GPGPU as if making algorithms data-parallel is a universally solvable problem, and that computational power and thread availability is the one and only barrier. That's the one point in the blog entry that I tend to agree with most. Beyond just having a confined dataflow architecture, it's also designed not to be able to do some fairly important things which forces you to do things in ways that totally defeat the purpose of moving to the GPU in the first place. This is partly why I don't subscribe to the idea of designating it "GP"GPU... more like "SMT"GPU ("Somewhat More Than"...)
Agreed. I was never a fan of the term. Other academics get blamed for it. ;-)
Not everything can be made data parallel as you state, but many problems can be recast that way. The next step past that is task parallel, and we can almost do that on GPUs now. Each task still needs some amount of parallelism, but we are getting closer to this. In the end, it's still programming 100s of physical processors with 1000s of threads. Dealing with thousands of threads is tricky.
It will certainly wouldn't hurt. There are too many tasks where because the GPU can't do the job in one shot, you end up having that acceleration in sequential pieces. That tends to make the end result overall poorer performing than doing it all on the CPU. And I get all the more peeved by the fact that the shpiel from GPU manufacturers leaves out that little detail. It's always "Oh, we can do all these computations SOO FAAST! What happens outside of that isn't important..."
This is why most accepted papers now only deal with end to end performance and not just the performance of a single kernel. But, most HPC type stuff has huge data sets so the CPU interaction time is small. But yet, eventually you get screwed by required data transfer and interaction with the CPU.
So much tends to be the domain of academic exercises. I don't necessarily mean this in the pejorative sense, but that if extracting TLP is a concern, it's a nice (and most importantly widely-available) testbed to pick up some lessons, even if it is one that demands jumping through a lot of hoops.
This is valid. Academics rarely do anything real. If it was real, they would leave and do startups. And there are several of these that have happened. If it's really worth money, then people don't publish and start a company, but then you don't hear about it since some of this stuff is pretty niche market. Like virus scanning, or mask verification, or radiation transport, etc.
While I'm not entirely surprised, some of those just seem immeasurably ludicrous to me to be putting on a GPU. Virus scanners?!? What do you get out of it when you're dealing with something that is utterly disk/data transfer-limited?
Not from disk, from high speed network lines. ;-) On the fly detection and handling for large companies. This is not consumer stuff.
One of the issues for consumer stuff is "what does your mother need 1TFlop of processing with a memory speed of >200GB/s". Other than gaming, that level of power is a tough sell. If you are just running Word, then I'd argue you would likely be fine with a 5 year old machine. But remember that much of GPGPU stuff is going after parts of the gaming market. These include graphics outside of traditional raster (HDR, tone-mapping, lens effects, adaptive sampling, ambient occlusion, better shadowing, etc) as well as physics, AI, randomization, etc.
But, if I had a clear application for GPGPU, or consumer massive parallelism outside of gaming, I'd be doing a startup and running with it. :twisted:
Half life is damn short though ... and though the research helps you understand the limit, I think the lack of focus on the actual costs of certain processor features makes it relevance to hardware design a bit of a stretch.
Sure, allowing general communication is going to be difficult ... but the cost is going to be determined almost solely by the bandwidth and caching you dedicate to it. You have general purpose GHz processors with core areas on the same scale as G80 SPs. It's not the core circuitry which makes MPMD with arbitrary data access expensive.
If you provide very little bandwidth and caching then not treating the GPU as you do now without negatively impacting performance will be very difficult, but given the choice I'd still prefer having more rope. There's always the option of just ignoring it.
PS. I don't think CUDA really improves on MPI's nastiness.
mhouston
20-Oct-2007, 22:16
I'll agree about the half-life, although physics simulation is doing reasonably well (Havok-FX, Folding@Home, NAMD, VMD). And it's true that few academics have spent time in industry or have worked closely enough with architects to understand the cost of things. 1% in silicon is actually quite expensive from a business point of view unless you get it all back in new market success.
But scaling up bandwidth is one of the hardest things to do. This is why we are moving to NUMA and NUCA systems. But then you still have to be very careful about managing locality to run well. Don't underestimate the cost of adding these things. For example, if your communication requirement lead to cache coherence, then you get screwed on scalability and extraneous memory traffic. GPUs devote an amazing amount of die area to effectively maximize bandwidth resources. But the tricks used to do that are often at odds with some more general features like cache coherence, strict consistency when scatter is involved.
It's a tough call between ramping up the floating point power and getting the next game faster and winning benchmarks and making things more general. But, I think there are ways to build a traditional GPU, a throughput compute device, and a traditional CPU for each market and even combine the different ideas together. The question is what is the right thing to do. I don't think I've heard a "perfect" solution from anyone just yet.
Graphics can already do MPMD via vertex and geometry shading, it's just that few people use that in GPGPU and none of the languages/systems export it in the model.
I agree that giving the programmer more rope to hang themselves is fine, but what is the silicon and overall performance cost of adding the features? And, more to the point, how do you justify the added production cost if there is not a clear money making market for it? We shall see how AMD and Nvidia do in the HPC market and what Intel has up their sleeves over the next year or so.
But GPGPU (or whatever you want to call it) is enough of a threat that architecture folks are trying to figure out what to do about it, and there is enough performance gain their, even with all the pain of getting to it, that there are lots of companies looking at how to use it. But, that gets us back to what the killer app is...
But, as I've said, I don't think "GPGPU" itself has much left in it since we aren't using graphics systems for this anymore. Things are now "streaming" or "data parallel" or whatever the current marketing speak is for this. It's just an approach to parallel programing for large parallel systems that happens to match the capabilities of evolving GPUs really well. But, also remember that if you add all the generality of a CPU, you will get the performance of a CPU. So the question is what can be removed that you can live with if you get more performance? And how much faster does it need to be for you to deal with the pain?
ShootMyMonkey
20-Oct-2007, 22:20
Agreed. I was never a fan of the term. Other academics get blamed for it. ;-)
Time to form a lynch mob! ;-)
Not from disk, from high speed network lines. ;-) On the fly detection and handling for large companies. This is not consumer stuff.
Even so, what kind of network setup is fast enough that GFLOPS becomes a limiting factor? Stuff that's spec'ed to Tbit bandwidths never actually reaches anything measurably close to it, and network hardware that fast is only bought when you're serving large numbers of people, which makes for latencies enormous enough that it's no faster than disk access.
One of the issues for consumer stuff is "what does your mother need 1TFlop of processing with a memory speed of >200GB/s". Other than gaming, that level of power is a tough sell. If you are just running Word, then I'd argue you would likely be fine with a 5 year old machine.
Hmmm... only 5 years old? Damn. That kicks me out of the market for Word -- not that I'm in the market for a DX9 or 10 (or... um... 7) GPU, anyway. Unless of course somebody is selling a VLB/PCI G80/R600 card :razz:.
Tim Murray
20-Oct-2007, 22:35
Can we please come up with a better term than GPGPU? I've done some private screaming about how meaningless it is, so I think we could do something there. :D
PS. I don't think CUDA really improves on MPI's nastiness.
Could you elaborate on this? I've done a fair bit of CUDA but no MPI programming; however, the code samples I've seen for MPI have just been terrifying.
mhouston
21-Oct-2007, 01:11
It's totally not general purpose as you aren't running an OS on the GPU. A better term would be streaming computation on GPUs or data parallel computation on GPUs. It's not purely scientific computing on GPUs since we are doing video/AI/graphics as well now, but that's a pretty good fit as it's restrictive, but probably too much so. Task parallel computation on GPUs might be the best match, but that's maybe to general, but it does encompass the data parallel nature as well as the task nature of the machine. Bulk Synchronous Parallelism on GPUs maybe? BSP is a little too restricted though... Maybe just MPGPU, massively parallel computation on GPUs?
mhouston
21-Oct-2007, 01:17
As for network performance, there are systems that do need to deal with 100's of MB/s (I do mean big B). (Big OC lines going into universities and big companies) Virus scanning at that level in incredibly intensive and you can burn up lots of chip bandwidth during compute to keep up. The main goal is to be able to do realtime scanning of the network streams. GPUGems 3 has an interesting article on this. As I said, this isn't something that people outside of large companies, ISPs, telcos, etc really need. Doing AI analysis of attack vectors on network streams is another interesting possibility for GPUs to crunch through since it turns out that GPUs are pretty good at neural nets and the like.
Shrug, it's just syntax ... the whole fork like "am I root or not thing" is a bit counterintuitive and the calling conventions are a bit wordy but if you just stick to a SPMD program where you just have a bunch of workers all doing the same thing ala CUDA I don't see what's so nasty about it (http://web.abo.fi/~Mats.Aspnas/PP2006/examples/send-recv5.c).
AFAICS both with CUDA and MPI the nastiness comes in because they are so low level ... once you are done with all the minutia of communicating across thread blocks, bank conflicts, load balancing and distributing work across multiple GPUs do you really end up with anything less nasty than a good MPI implementation to solve the same problem on a cluster?
Virus scanners?!? What do you get out of it when you're dealing with something that is utterly disk/data transfer-limited?
I couldn't find the live link so this is from Google's cache.
GPGPU drastically accelerates anti-virus software (http://209.85.165.104/search?q=cache:dBl9GwG0lV4J:www.theinquirer.net/en/inquirer/news/2007/09/12/gpgpu-drastically-accelerates-anti-virus-software+gpgpu+antivirus&hl=en&ct=clnk&cd=1&gl=us&client=firefox-a)
Graphics can already do MPMD via vertex and geometry shading, it's just that few people use that in GPGPU and none of the languages/systems export it in the model.
This is one thing I had not thought of before. The current model for CUDA is that the CPU sets up everything, tells the GPU to run a fixed number of threads, and waits for a response. However there clearly is support for setting up and creating threads on-GPU (Virtex->Geometry, obviously the Rasterizer). The limitation on this is that the program has to be known ahead of time (the GS can't change what PS is bound), but that doesn't seem to be too large of a problem.
I haven't completely thought this out yet, but I'm thinking something along the lines of allowing unbalanced tree algorithms to run more naturally on the GPU. You start 1 cuda block with a number of tree roots, and then each each root can setup and spawn off a number blocks for its children, and then you can recurse. Obviously you will destroy performance if you're spawning off too many tiny blocks (just like if you have too many tiny triangles), but I think this could be a quite useful feature.
Anyone looked at anything like this before and is it feasible to implement or useful?
armchair_architect
31-Oct-2007, 06:43
Stream output and DrawAuto give you something like this in DX10. The GS in the first pass gets to determine how many VS (and thus GS) threads there are in the second pass, without a roundtrip through the CPU. This is different from the cases you mentioned because those all involve one pipe stage determining how much work is done by something further down the (virtual) pipeline; DrawAuto and your CUDA example both involve one pipe stage choosing how much work is done in the future by that same pipe stage.
The streamout counters are very special-purpose, and this still requires the CPU to issue draw calls for anything to happen. But I see it as a first baby step away from the traditional feed-forward model towards self-issuing GPU programs.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.