Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 06-Nov-2007, 17:59   #1
fearsomepirate
Dinosaur Hunter
 
Join Date: Sep 2005
Location: Kentucky
Posts: 2,564
Send a message via AIM to fearsomepirate
Default Poisson Solver on a GPU

Hey all. As part of my grad research, I'm going to eventually need to write a Poisson solver. Now, I've done this a million times for a regular old Pentium in C, but I got to thinking...we mostly just use SLOR, which is a bunch of (mathematically) simple vector operations, so it seems like this is something you could program a GPU to do. However, I know exactly jack diddly squat about how to tell a GPU to do anything, especially since I use g++ in linux. I have Visual Fortran on my Windows box, but I'm not sure I could use Direct X.

So the question:

a) Is this feasible?

b) If yes, can you point me to where I can learn how to program a GPU? Any such resource would preferably include a lot of mathematical explanation.
__________________
Don't vote; it just encourages them.
fearsomepirate is offline   Reply With Quote
Old 06-Nov-2007, 18:15   #2
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,904
Default

Perhaps you should look into Cg or CUDA or CTM.
ShaidarHaran is offline   Reply With Quote
Old 06-Nov-2007, 18:18   #3
Tim Murray
chaos dunk
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
Default

Not familiar with the algorithm--could you post some links on the topic?

Some basic links to start: CUDA homepage, the best place to learn about CUDA, BrookGPU, and GPGPU.org, which could have some helpful links or info in their forums.

Basically, you want to avoid DX/OGL because then you're potentially screwed with every new driver revision. CUDA lets you write for G8x/G9x without touching 3D APIs. Brook is the progenitor of CUDA (I think that's fair to say), and it supports general D3D9 codepaths as well as AMD's CTM (which lets you skip 3D APIs as well) for R5x0 (don't know if it suports R6x0 yet).

There's RapidMind as well, but I don't know if that's really necessary here (not free, for one) but it would let you target Cell.

The rumor goes that AMD is announcing something at Supercomputing 07 that will let you target multi-core CPUs as well as GPUs, but who knows when that will come out, how good it will be initially, what it will actually support, etc. At the moment, Brook is your best choice for AMD chips, and CUDA is your best bet for NVIDIA cards.

(okay token CTM explanation: write your app in HLSL, compile using the CTM compiler, and then you can call that without going through 3D APIs. and then if you want anything outside of the D3D9 spec--which is a lot, in terms of GPGPU--it's time to monkey with assembly. this is why nobody talks about CTM anymore.)

Okay, now let's talk about where things can go terribly, terribly wrong. First, expressing your problem in the first place! The basic idea is that instead of writing your problem sequentially (obviously) or in terms of the usual task-oriented parallel model you see (CPU 1, go run function X, CPU 2, run function Y, etc), you need to have a data-parallel formulation for your algorithm. Essentially, you need to have a relatively-branch-free way of running your algorithm on a single piece of your data set at a time. This can be really tricky and just doesn't work for a lot of things. This is why I linked the UIUC course above--it makes a serious effort to explain how to do that.

Second, branching is bad. Don't do it if you like performance except on a very coarse level (you'll see that if you look at the CUDA stuff).

Third, copying from the CPU to the GPU is expensive. If you constantly need to synchronize with the CPU in order to run other stuff, things could get ugly.

Fourth, and kind of the fundamental assumption: you need to have enough arithmetic intensity to hide latency from memory accesses. This isn't like a CPU, where you might lose twenty cycles or so waiting on a memory access. You lose hundreds, and if you're constantly waiting on memory accesses, your performance will be worse than a recent CPU.

Finally, and I am dumb for forgetting to mention this beforehand, you might just be able to use the CUBLAS library if you're really just doing simple matrix/vector ops with large matrices.

(five bucks that mhouston will show up and call me dumb--that's okay, I'd appreciate it!)
Tim Murray is offline   Reply With Quote
Old 06-Nov-2007, 20:34   #4
Rufus
Member
 
Join Date: Oct 2006
Posts: 214
Default

At first glance it seems that the Poisson equations are somewhat similar to the Navier-Stokes equations (both being complex PDEs). If that's the case this chapter from GPU Gems 3 has a complete writeup of how to solve it.
Rufus is offline   Reply With Quote
Old 06-Nov-2007, 22:08   #5
fearsomepirate
Dinosaur Hunter
 
Join Date: Sep 2005
Location: Kentucky
Posts: 2,564
Send a message via AIM to fearsomepirate
Default

Although I would question the formal accuracy of the methods they're using in the above presentation (one can get qualitatively good-looking solutions w/o actually being all that accurate), they're computationally doing the same general sort of stuff. And if that's all being done on the GPU, there's no reason we need to do anything on the CPU, either.

We don't do branching, just lots and lots of for loops on matrices and vectors.
__________________
Don't vote; it just encourages them.

Last edited by fearsomepirate; 06-Nov-2007 at 22:41.
fearsomepirate is offline   Reply With Quote
Old 07-Nov-2007, 05:49   #6
silent_guy
Senior Member
 
Join Date: Mar 2006
Posts: 1,687
Default

Quote:
Originally Posted by fearsomepirate View Post
We don't do branching, just lots and lots of for loops on matrices and vectors.
In that case, CUDA should be fine, if you're willing to restrict yourself to Nvida GF8 cards. I can highly recommend the Nvidia CUDA forums, as well as the examples in the CUDA SDK.
silent_guy is offline   Reply With Quote
Old 13-Nov-2007, 04:09   #7
Rufus
Member
 
Join Date: Oct 2006
Posts: 214
Default

From the Cuda talk at SC07 over the weekend in this presentation: http://www.gpgpu.org/sc2007/SC07_CUDA_3_Libraries.pdf
Quote:
In this example, we want to solve a Poisson equation on a rectangular domain with periodic boundary conditions using a Fourier-spectral method.
Rufus is offline   Reply With Quote
Old 30-Apr-2008, 15:54   #8
PeterT
Member
 
Join Date: May 2002
Location: Austria
Posts: 699
Default

I wrote a GPU-based multigrid Poisson solver using OpenGL as part my master thesis (slightly over a year ago). Except for some (expected) inefficiencies at coarse grid levels multigrid methods are very well suited to GPUs.

(Edit: sorry for the very late bump, I just now took a look at the dates of the previous replies. I have recently posted mostly on high traffic forums so I didn't expect a thread on the first page to be many months old)
PeterT is offline   Reply With Quote
Old 30-Apr-2008, 21:01   #9
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 9,809
Default

Quote:
Originally Posted by PeterT View Post
I have recently posted mostly on high traffic forums so I didn't expect a thread on the first page to be many months old
Welcome to the GPGPU forum at B3D
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 30-Apr-2008, 23:21   #10
suryad
Senior Member
 
Join Date: Aug 2004
Posts: 2,454
Default

I am not as knowledgeable as you guys...but if you had an SLI setup...could your code if it can be parallelized using CUDA take advantage of basically what amounts to 2 processors since that is what SLI technically allows?
suryad is offline   Reply With Quote
Old 01-May-2008, 05:11   #11
Tim Murray
chaos dunk
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
Default

Quote:
Originally Posted by suryad View Post
I am not as knowledgeable as you guys...but if you had an SLI setup...could your code if it can be parallelized using CUDA take advantage of basically what amounts to 2 processors since that is what SLI technically allows?
An SLI device appears to the system as one GPU, so running a CUDA app on an SLI setup will only use one chip. AFR and SFR don't make sense in the context of CUDA either. However, depending on your algorithm, it is often possible to write code that scales to multiple GPUs--you just can't have them in an SLI setup.
Tim Murray is offline   Reply With Quote
Old 01-May-2008, 18:12   #12
suryad
Senior Member
 
Join Date: Aug 2004
Posts: 2,454
Default

Quote:
Originally Posted by Tim Murray View Post
An SLI device appears to the system as one GPU, so running a CUDA app on an SLI setup will only use one chip. AFR and SFR don't make sense in the context of CUDA either. However, depending on your algorithm, it is often possible to write code that scales to multiple GPUs--you just can't have them in an SLI setup.
Thanks for the explanation. I was of course not thinking about AFR and SFR in this sense but more like how you can imagine a system with 2 CPUs running a multithreaded app. So if you have 2 cards in your system but not have them in SLI, and you have a multithreaded CUDA app, then the app can leverage both the GPUs you are saying right?
suryad is offline   Reply With Quote
Old 01-May-2008, 19:37   #13
Tim Murray
chaos dunk
 
Join Date: May 2003
Location: Mountain View, CA
Posts: 3,274
Default

Quote:
Originally Posted by suryad View Post
Thanks for the explanation. I was of course not thinking about AFR and SFR in this sense but more like how you can imagine a system with 2 CPUs running a multithreaded app. So if you have 2 cards in your system but not have them in SLI, and you have a multithreaded CUDA app, then the app can leverage both the GPUs you are saying right?
Yeah, basically. Not multithreaded exactly, but takes advantage of multiple CUDA devices--you can enumerate the list of CUDA devices easily and select which device will run a kernel. So, you can take the Tesla D870 (2 G80s in a deskside unit connected via PCIe 2) or the Tesla S870 (4 G80s in a 1U rack connected via PCie 2), connect that to a machine, and given an appropriate algorithm, you can get a linear (or nearly linear) speedup over a single GPU. But, this is a different model than your normal multicore system, as you don't have a single shared memory space--each card has its own memory and can't easily access that of others. This is more like the message-passing model that is used in supercomputing clusters and such.
Tim Murray is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:07.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.