Using GPUs as number crunchers?

NickK

Newcomer
Hello, well it's my first post :D

I'm interested in a suggestion by a mate that it should be possible to use a GPU as a number cruncher. So it's early days on this idea.
The background is a large amount of number crunching. The amount of data is getting on for 8+ Gb for each model frame, these are split into 3D groups of any size required to improve memory cache hit ratio.

Basically what I would like to do is upload data and processing scripts to the graphics card then allow the GPU to perform number crunching (if the required operations can make use of the parallel pipelines then that'll be even better). Once processing is complete, the results can be offloaded back to normal memory and the process repeats.

I understand that loading/offloading data via AGP will be slow but with PCI express on the horizon this may change.

What's displayed on the screen is unimportant during the processing as the machine would be controlled via the a net connection.

Has anyone attempted this with the current crop of graphics cards?
 
There were several papers on this topic at last year's Siggraph and Graphics Hardware conferences. I believe there is also a website collating these papers but I can't recall the site name at the moment.
 
Thanks guys, it looks very promising and looks that the idea will fit my current thinking :D

All I need to do now is read up on the abilities and operations available to me from nVidia and ATi then work out the best way to batch process data.

Incidently - does anyone know if the graphics card manufacturers and PCI express will support multiple graphics cards on the same bus? I'm thinking one possible way to get a dense processing farm is to load up a PC with a set of graphics cards - depending on bus bandwidth and power drawn by the cards.
 
PCI Express is a point-to-point design. Current chipset designs call for only one PCI-E x16 bus between the Northbridge and the graphics card. The rest of the PCI-E buses reside from the Southbridge (which would most likely to be x8 or less). Even if it's also x16, there could potentially be a bandwidth bottleneck between the Northbridge and Southbridge if there's one card at each end. Unless somebody comes up with a design that allows for dedicated routing of data traffic between 2 or more PCI-E x16 graphics cards and the Northbridge, multiple cards working in parallel (Eg: SLI style) on PC hardware is unlikely. There could be other issues as well, but the above's the one that occured to me immediately. IMO, it's unlikely to appear at consumer level anytime soon. Dunno about higher ends.

The closest to having multiple processors now would be 3DLab's Realizm with its VSU + dual VPUs on a single PCI-E x16 card. Maybe someone could confirm or correct me...
 
The issue has even turned up on Slashdot.
After seeing the press releases from both Nvidia and ATI announcing their next generation video card offerings, it got me to thinking about what else could be done with that raw processing power. These new cards weigh in with transistor counts of 220 and 160 million (respectively) with the P4 EE core at a count of 29 million. What could my video card be doing for me while I am not playing the latest 3d games? A quick search brought me to some preliminary work done at the University of Washington with a GeForce4 TI 4600 pitted against a 1.5GHz P4. My Favorite excerpt from the paper: 'For a 1500x1500 matrix, the GPU outperforms the CPU by a factor of 3.2.' A PDF of the paper is available here."
- http://developers.slashdot.org/arti...39204&mode=thread&tid=152&tid=185
 
That's pretty exciting, really. Using the two vector processors on the GeForce4 they achieved greater throughput than a CPU. Imagine what an NV40 can do with all 16 of its pixel pipelines and two shader units per pipeline can do?
 
I am *HIGHLY* interested in this for use in molecular modeling (calculations use a ton of matrix algebra).

Do any of you guys know of any on-going work in this area? The paper from slashdot mentions molecular calculations for biology (i.e. folding, I'm guessing), but I'm much more interested in molecular systems, espcially containing transition metals.

COOL stuff. :D
 
accidentalsuccess said:
I am *HIGHLY* interested in this for use in molecular modeling (calculations use a ton of matrix algebra).
I'm definitely interested in attempting to write a matrix diagonalization routine for the NV40. Tons of physics calculations require diagonalization. The main problem would be keeping the diagonalization numerically stable with only 32-bit floats to work with.

But if it were possible to do this for large matrices, I bet I could get in the range of 20x the performance of a single high-end CPU (If I can't keep it numerically stable, of course, it would be pointless: small matrices can be diagonalized on a CPU very quickly).
 
I see that ATi have already produced a fluid dynamics model using their card with a good improvement in speed. Hopefully nVidia, Intel and ATi should cotton onto the fact that if they produced chipsets that supported multiple PCI-E x16 slots they could drive into the supercomputing market with their existing product ranges.

This feels like the days when you bought an external fpu.. now it's a gpu/vpu that's there to aid 3d maths for games :D

There is now a "hello world" basic beginners demonstration in the www.gpgpu.org developers section. The basic demonstration loads data from the main memory into gpu memory, processes it and then shifts the data back from gpu into main memory.
 
Before we all get too excited I should say that for many applications FP32 is not sufficient, many require FP64. Real-time fluid-dynamics has been demonstrated on both NVIDIA and ATI hardware AFAIK, but the algorithms used were specifically chosen for their numeric robustness, not for their general and widespread applicability.

So for some applications the GPUs are a very exciting prospect, particularly if we can develop algorithms for which FP32 is sufficient. But they're not a panacea, and I really can't see FP64 coming to pixel pipelines in the forseeable future! :)
 
nutball said:
Before we all get too excited I should say that for many applications FP32 is not sufficient, many require FP64. Real-time fluid-dynamics has been demonstrated on both NVIDIA and ATI hardware AFAIK, but the algorithms used were specifically chosen for their numeric robustness, not for their general and widespread applicability.

Good point - although I'd suggest perhaps a rethink in the algorithms and approaches used.
 
accidentalsuccess said:
I am *HIGHLY* interested in this for use in molecular modeling (calculations use a ton of matrix algebra).

Do any of you guys know of any on-going work in this area? The paper from slashdot mentions molecular calculations for biology (i.e. folding, I'm guessing), but I'm much more interested in molecular systems, espcially containing transition metals.

COOL stuff. :D

I'm interested in this as well. Is there some way to use two FP32 floats together as one 64 bit float?

On the other hand, FP32 shouldn't be too bad if one is doing molecular dynamics with a velocity verlet algorithm, since you'll still get total energy conservation. To handle different forces one might need to do some clever scaling and renormalization if one really wants good results, but it's probably dooable. I don't know how bad the lower precision would be for DFT type codes though.

Another interesting project would be using PCI-X GPU's for sensory processing and vehicle guidance on the DARPA Grand Challenge. I hear much of the algorithms used there are directly portable to GPU's.
 
<nitpick> PCI-X is an extention of PCI that is 64 bits wide and can be 66 MHZ, not PCI Express. </nitpick>
 
boobs said:
Does anybody know who I should contact over at Nvidia and Ati to get some test hardware?
If you want to be able to program for the new architecture, you'll typically be given a program that enables software emulation on older hardware. This was available publicly previously, and is now only available for registered developers. You can become a registered developer by filling out an application on nVidia's website.
 
Chalnoth said:
boobs said:
Does anybody know who I should contact over at Nvidia and Ati to get some test hardware?
If you want to be able to program for the new architecture, you'll typically be given a program that enables software emulation on older hardware. This was available publicly previously, and is now only available for registered developers. You can become a registered developer by filling out an application on nVidia's website.

Yeah but if the whole point of the exercise is speed, then running things on emulation is kinda pointless.
 
boobs said:
Chalnoth said:
boobs said:
Does anybody know who I should contact over at Nvidia and Ati to get some test hardware?
If you want to be able to program for the new architecture, you'll typically be given a program that enables software emulation on older hardware. This was available publicly previously, and is now only available for registered developers. You can become a registered developer by filling out an application on nVidia's website.

Yeah but if the whole point of the exercise is speed, then running things on emulation is kinda pointless.

Unfortuately nVidia aren't going to give all the developers free hardware. At least an quickly downloaded emulator allows you to progress your development whilst your order for the hardware goes though.

It would be good if there was a form of Prime95 for GPUs. 3DMark is only good if your looking at the visuals and happen to notice something. Unless I've missed the part where graphics benchies validates the output they get.
 
Does anybody know of fps figures for doing convolutions with 15x15 FP24/32 masks on 512x512 unsigned char images on X800 and 6800U respectively?
 
Back
Top