Evotistic
16-Feb-2007, 05:11
Check it out guys, the public beta of the CUDA toolkit and SDK are out:
http://developer.nvidia.com/object/cuda.html
Oooh, nice. Linux and Win drivers, FFT and BLAS.
Anyone with a G80 able to benchmark the provided libraries against some standard open libraries to see just how fast this thing is?
Edit: looking through the docs some more it seems the SDK has an entire G80 emulator in it. Specifically:
The emurelease and emudebug configurations run in device emulation mode, and therefore do not require a G8X-based GPU to run properly.
So that means those of us not lucky enough to have a G80 can still play around with CUDA, thought probably with abysmal performance.
And Beyond3D's article (http://www.beyond3d.com/articles/cuda-intro/) on it is live, too... :)
Too bad we didn't have the time to do practical tests, and that I had to write that stuff within about 24 hours, but heh!
mhouston
16-Feb-2007, 23:44
Minor article correction: gather has always been supported in Brook, just not scatter (although it's emulated on the CPU). ;-)
INKster
17-Feb-2007, 00:24
Humm... ?
However, before CUDA and graphics will be able to enter the supercomputing space, there are still some hurdles the company has to take. Compared, for example, to traditional floating point accelerators such as Clearspeed's CSX600 boards, graphics cards are running on 32-bit and not 64-bit, providing only single-precision data capability - instead of the required double-precision capability. This limitation is also present in Intel's recently unveiled teraflop processor project. Keane told TG Daily that Nvidia graphics cards scheduled to launch later this year will go 64-bit and offer the double-precision feature.
http://www.tgdaily.com/2007/02/16/nvidia_cuda/
CouldntResist
17-Feb-2007, 21:31
Individual GPU program launches are limited to a run time
of less than 5 seconds on the device. Exceeding this time
limit usually causes a launch failure reported through the
CUDA driver or the CUDA runtime, but in some cases hangs
the entire machine, requiring a hard reset. Microsoft Windows
has a "watchdog" timer that causes programs using the primary
graphics adapter to time out if they run longer than the
maximum allowed time. For this reason it is recommeded that
CUDA is run on a G80 that is NOT attached to a display and
does not have the Windows desktop extended onto it. In this
case, the system must contain at least one NVIDIA GPU that
serves as the primary graphics adapter.
Am I the only one scared by this?
Go buy another card, just to keep Windows from committing suicide? Welcome hardware solution to software problem. Isn't the OS, hmm, overreacting here a bit? :roll:
nutball
17-Feb-2007, 22:20
Go buy another card, just to keep Windows from committing suicide?
Or install Linux.
Humm... ?
Keane told TG Daily that Nvidia graphics cards scheduled to launch later this year will go 64-bit and offer the double-precision feature.
I find it particularily interesting that some samples from the CUDA SDK, running today on a G80, do output some not-too-common assembly code:
SimpleTexture.ptx
...
.reg .f64 $fd1,$fd2,$fd3,$fd4,$fd5,$fd6,$fd7,$fd8,$fd9,
$fd10,$fd11,$fd12;
...
# 49
# 50 // transform coordinates
# 51 u -= 0.5;
# 52 v -= 0.5;
# 53 float tu = u*cos(theta) - v*sin(theta) + 0.5;
cvt.f64.f32 $fd1, $f8; #
mov.f64 $fd2, 0dbfe0000000000000;# -0.5
add.f64 $fd3, $fd1, $fd2; #
div.f32 $f103, $f1, $f3; #
cvt.f64.f32 $fd4, $f103; #
mov.f64 $fd5, 0dbfe0000000000000;# -0.5
add.f64 $fd6, $fd4, $fd5; #
cvt.rn.f32.f64 $f104, $fd3; #
cvt.rn.f32.f64 $f105, $fd6; #
mul.f32 $f106, $f67, $f105; #
mul.f32 $f107, $f17, $f104; #
sub.f32 $f108, $f107, $f106; #
cvt.f64.f32 $fd7, $f108; #
mov.f64 $fd8, 0d3fe0000000000000;# 0.5
add.f64 $fd9, $fd7, $fd8; #
cvt.rn.f32.f64 $f109, $fd9; #
# .loc 14 1374 0
...
dwtHaar1D.ptx
...
# 132 // detail coefficient, not further referenced so directly store in
# 133 // global memory
# 134 od[tid_global + slength_step_half] = (data0 - data1) * INV_SQRT_2;
ld.param.u32 $r22, %parm_od; # id:109 %parm_od+0x0
sub.f32 $f5, $f3, $f4; #
cvt.f64.f32 $fd1, $f5; #
mov.f64 $fd2, 0d3fe6a09e667f3bcd;# 0.707107
mul.f64 $fd3, $fd1, $fd2; #
cvt.rn.f32.f64 $f6, $fd3; #
ld.param.u32 $r23, %parm_slength_step_half;# id:110 %parm_slength_step_half+0x0
add.s32 $r24, $r4, $r6; #
add.u32 $r25, $r23, $r24; #
mul.lo.u32 $r26, $r25, 4; #
add.u32 $r27, $r22, $r26; #
st.global.f32 [$r27+0], $f6; # id:111
...
There's also some f64 in SimpleGL.ptx.
Double registers, convertions to and from double, moves for doubles, double adds, double muls, all that in assembly language... Hum...
It would be quite interesting to know if all these f64 go unchanged to the hardware, or if some goblins intercept them in their way.
Do we have any elements regarding this?
EDIT: I just have build a quick try case, and the system indeed does behave as the doc. states: those doubles are mere floats on the G80. I was expecting the double->float transition to happen, if it happened, at an higher level, not after assembly code. Oh well...
Better luck next year !
BTW: To get the .ptx file, just add the "-keep" option to nvcc's invocation in the custom build step of the .cu files.
vBulletin® v3.8.4, Copyright ©2000-2010, Jelsoft Enterprises Ltd.