PDA

View Full Version : CUDA 2.1 beta


Tim Murray
20-Nov-2008, 19:23
and it includes a GPU-based debugger (for 32-bit Linux only right now, but seriously give it a try)

http://forums.nvidia.com/index.php?showtopic=82290

Betanumerical
20-Nov-2008, 19:39
Oh awesome, I will grab this later today, thanks.

willardjuice
21-Nov-2008, 00:10
for 32-bit Linux

No love for us x64 users, I see how it is. :razz:

rpg.314
21-Nov-2008, 07:22
I heard that fortran compiler was dropped from 2.1. What happened to the nvcc --multicore option. Is it there in 2.1 beta? if not, will it make it to the 2.1 final. If it is ther in beta, can you please post links to the updated programming guide.

Thanks

Panajev2001a
28-Nov-2008, 09:11
The most important bit is Visual Studio 2008 support :D.

Panajev2001a
06-Jan-2009, 12:04
Hello everyone (hello Tim, please do not be sick and boread about my posts here :(, not trying to annoy anyone)...

The application I am working on is the same I have been posting about in this thread:

http://forum.beyond3d.com/showthread.php?t=51430

Now, let's recap my system specs and the CUDA issues I am running into... :(...

OS: Fedora 10, using KDE 4.1.3 (latest stable for Fedora) and KWIN's compositing effects enabled.

NVIDIA driver: NVIDIA-Linux-x86-180.06-pkg1.run (2.1 Beta enabled)

CUDA Toolkit: 2.1 Beta

CUDA SDK: 2.1 Beta

GNU compiler: gcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)

CPU: Core 2 Duo 1.67 GHz

RAM: 3 GB DDR-2

GPU: nVIDIA GeForce 8400M GS

(HP Pavilion dv6775us)

Problems:

1.) no error is given when kernel is launched with incorrect parameters:

Code snippet:
CUDA_SAFE_CALL( cudaThreadSynchronize() );

dim3 dimBlock (4096);
dim3 dimGrid (ROWS/TBLOCK); //1 Thread Block per ogni BLOCK_SIZE^2 colonne di A

PRINT_N (dimBlock.x);
PRINT_N (dimGrid.x);

MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)

CUT_CHECK_ERROR("MatTest() execution failed\n");

CUDA_SAFE_CALL( cudaThreadSynchronize() );

Output (complete):

Initializing data...
...allocating CPU memory.
Matrix is 4096x4096
Vector is 4096x1


Using device 0: GeForce 8400M GS
Exec time only on CPU: 26.816000 (ms)

...allocating GPU memory.
...copying input data to GPU mem.
Data init done.
Executing GPU kernel...


"Using Shared Memory..."


---4096 ---

---16 ---
Reading back GPU result...



Transfer + Exec + Readback time on GPU with CUDA: 35.998001 (ms)

Execution time on GPU with CUDA: 0.076999 (ms)

Transfer to GPU with CUDA: 35.476002 (ms)

Transfer from GPU with CUDA: 0.445000 (ms)
Risultati CPU (C/C++):

C_CPU.x= 2.000000 C_CPU.y= 1.000000 C_CPU.z= 1.000000 C_CPU.w= 1.000000
Risultati GPU (CUDA):

C_GPU.x= 4.000000 C_GPU.y= 2.000000 C_GPU.z= 2.000000 C_GPU.w= 2.000000

Index: 0
a[0]: 2.000000 , b[0]: 4.000000

h_C_CPU != h_C_GPU ... :(.

Shutting down...

(disregard the output for now,set wrong on purpose)

On Windows (CUDA 2.0) this causes an error to be thrown (even in "Release" configuration so we are comparing both on the same footing).


2.) device memory is not correctly initialize/set/freed...

Let's say I run my Matrix * Vector operation using the a working code path in my application (C-preprocessor #if #endif block to specify codepaths) and thus I have the output matrix set (whether with correct results or not)... the application shuts down freeing the device memory too...

CUDA_SAFE_CALL( cudaFree(d_C) );
CUDA_SAFE_CALL( cudaFree(d_B) );
CUDA_SAFE_CALL( cudaFree(d_A) );

free(h_C_GPU);

If I execute the application again without even calling the kernel:

#if SHARED_MEM == 1

printf ("\n\n\"Using Shared Memory...\"\n\n");

#endif

#if SHARED_MEM == 0

printf ("\n\n\"Not using Shared Memory...\"\n\n");

#endif


CUDA_SAFE_CALL( cudaThreadSynchronize() );

dim3 dimBlock (4096);
dim3 dimGrid (ROWS/TBLOCK); //1 Thread Block per ogni BLOCK_SIZE^2 colonne di A

PRINT_N (dimBlock.x);
PRINT_N (dimGrid.x);

//MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)

CUT_CHECK_ERROR("MatTest() execution failed\n");

CUDA_SAFE_CALL( cudaThreadSynchronize() );

//fromGPU
start_timer(&timer_toRAM);

printf("Reading back GPU result...\n\n");

CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

stop_timer(timer_toRAM, &t_toRAM_ms);
//data transfered

stop_timer(timer1, &timer1_ms);//Timer stopped

but still trying to allocate and initialize the data (all of the following is of course run before the code block I just posted a few lines above):

void init_test1_data_CUDA (float** h_C_GPU,
float * &d_A, float * &d_B, float * &d_C)
{

*h_C_GPU = (float *)calloc(N_EL, sizeof(float));

for(int i = 0; i < ROWS; i++){

(*h_C_GPU)[i] = 0.0f;

}


printf("...allocating GPU memory.\n");
CUDA_SAFE_CALL( cudaMalloc((void **)&d_A, DATA_SZ) ); //input matrix
CUDA_SAFE_CALL( cudaMalloc((void **)&d_B, DATA_V) ); //input vector
CUDA_SAFE_CALL( cudaMalloc((void **)&d_C, DATA_V) ); //result vector

CUDA_SAFE_CALL(cudaMemset((void *)d_A, 0, DATA_SZ));
CUDA_SAFE_CALL(cudaMemset((void *)d_B, 0, DATA_V));
CUDA_SAFE_CALL(cudaMemset((void *)d_C, 0, DATA_V));


return;

}

And then I retrieve the output like so:

CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

The h_C_GPU matrix contains the same value as with the previous kernel invocation as if VRAM had never been freed, re-allocated, and memset-ed to 0 in the pass in which the application (with the kernel invocation commented out) was run... but NO ERROR is thrown (and CUDA_SAFE_CALL, going by cutil.h, should catch an error by either cudaMalloc or cudaMemset if thrown...)


Edit:

One mistake, re-running this on windows showed this again (I thought I took it out already, but I guess I forgot...)... I'll also re-upload the .zip file...

CUDA_SAFE_CALL(cudaMemset((void **)&d_A, 0, ROWS*COLS));
CUDA_SAFE_CALL(cudaMemset((void **)&d_B, 0, ROWS));
CUDA_SAFE_CALL(cudaMemset((void **)&d_C, 0, ROWS));

-->

CUDA_SAFE_CALL(cudaMemset((void *)d_A, 0, DATA_SZ));
CUDA_SAFE_CALL(cudaMemset((void *)d_B, 0, DATA_V));
CUDA_SAFE_CALL(cudaMemset((void *)d_C, 0, DATA_V));


(I'll boot back into Linux and see if that changes anything... still again the CUDA runtime does not give me any error there...)

Panajev2001a
06-Jan-2009, 14:28
Still on windows, another behaviour of the cuda run-time (on windows I use CUDA 2.0 with the provided CUDA enabled driver) I cannot understand right now:

where DATA_V = N_EL * sizeof(float) for example

N_EL = ROWS

ROWS = TBLOCK * TBLOCK * AMP

TBLOCK = 16

AMP = 16


//MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)

CUDA_SAFE_CALL(cudaMemset((void *)d_C, 5, DATA_V));

CUT_CHECK_ERROR("MatTest() execution failed\n");

CUDA_SAFE_CALL( cudaThreadSynchronize() );

//fromGPU
start_timer(&timer_toRAM);

printf("Reading back GPU result...\n\n");

CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

How come the h_C_GPU array is filled with 0's? The cudaMemset (running on Windows still at the moment) produces no error and should fill the d_C array on the device with 5's, but as you can see the output is all 0's...

Small aside:


h_C_CPU = where the result calculated by the C/C++ code is stored.

h_C_GPU = where the result calculated by the CUDA or CUBLAS code (depending on the codepath) is stored.

matA = source matrix in host memory

vecB = source vector in host memory

d_A = source matrix in device memory (matA --> d_A)

d_B = source vector in device memory

d_C (or d_C1 in the CUBLAS code) = destination vector in device memory.

Initializing data...
...allocating CPU memory.
Matrix is 4096x4096
Vector is 4096x1


Exec time only on CPU: 103.758018 (ms)

Using device 0: GeForce 8400M GS
...allocating GPU memory.
...copying input data to GPU mem.
Data init done.


"Using Shared Memory..."


---256 ---

---16 ---
Reading back GPU result...



Transfer + Exec + Readback time on GPU with CUDA: 38.185299 (ms)

Execution time on GPU with CUDA: 0.375187 (ms)

Transfer to GPU with CUDA: 37.595490 (ms)

Transfer from GPU with CUDA: 0.214622 (ms)
Risultati CPU (C/C++):

C_CPU.x= 2.000000 C_CPU.y= 1.000000 C_CPU.z= 1.000000 C_CPU.w= 1.000000
Risultati GPU (CUDA):

C_GPU.x= 0.000000 C_GPU.y= 0.000000 C_GPU.z= 0.000000 C_GPU.w= 0.000000

Index: 0
a[0]: 2.000000 , b[0]: 0.000000

h_C_CPU != h_C_GPU ... :(.

Shutting down...

Press ENTER to exit...


The output is the way it is because it is treated like this after the data is copied back...

CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

stop_timer(timer_toRAM, &t_toRAM_ms);
//data transfered

stop_timer(timer1, &timer1_ms);//Timer stopped

float exec_time_CUDA = (timer1_ms - t_toGDDR_ms - t_toRAM_ms);
////
printf ("\n\nTransfer + Exec + Readback time on GPU with CUDA: %f (ms)\n", timer1_ms);

printf ("\nExecution time on GPU with CUDA: %f (ms)\n", exec_time_CUDA);
printf ("\nTransfer to GPU with CUDA: %f (ms)\n", t_toGDDR_ms);
printf ("\nTransfer from GPU with CUDA: %f (ms)\n", t_toRAM_ms);

//Y + x/100 * Y = Z ... x = ((Z-Y)/Y) * 100

printf ("Risultati CPU (C/C++):\n");
printf ("\nC_CPU.x= %f C_CPU.y= %f C_CPU.z= %f C_CPU.w= %f\n", h_C_CPU[0], h_C_CPU[1],
h_C_CPU[2], h_C_CPU[3]);

printf ("Risultati GPU (CUDA):\n");
printf ("\nC_GPU.x= %f C_GPU.y= %f C_GPU.z= %f C_GPU.w= %f\n", h_C_GPU[0], h_C_GPU[1],
h_C_GPU[2], h_C_GPU[3]);


if (!vectorEQ(h_C_CPU, h_C_GPU, COLS)) printf("\nh_C_CPU != h_C_GPU ... :(.\n");
else printf("\nh_C_CPU == h_C_GPU... :).\n");

printf("\nShutting down...\n");
CUDA_SAFE_CALL( cudaFree(d_C) );
CUDA_SAFE_CALL( cudaFree(d_B) );
CUDA_SAFE_CALL( cudaFree(d_A) );

free(h_C_GPU);


#endif

Panajev2001a
06-Jan-2009, 14:37
Ok,

http://forums.nvidia.com/index.php?showtopic=72209&hl=cudaMemset

got why cudaMemset was not doing what I wanted it to be doing (if that post is accurate only 0 can be set with cudaMemset if you are filling an array of floats...)

That still does not explain why the Linux code seems not to output any error whatsoever...

Tim Murray
06-Jan-2009, 16:18
1. You really shouldn't be using the CUTIL macros; they're just a part of the SDK samples as opposed to CUDA proper and subject to change. They specifically don't do anything in release mode (ever). Just check for an error from cudaThreadSynchronize and never use CUDA_SAFE_CALL/CUT_CHECK_ERROR. (I think those macros went away and were replaced by inline functions in 2.1 anyway, which may be why they stopped working)

2. I don't even know what you're asking?

bowman
06-Jan-2009, 17:25
I have a question not directly related to 2.1, but CUDA 2 related anyway.

Since Cuda 2.0 has been made available in the middle of August this year, the project provides applications that are compiled with the Cuda 2.0 SDK.
Unfortunately Cuda 2.0 only supports Cards that have the compute capability of 1.1 and higher.
Cards that offer only compute capabilities at 1.0 level are lacking a hardware features (support for Atomic functions) that Cuda 2.0 relies on, hence Cuda 2.0 compiled apps won't run.
http://www.gpugrid.net/forum_thread.php?id=316

Why is there no backwards compatibility? Surely it can't be that hard to modify the compiler to make older cards use a slower workaround, or something? Or are 'atomic operations' really so central to these applications that they just can't run? Folding@Home seems to be doing fine with G80 chips.

Tim Murray
06-Jan-2009, 18:08
I have a question not directly related to 2.1, but CUDA 2 related anyway.


http://www.gpugrid.net/forum_thread.php?id=316

Why is there no backwards compatibility? Surely it can't be that hard to modify the compiler to make older cards use a slower workaround, or something? Or are 'atomic operations' really so central to these applications that they just can't run? Folding@Home seems to be doing fine with G80 chips.
I never understood that quote. The way it works is like this:

When you compile a kernel in CUDA, you have (currently) one of four architectural targets--sm_10 through sm_13. They map roughly as follows:

sm_10: G80, baseline functionality
sm_11: G84/86, G9x: global atomic operations, stream support
sm_12: GT200: improved memory coalescing, more registers, shared memory atomics, warp voting intrinsics
sm_13: GT200: sm_12 plus double precision

CUDA 2.0 and later can still compile to sm_10 just fine. sm_10 remains the default if you do not specify an arch target during compilation (which throws a lot of people who try to use doubles without reading any documentation). GPUGrid is dependent on atomic ops so it requires an sm_11 capable card (so not G80), but that has absolutely nothing to do with whatever CUDA 2.0 supports.

Panajev2001a
06-Jan-2009, 18:08
1. You really shouldn't be using the CUTIL macros; they're just a part of the SDK samples as opposed to CUDA proper and subject to change. They specifically don't do anything in release mode (ever). Just check for an error from cudaThreadSynchronize and never use CUDA_SAFE_CALL/CUT_CHECK_ERROR. (I think those macros went away and were replaced by inline functions in 2.1 anyway, which may be why they stopped working)

2. I don't even know what you're asking?


Ok thanks a lot, so never use CUDA_SAFE_CALL and CUT/CUTIL macros again, that explains why I was not catching errors on Linux where I am using the 2.1 Beta SDK...

In release mode, do the inline functions introduced by cutil_inline.h do nothing, but they do still work in debug mode, have I understood it correctly?

The CUTIL (inline or not) macros are very useful though, but for error checking which is critical I better do it the way you suggest.

The second part of my question is basically pointless given your answer to the first question... the program was just giving errors I was not catching and SEEMED to perform correctly.

Basically the situation was this...

I could fill source and result arrays with correct values and it seemed like they would stick around, not being properly freed so that I could perform a readback from device memory and get the right output vector even though I thought I was allocating it and setting it to 0 correctly, but as I said... the application was not working correctly and I wasted your time and B3D's real estate :(...

To make you forgive me humble B3D poster :) I worked a bit more on the CUDA powered version of my M*V code and got it only 2 ms behind the CUBLAS code (I might still approaching it naively so it's not like my job's done there :)):

http://forums.nvidia.com/index.php?s=&showtopic=85895&view=findpost&p=487032

Panajev2001a
07-Jan-2009, 09:23
1. You really shouldn't be using the CUTIL macros; they're just a part of the SDK samples as opposed to CUDA proper and subject to change. They specifically don't do anything in release mode (ever). Just check for an error from cudaThreadSynchronize and never use CUDA_SAFE_CALL/CUT_CHECK_ERROR. (I think those macros went away and were replaced by inline functions in 2.1 anyway, which may be why they stopped working)

2. I don't even know what you're asking?

I have one question Tim, I am looking at the cutil_inline.h inline functions...

(looking at these two for example)

inline void __cudaSafeThreadSync( const char *file, const int line )
{
do {
cudaError err = cudaThreadSynchronize();
if ( cudaSuccess != err) {
fprintf(stderr, "cudaThreadSynchronize() Driver API error in file '%s' in line %i : %s.\n",
file, line, cudaGetErrorString( err) );
exit(-1);
}
} while (0);
}


inline void __cutilCheckMsg( const char *errorMessage, const char *file, const int line )
{
do {
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) {
fprintf(stderr, "cutilCheckMsg() CUTIL CUDA error: %s in file <%s>, line %i : %s.\n",
errorMessage, file, line, cudaGetErrorString( err) );
exit(-1);
}
err = cudaThreadSynchronize();
if( cudaSuccess != err) {
fprintf(stderr, "cutilCheckMsg cudaThreadSynchronize error: %s in file <%s>, line %i : %s.\n",
errorMessage, file, line, cudaGetErrorString( err) );
exit(-1);
}
} while (0);
}

inline void __cutilSafeMalloc( void *pointer, const char *file, const int line )
{
do {
if( !(pointer)) {
fprintf(stderr, "cutilSafeMalloc host malloc failure in file <%s>, line %i\n",
file, line);
exit(-1);
}
} while (0);
}

...they seem like they should work in debug and release mode, but should I stay away from these anyways?

Tim Murray
07-Jan-2009, 17:02
They're not guaranteed to remain that way between versions (and for all I know there's an #ifdef DEBUG above that), so no, I would not use them.

Panajev2001a
07-Jan-2009, 18:20
They're not guaranteed to remain that way between versions (and for all I know there's an #ifdef DEBUG above that), so no, I would not use them.

Thanks for you answer Tim :).

I checked the cutil_inline.h file and the ifdef you mention is in the file, but covers a different section of the code:

#ifdef _DEBUG
#if __DEVICE_EMULATION__
#define cutilBankChecker(array, idx) (__cutilBankChecker (threadIdx.x, threadIdx.y, threadIdx.z, \
blockDim.x, blockDim.y, blockDim.z, \
#array, idx, __FILE__, __LINE__), \
array[idx])

#else
#define cutilBankChecker(array, idx) array[idx]
#endif
#else
#define cutilBankChecker(array, idx) array[idx]
#endif

It seems to me that the #ifdef _DEBUG only covers this section of code...

Still, I'll use the functions' bodies as inspiration and roll my own version of them paying close attention to developments in cutil in future SDK releases and changes in functions like cudaMalloc, cudaThreadSynchronize, etc... as far as expected outputs are concerned... after all, how many ways are there to check if a kernel invocation failed/launched an error than doing:

cudaError_t err = cudaGetLastError();
if( cudaSuccess != err) {
fprintf(stderr, "cutilCheckMsg() CUTIL CUDA error: %s in file <%s>, line %i : %s.\n",
errorMessage, file, line, cudaGetErrorString( err) );
exit(-1);
}
err = cudaThreadSynchronize();
if( cudaSuccess != err) {
fprintf(stderr, "cutilCheckMsg cudaThreadSynchronize error: %s in file <%s>, line %i : %s.\n",
errorMessage, file, line, cudaGetErrorString( err) );
exit(-1);
}

?

Do you think it's bad?

Tim Murray
07-Jan-2009, 18:30
I do not like cutil and want much finer-grained control over error handling. Maybe I should leave it at that. If it works for you and you want to roll your own thing based on what cutil does, that's probably okay. Just pick a version, stick with it, and distribute it with your app. Every time I get a source dump that requires me to type -I ~/NVIDIA_CUDA_SDK/common/include in order to compile it and then whoops I have some goofy SDK version installed that doesn't have the cutil.h that the app developer used, I kill a kitten Arun.

Panajev2001a
07-Jan-2009, 23:40
I do not like cutil and want much finer-grained control over error handling. Maybe I should leave it at that. If it works for you and you want to roll your own thing based on what cutil does, that's probably okay. Just pick a version, stick with it, and distribute it with your app. Every time I get a source dump that requires me to type -I ~/NVIDIA_CUDA_SDK/common/include in order to compile it and then whoops I have some goofy SDK version installed that doesn't have the cutil.h that the app developer used, I kill a kitten Arun.

Thanks for your advices :).