Panajev2001a
26-Dec-2008, 11:29
In order to do some work for my thesis, I am doing some small tests to learn and understand how to use GPU's for general computation and one of the tests I am doing is a large dense square matrix * vector multiply and comparing it with straight C/C++ code.
A speed-up of only 15-16% considering I am only using one CPU core without special GCC optimizations and custom written SSE/SSE2 code (in the program's output I still mention SSE2 as I had enabled Visual Studio's SSE2 code generation optimizations, but I am not doing it in GCC and still I do not think such a flag magically buys a lot of performance) and CUBLAS on the other side (so extremely optimized functions... and I am using pinned memory for faster PCI-Express transfers).
My CPU is a common C2D 1.67 GHz and the GPU is a GeForce 8400M GS with 256 MB of memory (400 MHz clock, 800 MHz shader clock, 2 Streaming Multiprocessors --> 16 Scalar Processors).
I am trying to run a 5K x 5K * 5K matrix-vector multiply both with CUBLAS and C/C++ code and for each I can loop the computation with a defined macro called STEPS (STEPS == 1 --> no loop basically).
STEPS == 50 (Linux, nVIDIA 180.06 driver beta, CUDA 2.1 Beta)
Initializing data...
...allocating CPU memory.
Matrix is 5376x5376
Vector is 5376x1
Exec time only on CPU: 2420.373047 (ms)
simpleCUBLAS test running..
Transfer + Exec + Readback time on GPU with CUBLAS: 2097.648926 (ms)
Execution time on GPU with CUBLAS: 2032.922924 (ms)
Transfer to GPU with CUBLAS: 63.268002 (ms)
Transfer from GPU with CUBLAS: 1.458000 (ms)
GPU CUBLAS code is 1.153850x faster than the CPU code (VS SSE2)...
STEPS == 1 (same)
Initializing data...
...allocating CPU memory.
Matrix is 5376x5376
Vector is 5376x1
Exec time only on CPU: 47.665001 (ms)
simpleCUBLAS test running..
Transfer + Exec + Readback time on GPU with CUBLAS: 99.781998 (ms)
Execution time on GPU with CUBLAS: 40.709996 (ms)
Transfer to GPU with CUBLAS: 58.983002 (ms)
Transfer from GPU with CUBLAS: 0.089000 (ms)
GPU CUBLAS code is 0.477691x faster than the CPU code (VS SSE2)...
As far as source code is concerned, this is the code that calls the CUBLAS routine:
unsigned int timer_toGDDR = 0;
float t_toGDDR_ms = 0.0f;
unsigned int timer_toRAM = 0;
float t_toRAM_ms = 0.0f;
init_test1_data_CUBLAS (&h_C_CUBLAS);
init_test1_CUBLAS_alloc (d_A, d_B, d_C1, status);
CUDA_SAFE_CALL( cudaThreadSynchronize() );
unsigned int timer2 = 0;
float timer2_ms = 0.0f;
start_timer(&timer2);
//toGPU
start_timer(&timer_toGDDR);
init_test1_CUBLAS_transfer_to_GPU (matA, vecB, d_A, d_B, argc, argv, status);
stop_timer(timer_toGDDR, &t_toGDDR_ms);
//data transfered
////exec
CUDA_SAFE_CALL( cudaThreadSynchronize() );
for (int n = 0; n < STEPS; n++) {
//init_test1_CUBLAS_transferVec_to_GPU (vecB, d_B,argc, argv, status);
cublasSgemv('n', ROWS, COLS, 1, d_A, ROWS, d_B, 1, 0, d_C1, 1);
}
CUDA_SAFE_CALL( cudaThreadSynchronize() );
//fromGPU
start_timer(&timer_toRAM);
init_test1_CUBLAS_readback (h_C_CUBLAS, d_C1, argc, argv, status);
stop_timer(timer_toRAM, &t_toRAM_ms);
//data transfered
stop_timer(timer2, &timer2_ms);//Timer stopped
////
printf ("\n\nTransfer + Exec + Readback time on GPU with CUBLAS: %f (ms)\n", timer2_ms);
printf ("\nExecution time on GPU with CUBLAS: %f (ms)\n", (timer2_ms - t_toGDDR_ms - t_toRAM_ms));
printf ("\nTransfer to GPU with CUBLAS: %f (ms)\n", t_toGDDR_ms);
printf ("\nTransfer from GPU with CUBLAS: %f (ms)\n", t_toRAM_ms);
printf ("\nGPU CUBLAS code is %fx faster than the CPU code (VS SSE2)...\n\n",
(timer / timer2_ms));
This is the C/C++ code that does the same operation (matrix*vector):
void mat_vec ( float *a, const int R, const int C, float *b, const int SIZE, float *c) {
//A is a column major ordered matrix
float temp = 0;
if ( null == a || null == b || null == c ) return;
for (int j = 0; j < C; j++) {
//for (int i = 0; i < R; i++) {
temp = b[j];
for (int i = 0; i < R; i++) {
//for (int j = 0; j < C; j++) {
if ( j == 0) c[i] = 0;
c[i] += MATC(a,i,j,R) * temp;
//c[i] += MATC(a,i,j,R) * b[i];
}
}
No special GCC optimizations, nothing special in the Makefile (standard Makefile used in other CUDA SDK projects).
http://forums.nvidia.com/index.php?showtopic=84441
(in that thread I post the source code of this project of mine... including all the helper functions such as the ones used for the timers... which basically are wrappers...
void start_timer (unsigned int* t) {
//CUT_SAFE_CALL(cutCreateTimer(&timer_t1));
CUT_SAFE_CALL(cutCreateTimer(t));
CUT_SAFE_CALL(cutStartTimer((*t)));
return;
}
void stop_timer (unsigned int t, float * t_ms) {
CUT_SAFE_CALL(cutStopTimer(t));
(*t_ms) = cutGetTimerValue(t);
CUT_SAFE_CALL(cutDeleteTimer(t));
return;
}
The CPU executed code is only about 15-16% slower than the CUBLAS code (counting execution time alone... please do tell me if I am timing things the wrong way... :()... it kinda seems weird to me... I expected much more...
Counting one MADD unit per SP we should get a peak of 12.8-25.6 GFLOPS (depending if we do a single op per cycle or two as we can with a MADD operation) for the whole chip.
The C2D at 1.67 GHz should get at most 6.68-13.6 GFLOPS per core... 4-8 ops per cycle per core (depending on the kind of SSE operations you execute).
That is using SSE instructions however... I am not...
This is the disassembled portion of the C/C++ code that performs the matrix*vector operation.
00000000 <_Z7mat_vecPfiiS_iS_>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 57 push %edi
4: 56 push %esi
5: 53 push %ebx
6: 83 ec 04 sub $0x4,%esp
9: 8b 45 08 mov 0x8(%ebp),%eax
c: 8b 75 0c mov 0xc(%ebp),%esi
f: 8b 4d 1c mov 0x1c(%ebp),%ecx
12: 85 c0 test %eax,%eax
14: 74 65 je 7b <_Z7mat_vecPfiiS_iS_+0x7b>
16: 8b 5d 14 mov 0x14(%ebp),%ebx
19: 85 db test %ebx,%ebx
1b: 74 5e je 7b <_Z7mat_vecPfiiS_iS_+0x7b>
1d: 85 c9 test %ecx,%ecx
1f: 74 5a je 7b <_Z7mat_vecPfiiS_iS_+0x7b>
21: 8b 55 10 mov 0x10(%ebp),%edx
24: 85 d2 test %edx,%edx
26: 7e 53 jle 7b <_Z7mat_vecPfiiS_iS_+0x7b>
28: 8d 14 b5 00 00 00 00 lea 0x0(,%esi,4),%edx
2f: 89 c7 mov %eax,%edi
31: 89 55 f0 mov %edx,-0x10(%ebp)
34: 31 db xor %ebx,%ebx
36: 66 90 xchg %ax,%ax
38: 8b 45 14 mov 0x14(%ebp),%eax
3b: 85 f6 test %esi,%esi
3d: d9 04 98 flds (%eax,%ebx,4)
40: 7e 26 jle 68 <_Z7mat_vecPfiiS_iS_+0x68>
42: 31 c0 xor %eax,%eax
44: 85 db test %ebx,%ebx
46: 89 fa mov %edi,%edx
48: 74 3e je 88 <_Z7mat_vecPfiiS_iS_+0x88>
4a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
50: d9 02 flds (%edx)
52: 83 c2 04 add $0x4,%edx
55: d8 c9 fmul %st(1),%st
57: d8 04 81 fadds (%ecx,%eax,4)
5a: d9 1c 81 fstps (%ecx,%eax,4)
5d: 83 c0 01 add $0x1,%eax
60: 39 c6 cmp %eax,%esi
62: 7f ec jg 50 <_Z7mat_vecPfiiS_iS_+0x50>
64: dd d8 fstp %st(0)
66: eb 08 jmp 70 <_Z7mat_vecPfiiS_iS_+0x70>
68: dd d8 fstp %st(0)
6a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
70: 83 c3 01 add $0x1,%ebx
73: 03 7d f0 add -0x10(%ebp),%edi
76: 39 5d 10 cmp %ebx,0x10(%ebp)
79: 7f bd jg 38 <_Z7mat_vecPfiiS_iS_+0x38>
7b: 83 c4 04 add $0x4,%esp
7e: 5b pop %ebx
7f: 5e pop %esi
80: 5f pop %edi
81: 5d pop %ebp
82: c3 ret
83: 90 nop
84: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
88: d9 ee fldz
8a: d9 14 81 fsts (%ecx,%eax,4)
8d: d9 02 flds (%edx)
8f: 83 c2 04 add $0x4,%edx
92: d8 ca fmul %st(2),%st
94: de c1 faddp %st,%st(1)
96: d9 1c 81 fstps (%ecx,%eax,4)
99: 83 c0 01 add $0x1,%eax
9c: 39 c6 cmp %eax,%esi
9e: 7f e8 jg 88 <_Z7mat_vecPfiiS_iS_+0x88>
a0: dd d8 fstp %st(0)
a2: eb cc jmp 70 <_Z7mat_vecPfiiS_iS_+0x70>
a4: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
aa: 8d bf 00 00 00 00 lea 0x0(%edi),%edi
No xmm registers used... no SSE ops... so straight x87 FPU code...
How can the CUBLAS function with so much more theoretical FLOPS available and a normal Matrix*Vector operation perform only 16% better than that C/C++ code?
I am puzzled... :(.
A speed-up of only 15-16% considering I am only using one CPU core without special GCC optimizations and custom written SSE/SSE2 code (in the program's output I still mention SSE2 as I had enabled Visual Studio's SSE2 code generation optimizations, but I am not doing it in GCC and still I do not think such a flag magically buys a lot of performance) and CUBLAS on the other side (so extremely optimized functions... and I am using pinned memory for faster PCI-Express transfers).
My CPU is a common C2D 1.67 GHz and the GPU is a GeForce 8400M GS with 256 MB of memory (400 MHz clock, 800 MHz shader clock, 2 Streaming Multiprocessors --> 16 Scalar Processors).
I am trying to run a 5K x 5K * 5K matrix-vector multiply both with CUBLAS and C/C++ code and for each I can loop the computation with a defined macro called STEPS (STEPS == 1 --> no loop basically).
STEPS == 50 (Linux, nVIDIA 180.06 driver beta, CUDA 2.1 Beta)
Initializing data...
...allocating CPU memory.
Matrix is 5376x5376
Vector is 5376x1
Exec time only on CPU: 2420.373047 (ms)
simpleCUBLAS test running..
Transfer + Exec + Readback time on GPU with CUBLAS: 2097.648926 (ms)
Execution time on GPU with CUBLAS: 2032.922924 (ms)
Transfer to GPU with CUBLAS: 63.268002 (ms)
Transfer from GPU with CUBLAS: 1.458000 (ms)
GPU CUBLAS code is 1.153850x faster than the CPU code (VS SSE2)...
STEPS == 1 (same)
Initializing data...
...allocating CPU memory.
Matrix is 5376x5376
Vector is 5376x1
Exec time only on CPU: 47.665001 (ms)
simpleCUBLAS test running..
Transfer + Exec + Readback time on GPU with CUBLAS: 99.781998 (ms)
Execution time on GPU with CUBLAS: 40.709996 (ms)
Transfer to GPU with CUBLAS: 58.983002 (ms)
Transfer from GPU with CUBLAS: 0.089000 (ms)
GPU CUBLAS code is 0.477691x faster than the CPU code (VS SSE2)...
As far as source code is concerned, this is the code that calls the CUBLAS routine:
unsigned int timer_toGDDR = 0;
float t_toGDDR_ms = 0.0f;
unsigned int timer_toRAM = 0;
float t_toRAM_ms = 0.0f;
init_test1_data_CUBLAS (&h_C_CUBLAS);
init_test1_CUBLAS_alloc (d_A, d_B, d_C1, status);
CUDA_SAFE_CALL( cudaThreadSynchronize() );
unsigned int timer2 = 0;
float timer2_ms = 0.0f;
start_timer(&timer2);
//toGPU
start_timer(&timer_toGDDR);
init_test1_CUBLAS_transfer_to_GPU (matA, vecB, d_A, d_B, argc, argv, status);
stop_timer(timer_toGDDR, &t_toGDDR_ms);
//data transfered
////exec
CUDA_SAFE_CALL( cudaThreadSynchronize() );
for (int n = 0; n < STEPS; n++) {
//init_test1_CUBLAS_transferVec_to_GPU (vecB, d_B,argc, argv, status);
cublasSgemv('n', ROWS, COLS, 1, d_A, ROWS, d_B, 1, 0, d_C1, 1);
}
CUDA_SAFE_CALL( cudaThreadSynchronize() );
//fromGPU
start_timer(&timer_toRAM);
init_test1_CUBLAS_readback (h_C_CUBLAS, d_C1, argc, argv, status);
stop_timer(timer_toRAM, &t_toRAM_ms);
//data transfered
stop_timer(timer2, &timer2_ms);//Timer stopped
////
printf ("\n\nTransfer + Exec + Readback time on GPU with CUBLAS: %f (ms)\n", timer2_ms);
printf ("\nExecution time on GPU with CUBLAS: %f (ms)\n", (timer2_ms - t_toGDDR_ms - t_toRAM_ms));
printf ("\nTransfer to GPU with CUBLAS: %f (ms)\n", t_toGDDR_ms);
printf ("\nTransfer from GPU with CUBLAS: %f (ms)\n", t_toRAM_ms);
printf ("\nGPU CUBLAS code is %fx faster than the CPU code (VS SSE2)...\n\n",
(timer / timer2_ms));
This is the C/C++ code that does the same operation (matrix*vector):
void mat_vec ( float *a, const int R, const int C, float *b, const int SIZE, float *c) {
//A is a column major ordered matrix
float temp = 0;
if ( null == a || null == b || null == c ) return;
for (int j = 0; j < C; j++) {
//for (int i = 0; i < R; i++) {
temp = b[j];
for (int i = 0; i < R; i++) {
//for (int j = 0; j < C; j++) {
if ( j == 0) c[i] = 0;
c[i] += MATC(a,i,j,R) * temp;
//c[i] += MATC(a,i,j,R) * b[i];
}
}
No special GCC optimizations, nothing special in the Makefile (standard Makefile used in other CUDA SDK projects).
http://forums.nvidia.com/index.php?showtopic=84441
(in that thread I post the source code of this project of mine... including all the helper functions such as the ones used for the timers... which basically are wrappers...
void start_timer (unsigned int* t) {
//CUT_SAFE_CALL(cutCreateTimer(&timer_t1));
CUT_SAFE_CALL(cutCreateTimer(t));
CUT_SAFE_CALL(cutStartTimer((*t)));
return;
}
void stop_timer (unsigned int t, float * t_ms) {
CUT_SAFE_CALL(cutStopTimer(t));
(*t_ms) = cutGetTimerValue(t);
CUT_SAFE_CALL(cutDeleteTimer(t));
return;
}
The CPU executed code is only about 15-16% slower than the CUBLAS code (counting execution time alone... please do tell me if I am timing things the wrong way... :()... it kinda seems weird to me... I expected much more...
Counting one MADD unit per SP we should get a peak of 12.8-25.6 GFLOPS (depending if we do a single op per cycle or two as we can with a MADD operation) for the whole chip.
The C2D at 1.67 GHz should get at most 6.68-13.6 GFLOPS per core... 4-8 ops per cycle per core (depending on the kind of SSE operations you execute).
That is using SSE instructions however... I am not...
This is the disassembled portion of the C/C++ code that performs the matrix*vector operation.
00000000 <_Z7mat_vecPfiiS_iS_>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 57 push %edi
4: 56 push %esi
5: 53 push %ebx
6: 83 ec 04 sub $0x4,%esp
9: 8b 45 08 mov 0x8(%ebp),%eax
c: 8b 75 0c mov 0xc(%ebp),%esi
f: 8b 4d 1c mov 0x1c(%ebp),%ecx
12: 85 c0 test %eax,%eax
14: 74 65 je 7b <_Z7mat_vecPfiiS_iS_+0x7b>
16: 8b 5d 14 mov 0x14(%ebp),%ebx
19: 85 db test %ebx,%ebx
1b: 74 5e je 7b <_Z7mat_vecPfiiS_iS_+0x7b>
1d: 85 c9 test %ecx,%ecx
1f: 74 5a je 7b <_Z7mat_vecPfiiS_iS_+0x7b>
21: 8b 55 10 mov 0x10(%ebp),%edx
24: 85 d2 test %edx,%edx
26: 7e 53 jle 7b <_Z7mat_vecPfiiS_iS_+0x7b>
28: 8d 14 b5 00 00 00 00 lea 0x0(,%esi,4),%edx
2f: 89 c7 mov %eax,%edi
31: 89 55 f0 mov %edx,-0x10(%ebp)
34: 31 db xor %ebx,%ebx
36: 66 90 xchg %ax,%ax
38: 8b 45 14 mov 0x14(%ebp),%eax
3b: 85 f6 test %esi,%esi
3d: d9 04 98 flds (%eax,%ebx,4)
40: 7e 26 jle 68 <_Z7mat_vecPfiiS_iS_+0x68>
42: 31 c0 xor %eax,%eax
44: 85 db test %ebx,%ebx
46: 89 fa mov %edi,%edx
48: 74 3e je 88 <_Z7mat_vecPfiiS_iS_+0x88>
4a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
50: d9 02 flds (%edx)
52: 83 c2 04 add $0x4,%edx
55: d8 c9 fmul %st(1),%st
57: d8 04 81 fadds (%ecx,%eax,4)
5a: d9 1c 81 fstps (%ecx,%eax,4)
5d: 83 c0 01 add $0x1,%eax
60: 39 c6 cmp %eax,%esi
62: 7f ec jg 50 <_Z7mat_vecPfiiS_iS_+0x50>
64: dd d8 fstp %st(0)
66: eb 08 jmp 70 <_Z7mat_vecPfiiS_iS_+0x70>
68: dd d8 fstp %st(0)
6a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
70: 83 c3 01 add $0x1,%ebx
73: 03 7d f0 add -0x10(%ebp),%edi
76: 39 5d 10 cmp %ebx,0x10(%ebp)
79: 7f bd jg 38 <_Z7mat_vecPfiiS_iS_+0x38>
7b: 83 c4 04 add $0x4,%esp
7e: 5b pop %ebx
7f: 5e pop %esi
80: 5f pop %edi
81: 5d pop %ebp
82: c3 ret
83: 90 nop
84: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
88: d9 ee fldz
8a: d9 14 81 fsts (%ecx,%eax,4)
8d: d9 02 flds (%edx)
8f: 83 c2 04 add $0x4,%edx
92: d8 ca fmul %st(2),%st
94: de c1 faddp %st,%st(1)
96: d9 1c 81 fstps (%ecx,%eax,4)
99: 83 c0 01 add $0x1,%eax
9c: 39 c6 cmp %eax,%esi
9e: 7f e8 jg 88 <_Z7mat_vecPfiiS_iS_+0x88>
a0: dd d8 fstp %st(0)
a2: eb cc jmp 70 <_Z7mat_vecPfiiS_iS_+0x70>
a4: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
aa: 8d bf 00 00 00 00 lea 0x0(%edi),%edi
No xmm registers used... no SSE ops... so straight x87 FPU code...
How can the CUBLAS function with so much more theoretical FLOPS available and a normal Matrix*Vector operation perform only 16% better than that C/C++ code?
I am puzzled... :(.