CUBLAS vs C/C++ unoptimized code (performance and doubts :()

Discussion in 'GPGPU Technology & Programming' started by Panajev2001a, Dec 26, 2008.

  1. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    In order to do some work for my thesis, I am doing some small tests to learn and understand how to use GPU's for general computation and one of the tests I am doing is a large dense square matrix * vector multiply and comparing it with straight C/C++ code.

    A speed-up of only 15-16% considering I am only using one CPU core without special GCC optimizations and custom written SSE/SSE2 code (in the program's output I still mention SSE2 as I had enabled Visual Studio's SSE2 code generation optimizations, but I am not doing it in GCC and still I do not think such a flag magically buys a lot of performance) and CUBLAS on the other side (so extremely optimized functions... and I am using pinned memory for faster PCI-Express transfers).

    My CPU is a common C2D 1.67 GHz and the GPU is a GeForce 8400M GS with 256 MB of memory (400 MHz clock, 800 MHz shader clock, 2 Streaming Multiprocessors --> 16 Scalar Processors).

    I am trying to run a 5K x 5K * 5K matrix-vector multiply both with CUBLAS and C/C++ code and for each I can loop the computation with a defined macro called STEPS (STEPS == 1 --> no loop basically).


    STEPS == 50 (Linux, nVIDIA 180.06 driver beta, CUDA 2.1 Beta)
    Code:
    Initializing data...
    ...allocating CPU memory.
    Matrix is 5376x5376
    Vector is 5376x1
    
    
    Exec time only on CPU: 2420.373047 (ms)
    
    simpleCUBLAS test running..
    
    
    Transfer + Exec + Readback time on GPU with CUBLAS: 2097.648926 (ms)
    
    Execution time on GPU with CUBLAS: 2032.922924 (ms)
    
    Transfer to GPU with CUBLAS: 63.268002 (ms)
    
    Transfer from GPU with CUBLAS: 1.458000 (ms)
    
    GPU CUBLAS code is 1.153850x faster than the CPU code (VS SSE2)...
    STEPS == 1 (same)

    Code:
    Initializing data...
    ...allocating CPU memory.
    Matrix is 5376x5376
    Vector is 5376x1
    
    
    Exec time only on CPU: 47.665001 (ms)
    
    simpleCUBLAS test running..
    
    
    Transfer + Exec + Readback time on GPU with CUBLAS: 99.781998 (ms)
    
    Execution time on GPU with CUBLAS: 40.709996 (ms)
    
    Transfer to GPU with CUBLAS: 58.983002 (ms)
    
    Transfer from GPU with CUBLAS: 0.089000 (ms)
    
    GPU CUBLAS code is 0.477691x faster than the CPU code (VS SSE2)...

    As far as source code is concerned, this is the code that calls the CUBLAS routine:

    Code:
    	unsigned int timer_toGDDR = 0;
    	float t_toGDDR_ms = 0.0f;
    
    	unsigned int timer_toRAM = 0;
    	float t_toRAM_ms = 0.0f;
    
    	init_test1_data_CUBLAS (&h_C_CUBLAS);
    	init_test1_CUBLAS_alloc (d_A, d_B, d_C1, status);
    
    	CUDA_SAFE_CALL( cudaThreadSynchronize() );
    
    	unsigned int timer2 = 0;
    	float timer2_ms = 0.0f;
    
    	start_timer(&timer2);
    
    	//toGPU
    
    
    	start_timer(&timer_toGDDR);
    
    	init_test1_CUBLAS_transfer_to_GPU (matA, vecB, d_A, d_B, argc, argv, status);
    
    	stop_timer(timer_toGDDR, &t_toGDDR_ms);
    	//data transfered
    
    	////exec
    
    	CUDA_SAFE_CALL( cudaThreadSynchronize() );
    
    	for (int n = 0; n < STEPS; n++) {
    
    		//init_test1_CUBLAS_transferVec_to_GPU (vecB, d_B,argc, argv, status);
    
    		cublasSgemv('n', ROWS, COLS, 1, d_A, ROWS, d_B, 1, 0, d_C1, 1);
    
    	}
    
    	CUDA_SAFE_CALL( cudaThreadSynchronize() );
    
    	//fromGPU
    	start_timer(&timer_toRAM);
    	init_test1_CUBLAS_readback (h_C_CUBLAS, d_C1, argc, argv, status);
    	stop_timer(timer_toRAM, &t_toRAM_ms);
    	//data transfered
    
    	stop_timer(timer2, &timer2_ms);//Timer stopped
    	////
    	printf ("\n\nTransfer + Exec + Readback time on GPU with CUBLAS: %f (ms)\n", timer2_ms);
    
    	printf ("\nExecution time on GPU with CUBLAS: %f (ms)\n", (timer2_ms - t_toGDDR_ms - t_toRAM_ms));
    	printf ("\nTransfer to GPU with CUBLAS: %f (ms)\n", t_toGDDR_ms);
    	printf ("\nTransfer from GPU with CUBLAS: %f (ms)\n", t_toRAM_ms);
    
    
    	printf ("\nGPU CUBLAS code is %fx faster than the CPU code (VS SSE2)...\n\n",
    		(timer / timer2_ms));

    This is the C/C++ code that does the same operation (matrix*vector):

    Code:
    void mat_vec ( float *a, const int R, const int C, float *b, const int SIZE, float *c) {
    
    	//A is a column major ordered matrix
    
    	float temp = 0;
    
    	if ( null == a || null == b || null == c ) return;
    
    	for (int j = 0; j < C; j++) {
    	//for (int i = 0; i < R; i++) {
    		temp = b[j];
    
    		for (int i = 0; i < R; i++) {
    		//for (int j = 0; j < C; j++) {
    
    			if ( j == 0) c[i] = 0;  
    
    			c[i] += MATC(a,i,j,R) * temp;
    			//c[i] += MATC(a,i,j,R) * b[i];
    
    		}
    
    	}
    No special GCC optimizations, nothing special in the Makefile (standard Makefile used in other CUDA SDK projects).

    http://forums.nvidia.com/index.php?showtopic=84441

    (in that thread I post the source code of this project of mine... including all the helper functions such as the ones used for the timers... which basically are wrappers...

    Code:
    void start_timer (unsigned int* t) {
    
        //CUT_SAFE_CALL(cutCreateTimer(&timer_t1));
    
        CUT_SAFE_CALL(cutCreateTimer(t));
        CUT_SAFE_CALL(cutStartTimer((*t)));
    
        return;
    }
    void stop_timer (unsigned int t, float * t_ms) {
    
        CUT_SAFE_CALL(cutStopTimer(t));
        (*t_ms) = cutGetTimerValue(t);
        CUT_SAFE_CALL(cutDeleteTimer(t));
    
        return;
    }
    The CPU executed code is only about 15-16% slower than the CUBLAS code (counting execution time alone... please do tell me if I am timing things the wrong way... :()... it kinda seems weird to me... I expected much more...

    Counting one MADD unit per SP we should get a peak of 12.8-25.6 GFLOPS (depending if we do a single op per cycle or two as we can with a MADD operation) for the whole chip.

    The C2D at 1.67 GHz should get at most 6.68-13.6 GFLOPS per core... 4-8 ops per cycle per core (depending on the kind of SSE operations you execute).

    That is using SSE instructions however... I am not...

    This is the disassembled portion of the C/C++ code that performs the matrix*vector operation.

    Code:
    00000000 <_Z7mat_vecPfiiS_iS_>:
       0:	55                   	push   %ebp
       1:	89 e5                	mov    %esp,%ebp
       3:	57                   	push   %edi
       4:	56                   	push   %esi
       5:	53                   	push   %ebx
       6:	83 ec 04             	sub    $0x4,%esp
       9:	8b 45 08             	mov    0x8(%ebp),%eax
       c:	8b 75 0c             	mov    0xc(%ebp),%esi
       f:	8b 4d 1c             	mov    0x1c(%ebp),%ecx
      12:	85 c0                	test   %eax,%eax
      14:	74 65                	je     7b <_Z7mat_vecPfiiS_iS_+0x7b>
      16:	8b 5d 14             	mov    0x14(%ebp),%ebx
      19:	85 db                	test   %ebx,%ebx
      1b:	74 5e                	je     7b <_Z7mat_vecPfiiS_iS_+0x7b>
      1d:	85 c9                	test   %ecx,%ecx
      1f:	74 5a                	je     7b <_Z7mat_vecPfiiS_iS_+0x7b>
      21:	8b 55 10             	mov    0x10(%ebp),%edx
      24:	85 d2                	test   %edx,%edx
      26:	7e 53                	jle    7b <_Z7mat_vecPfiiS_iS_+0x7b>
      28:	8d 14 b5 00 00 00 00 	lea    0x0(,%esi,4),%edx
      2f:	89 c7                	mov    %eax,%edi
      31:	89 55 f0             	mov    %edx,-0x10(%ebp)
      34:	31 db                	xor    %ebx,%ebx
      36:	66 90                	xchg   %ax,%ax
      38:	8b 45 14             	mov    0x14(%ebp),%eax
      3b:	85 f6                	test   %esi,%esi
      3d:	d9 04 98             	flds   (%eax,%ebx,4)
      40:	7e 26                	jle    68 <_Z7mat_vecPfiiS_iS_+0x68>
      42:	31 c0                	xor    %eax,%eax
      44:	85 db                	test   %ebx,%ebx
      46:	89 fa                	mov    %edi,%edx
      48:	74 3e                	je     88 <_Z7mat_vecPfiiS_iS_+0x88>
      4a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
      50:	d9 02                	flds   (%edx)
      52:	83 c2 04             	add    $0x4,%edx
      55:	d8 c9                	fmul   %st(1),%st
      57:	d8 04 81             	fadds  (%ecx,%eax,4)
      5a:	d9 1c 81             	fstps  (%ecx,%eax,4)
      5d:	83 c0 01             	add    $0x1,%eax
      60:	39 c6                	cmp    %eax,%esi
      62:	7f ec                	jg     50 <_Z7mat_vecPfiiS_iS_+0x50>
      64:	dd d8                	fstp   %st(0)
      66:	eb 08                	jmp    70 <_Z7mat_vecPfiiS_iS_+0x70>
      68:	dd d8                	fstp   %st(0)
      6a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
      70:	83 c3 01             	add    $0x1,%ebx
      73:	03 7d f0             	add    -0x10(%ebp),%edi
      76:	39 5d 10             	cmp    %ebx,0x10(%ebp)
      79:	7f bd                	jg     38 <_Z7mat_vecPfiiS_iS_+0x38>
      7b:	83 c4 04             	add    $0x4,%esp
      7e:	5b                   	pop    %ebx
      7f:	5e                   	pop    %esi
      80:	5f                   	pop    %edi
      81:	5d                   	pop    %ebp
      82:	c3                   	ret    
      83:	90                   	nop    
      84:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
      88:	d9 ee                	fldz   
      8a:	d9 14 81             	fsts   (%ecx,%eax,4)
      8d:	d9 02                	flds   (%edx)
      8f:	83 c2 04             	add    $0x4,%edx
      92:	d8 ca                	fmul   %st(2),%st
      94:	de c1                	faddp  %st,%st(1)
      96:	d9 1c 81             	fstps  (%ecx,%eax,4)
      99:	83 c0 01             	add    $0x1,%eax
      9c:	39 c6                	cmp    %eax,%esi
      9e:	7f e8                	jg     88 <_Z7mat_vecPfiiS_iS_+0x88>
      a0:	dd d8                	fstp   %st(0)
      a2:	eb cc                	jmp    70 <_Z7mat_vecPfiiS_iS_+0x70>
      a4:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
      aa:	8d bf 00 00 00 00    	lea    0x0(%edi),%edi
    No xmm registers used... no SSE ops... so straight x87 FPU code...

    How can the CUBLAS function with so much more theoretical FLOPS available and a normal Matrix*Vector operation perform only 16% better than that C/C++ code?

    I am puzzled... :(.
     
  2. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Matrix multiplication tests usually stress the memory system, not the ALU's.

    Assuming that the CUBLAS uses a similarly naive matrix algorithm, then it would need 5000 * 5000 * 2 * 4 bytes = 200MB of data.

    If your GPU has memory running at 600MHz DDR and a 64bit bus, then your theoretical max bandwidth is 9600MB/s. So just fetching data will cost you 20.8 ms. (More if you take into account DRAM inefficiencies.)

    If a G80 class GPU can't co-issue an ALU operation and an external memory read operation in the same instruction (is this the case?), and your total calculation time is 40ms per matrix, then you're ALU's are basically sitting idle 50% of the time, waiting for data to arrive.

    Now maybe CUBLAS is smarter than this and it preloads some data chunks into shared memory so it can be reuses multiple times, in which case my numbers would be too pessimistic, but still, memory bandwidth will be a major factor.

    To reach theoretical FLOPS numbers on a GPU, you need a problem that has many calculations per byte of data that's fetched from external memory. A MADD is not enough.
     
  3. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Ok, that would explain it... still weird that the CPU is able to keep up like that... and move much farther ahead with well optimized code (I am only using one of the two cores so half its peak FLOPS rate) while I cannot do much about it on the CUBLAS side so that even with very very large matrices the CPU could pull ahead.

    Bandwidth to main RAM is not also close to what the GPU has to its global memory and at this kind of matrix size it will not fit into the cache either so you will spill to main RAM quite a bit.

    My laptop has 3 GB of RAM (DDR2-677):

    The caching hierarchy is the difference maker here... and what a difference it makes...

    Perhaps the caching hierarchy of the Core 2 Duo is just THAT good that even with an arm tied behind its back (one core working on the problem), not having an incredible bandwidth to memory either ( and blindfolded (only x87 FP ops) it can only lag behind a little bit to a much more powerful monster that is horribly bandwidth starved...

    I understand the thing might get bandwidth starved, but I thought CUBLAS would have been smart about the use of registers and shared memory especially to minimize the hit on external memory pressure although that it is easier for 3D graphics processing as you always keep the 4x4 matrix in the shared memory and you stream in the vertices to be processed (to make a simple example) and work on them in batches (having new vertices streamed into RAM as you process others).

    A SM is kinda starved though... compared to Emotion Engine's VU1 which had 16 KB of local memory (and 32x128 bits registers) for data to feed a normal 4-way SIMD unit you are feeding twice the number of execution units with the same amount of local memory.

    It is impressive to see it all play out like this though, but still thanks for your comments :).
     
    #3 Panajev2001a, Dec 26, 2008
    Last edited by a moderator: Dec 26, 2008
  4. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Matrix vector multiplication is a memory bound algorithm. Do not compare the compute speed of these two codes.

    What I am more surprised is why x87 FPU instructions are generated and not SSE. I guess you need -O3 switch to turn on autovectorization.
     
  5. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    That I will try to fix, but I do understand that in this case the GPU is heavily memory starved and that the caching hierarchy of the Core 2 Duo is just very very well realized... the CPU is even more bandwidth constrained if you look at how many MB/s its RAM can feed it at, but in the end even that does not matter one iota... even with such an unoptimized approach the CPU is still barely slower than the GPU.

    Still, CUDA/CUBLAS is hugely useful even in a case like this... it is still fast and you free the CPU up to do other things.
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,428
    Likes Received:
    425
    Location:
    New York
    Very interesting. Is there any indication of a move to large generalized chip-wide caches on future GPU's? Doesn't RV770 have something similiar already with its global data store?
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    Since GeForce 8400M GS has limited memory bandwidth (9.6GB/s as silent_guy already mentioned), it does not have much advantage over a Core 2 Duo. Even a single channel 667MHz FSB Core 2 Duo has more than 5GB/s bandwidth. You'll get much better result from a GeForce 8800GT for example. My own matrix multiplication code with Kahan's summation formula to reduce error can reach 48GFLOPS on a GeForce 8800GT.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It seems so. It's tiny, though, 16KB. It seems to me it's more of a "goods yard" for data on it's way from one place to another... I suspect it's no good thinking of it as a "cache".

    Based on my understanding of D3D11 Compute Shader and OpenCL, I think AMD has more work to do on the "shared memory" architecture (for moving/sharing data amongst elements in a computation). I'm not sure if the current global data share + local data share organisation is up to the job of the generalised sharing that these compute models require...

    ---

    Panajev have you read the NVidia forum threads on dense matrix multiplication? They're worth seeking out as they provide interesting architectural insights as well as putting in perspective the difficulty of maximising performance (you'll have to go back into the mists of time). DGEMM is now heavily optimised and integrated within CUBLAS as far as I know.

    Jawed
     
  9. randomhack

    Newcomer

    Joined:
    Apr 4, 2008
    Messages:
    41
    Likes Received:
    0
    (Jawed : Hes doing SGEMV not SGEMM. )

    Theoretical flops : The theoretical flops of your card are 16*0.8*2 gflops = 25.6 gflops for single precision (your GPU cant do double precision).

    For your CPU its : 1.67*8*2 = 26.7 gflops.

    So in your case your CPU has a better gflop rating.

    edit : Though in the case of matrix-vector multiply, neither gflop rating matters. You will, as others already said, be constrained by memory.
     
  10. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    1.) I am using x87 FPU ops and not SSE

    2.) I am not using multi-threading to do the M*V operation so I only have one core working on the problem...


    so the GPU has a MUCH bigger GFLOPS advantage, but still memory bandwidth and data caching are the main problems we are dealing with here.
     
  11. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    Compare with 4096x4096 or 5120x5120. I think CUBLAS has an easier time with power-of-two sizes than non-power-of-two.
     
  12. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    It shows,

    STEPS == 1 and 4096x4096 matrix:

    Code:
    Initializing data...
    ...allocating CPU memory.
    Matrix is 4096x4096
    Vector is 4096x1
    
    
    Exec time only on CPU: 27.841999 (ms)
    
    simpleCUBLAS test running..
    
    
    Transfer + Exec + Readback time on GPU with CUBLAS: 59.033001 (ms)
    
    Execution time on GPU with CUBLAS: 21.761999 (ms)
    
    Transfer to GPU with CUBLAS: 35.747002 (ms)
    
    Transfer from GPU with CUBLAS: 1.524000 (ms)
    
    GPU CUBLAS code is 0.471634x faster than the CPU code...
    
    
    GPU CUBLAS code is 21.837512% faster than the CPU code (execution time)...
    
    
    (it fluctuates between 21.8% and 18.9%)

    STEPS == 1 and 4352x4352 Matrix:

    Code:
    Initializing data...
    ...allocating CPU memory.
    Matrix is 4352x4352
    Vector is 4352x1
    
    
    Exec time only on CPU: 31.350000 (ms)
    
    simpleCUBLAS test running..
    
    
    Transfer + Exec + Readback time on GPU with CUBLAS: 66.989998 (ms)
    
    Execution time on GPU with CUBLAS: 27.226999 (ms)
    
    Transfer to GPU with CUBLAS: 39.667999 (ms)
    
    Transfer from GPU with CUBLAS: 0.095000 (ms)
    
    GPU CUBLAS code is 0.467980x faster than the CPU code...
    
    
    GPU CUBLAS code is 13.151518% faster than the CPU code (execution time)...
    (it fluctuates between 13% and 10%)


    Still, I installed and ran memtest (separate and nice grub entry :)) and this is the result of its benchmark:

    I guess when they say that the 3 GB configuration of this laptop sucks as far as memory bandwidth is concerned they do speak the truth... still this is about half of what it should transfer at even in a single channel configuration... (ok effective vs theoretical but still...).
     
  13. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    So you're basically testing worst-case CUBLAS performance--8400M, terrible bandwidth (both on host and from host to device), etc.
     
  14. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    Shh... shhh.... laptop I bought more than a year ago for about $700... he was not trying to be mean, your bandwidth sucks... shhh shhh.... it's ok, it's ok... ;).

    Ok, bad jokes aside (thanks for your posts) the terrible bandwidth on the host side should be affecting the CPU here in a bad way since I am trying to compare execution times only aside from the transfer and readback of data.

    Still, before putting it all up on the GTX280 in the test computer they want me to get things up and runnign well on my system (we will see if we can avoid using doubles as even on GTX280 having only 1/8th of the available peak power does not seem incredibly appealing... the 1 GB of RAM and the HUGE bandwidth to it [device memory] are appealing because even a suitable combination of CPU cores pushing the same amount of DP FP ops would not have nearly the same RAM bandwidth to match the GTX280 in scenarios such as this one).
     
  15. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    Just so you know, GTX 280 gets about 95% of peak perf on DGEMM (think it's getting about 82 GFlops). Then it gets like 350, 400 GFlops in SGEMM? Something like that.

    Regardless, I don't think these results are indicative of anything surprising. There's an initial cost to using CUDA because of PCIe bandwidth--uh, yeah? How's that strange or unreasonable or anything like that? The transfer to the GPU alone is more than the cost of the matrix operation on the host, so if you do transfer + short operation + transfer you don't get any sort of benefit. If you want a more reasonable test, try SGEMM instead of SGEMV.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...