View Single Post
Old 10-Jul-2010, 00:16   #117
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,681
Default

Just for fun, I thought I'd throw together a trivial test to take a look at SSE performance on my AMD Phenom II. The test code is extremely simple: make some arbitrary 4x4 matrix, and multiply it by itself ten million times. Here is the exact source:

Code:
void mul_mat(double **mat1, double **mat2, int sz)
{
  int ii, jj, kk;
  for (ii = 0; ii < sz; ii++)
  {
    double temp[sz];
    for (jj = 0; jj < sz; jj++)
    {
      temp[jj] = 0.;
      for (kk = 0; kk < sz; kk++)
	temp[jj] += mat1[ii][kk]*mat2[kk][jj];
    }
    for (jj = 0; jj < sz; jj++)
      mat1[ii][jj] = temp[jj];
  }
}

int main()
{
  const int sz = 4;
  const int num = 10000000;
  int ii, jj;

  double **mat1 = new double*[sz];
  for (ii = 0; ii < sz; ii++)
    mat1[ii] = new double[sz];
  double **mat2 = new double*[sz];
  for (ii = 0; ii < sz; ii++)
    mat2[ii] = new double[sz];

  for (ii = 0; ii < sz; ii++)
    for (jj = 0; jj < sz; jj++)
    {
      mat1[ii][jj] = double(ii)/sz*double(jj)/sz;
      mat2[ii][jj] = double(ii)/sz*double(jj)/sz;
    }

  for (ii = 0; ii < num; ii++)
    mul_mat(mat1, mat2, sz);
}
No inputs, no outputs, no library files. Just some very simple math. Now, if I compile the code with the following command:

icc test-perf.cpp -o test-perf -O3 -msse2 -m32

...the code completes in 0.38 seconds. If, instead, I compile with this command:

icc test-perf.cpp -o test-perf -O3 -mno-sse -mno-sse2 -m32

...the code takes a whole 11.4 seconds to finish. That's a speed up of 30 times! Now, this is a very artificial scenario, but it just highlights how horrible the x87 floating point unit can be. If, by contrast, I compile with the "-m64" option to produce a 64-bit binary, none of the SSE options make any noticeable difference (all complete in about 0.38-0.39 seconds). My suspicion is that the small size of the matrix allows the extra register space available from using the SSE2 instructions to really make a big impact, and a 4x4 matrix is just large enough to overflow the register space for x87, but not with the extra SSE2 registers. If I pick a 3x3 or 2x2 matrix, by contrast, the performance benefit drops to 2x. If I increase to an 8x8 matrix, the performance benefit disappears.

My strong suspicion is that if one were to properly optimize this simple matrix multiplication code for SSE, the full performance benefit, even for larger matrices, would be closer to 2x. I'm not really sure how to do that, though.
Chalnoth is offline   Reply With Quote