PDA

View Full Version : OpenCL DIA Sparse Matrix Vector Multiply


RecessionCone
13-May-2010, 17:46
Some of you might be interested in this article (http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study.aspx) from AMD about DIA Sparse Matrix Vector Multiply Optimization. The article is intended for people learning OpenCL, and takes you through several steps in optimizing the DIA SpMV kernel for the Radeon 5870 and also for the Phenom II X4. I wrote the article, and am in the process of writing another one for AMD. Let me know what you think!

pcchen
13-May-2010, 18:19
Nice article. :)
The performance jump from using image is a bit surprising though.

RecessionCone
13-May-2010, 18:53
Nice article. :)
The performance jump from using image is a bit surprising though.
It is a pretty nice jump, but it makes sense. Without images, we have to load 8 bytes (4 for matrix element, 4 for vector element) for every 2 floating point operations. If we assume we're getting 100% of peak memory bandwidth, that gives us a bound of (153.6 GB/s)/(8 bytes) * 2 flops = 38.3 GFLOP/s. We achieved 32.9 GFLOP/s without images, which is 86% of the bound - pretty close to the theoretical limit.

With images, we only have to load 4 bytes for each 2 floating point operations, if we assume completely perfect caching of the vector. This raises the bound by a factor of 2, so with a perfect texture cache, we would expect 2x the perfomance. We got 1.76x better performance. So it all makes sense.

Jawed
13-May-2010, 19:09
Looks interesting... I like the fact you've identified architecture traits as bounds on performance and used them to keep an eye on where you're going. Seems surprisingly rare that people do this.

Do you use SKA when you're working on OpenCL kernels?

For each element of the result, we simply sum the element-wise product of the corresponding column from the matrix with the given vector.
Should be "row".

In our case, the ATI Radeon 5870 GPU supports 2-D images up to 8192x8192 pixels,[...]
The chip supports 16384x16384 - though obviously there's not enough memory for the format you're using (but there would be if you used a scalar format).

Jawed

RecessionCone
13-May-2010, 19:50
Looks interesting... I like the fact you've identified architecture traits as bounds on performance and used them to keep an eye on where you're going. Seems surprisingly rare that people do this.

Do you use SKA when you're working on OpenCL kernels?


Should be "row".


The chip supports 16384x16384 - though obviously there's not enough memory for the format you're using (but there would be if you used a scalar format).

Jawed

Thanks for the row/column typo. :oops: I'll see if I can get AMD to fix that typo.
About the texture format size - the chip may support 16384x16384 textures, but when I queried AMD's OpenCL runtime, it told me the limits the runtime supports are 8192x8192. Additionally, AMD's OpenCL runtime does not currently support scalar texture formats. You have to use the RGBA format. (My code would have been much simpler if I could have used a scalar format, but alas...)

I used SKA a little bit when I was working on this article, but not much, since I strongly prefer development in Linux, and SKA is Windows only.

Jawed
13-May-2010, 20:11
About the texture format size - the chip may support 16384x16384 textures, but when I queried AMD's OpenCL runtime, it told me the limits the runtime supports are 8192x8192.
CAL reports the chip capability. Seems strange that AMD would have it report the wrong size.

Additionally, AMD's OpenCL runtime does not currently support scalar texture formats. You have to use the RGBA format. (My code would have been much simpler if I could have used a scalar format, but alas...)
Ha. I suppose if one was really desperate one could use as_typen() to reinterpret from an image format of uchar4 :razz: - EDIT - hmm, component count gets in the way - would have to go via uchar4->int conversion first, then do as_typen().

Jawed

Dade
13-May-2010, 22:30
Ha. I suppose if one was really desperate one could use as_typen() to reinterpret from an image format of uchar4 :razz: - EDIT - hmm, component count gets in the way - would have to go via uchar4->int conversion first, then do as_typen().


It doesn't work. It is something I have tried but you can read from a CL_FLOAT/CL_RGBA image only valid floating point numbers, as_typen() returns 0.f otherwise. You can find an explaination of why it doesn't work here: http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=133069&enterthread=y

Quoting form Micah Villmow (AMD): "The reason is that our GPU's don't support denorm's and thus would flush many values that are represented as normal int's to zero when using the floating point path."

Jawed
14-May-2010, 01:40
The key thing is his next sentence, "The correct way to do this is to read as unsigned integers and bitcast to floats, as that guarantee's that no formatting is done on the integer types."

What he's saying is that the texturing hardware can be forced to fetch bits without any manipulation, by defining the image as uint. So on the host define the image as uint but stuff it with any data you like (floats, doubles, structs etc.). Then in the kernel extract bits according to alignment/sizing of the struct you've packed into the image, then use as_typen() to convert those extracted things.

As far as performance goes it pays to remember that an ATI GPU can't fetch less than 128 bits through the texturing hardware. If you ask for 8 bits, it'll fetch 128 and discard 120. So you're best-off doing 128-bit fetches and extracting bits/data from that, if you are image-data bandwidth bottlenecked.

Given the CL_RGBA restriction, CL_UNSIGNED_INT8 can be packed with scalar floats.

Jawed

Dade
14-May-2010, 09:05
As far as performance goes it pays to remember that an ATI GPU can't fetch less than 128 bits through the texturing hardware. If you ask for 8 bits, it'll fetch 128 and discard 120. So you're best-off doing 128-bit fetches and extracting bits/data from that, if you are image-data bandwidth bottlenecked.


It is a good idea to reverse the problem: store as int32 an read as float instead that storing as float and read int32. My solution was just to store all float fields in an image and all int32 fields in another image.
Thanks, I'm going to try if this idea works better.

Jawed
14-May-2010, 10:25
Make sure to use uint, otherwise the sign will get you.

Also:

In addition, some other extensions to the C language designed to support particular vector ISA (e.g. AltiVec™, CELL Broadband Engine™ Architecture) use such conversions in conjunction with swizzle operators to achieve type unconversion. So as to support legacy code of this type, as_typen() allows conversions between vectors of the same size but different numbers of elements, even though the behavior of this sort of conversion is not likely to be portable except to other OpenCL implementations for the same hardware architecture.
implies that you should be able to do:

float f = as_float(read_imageui(packed, samplerPacked, coordPacked))

without having to mess about unpacking from CL_RGBA into a 32-bit uint. The remaining question is whether this is portable across ATI and NVidia... - well, it would be bizarre if it wasn't..

Anyway, surely AMD plans to implement CL_R. I'm aghast it's not in there already.

Jawed

Dade
15-May-2010, 11:39
I'm now getting a decent results out of this storing-inside-image trick. I have about a +25% on my 5870/5850/5770 in a benchmark and about a 10-15% in real-world application.

I was really surprised when testing the code on a NVIDIA 240GT I noticed a +40% speedup. May be my result is biased by the low-end NVIDIA GPU but apparently both ATI and NVIDIA GPUs benefit from this trick.

Jawed
15-May-2010, 12:55
Ooh that's pretty nice, glad it worked. Are you able to use multiples of 128-bits per read? Or are you doing 32-bit reads?

Fermi may prefer conventional memory for this data.

Jawed

Dade
15-May-2010, 17:04
Ooh that's pretty nice, glad it worked. Are you able to use multiples of 128-bits per read? Or are you doing 32-bit reads?


Yup, all the fields in my structures are float4 or int4 (i.e. 7 pixels to store the nodes of a QBVH tree and 10 to store 4xTriangles leafs).

Jawed
16-May-2010, 01:32
Those numbers make me wonder if you might see a gain with padding the data to multiples of 8, i.e. 8 for a node and 16 for leaves, which would limit the number of cache lines a node or leaf-set straddles - the article says that alignment with 128-byte cache lines is the target on ATI GPUs (have to admit, I thought it was 64). No idea what NVidia prefers...

Theoretically with your current code a node fetch is going to cause two cache lines to be fetched 86% of the time on average. Alignment would stop that, making every fetch access a single cache line. Not only does this save on the count of lines fetched, but it also significantly increases the effective cache capacity, by not filling it with superfluous cache lines 86% of the time. Though it could be argued that cache capacity is irrelevant when all the fetches are random.

Dade
16-May-2010, 21:04
Those numbers make me wonder if you might see a gain with padding the data to multiples of 8, i.e. 8 for a node and 16 for leaves, which would limit the number of cache lines a node or leaf-set straddles - the article says that alignment with 128-byte cache lines is the target on ATI GPUs (have to admit, I thought it was 64). No idea what NVidia prefers...


Padding was something I was thinking about. However GPU ram is such limited resource and this trick of storing data inside an image imposes even more limits (i.e. max image width/height supported). I'm a bit against "wasting" precious memory with padding.

Anyway I can give it a try just to check what I could gain.

Jawed
17-May-2010, 00:44
I suppose some kind of compression might help:

http://gv2.cs.tcd.ie/egirl09/papers/01.pdf

but you have to balance compression and cache line usage, presumably, to see a performance gain.

Jawed