PDA

View Full Version : Help! How to implement scatter and gather using G80?


dragonyzl
28-Mar-2007, 11:28
Hi,
This is my first post to the Forum.
I have generate two arrays in fragment shader, named amp and DistanceIndex. Now I want to implement the following function(described in C)

for(int i=0;i<Width*Height;i++)
mySignal[ DistanceIndex[i] ] += amp[i];

After reading some documents on web. It is said that the scatter and gather techniques should be implemented here. The solutions for scatter includes: 1. read back to the CPU, 2. Using Vertex texture. But I think these methods are not suitable for me because of the efficiency (after read back to the main memory, the above loops takes about 8 minutes for Width=Height=1024, I did not test the method that using vertex texture, some one did it?:?: ) From the documentation of Nvidia G80 GPU and Shader model 4.0, it seems that the function above can be implemented, but I don't know how. So I am asking for help from the forum and hope some experts can help me or give some information.

Thanks!

mhouston
28-Mar-2007, 17:44
This is easy with CUDA. With SM4, you will have to do this as a vertex program to gather from DistanceIndex and then generate a vertex corresponding to mySignal[ DistanceIndex[i] ]. To add the amp[i], you will have to bind a fragment program, passing the value of i as a texture coord, which will take that vertex, read from the current value of the framebuffer, lookup amp[i] and add it. This will be extremely inefficient.

dragonyzl
29-Mar-2007, 12:11
This is easy with CUDA. With SM4, you will have to do this as a vertex program to gather from DistanceIndex and then generate a vertex corresponding to mySignal[ DistanceIndex[i] ]. To add the amp[i], you will have to bind a fragment program, passing the value of i as a texture coord, which will take that vertex, read from the current value of the framebuffer, lookup amp[i] and add it. This will be extremely inefficient.

Thank you for your constructive reply!
I have not tried the CUDA! I will try the methods you mentioned.
I read some docs about parallel computing, this kind of operation is really difficult because of the address translation.