LRB - ditching x86?

On GPU, I decided to try another method: each core has 8 set of bins, and in each warp every thread access their own bins. That is, thread 0 only uses bin 0, thread 1 uses bin 1, ... thread 8 uses bin 0, etc. This way, there will be no bank conflict between threads.

Wouldn't 8 sets of bins have 2 way bank conflicts according to the CUDA docs which say that shared memory is banked 16-way?
 
Maybe some type of binary sort to cut down on the ridiculous redundancy ... really though this just shows that some type of problems can only be solved by better memory subsystems rather than computation.
That's nowhere near as hilarious as doing the same evaluation 240 times with the knowledge that it will succeed only once.

Ideally we'd have caches with 32 bit banks with one port per bank and a small FIFO for colliding writes (Larrabee should have had this).

How about an accumulating cache?
Stick an adder on the write queue in the cache controller and make all hits to the queue adds and all writes to the cache a read/modify/write.
The neat thing about this is that if ECC is on the cache, a fair amount of work is already done since the cache line needs to be read back, modified, and ECC bits calculated before storing back.
 
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.

CPU: Core i7 920
CPU (1 thread): ~800MB/s.
CPU (2 threads): ~1150MB/s.
CPU (4 threads): ~1650MB/s.
CPU (8 threads): ~1750MB/s.

GPU: GeForce GTX 285
First algorithm: multiple histograms, no atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 650MB/s
GPU (1024 threads): ~ 700MB/s

Second algorithm: single histogram, with global atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 850MB/s
GPU (1024 threads): ~ 900MB/s
GPU (16384 threads): ~ 900MB/s

I used Windows 7 64 bits and the profiler is not working for some mysterious reason, so I can't make additional optimizations.

Interestingly, even with the first algorithm using 1024 threads (which needs 256MB for storing histograms!) the reduction is extremely quick, since the data access pattern is completely linear.


I was curious how good this would work on a Quad Core 2, 3.2Ghz.
With 4 threads I'm getting just over 3 GB/s.

I added an optimization to have two histograms per thread, one at 1 byte and one at 4 byte int per element. I use the 1 byte one and when it overflows I increase the int bin. This gains 50 % speed, due to better cache usage. Maybe a 4 bit histogram would be even better as it would fit the first level cache.
 
That's close to what I was suggesting, though it might not have to go that high (though there is no reason it couldn't for other reasons I haven't thought of).

At it's most basic, it would be dividing the histogram into at least 30 blocks, with each of the 8 lanes of each SIMD running a block getting 1/8 of the bins in the 1/30.

So it would be a grid of 30 blocks, and every block would wind up evaluating every sample, exploiting the GPU's high emphasis on read bandwidth to broadcast a single sample to the entire chip (or just split the blocks up further and add more threads to hide additional latency).

That way, all 240 physical lanes will evaluate a pixel (possibly not in the same cycle, but close to it so they can reuse the sample), and at most one lane will ever write to its unique subset of the SIMD's shared memory, and the shared memory itself would be unique to the SIMD.

After the initial startup period, there would be an average throughput of 1 histogram update per hot cycle, and it's been guaranteed by how the bins are partitioned that there won't be any collisions, since only one lane can update the bin in question.
But the domain of execution is a power of 2 - so the 256<->240 mismatch is going to skew the lane<->bin mapping, isn't it?

Writeback to memory would be straighforward, as there would be no overlap between the blocks.
Yeah, that's the one fast bit of this whole crazy scheme.

I was going to post another idea I had of leveraging the GTX's register file, since it was 4x the size of the shared memory that would have been along those lines.
You mentioned a multi-pass scheme for splitting up the bins, and the register file in theory could avoid spilling to main memory between passes.
At the end of a pass, the contents of the shared memory would be arrayed properly and could be stored to statically indexed registers, as opposed to the data-dependent accesses needed during the pass.

The reg file would fill completely after 4 passes, so that scheme could only stack things three deep before needing to write back.
I suppose you could go further and use the register file to hold the bins and use shared memory as a communication channel (well, a web of channels). i.e. a thread has a sample that doesn't fit into any of its own bins, but it does fit into a bin belonging to one of that thread's partners in the warp, or somewhere else in the block.

This algorithm entirely removes atomics from bin increments. But using shared memory as a mesh of communication channels amongst pairs of threads (or merely one queue per recipient thread) does raise the question of banking issues in shared memory.

This would all be so much easier if threads could move data to anywhere else on the GPU. As it is anything like this has to be done via global memory. Who knows, with the severe latencies we're still seeing, maybe going via global would be a win, since so much bandwidth (erm, nearly all of it) is being left on the table.

It would also be easier if more than one kernel could run at the same time (e.g. data manager + bin incrementor) rather than having to use control-flow based on thread-ID (e.g. making blockthreadID0 the master for all threads in a block).

Jawed
 
But the domain of execution is a power of 2 - so the 256<->240 mismatch is going to skew the lane<->bin mapping, isn't it?
The math would be less clean and require extra latency hiding, but the bins could be allocated almost equally between all the lanes.

A block ID check could predicate the threads in the two extra blocks in a 32 block grid as being always off. It's a kludge, but it fits with the sloppy brute-force approach I was going for.

I didn't posit an upper bound on the number of blocks, just that there be enough to distribute across the chip.

I suppose you could go further and use the register file to hold the bins and use shared memory as a communication channel (well, a web of channels). i.e. a thread has a sample that doesn't fit into any of its own bins, but it does fit into a bin belonging to one of that thread's partners in the warp, or somewhere else in the block.
What kind of scheme would be used to direct regsiter updates based on what the data values are, besides a lot of branching? I was working from an assumption that shared memory addresses could be generated in a more flexible fashion than register identifiers can be.
 
I just don't see any way to do this particularly efficiently on ATI hardware in the near future (I assume the limits on LDS writes will still be there next generation) the means for inter-thread communication are just too limited
 
How about an accumulating cache?
Stick an adder on the write queue in the cache controller and make all hits to the queue adds and all writes to the cache a read/modify/write.
You'd really want it to do all of the atomic reduction operations though - (int) add, (float/int) min, max, etc. and then you're encroaching quickly on ROP territory.
 
What kind of scheme would be used to direct regsiter updates based on what the data values are, besides a lot of branching? I was working from an assumption that shared memory addresses could be generated in a more flexible fashion than register identifiers can be.
CUDA allows arrays.

Although I'm now wondering about the divergence issue I mentioned before with SRs on ATI. Hmm.

Jawed
 
I was curious how good this would work on a Quad Core 2, 3.2Ghz.
With 4 threads I'm getting just over 3 GB/s.

I added an optimization to have two histograms per thread, one at 1 byte and one at 4 byte int per element. I use the 1 byte one and when it overflows I increase the int bin. This gains 50 % speed, due to better cache usage. Maybe a 4 bit histogram would be even better as it would fit the first level cache.
Nice work. So GTX285 is now chasing 4.5GB/s. That's about half the available DDR bandwidth on your processor, isn't it?

On NVidia a single pass for 65536 bins and nibble-sized in-register histogram bins won't fit as far as I can tell - the register file per core is just too small.

But it would work over say 4 passes, e.g. 60 blocks and 64-threads per block, with 2 blocks per core requires 4096 registers to enable each thread to hold 256 nibble-sized bins in its private 32 registers. That is if indexed registers aren't a complete disaster zone, performance-wise.

Jawed
 
You'd really want it to do all of the atomic reduction operations though - (int) add, (float/int) min, max, etc. and then you're encroaching quickly on ROP territory.
IMO a single bit for test and set on each 32 bit word in the cache is enough for atomics ... everything else can be handled higher up.
 
Would those be mapped to the register file if accesses to them depend on a variable?
5.1.2.2 Local Memory
Like the global memory space, the local memory space is not cached, so accesses to local memory are as expensive as accesses to global memory. Local memory accesses are always coalesced though since they are per-thread by definition.

Local memory accesses only occur for some automatic variables as mentioned in Section B.2.4. Inspection of the PTX assembly code (obtained by compiling with the –ptx or -keep option) will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the .local mnemonic and accessed using the ld.local and st.local mnemonics. If it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture. There is no way to check this for a particular variable, but the compiler reports total local memory usage per kernel (lmem) when compiling with the --ptxas-options=-v option.

Automatic variables that are likely to be placed in local memory are large structures or arrays that would consume too much register space, and arrays for which the compiler cannot determine that they are indexed with constant quantities.
Shame the arrays don't have a defined behaviour for out of bounds.

Indexed registers in ATI return the value in GPR0 for out of bounds accesses. Out of bounds on ATI though means out of bounds of the per thread allocated registers. This section, 4.6.3, doesn't mention out of bounds SR indexing, so I'm not sure about that.

Jawed
 
Would those be mapped to the register file if accesses to them depend on a variable?

In CUDA, registers can't be indexed, so arrays have to be in local memory (that is, in the video memory). Therefore, if the array is small enough, it's generally better to put the array into shared memory even if it's not shared at all.
 
If you increase the block count you have to decrease the number of bins per block to make them all fit simultaneously.

When 2185 bins were originally derived it was effectively done by dividing the full bin count by the number of blocks. So, 1093 bins per block for 60 blocks.

Jawed

I was thinking about the 256 bins case. Reducing the "lanes" from 8 to 4 is probably not good because it increases chances of bank conflict.

However, for the 65536 bins case it probably worth a try. I didn't try this because I wasn't sure whether the memory controller is smart enough to not load the same data for each core. However, now it seems to be smart enough so I think this is probably worth trying as on GTX 285 a core can have at most 1024 active threads which is more than the maximum number of threads allowed per block (512 threads). This may be able to hide more memory latency.

Unfortunately, now I'm at home so I don't have access to the GTX 285 I was using. So it has to wait until next Monday :)
 
Nice work. So GTX285 is now chasing 4.5GB/s. That's about half the available DDR bandwidth on your processor, isn't it?

On NVidia a single pass for 65536 bins and nibble-sized in-register histogram bins won't fit as far as I can tell - the register file per core is just too small.

But it would work over say 4 passes, e.g. 60 blocks and 64-threads per block, with 2 blocks per core requires 4096 registers to enable each thread to hold 256 nibble-sized bins in its private 32 registers. That is if indexed registers aren't a complete disaster zone, performance-wise.

Jawed

Actually the 3 GB/s already included the 50% optimization. With some more tuning it now does 3.3GB/s. Memory reading now seems to be the bottleneck as even with the data to all zero, it does not go much faster. A Core i7 should do better.
Here the source if anyone likes to try:

#include "windows.h"
#include "stdio.h"
#include <conio.h>
#include <ctype.h>

int nCores = 4; // change according to number of threads

static const int MAX_CORES = 16;
HANDLE histogramThread[MAX_CORES];

#define SIZE (128*1024*1024)
unsigned int histogram_total[1<<16];
unsigned int histogram32c[MAX_CORES][1<<16];
unsigned char histogram8c[MAX_CORES][1<<16];
unsigned short dat[SIZE];

DWORD WINAPI HistogramThread(void *arg)
{
int core = (int) arg;
unsigned int *histogram32 = histogram32c[core];
unsigned char *histogram8 = histogram8c[core];

for (int i=core*SIZE/nCores; i<(core+1)*SIZE/nCores; i++)
{
unsigned int d = dat;

if (++histogram8[d]==0)
histogram32[d]++;
}
return 0;
}


void main()
{
for (int i=0;i<SIZE; i++)
dat = 2*rand()+ (rand()&1);
for (int core=0; core < nCores; core++)
for (int i=0;i<1<<16; i++)
{
histogram_total = 0;
histogram8c[core] = 0;
histogram32c[core] = 0;
}

// for timing
LARGE_INTEGER tm0, tm1;
LARGE_INTEGER fq;
QueryPerformanceFrequency(&fq);
QueryPerformanceCounter(&tm0);

// create and start threads
for (int core=0; core < nCores; core++)
histogramThread[core] = CreateThread(NULL, 0, HistogramThread, (void*)core, 0 , NULL);

// wait till all threads finished
WaitForMultipleObjects(nCores, histogramThread, true, INFINITE);

// calc resulting histogram
int checksum = 0;
for (int i=0;i<1<<16; i++)
{
for (int core=0; core < nCores; core++)
{
histogram_total += (histogram32c[core] << 8) + histogram8c[core];
}
checksum += histogram_total;
}

QueryPerformanceCounter(&tm1);
float tm = (float)((tm1.QuadPart-tm0.QuadPart)/(float)fq.QuadPart)*1000;
printf("Time %f ms, %f MB/s, checksum %d ", tm, 256*1000/tm, checksum);
int i= _getche();

for (int core=0; core < nCores; core++)
CloseHandle(histogramThread[core]);
}
 
I tried your code on my A64 X2 at 2GHz with 2GB 400MHz DDR on XP 32-bit, and got the following results:
  • 1 thread - 480MB/s
  • 2 threads - 1060MB/s
I'm not sure how much the two x264 encode jobs (idle priority) that I've got currently running will interfere, but that seems pretty decent for this old girl.

Jawed
 
I tried Voxilla's program on my Core i7 920 (2.66GHz), the result is about 3.5GB/s. 65536 bins histogram with Voxilla's smaller histogram optimization now runs much faster at ~3.9GB/s compared to ~1.7GB/s before. I think Voxilla's idea is even more important for Core i7 because Core i7's L2 cache is rather small at 256KB, instead of Core 2's much larger L2 cache.

On the GPU front, it's quite grim though. The idea that making the bins of each blocks smaller to run more blocks at the same time does not fare very well. By using 1093 bins per blocks and running 60 blocks instead of 30 blocks, it performs worse at about 1GB/s (instead of ~1.6GB/s).

Unfortunately it's difficult to incorporate Voxilla's idea into GPU. 16KB shared memory is just too small. Furthermore, the atomic operations only support 32 bits integer, so it won't work on smaller data types.

For 256 bins histogram on CPU, since I suspect it's now limited to CPU's arithmetic performance, I decided to make some optimizations: first, I unroll the loop by 8. Second, I use a separate histogram for each unrolled "stage," that is 8 histograms. Since a histogram only takes 1KB memory, it doesn't put too much pressure on the cache. The reason for this is to reduce false dependency (CPU is good at removing false dependency on registers, not so good on memory). This way, I can put the performance upward to ~ 4.4GB/s. That is basically the same as the GPU if it's reading directly from the main memory.

This optimization is not suitable for 65536 bins because 8 histograms would be too large and completely defeat the purpose of Voxilla's optimization.
 
Back
Top