I think the mixture of increments against intensely random addresses and a moderate collision rate makes histogramming heavily dependent on the memory system, which means parallel speed-up can only come if you can hide the latency.And Jawed, why would you need 1024 threads to hide latency?
I think the mixture of increments against intensely random addresses and a moderate collision rate makes histogramming heavily dependent on the memory system, which means parallel speed-up can only come if you can hide the latency.
Maybe you can hide all the latency with just a few threads per core and a heavy-weight hashing function?
Jawed
However, from the results of the histogram sample in CUDA SDK, it can only reach about 10GB/s which is less than 10% of the theoretical bandwidth. The major issue is that when using atomic shared memory operations, serialization becomes a serious issue. In the worst case, if all eight threads are accessing the same bank, you need 8 clocks to complete all operations instead of one.
Which is why having lots of latency hiding is important - it's the life-blood of GPUs. Maybe do extra work to re-arrange the data so that it's less painful in the bin-increment phase.Histogram is a problem which looks easily parallelizable but actually very hard to parallelize over multiple cores, as the major bottleneck is actually the memory subsystem, not the ALU.
Interesting results, although that's even worse than I expected for "random" data. I think at the very least this data demonstrates that this is one place that GPUs - and I'd argue somewhat due to the programming model - are doing way worse than they ought to be relative to CPUs.I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.
CPU: Core i7 920
CPU (1 thread): ~800MB/s.
CPU (2 threads): ~1150MB/s.
CPU (4 threads): ~1650MB/s.
CPU (8 threads): ~1750MB/s.
GPU: GeForce GTX 285
First algorithm: multiple histograms, no atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 650MB/s
GPU (1024 threads): ~ 700MB/s
Second algorithm: single histogram, with global atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 850MB/s
GPU (1024 threads): ~ 900MB/s
GPU (16384 threads): ~ 900MB/s
I used Windows 7 64 bits and the profiler is not working for some mysterious reason, so I can't make additional optimizations.
Interestingly, even with the first algorithm using 1024 threads (which needs 256MB for storing histograms!) the reduction is extremely quick, since the data access pattern is completely linear.
Ooh, very interesting.I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.
Ooh, very interesting.
On the CPU did you give each thread a local histogram and then do a reduction at the end?
So atomics are faster once the GPU gets to 512 threads, but it seems like 1024 threads is the limit for scaling, which is ~1 warp per core, which is pretty tragic.
Doing this with Dickens would require indexing/hashing of the words, which should add a lot of computational work. Would this sway things more in favour of the GPU?
I wonder how this would work on ATI where there's only 10 cores fighting over memory?
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.
This is my understanding: each MC is connected to a pair of memory chips. Each chip has a 32-bit interface. The minimum burst length is 4, which gives a 16-byte burst per chip, or a 32B burst per MC.Jawed mentioned before that it had 8 64-bit MCs (makes sense that global atomics have 64-bit operations). Which would be 64-bytes wide, yet they can do a 32-byte wide segment. So it is just two groups of 4 MCs always fetching a 32-byte wide segment?
This is my understanding: each MC is connected to a pair of memory chips. Each chip has a 32-bit interface. The minimum burst length is 4, which gives a 16-byte burst per chip, or a 32B burst per MC.
GDDR5 has a minimum burst length of 8, so if you want 32B bursts then you want to have 32-bits of IO per MC. Otherwise you end up with 64B bursts.
Jawed
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.
CPU: Core i7 920
CPU (1 thread): ~800MB/s.
CPU (2 threads): ~1150MB/s.
CPU (4 threads): ~1650MB/s.
CPU (8 threads): ~1750MB/s.
GPU: GeForce GTX 285
First algorithm: multiple histograms, no atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 650MB/s
GPU (1024 threads): ~ 700MB/s
Second algorithm: single histogram, with global atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 850MB/s
GPU (1024 threads): ~ 900MB/s
GPU (16384 threads): ~ 900MB/s
I used Windows 7 64 bits and the profiler is not working for some mysterious reason, so I can't make additional optimizations.
Interestingly, even with the first algorithm using 1024 threads (which needs 256MB for storing histograms!) the reduction is extremely quick, since the data access pattern is completely linear.
I am no CUDA guru, but I think that this is an unfair match up (for the GPU).
64 and 256 bin examples in the SDK as far as I can tell.IIRC, there have been histogram generation apps demoed on the CUDA zone which show a real speed up. May be considering them would be a good idea.
Seems right to me.For the 8 MCs, would global memory be banked 8-way with 32B per way?
In comparison with shared memory being banked 16-way with 4B per way?
So then you need some more-elaborate scheme. I can't help thinking that you could just simply multi-pass the samples, with 4096 bins per core. e.g. with 4 cores the first pass does bins 1-16384, second pass does 16385-32768 etc.
To avoid repeated reading from the same address, maybe texture cache can be used. However, I'm not sure whether the texture cache is shared among the cores. On the other hand, it's really not limited to reading bandwidth right now, so even a 30 times waste is, maybe, acceptable