LRB - ditching x86?

Only block size is limited to 512 threads. Multiple blocks can execute per core simultaneously if registers and shared memory allow. On GT200 blocks_per_core*threads_per_block <= 1024.

And Jawed, why would you need 1024 threads to hide latency?
 
And Jawed, why would you need 1024 threads to hide latency?
I think the mixture of increments against intensely random addresses and a moderate collision rate makes histogramming heavily dependent on the memory system, which means parallel speed-up can only come if you can hide the latency.

Maybe you can hide all the latency with just a few threads per core and a heavy-weight hashing function?

Jawed
 
I think the mixture of increments against intensely random addresses and a moderate collision rate makes histogramming heavily dependent on the memory system, which means parallel speed-up can only come if you can hide the latency.

Maybe you can hide all the latency with just a few threads per core and a heavy-weight hashing function?

Jawed

I think the problem after a number of threads is not about hiding the latency anymore. The problem is more about the inefficiency of random access. In the worst case, you may have to read, say, 32 bytes, to use only 4 bytes. Therefore, even if you can hide latency completely, you still lose memory bandwidth.

However, from the results of the histogram sample in CUDA SDK, it can only reach about 10GB/s which is less than 10% of the theoretical bandwidth. The major issue is that when using atomic shared memory operations, serialization becomes a serious issue. In the worst case, if all eight threads are accessing the same bank, you need 8 clocks to complete all operations instead of one.

Histogram is a problem which looks easily parallelizable but actually very hard to parallelize over multiple cores, as the major bottleneck is actually the memory subsystem, not the ALU.
 
However, from the results of the histogram sample in CUDA SDK, it can only reach about 10GB/s which is less than 10% of the theoretical bandwidth. The major issue is that when using atomic shared memory operations, serialization becomes a serious issue. In the worst case, if all eight threads are accessing the same bank, you need 8 clocks to complete all operations instead of one.

Isn't it rather worst case of 16 threads of a half-warp having a 16 way shared memory address bank conflict on current architecture.
 
Last edited by a moderator:
Histogram is a problem which looks easily parallelizable but actually very hard to parallelize over multiple cores, as the major bottleneck is actually the memory subsystem, not the ALU.
Which is why having lots of latency hiding is important - it's the life-blood of GPUs. Maybe do extra work to re-arrange the data so that it's less painful in the bin-increment phase.

Jawed
 
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.

CPU: Core i7 920
CPU (1 thread): ~800MB/s.
CPU (2 threads): ~1150MB/s.
CPU (4 threads): ~1650MB/s.
CPU (8 threads): ~1750MB/s.

GPU: GeForce GTX 285
First algorithm: multiple histograms, no atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 650MB/s
GPU (1024 threads): ~ 700MB/s

Second algorithm: single histogram, with global atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 850MB/s
GPU (1024 threads): ~ 900MB/s
GPU (16384 threads): ~ 900MB/s

I used Windows 7 64 bits and the profiler is not working for some mysterious reason, so I can't make additional optimizations.

Interestingly, even with the first algorithm using 1024 threads (which needs 256MB for storing histograms!) the reduction is extremely quick, since the data access pattern is completely linear.
 
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.

CPU: Core i7 920
CPU (1 thread): ~800MB/s.
CPU (2 threads): ~1150MB/s.
CPU (4 threads): ~1650MB/s.
CPU (8 threads): ~1750MB/s.

GPU: GeForce GTX 285
First algorithm: multiple histograms, no atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 650MB/s
GPU (1024 threads): ~ 700MB/s

Second algorithm: single histogram, with global atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 850MB/s
GPU (1024 threads): ~ 900MB/s
GPU (16384 threads): ~ 900MB/s

I used Windows 7 64 bits and the profiler is not working for some mysterious reason, so I can't make additional optimizations.

Interestingly, even with the first algorithm using 1024 threads (which needs 256MB for storing histograms!) the reduction is extremely quick, since the data access pattern is completely linear.
Interesting results, although that's even worse than I expected for "random" data. I think at the very least this data demonstrates that this is one place that GPUs - and I'd argue somewhat due to the programming model - are doing way worse than they ought to be relative to CPUs.
 
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.
Ooh, very interesting.

On the CPU did you give each thread a local histogram and then do a reduction at the end?

So atomics are faster once the GPU gets to 512 threads, but it seems like 1024 threads is the limit for scaling, which is ~1 warp per core, which is pretty tragic.

Doing this with Dickens would require indexing/hashing of the words, which should add a lot of computational work. Would this sway things more in favour of the GPU?

I wonder how this would work on ATI where there's only 10 cores fighting over memory?

Jawed
 
Ooh, very interesting.

On the CPU did you give each thread a local histogram and then do a reduction at the end?

Yes, there is no need to let puny 4 cores to fight for the same histogram :)

So atomics are faster once the GPU gets to 512 threads, but it seems like 1024 threads is the limit for scaling, which is ~1 warp per core, which is pretty tragic.

I think the problem is still on the memory. Unfortunately, since the profiler is not working, I can't get more detailed information on why it's so slow (GT200 does have a quite impressive array of profiling registers). I'll try it under Windows XP if I have some free time.

Doing this with Dickens would require indexing/hashing of the words, which should add a lot of computational work. Would this sway things more in favour of the GPU?

Possible, but it's also more complex. Since word lengths are not equal, the memory access pattern is going to be more unpredictable. Also, while GPU does have more computation power, even CPU has a lot unused computation power when computing normal histogram, which may be already enough for the indexing/hash works.

I wonder how this would work on ATI where there's only 10 cores fighting over memory?

I'm not sure about this, but the problem is still mostly on how the memory controller copes with random access.
 
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.

Results are interesting.

128M 16-bit shorts as input into 65536 32-bit int bins. Did I read that correctly?

And what algorithm did you use for this test?

With regards to memory performance, anyone know (or want to speculate) how the MCs carve up up global memory based on address on the GT200 series? Jawed mentioned before that it had 8 64-bit MCs (makes sense that global atomics have 64-bit operations). Which would be 64-bytes wide, yet they can do a 32-byte wide segment. So it is just two groups of 4 MCs always fetching a 32-byte wide segment?

Reason I'm asking is to get an idea of how global atomics work in parallel close to the MCs. Seems like they should be able to do at least 8 global atomics in parallel assuming they hit 2 32-byte segments without address collision. I wonder if the MCs can do 2 parallel 32-bit atomics or just one 32-bit or 64-bit atomic.

A little OT, but I wonder if future GPU workflow would see a benefit to less shared address lines and smaller minimum segment size with the extra area and pin cost?
 
900MB/s is 900 / 2 = 450M samples per second. GTX285 is theoretically capable of 1476 * 240 = 354,240M instructions per second, so that's ~787 cycles per sample.

While huge bin sets are unable to fit in a core's shared memory, you could partition bins across the cores' shared memories - so bin 21574 only exists in core 9's shared memory (65536 / 30 = 2185 bins per core). In this scenario every core evaluates every sample to see if it's owned by that core and does a shared memory atomic update if so.

Would this be faster than 787 / 30 cores = 26 cycles per sample?

But this approach doesn't scale - e.g. a GPU with only 4 cores can't do this.

So then you need some more-elaborate scheme. I can't help thinking that you could just simply multi-pass the samples, with 4096 bins per core. e.g. with 4 cores the first pass does bins 1-16384, second pass does 16385-32768 etc.

Jawed
 
Jawed mentioned before that it had 8 64-bit MCs (makes sense that global atomics have 64-bit operations). Which would be 64-bytes wide, yet they can do a 32-byte wide segment. So it is just two groups of 4 MCs always fetching a 32-byte wide segment?
This is my understanding: each MC is connected to a pair of memory chips. Each chip has a 32-bit interface. The minimum burst length is 4, which gives a 16-byte burst per chip, or a 32B burst per MC.

GDDR5 has a minimum burst length of 8, so if you want 32B bursts then you want to have 32-bits of IO per MC. Otherwise you end up with 64B bursts.

Jawed
 
This is my understanding: each MC is connected to a pair of memory chips. Each chip has a 32-bit interface. The minimum burst length is 4, which gives a 16-byte burst per chip, or a 32B burst per MC.

GDDR5 has a minimum burst length of 8, so if you want 32B bursts then you want to have 32-bits of IO per MC. Otherwise you end up with 64B bursts.

Jawed

So am I understanding this correctly now then,

For the 8 MCs, would global memory be banked 8-way with 32B per way?
In comparison with shared memory being banked 16-way with 4B per way?

It's too bad I'm not at home, I could just test CUDA write performance to figure this stuff out...
 
I did a simple experiment with a 16 bits histogram test (i.e. bin size = 65536). Data size is 128M 16 bits numbers. The image is random data.

CPU: Core i7 920
CPU (1 thread): ~800MB/s.
CPU (2 threads): ~1150MB/s.
CPU (4 threads): ~1650MB/s.
CPU (8 threads): ~1750MB/s.

GPU: GeForce GTX 285
First algorithm: multiple histograms, no atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 650MB/s
GPU (1024 threads): ~ 700MB/s

Second algorithm: single histogram, with global atomics
GPU (256 threads): ~ 500MB/s
GPU (512 threads): ~ 850MB/s
GPU (1024 threads): ~ 900MB/s
GPU (16384 threads): ~ 900MB/s

I used Windows 7 64 bits and the profiler is not working for some mysterious reason, so I can't make additional optimizations.

Interestingly, even with the first algorithm using 1024 threads (which needs 256MB for storing histograms!) the reduction is extremely quick, since the data access pattern is completely linear.

I am no CUDA guru, but I think that this is an unfair match up (for the GPU). 64K bins, each 4 bytes= 256KB. Which fits in the private L2 cache of each core on an i7. The image data is streamed through.

IMHO, a better way would be to have the num of blocks proportional to the amount of data along x axis and have 16 blocks in the y direction. Along y axis, split up the bins so that each bin along y tackles only 4k bins (to keep them in shared memory) and use shared atomics. Global memory latency is known to be (500-700) cycles, so Jawed's estimate of ~787 cycles per sample suggests that in this case you are experiencing the full blast of entire memory latency. Using shared memory would be much better, IMHO.

IIRC, there have been histogram generation apps demoed on the CUDA zone which show a real speed up. May be considering them would be a good idea.
 
For the 8 MCs, would global memory be banked 8-way with 32B per way?
In comparison with shared memory being banked 16-way with 4B per way?
Seems right to me.

The next question is how memory addresses are tiled/striped across MCs... What happens on GPUs with non-power-of-2 MC count?

Also, with banking a feature of the internal architecture of memory chips, it can get confusing to talk about banks in DDR, so just bear that potential for confusion in mind.

Jawed
 
So then you need some more-elaborate scheme. I can't help thinking that you could just simply multi-pass the samples, with 4096 bins per core. e.g. with 4 cores the first pass does bins 1-16384, second pass does 16385-32768 etc.

I was thinking about this, and I have an idea. Although the shared memory is 16KB, but it can't be completely used because some are used for other purposes. Because a GeForce GTX 285 has 30 MP so each core needs to store 2185 bins.

To avoid repeated reading from the same address, maybe texture cache can be used. However, I'm not sure whether the texture cache is shared among the cores. On the other hand, it's really not limited to reading bandwidth right now, so even a 30 times waste is, maybe, acceptable :)
 
To avoid repeated reading from the same address, maybe texture cache can be used. However, I'm not sure whether the texture cache is shared among the cores. On the other hand, it's really not limited to reading bandwidth right now, so even a 30 times waste is, maybe, acceptable :)

Isn't TEX L1 shared across 3 cores, and TEX L2 shared across all?
 
Back
Top