Depends on the ratio of data to bins, and the distribution of course. For the problems that I'm thinking of, it wouldn't work particularly well because there is a lot of overlap between which bins are touched by different cores, to the point that you could easily have a large percentage of the cores just doing nothing because they didn't happen to see any data for the bins that they were responsible for.While huge bin sets are unable to fit in a core's shared memory, you could partition bins across the cores' shared memories - so bin 21574 only exists in core 9's shared memory (65536 / 30 = 2185 bins per core). In this scenario every core evaluates every sample to see if it's owned by that core and does a shared memory atomic update if so.
Most of the time yes ... of course with the example used here it's not the case. LRU is not optimal for essentially random access when the dataset doesn't fit. So you'd still need to do some software management of the caching (locking or non-temporal hints).Really though, this is what a cache buys you very efficiently. If you're only touching a small subset of bins (on each core), then only those get pulled in and the rest never leave global memory. And even if you end up touching more in a few cases, caches give you "soft edges" on performance vs. any static partitioning scheme.
It's not an unfair match-up if that's the real-world problem you happen to have to solve.
Yeah.65536 32-bit entries?
Any performance would be a good start!Are we going for efficiency, or can we wastefully go for the top performance possible?
Wow, that's grim. Linux for the win?I ran the same program under Windows XP (32 bits), and for some reason the CPU performance is much worse, only about 1GB/s. It probably has something to do with worse SMT management in Windows XP?
Hmm, that was what I was trying to say, guess I didn't describe it well.I also tried the idea in my previous post based on Jawed's idea, using 30 cores with 2185 bins per core.
Nice, so about 400 cycles per sample. In the texturing version are you using linearly ordered coordinates (e.g. reading in row major order?). Texturing wants rasterisation order, so I'm guessing the caching breaks down.The performance is currently the best with peak about 1.8GB/s (under Windows XP). Using texture does not help at all and reduced its performance instead. I guess GT200's memory controller is smart enough to understand that those cores are actually reading from the same memory address.
Does performance vary much with the number of threads?
I puzzled why you can't have multiple blocks per core. Is shared memory bound by blockID? If so, why not just divide the bin space by 60 or 120 etc.?Yes, it's about 1000MB/s when only 256 threads are used. Unfortunately, it can't be more than 512 threads, otherwise I think it could be even better.
I puzzled why you can't have multiple blocks per core. Is shared memory bound by blockID? If so, why not just divide the bin space by 60 or 120 etc.?
Hmm, that's what 3dilettante's suggesting... It's kinda comical to run a sample through 240 blocks, but if it turns out faster who cares?
Threads within a wavefront can't collide with each other, as SRs are indexed by threadID in the hardware.How would you handle colliding writes though with global shared registers?
Need to work out how exactly the histogram mentioned here works, first:Still by only doing a subset per SIMD so you can do a local reduction first? A bit wasteful.
PS. hmm, maybe by writing the absolute thread ID to the bottom 10 bits you could do some kind of serialization ...
Jawed2.1.6.1 said:This pool of global GPRs can be used to provide many powerful features, including:
- Atomic reduction variables per lane (the number depends on the number of GPRs), such as:
- max, min, small histogram per lane,
- software-based barriers or synchronization primitives.
If you increase the block count you have to decrease the number of bins per block to make them all fit simultaneously.Because if you have multiple blocks per core, they will have to share the same amount of shared memory, otherwise they can't be active at the same time. For example, supposed that a block uses 9KB shared memory, then two blocks can't actually sharing one core, because a core only has 16KB shared memory. That is, if you assign 60 blocks to run on a GTX 285 (which has 30 cores), and each block uses more than 8KB shared memory, then the scheduler is most likely to run 30 blocks first, and run the remaining 30 blocks after they are completed. Therefore, the additional 30 blocks can't help hiding any latency.
That's close to what I was suggesting, though it might not have to go that high (though there is no reason it couldn't for other reasons I haven't thought of).Hmm, that's what 3dilettante's suggesting... It's kinda comical to run a sample through 240 blocks, but if it turns out faster who cares?
On ATI the register file can have a portion set aside as "global shared registers" - global meaning global to all wavefronts on the SIMD. The threadID (within the wavefront) implicitly causes a numbered register to be shared by the same "lane" within all wavefronts. I think you can allocate upto 128*64 of these shared registers, i.e. 8192 vec4s or 32768 scalars, i.e. half the register file. But this is only available with IL, which is usually grief. In theory, you could hold 327680 bins in registers.