LRB - ditching x86?

900 MB/s is too slow, but doing essentially random read/modify/writes of 32 bit words is never going to be terribly fast ... that's going to give you what, 1/16th of peak bandwidth at best? It fits cache on the i7 and it doesn't fit "cache" on the GPU then I'm not surprised it doesn't make for a very good application for the GPU.

If NVIDIA is really serious about GPGPU they should just put a banked multiported read/write cache on every memory controller in their next architecture (banked in the same way as shared memory is, it's the most economic way of implementing multiported cache).
 
Last edited by a moderator:
While huge bin sets are unable to fit in a core's shared memory, you could partition bins across the cores' shared memories - so bin 21574 only exists in core 9's shared memory (65536 / 30 = 2185 bins per core). In this scenario every core evaluates every sample to see if it's owned by that core and does a shared memory atomic update if so.
Depends on the ratio of data to bins, and the distribution of course. For the problems that I'm thinking of, it wouldn't work particularly well because there is a lot of overlap between which bins are touched by different cores, to the point that you could easily have a large percentage of the cores just doing nothing because they didn't happen to see any data for the bins that they were responsible for.

Really though, this is what a cache buys you very efficiently. If you're only touching a small subset of bins (on each core), then only those get pulled in and the rest never leave global memory. And even if you end up touching more in a few cases, caches give you "soft edges" on performance vs. any static partitioning scheme.

That's my point - with histogram generation as an example - and I think pcchen's results sort of bear it out. Even if there are better ways to optimize that example, it would be hard to get near the peak theoretical rate for distributions like the one that I described (which is a real world problem, and not a particularly uncommon one) without using a cache in either software or hardware, and I doubt a software-managed cache in CUDA would come close to the performance of a hardware solution. But again, I'd love to be proven wrong!
 
Really though, this is what a cache buys you very efficiently. If you're only touching a small subset of bins (on each core), then only those get pulled in and the rest never leave global memory. And even if you end up touching more in a few cases, caches give you "soft edges" on performance vs. any static partitioning scheme.
Most of the time yes ... of course with the example used here it's not the case. LRU is not optimal for essentially random access when the dataset doesn't fit. So you'd still need to do some software management of the caching (locking or non-temporal hints).

The naive implementation of histogram calculation works perfectly well with completely software managed caches, you simply only cache the most frequent bins. It's the optimal solution.
 
Just to check my fuzzy thinking, but what exactly would be the histogram's size?

65536 32-bit entries?

Are we going for efficiency, or can we wastefully go for the top performance possible?
 
I ran the same program under Windows XP (32 bits), and for some reason the CPU performance is much worse, only about 1GB/s. It probably has something to do with worse SMT management in Windows XP?

I also tried the idea in my previous post based on Jawed's idea, using 30 cores with 2185 bins per core. The performance is currently the best with peak about 1.8GB/s (under Windows XP). Using texture does not help at all and reduced its performance instead. I guess GT200's memory controller is smart enough to understand that those cores are actually reading from the same memory address.

Today a new driver (186.16) is released and I'm going to try it under Windows 7 64 bits. Maybe it'll have a working profiler :)
 
Unfortunately the profiler is still not working under Windows 7 64 bits with the new driver. However, the performance of the new algorithm under Windows 7 is still pretty good at about 1700MB/s (30 blocks with 512 threads each). Using textures reduce its performance to about 1100MB/s.

The drawback of the new algorithm is that it requires a GT200 with sufficient amount of cores (where the total amount of shared memory combined is large enough to fit the bins). Otherwise it will have to actually read the image multiple times.
 
I ran the same program under Windows XP (32 bits), and for some reason the CPU performance is much worse, only about 1GB/s. It probably has something to do with worse SMT management in Windows XP?
Wow, that's grim. Linux for the win?

I also tried the idea in my previous post based on Jawed's idea, using 30 cores with 2185 bins per core.
Hmm, that was what I was trying to say, guess I didn't describe it well.

The performance is currently the best with peak about 1.8GB/s (under Windows XP). Using texture does not help at all and reduced its performance instead. I guess GT200's memory controller is smart enough to understand that those cores are actually reading from the same memory address.
Nice, so about 400 cycles per sample. In the texturing version are you using linearly ordered coordinates (e.g. reading in row major order?). Texturing wants rasterisation order, so I'm guessing the caching breaks down.

Does performance vary much with the number of threads?

Jawed
 
Does performance vary much with the number of threads?

Yes, it's about 1000MB/s when only 256 threads are used. Unfortunately, it can't be more than 512 threads, otherwise I think it could be even better.
 
Divide the bin space by 240, give each lane its own section of the bin space.
Evaluate the same pixel 240 times, with one lane at a time ever adding to its unique section of the distributed histogram in its local shared memory, based on block and thread ID.

Repeat 128M times, then write back the histogram. ;)
 
I decided to write my own 8 bits histogram program (256 bins). Data size is 256MB.

Running on the same CPU (Windows 7):
CPU (1 thread): ~1250MB/s
CPU (2 threads): ~750MB/s (due to set up costs)
CPU (4 threads): ~1250MB/s
CPU (8 threads): ~1500MB/s

On GPU, I decided to try another method: each core has 8 set of bins, and in each warp every thread access their own bins. That is, thread 0 only uses bin 0, thread 1 uses bin 1, ... thread 8 uses bin 0, etc. This way, there will be no bank conflict between threads. Also atomic shared memory operation is used. The number of threads is 512 (again the maximum value) and the number of blocks is 30 (the number of actual multi-processors in a GTX285).

The result is, actually similar to the SDK sample, at about ~11000MB/s. Of course it takes much less number of threads. I also modified the program to use the "zero copy" function, which allows the GPU to directly access a pinned main memory area through PCIe, so the image does not have to be copied to the video memory. The result is ~ 4300MB/s, still faster than the CPU.

So I guess the conclusion here is that if your GPU has enough internal memory for it, then yes you are going to have a pretty good performance. On the other hand, if your GPU don't and you need really random access pattern, then you are boned >.<
 
Yes, it's about 1000MB/s when only 256 threads are used. Unfortunately, it can't be more than 512 threads, otherwise I think it could be even better.
I puzzled why you can't have multiple blocks per core. Is shared memory bound by blockID? If so, why not just divide the bin space by 60 or 120 etc.?

Hmm, that's what 3dilettante's suggesting... It's kinda comical to run a sample through 240 blocks, but if it turns out faster who cares?

---

On ATI the register file can have a portion set aside as "global shared registers" - global meaning global to all wavefronts on the SIMD. The threadID (within the wavefront) implicitly causes a numbered register to be shared by the same "lane" within all wavefronts. I think you can allocate upto 128*64 of these shared registers, i.e. 8192 vec4s or 32768 scalars, i.e. half the register file. But this is only available with IL, which is usually grief. In theory, you could hold 327680 bins in registers.

Jawed
 
Last edited by a moderator:
How would you handle colliding writes though with global shared registers? Still by only doing a subset per SIMD so you can do a local reduction first? A bit wasteful.

PS. hmm, maybe by writing the absolute thread ID to the bottom 10 bits you could do some kind of serialization ...
 
Last edited by a moderator:
I puzzled why you can't have multiple blocks per core. Is shared memory bound by blockID? If so, why not just divide the bin space by 60 or 120 etc.?

Hmm, that's what 3dilettante's suggesting... It's kinda comical to run a sample through 240 blocks, but if it turns out faster who cares?

Because if you have multiple blocks per core, they will have to share the same amount of shared memory, otherwise they can't be active at the same time. For example, supposed that a block uses 9KB shared memory, then two blocks can't actually sharing one core, because a core only has 16KB shared memory. That is, if you assign 60 blocks to run on a GTX 285 (which has 30 cores), and each block uses more than 8KB shared memory, then the scheduler is most likely to run 30 blocks first, and run the remaining 30 blocks after they are completed. Therefore, the additional 30 blocks can't help hiding any latency.
 
How would you handle colliding writes though with global shared registers?
Threads within a wavefront can't collide with each other, as SRs are indexed by threadID in the hardware.

2.6.1.2 of the R700 ISA explains that a stall arises for one of the pair of wavefronts that's trying to simultaneously update an SR. The stall lasts one cycle.

I'm unclear on the granularity of addressing though. If every one of the 64 threads in a wavefront tries to update a different SR (e.g. thread0 updates SR0, thread1 updates SR1) then wouldn't this take 64 cycles? I believe SR can be indexed, i.e. instead of there being a literal register:

ADD SR0, SR0, 1

I think you can do:

MOV AR.x, R0.x ; put bin index (0-127) in address register
ADD SR[AR.x], SR[AR.x], 1

Need to see what else I can find.

Still by only doing a subset per SIMD so you can do a local reduction first? A bit wasteful.

PS. hmm, maybe by writing the absolute thread ID to the bottom 10 bits you could do some kind of serialization ...
Need to work out how exactly the histogram mentioned here works, first:

2.1.6.1 said:
This pool of global GPRs can be used to provide many powerful features, including:
  • Atomic reduction variables per lane (the number depends on the number of GPRs), such as:
    • max, min, small histogram per lane,
    • software-based barriers or synchronization primitives.
Jawed
 
Because if you have multiple blocks per core, they will have to share the same amount of shared memory, otherwise they can't be active at the same time. For example, supposed that a block uses 9KB shared memory, then two blocks can't actually sharing one core, because a core only has 16KB shared memory. That is, if you assign 60 blocks to run on a GTX 285 (which has 30 cores), and each block uses more than 8KB shared memory, then the scheduler is most likely to run 30 blocks first, and run the remaining 30 blocks after they are completed. Therefore, the additional 30 blocks can't help hiding any latency.
If you increase the block count you have to decrease the number of bins per block to make them all fit simultaneously.

When 2185 bins were originally derived it was effectively done by dividing the full bin count by the number of blocks. So, 1093 bins per block for 60 blocks.

Jawed
 
Hmm, that's what 3dilettante's suggesting... It's kinda comical to run a sample through 240 blocks, but if it turns out faster who cares?
That's close to what I was suggesting, though it might not have to go that high (though there is no reason it couldn't for other reasons I haven't thought of).

At it's most basic, it would be dividing the histogram into at least 30 blocks, with each of the 8 lanes of each SIMD running a block getting 1/8 of the bins in the 1/30.

So it would be a grid of 30 blocks, and every block would wind up evaluating every sample, exploiting the GPU's high emphasis on read bandwidth to broadcast a single sample to the entire chip (or just split the blocks up further and add more threads to hide additional latency).

That way, all 240 physical lanes will evaluate a pixel (possibly not in the same cycle, but close to it so they can reuse the sample), and at most one lane will ever write to its unique subset of the SIMD's shared memory, and the shared memory itself would be unique to the SIMD.

After the initial startup period, there would be an average throughput of 1 histogram update per hot cycle, and it's been guaranteed by how the bins are partitioned that there won't be any collisions, since only one lane can update the bin in question.

Writeback to memory would be straighforward, as there would be no overlap between the blocks.

On ATI the register file can have a portion set aside as "global shared registers" - global meaning global to all wavefronts on the SIMD. The threadID (within the wavefront) implicitly causes a numbered register to be shared by the same "lane" within all wavefronts. I think you can allocate upto 128*64 of these shared registers, i.e. 8192 vec4s or 32768 scalars, i.e. half the register file. But this is only available with IL, which is usually grief. In theory, you could hold 327680 bins in registers.

I was going to post another idea I had of leveraging the GTX's register file, since it was 4x the size of the shared memory that would have been along those lines.
You mentioned a multi-pass scheme for splitting up the bins, and the register file in theory could avoid spilling to main memory between passes.
At the end of a pass, the contents of the shared memory would be arrayed properly and could be stored to statically indexed registers, as opposed to the data-dependent accesses needed during the pass.

The reg file would fill completely after 4 passes, so that scheme could only stack things three deep before needing to write back.
 
Last edited by a moderator:
Maybe some type of binary sort to cut down on the ridiculous redundancy ... really though this just shows that some type of problems can only be solved by better memory subsystems rather than computation.

Ideally we'd have caches with 32 bit banks with one port per bank and a small FIFO for colliding writes per port (Larrabee should have had this).
 
Back
Top