That's close to what I was suggesting, though it might not have to go that high (though there is no reason it couldn't for other reasons I haven't thought of).
At it's most basic, it would be dividing the histogram into at least 30 blocks, with each of the 8 lanes of each SIMD running a block getting 1/8 of the bins in the 1/30.
So it would be a grid of 30 blocks, and every block would wind up evaluating every sample, exploiting the GPU's high emphasis on read bandwidth to broadcast a single sample to the entire chip (or just split the blocks up further and add more threads to hide additional latency).
That way, all 240 physical lanes will evaluate a pixel (possibly not in the same cycle, but close to it so they can reuse the sample), and at most one lane will ever write to its unique subset of the SIMD's shared memory, and the shared memory itself would be unique to the SIMD.
After the initial startup period, there would be an average throughput of 1 histogram update per hot cycle, and it's been guaranteed by how the bins are partitioned that there won't be any collisions, since only one lane can update the bin in question.