LRB - ditching x86?

I think the salient part of the article is:

We contacted Intel for comment in regards to the above information. Intel denied that any of the above is true.

Note that it wasn't one of those "Intel said it doesn't comment on rumors" or "Intel said it can't disclose information about future products" or anything.

The whole article seems weirdly out there to me. From the odd sense of comparing it to the GT200 on a different process (how big would any future chip be on an older process?), to the claim that we don't see it until summer 2011 (they have a wafer to show us but it's still two years away?), to the way they buried the lead about Intel flat out denying ANY of it is true.
 
So what I was suggesting (in the prior case of a 4 way colliding address without needing the atomic return value) was a parallel non-atomic And reduction followed by one single atomicAnd. Easy to do in software if you know in advance which addresses are going to collide. Seems really messy to do this in general in hardware when the software doesn't know which addresses are going to collide.

I think it's possible to do so in a SIMD vector. i.e. if multiple elements in a vector want to do atomic add on the same address, the hardware can detect this and add all of them at the same time.
 
I think it's possible to do so in a SIMD vector. i.e. if multiple elements in a vector want to do atomic add on the same address, the hardware can detect this and add all of them at the same time.
That's some pretty non-trivial hardware to dedicate to that task - are you sure GT200 or similar actually implements this? My guess is it simply sees bank conflicts and serializes the atomics across all SIMD lanes.
 
That's some pretty non-trivial hardware to dedicate to that task - are you sure GT200 or similar actually implements this? My guess is it simply sees bank conflicts and serializes the atomics across all SIMD lanes.

No. I'm just saying it's possible to do so. I don't think GT200 is doing it this way (though I'm not sure).
 
What do we mean by SIMD when doing this?
SIMD as in single SIMD op, or SIMD as the hardware unit supporting multiple threads?

Atomic operations are supposed to be able to achieve their results in a manner consistent with them having exclusive and uninterrupted access and modification of the location. If this cannot be met, the atomic operation is supposed to fail.
The individual thread that issued a failed atomic operation will have to handle this, but should not go on with business as usual.

Having multiple simultaneous atomic adds politely cooperating and sourcing one another is injecting instruction behavior that would break anything not expecting this spontaneous access-sharing. Sharing means this is some kind of communal exclusivity, and it would not allow atomic ops to fail within a SIMD.

Since having multiple lanes forward to each other is not instantaneous, that's at most a very fast serialization through a specialized path if this is desired program behavior.
 
Last edited by a moderator:
What do we mean by SIMD when doing this?
SIMD as in single SIMD op, or SIMD as the hardware unit supporting multiple threads?
Hardware unit with different lanes (can we not use "threads" ever again in this context? It's extremely confusing).

We're taking about atomic reduction (i.e. add, min, max, etc) operations with per-lane addresses. What happens when there is a collision? I don't believe these are specified to fail in that case in OpenCL/ComputeShader, so the question is how hardware handles them. It's "correct" per se to do them in any order, but they all need to be done. The point here is doing them one by one is less efficient than the ideal (logn reduction), although I doubt any current hardware can efficiently implement the latter case if all addresses with the same for instance.
 
Hardware unit with different lanes (can we not use "threads" ever again in this context? It's extremely confusing).
I asked because atomicity as defined elsewhere has some very specific behaviors that would differentiate between a SIMD instruction belonging to a single thread accessing a single memory location, and a SIMD unit that abstractly maintains multiple lanes that each behave as single thread accessing a single memory location.

We're taking about atomic reduction (i.e. add, min, max, etc) operations with per-lane addresses. What happens when there is a collision? I don't believe these are specified to fail in that case in OpenCL/ComputeShader, so the question is how hardware handles them. It's "correct" per se to do them in any order, but they all need to be done. The point here is doing them one by one is less efficient than the ideal (logn reduction), although I doubt any current hardware can efficiently implement the latter case if all addresses with the same for instance.

A proper atomic operation shouldn't allow other operations to tread on the memory location it is working on, as it enforces exclusivity of a location.

If lane 0 and lane 1 perform an atomic add on a shared location, what do we want to have happen?
If they are serialized, one of the lanes will go first, read the initial state, add, then write back.
Then the other lane will procede using the new value as a basis.

If they do not serialize, what happens?
Lane 0 and 1 read the exact same value, and write back whatever they want in some unknown order.
 
If lane 0 and lane 1 perform an atomic add on a shared location, what do we want to have happen?
If they are serialized, one of the lanes will go first, read the initial state, add, then write back.
Then the other lane will procede using the new value as a basis.
Yes, that's conceptually how it will happen, with no order guarantees on which lane goes "first" (and there's no ordering guarantees with contention across the rest of the machine, so this isn't unreasonable). The question is, many of these operations are fully associative and thus the ordering is irrelevant (int add, min/max, etc), but they all need to happen. Thus the hardware/software can get clever and do it hierarchically instead of linearly reducing the asymptotic complexity of the operation significantly. In heavily contested cases, this can be a huge win.

If they do not serialize, what happens?
Lane 0 and 1 read the exact same value, and write back whatever they want in some unknown order.
No, see above. "Reduction"-style operations without return values can go in any order - even hierarchically - with the user being none-the-wiser, and the result being identical.
 
Yes, that's conceptually how it will happen, with no order guarantees on which lane goes "first" (and there's no ordering guarantees with contention across the rest of the machine, so this isn't unreasonable). The question is, many of these operations are fully associative and thus the ordering is irrelevant (int add, min/max, etc), but they all need to happen. Thus the hardware/software can get clever and do it hierarchically instead of linearly reducing the asymptotic complexity of the operation significantly. In heavily contested cases, this can be a huge win.
If I'm interpreting this correctly now, I missed the point earlier that the non-atomic reductions are being performed on register-side operands of the various lanes, once it is known that they collide.

The physical process of skimming operands like that from other lanes would need to be decided.
For one thing, doing this on the fly might conflict with how operands are scoreboarded for readiness, as the value at the contended address would normally prevent instruction issue for the rest of the lanes.

The operand collector's job may potentially inject contention, since it would have to route the needed operands.

Another check would be a check on the IP.
It would be an artifact of the current SIMD setup that the exact same atomic operation in the same spot of the program would result in collisions.
Who is to say that in the future we can assume a contending atomic access to a location will be from the same atomic instruction?
 
For one thing, doing this on the fly might conflict with how operands are scoreboarded for readiness, as the value at the contended address would normally prevent instruction issue for the rest of the lanes.
I'm not sure how the different IHV GPUs work at that level, but isn't the entire point of the SIMD level to issue the entire instruction at once? i.e. it's the job of specialized hardware further down the pipeline (past instruction issues) to handle scatter/atomic collisions or memory bank conflicts?

It would be an artifact of the current SIMD setup that the exact same atomic operation in the same spot of the program would result in collisions.
Who is to say that in the future we can assume a contending atomic access to a location will be from the same atomic instruction?
Not sure what you're getting at it here... it's not important that it happen in exactly the same hierarchy on every piece of hardware, or even every time the code is run. The key ist just to get it done as efficiently as possible. i.e. in the case where I do an "atomic add" to the same address from N different lanes, that should ideally take O(logn), not O(n). But yes, I agree that hardware would need to be quite sophisticated to make that happen in practice. That said, it's an area where software programmability and a few horizontal operations is key to staying efficient, which incidentally brings us back to why we got onto this topic :)
 
This is a lot of discussion on a minute feature ... just for reference, what common use case would cause a significant amount of single address collisions apart from append buffers?
 
I'm not sure how the different IHV GPUs work at that level, but isn't the entire point of the SIMD level to issue the entire instruction at once? i.e. it's the job of specialized hardware further down the pipeline (past instruction issues) to handle scatter/atomic collisions or memory bank conflicts?
Instructions won't issue if their operands aren't ready, which for the colliding memory accesses means only one instruction will be able to perform the read.
For something like a shared memory atomic, the most that can normally be hoped for is that it will serialize and issue the SIMD instruction with all but one lane predicated off.
If we want to pull off this more efficient reduction, the scheduling hardware will have to detect the colliding atomic ops and override the standard serializing behavior and then create an entirely new series of operations that synthesizes the desired outcome.

Not sure what you're getting at it here... it's not important that it happen in exactly the same hierarchy on every piece of hardware, or even every time the code is run. The key ist just to get it done as efficiently as possible. i.e. in the case where I do an "atomic add" to the same address from N different lanes, that should ideally take O(logn), not O(n). But yes, I agree that hardware would need to be quite sophisticated to make that happen in practice. That said, it's an area where software programmability and a few horizontal operations is key to staying efficient, which incidentally brings us back to why we got onto this topic :)

The case where we can expect simultaneous reduction operations across a SIMD's lanes is partly an artifact of current hardware implementations. Some ATI stream documentation a while back indicated that at least that IHV did not want to commit to the current lock-step style of execution for all eternity.

If there is more variation, the probability of being able to combine atomics goes down.
If things go to some wildly free MIMD setup, there isn't any guarantee that every lane is running the same instruction or even the same instruction stream, which might inject subtle inter-thread behaviors programmers wouldn't expect if program A and program B have their own atomic adds that happen to overlap some of the time.
 
This is a lot of discussion on a minute feature ... just for reference, what common use case would cause a significant amount of single address collisions apart from append buffers?
Histogram generation. Been a problem on GPUs for ages. And wait, isn't a lot of discussion on "minute" features the point of the B3D forums? :)

If there is more variation, the probability of being able to combine atomics goes down.
If things go to some wildly free MIMD setup, there isn't any guarantee that every lane is running the same instruction or even the same instruction stream, which might inject subtle inter-thread behaviors programmers wouldn't expect if program A and program B have their own atomic adds that happen to overlap some of the time.
As I described it, this would merely be a performance optimization and would never produce different results than serializing. Also during a dense pass like a histogram generation, you're going to get *tons* of these "conflicts", even with local copies of histograms to reduce contention. There are lots of common workloads in graphics that exhibit this highly-contended histogram reduction style access pattern.

To reiterate, my point is not that hardware *should* magically do any of this, but rather that there's a clear benefit to a more general programming model and hardware execution model for these cases. OpenCL/ComputeShader/CUDA do not allow one to (easily - prove me wrong!) express the most efficient parallel algorithm for this case - atomic operations from "inside your double nested loop" are simply insufficient without a rediculous amount of hardware sophistication to reverse-engineer the memory access pattern on the fly.
 
To reiterate, my point is not that hardware *should* magically do any of this, but rather that there's a clear benefit to a more general programming model and hardware execution model for these cases. OpenCL/ComputeShader/CUDA do not allow one to (easily - prove me wrong!) express the most efficient parallel algorithm for this case - atomic operations from "inside your double nested loop" are simply insufficient without a rediculous amount of hardware sophistication to reverse-engineer the memory access pattern on the fly.

OpenCL at least might need to special-case such behavior to some kind of explicit coordinated atomic extension, since its intended targets span so many disparate architectures.

Atomic operations have low-level implications outside of the arithmetic behaviors, and varying architectures may have implemented various forms of atomic operations that may fail almost immediately upon trying this.
 
I really think this is the basis of an excellent benchmark - e.g. histogramming the word-count of the novels of Dickens. None of this namby-pamby histogramming of a render target.

Jawed
Except it won't have nearly as many collisions.
 
It wasn't so much atomic collisions as the general awkwardness - though a quick rummage appears to indicate that the top 10 words will produce about 20% of the entire word count. Zipf's law:

http://en.wikipedia.org/wiki/Zipf's_law

gives about 3x the count of the most popular word for the top 10 and the most common word, "the", is ~7% of English.

Histogram atomic-adds are fire and forget anyway, the incrementing thread doesn't care what the count is, so it's a "fast" atomic even when it collides.

Anyway, this is more about not having a fixed number of bins and the hoops that need to be jumped through to make it parallel. e.g. how to cope with 50,000 bins and ~20-bits precision per count? Shared memory isn't big enough under CUDA or OpenCL to hold an entire histogram per core.

Jawed
 
Anyway, this is more about not having a fixed number of bins and the hoops that need to be jumped through to make it parallel. e.g. how to cope with 50,000 bins and ~20-bits precision per count? Shared memory isn't big enough under CUDA or OpenCL to hold an entire histogram per core.

If you can have enough number of threads, which shouldn't be too hard for computing histogram, you can actually use global memory for the bins, and you don't really need atomics because you can probably afford to have bins for each threads. Of course, that would require a reduction at the end, but if your image is large enough the cost of reduction should be quite small.
 
If you can have enough number of threads, which shouldn't be too hard for computing histogram, you can actually use global memory for the bins, and you don't really need atomics because you can probably afford to have bins for each threads. Of course, that would require a reduction at the end, but if your image is large enough the cost of reduction should be quite small.
If you want decent latency hiding then each of the 30 cores in GT200 needs 1024 threads, at 4 bytes per bin that's 122,880 bytes for 1 bin per thread. 50,000 bins require 5.7GB. Whoops, so you'd be limited to 192 threads per core - but that is over 1GB. And it would be pretty poor at hiding latency.

On NVidia the "fast atomic" should be faster than giving each thread a private bin. Apart from anything else, by using less memory (e.g. one histogram per core) will be friendlier to the memory system.

Whereas on Larrabee the entire histogram stays in cache - but now you've apparently got the slow, cache-line colliding, style of x86 atomics to deal with. If each core has a local histogram, that's 50,000 bins * 4 bytes * 32 cores = 6.1MB (or 195KB per core). Obviously borderline, as 50,000 bins is conservative.

Jawed
 
If you want decent latency hiding then each of the 30 cores in GT200 needs 1024 threads, at 4 bytes per bin that's 122,880 bytes for 1 bin per thread. 50,000 bins require 5.7GB. Whoops, so you'd be limited to 192 threads per core - but that is over 1GB. And it would be pretty poor at hiding latency.

You can't have 1024 threads per core, because it's limited to 512 threads. Furthermore, you don't really need that much threads, since what you want is to hide memory latency, not computation latency. Computing histogram is not exactly a computation extensive work. So in theory if you need only 512 threads to hide memory latency, you don't even need to use all 30 cores.
 
Back
Top