Because it's not only visible to a single schedulable entity. If I create a workgroup of size 1024 on SI that'll be multiple different scheduled entities that are looking at it. Otherwise I wouldn't need barriers.
Aren't we supposed not to make assumption about the hardware and to look on it from the API perspective (that was your suggestion). The API does not make any assumption about the size of the individually scheduled entities. It could be a SIMD architecture with a vector size of 1024 for that matter. In that sense, the barriers are just there to ensure consistency for all possible implementations. If you write non portable code and use workgroups the size of the vector size of the underlying hardware, you don't have to use them. From the API side, the workgroups are the smallest schedulable entities you can rely on and OpenCL just supports execution on all kinds of hardware (with different supported sizes).
That's fundamentally a different programming model than one in which my single serial thread of execution (even if it uses SIMD) is accessing a consistent view of some resource (registers).
As stated several times already, I disagree. There is just a different kind of memory defined which is supposed to be private to a workgroup. This is from the basic definition much closer to registers (private to a work element, btw., as said, OpenCL also doesn't specify anything about OOB private accesses and also does not state it has to sit in registers, so where is the difference?)than to global memory (which is accessible by all work items from all work groups and also from other kernels [of the same process]).
You didn't com up with convincing arguments, sorry.
DirectX specifies that OOB accesses for local memory can only affect other local memory (but on the whole machine). OCL leaves it open to any sort of undefined behavior or arbitrary global memory corruption.
Yes, as I said, DirectX gives specs for global memory and basically just use different words for the local/shared memory, they say the content of the shared memory (all of it) is undefined after OOB writes, i.e. it is unspecified, what operations are exactly carried out, same as with OpenCL. That they restrict corruption to the shared memory has the simple effect that you are not allowed to run shaders using shared memory without having it in hardware as a separate array (or to provide equivalent means of doing so).
Anyways I'm sad you didn't focus more on my summary of the two remaining discussion points, as those questions are what's actually interesting at this point I think.
I seem to have missed that.
Generally, if I didn't reply to something, I probably didn't have an urgent issue with it (or just gave up because I have more pressing things to do). And I don't want to start nitpicking abut the ideal solution for OOB accesses. You say, it would be to crash and tell you, where it was. With GCN 1.1 a memory access violation exception is raised and a trap handler can gracefully quit the program and provide you with a meaningful error message. Or you can use this functionality in a debugger. Doesn't sound that overengineered to me, more like a useful design. As I said, it's the better solution.
Edit:
You probably meant this (I was thinking about to answer, but it was late here):
So I think we're mostly in agreement here, with the only disagreements being around...
1) You think the additional isolation of workgroups is useful; I don't. Obviously it's fine and spec compliant either way.
2) You think the bounds checking hardware is required because of indirect RF accesses. I claim those indirect accesses are an implementation choice, and alternate choices would not require that bounds checking hardware. But in any case, you are correct that it doesn't need a full 32-bit compare in general as high address bits can just be dropped (with proper sign handling) on a given architecture with a fixed maximum shared memory size.
1):
It's the cleaner way (again, take the example of a possible register access to another thread of the same process, even if you don't like that example; it would be considered a serious flaw on CPUs). And as explained, it's the same effort as to ensure just process isolation.
It's for this sort of arbitrary concept of "workgroup resource boundaries"
It's the same arbitrary concept as having registers private to a thread.
2):
The bounds checking for the reg file is very likely done by different hardware. It should just clarify, that it can't be neither expensive nor a rigorous distinction between registers and local/shared memory.
And to your claim that it is an implementation choice: Of course it is! Everything is one. You can decide to implement everything on a single scalar ALU without any data in registers if you want. Does it make sense? No!