No, you're lumping stuff in here which isn't the same thing. On CPUs there is no memory safety inside a single process. Two threads can race/corrupt memory, etc.
As possible on GPUs.
There's never any expectation that my code has to be safe from... uhh... itself. Bad code is completely allowed to corrupt any of my own memory space, and that's totally fine.
I just don't regard the local memory as such a universally usable memory
space as global memory. From my perspective it's much closer to the registers because local memory is a resource defined as private to a warp/wavefront/workgroup just like registers. Even within the same process, you can't access the registers of another thread on a CPU.
Incidentally, dropping accesses to out of bound memory doesn't solve any problems with bad code what-so-ever. It's clearly an error in the application and needs to be fixed. The only bit you don't want is for it to spill over into separate *processes*. i.e. contexts/DMA buffers for GPUs, not workgroups.
That's solved with that. How do you want to ensure isolation between the local memory of two different kernels running on the same CU? You would have to do the same amount of range checks anyway. So your argument, that it is overly complicated doesn't hold water in my view.
Sure, but shared memory addresses are defined *dynamically*. Buggy code can address outside the range of any array, and you can't statically prove anything about that in advance.
As you can do with registers on GPUs (which get boundary checked in these cases too as mentioned already). Your point is?
It's the way *one piece* of hardware works. i.e. relevant if I'm only targeting that hardware and irrelevant if i'm writing portable code (which must be written to the API specifications). Try writing that code and running it on an OpenCL CPU implementation. You're either going to get badness or your code is going to be much slower than it needs to be.
I thought we are talking about GPUs here. And the spec says out of bound accesses are illegal. If you write code to spec, it runs everywhere anyway. We are talking about how these illegal cases are handled (which was left out of the OpenCL spec probably because of fears that specifying it would make it harder to implement on the some devices). It doesn't change that the local memory is defined as private to a workgroup and the programmer is supposed to adhere to it. GCN enforces this in hardware, i.e. has a defined behaviour for this cases and will raise a memory access violation exception with the next iteration. That behaviour is completely within spec. And I can't see anything bad there.
No, that's the fundamental mistake you guys are making in logic here. You're assuming that "this memory is well-defined to be shared between these invocations" (i.e. shares a common base pointer) means that out-of-bounds reads/writes are suppressed.
Look how the discussion started! You were asking specifically about the implementation on AMD GPUs, where these OOB accesses get suppressed. I was specifically talking about that. It only later developed into in a discussion how much sense this implementation makes. So we are not making a fundamental mistake when we explained how it is implemented. All APIs basically state that out of bounds accesses on the local memory are illegal. It therefore doesn't hurt to do it. And it solves the process isolation issue with the same effort you would have to do anyway.
The key difference here is that it takes dynamic addresses and thus can have out of bounds accesses. It's more similar to memory in semantics than registers.
As stated already, that's actually not such a strict distinction as you think. It doesn't have to be static, you can also index dynamically into the reg file on GPUs. You can indeed have out of bounds indices into the regfile. By the way, that is an implementation specific detail which is not covered by the OpenCL spec, isn't it?
AFAIK the OpenCL spec also doesn't say what happens if you index out of bounds into private memory. It also does not specify, that private memory has to sit in registers, actually it specifically mentions other possibilities (which gets heavily used in CPU implementations). It's therefore all implementation dependent.
It's fine, but as I've said it doesn't actually help matters. First since it's not guaranteed I can't rely on it and second that's not a good way to write code anyways.
RLY? Come on, you started the whole thing by doubting that GCN isolates processes also within the local memory. You get the answer that it does and the explanation how it is done, and now you say you can't rely on it?
You can rely on proper process isolation if you use their hardware. That AMD GPUs hinder local as well as global memory corruption by suppressing out of bounds accesses (and raises exceptions shortly) has some potentially useful applications (besides the obvious debugger think of MegaTexture/PRT, paging). Should you write illegal OpenCL code with undefined behaviour because of it? Of course not! That would be indeed bad practice. But wait for some future DX, OpenCL, or OpenGL version/extension and we may see this functionality exposed with a proper API to use (or wait until you get your hands on one of the next consoles
).
Btw., I don't know what intel GPUs do, but frankly, I doubt a bit you can corrupt others process' memory on them. And on nVidia it's also not possible (but they may crash last time I read something about it, could have changed though).