AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

There's a *big* difference between resources that are only visible to a single invocation/"thread of execution" and stuff that is visible to multiple different scheduled entities. Registers are the former, while LDS is the latter.
I always thought, that OpenCL does not make any assumptions about the underlying hardware (or as few as possible). This is also the reason why designations like "threads" were avoided. Work items are not necessarily individually scheduled. The smallest really individual entities OpenCL knows of are the workgroups (how the work items are handled is up to the implementation and not specified, some types of workflow is explicitly not guaranteed to work within a workgroup to allow execution in SIMD fashion). My characterization with the local memory as private to the workgrops therefore exactly fits the definition of a resource only visible to a single invocation/"thread of execution" you just came up with. ;)
I understand where you're coming from from a hardware perspective, but I think the API perspective is the more relevant one when considering parallel semantics.
And as you just wrote, the OpenCL API leaves huge holes, so we actually can't consider it to deliver a rigorous perspective.
As I said, you can't corrupt other processes' (shared local) memory on *any* GPUs right now because you can't have multiple DMA buffers from different processes in flight at the same time. This is all an academic discussion :)
You can't have multiple CS kernels from different processes in flight at the same time? What is nVidia's HyperQ doing, btw.? I guess AMD's ACEs (you have two), especially with the extensions coming with GCN 1.1 can basically achieve something similar. And there is also the device fission OpenCL extension. Hmm.
 
Last edited by a moderator:
Work items are not necessarily individually scheduled.
True, but they *can* be. Indeed even single work items can be scheduled individually subject to barrier operations. Or an entire workgroup can be scheduled together. Code is not allowed to make any assumptions either way.

My characterization with the local memory as private to the workgrops therefore exactly fits the definition of a resource only visible to a single invocation/"thread of execution" you just came up with. ;)
Nah but that only works if I could write code as if the workgroups were indeed the smallest schedulable entity, but I'd wreck GPUs if I write all my code with 1 item/group ;) Honestly at this stage I don't buy that the additional abstraction of workitem is particularly useful compared to something like ISPC where I can just write my code with a parameterized SIMD size (which enables more efficient expressions of algorithms), but that's the way it is unfortunately...

And as you just wrote, the OpenCL API leaves huge holes, so we actually can't consider it to deliver a rigorous perspective.
Sure, but DX is pretty good in that regard, which is why I tend to use it in examples. I only switched to CL because OpenGL Guy rejected the relevance of DX (for self-evident reasons I guess? :)).

You can't have multiple CS kernels from different processes in flight at the same time? What is nVidia's HyperQ doing, btw.? I guess AMD's ACEs (you have two), especially with the extensions coming with GCN 1.1 can basically achieve something similar. And there is also the device fission OpenCL extension. Hmm.
It's coming, but it's not all here yet on consumer solutions, and as you make clear, it's all extensions and not required by any specs yet. And yeah I definitely agree that process isolation is a requirement (which is what started the entire subthread here), I just don't think going beyond that is particularly useful. An ideal implementation would crash and tell me where if I access outside of bounds within my single process.
 
Please,
Gipsel or other person please can you upload somewhere new ISA doc somewhere
givel link is dead:
http://developer.amd.com/wordpress/media/2013/02/AMD_Sea_Islands_Instruction_Set_Architecture.pdf
They probably took it down to do the changes Dave suggested. Btw., you should be aware that this appears to be a bit preliminary and contains a few more errors than usual.

Sea Islands ISA Manual

Edit:
a mirror, just in case

PS: Does anybody know a good and free filehoster?
 
Last edited by a moderator:
True, but they *can* be. Indeed even single work items can be scheduled individually subject to barrier operations. Or an entire workgroup can be scheduled together. Code is not allowed to make any assumptions either way.
Yes, you are no allowed to make assumption which makes the workgroup the smallest individually scheduled entity you can rely on.
Nah but that only works if I could write code as if the workgroups were indeed the smallest schedulable entity, but I'd wreck GPUs if I write all my code with 1 item/group ;)
The smallest individually scheduled items on GPUs are the wavefronts/warps which almost coincide with the workgroups. Where is the problem?
Sure, but DX is pretty good in that regard, which is why I tend to use it in examples. I only switched to CL because OpenGL Guy rejected the relevance of DX (for self-evident reasons I guess? :)).
Does the description of the group shared memory of DX deviate in a significant way from the local memory of OpenCL? Doesn't DX specify what should happen for out of bounds accesses to texture or buffers (but lets it undefined for shared memory)? I don't get the self-evident reasons part.
It's coming, but it's not all here yet on consumer solutions, and as you make clear, it's all extensions and not required by any specs yet.
Okay, now you backtracked from *any* GPU to available consumer solutions and being an extension and not required by spec. :LOL:
I guess if one looks through nVidia's docs, one an dig out the spec for the already available GK110 solution. And as said, on a smaller scale it should be also possible on all GCN GPUs on sale for more than a year already.
 
The smallest individually scheduled items on GPUs are the wavefronts/warps which almost coincide with the workgroups. Where is the problem?
Because it's not only visible to a single schedulable entity. If I create a workgroup of size 1024 on SI that'll be multiple different scheduled entities that are looking at it. Otherwise I wouldn't need barriers. That's fundamentally a different programming model than one in which my single serial thread of execution (even if it uses SIMD) is accessing a consistent view of some resource (registers).

Does the description of the group shared memory of DX deviate in a significant way from the local memory of OpenCL? Doesn't DX specify what should happen for out of bounds accesses to texture or buffers (but lets it undefined for shared memory)?
DirectX specifies that OOB accesses for local memory can only affect other local memory (but on the whole machine). OCL leaves it open to any sort of undefined behavior or arbitrary global memory corruption.

I don't get the self-evident reasons part.
I was just joking about his name being "OpenGL Guy" :)

Okay, now you backtracked from *any* GPU to available consumer solutions and being an extension and not required by spec. :LOL:
My statement was in the context of graphics APIs, sorry I didn't make that clear. Pure compute stuff kind of lives in its own (somewhat simpler) world.

Anyways I'm sad you didn't focus more on my summary of the two remaining discussion points, as those questions are what's actually interesting at this point I think. On the rest I think we pretty much already agree...
 
Last edited by a moderator:
1) You think the additional isolation of workgroups is useful; I don't. Obviously it's fine and spec compliant either way.
I think it is quite useful. Enough with the shared mutable state already. The hw checks are quite cheap. The hw can provide signals for oob access if you do oob checking in the first place, making debugging easier. The spec should mandate it. Speaking of Khronos specs, sigh...
 
I think it is quite useful. Enough with the shared mutable state already. The hw checks are quite cheap. The hw can provide signals for oob access if you do oob checking in the first place, making debugging easier. The spec should mandate it. Speaking of Khronos specs, sigh...
Agreed for general array indexing, but this is less than that. It's for this sort of arbitrary concept of "workgroup resource boundaries"... i.e. I can still stomp all over other arrays that my workgroup uses. Definitely I won't complain about any additional debugging signals I can get though, although suppressing is less useful. Frankly though for debugging compiler-inserted bounds checking is fine too.

The handle predates 7+ years of Direct3D driver work and 3 years of OpenCL ;)
I know, I was just teasing because that sentence came out sounding funny :)
 
Because it's not only visible to a single schedulable entity. If I create a workgroup of size 1024 on SI that'll be multiple different scheduled entities that are looking at it. Otherwise I wouldn't need barriers.
Aren't we supposed not to make assumption about the hardware and to look on it from the API perspective (that was your suggestion). The API does not make any assumption about the size of the individually scheduled entities. It could be a SIMD architecture with a vector size of 1024 for that matter. In that sense, the barriers are just there to ensure consistency for all possible implementations. If you write non portable code and use workgroups the size of the vector size of the underlying hardware, you don't have to use them. From the API side, the workgroups are the smallest schedulable entities you can rely on and OpenCL just supports execution on all kinds of hardware (with different supported sizes).
That's fundamentally a different programming model than one in which my single serial thread of execution (even if it uses SIMD) is accessing a consistent view of some resource (registers).
As stated several times already, I disagree. There is just a different kind of memory defined which is supposed to be private to a workgroup. This is from the basic definition much closer to registers (private to a work element, btw., as said, OpenCL also doesn't specify anything about OOB private accesses and also does not state it has to sit in registers, so where is the difference?)than to global memory (which is accessible by all work items from all work groups and also from other kernels [of the same process]).
You didn't com up with convincing arguments, sorry.
DirectX specifies that OOB accesses for local memory can only affect other local memory (but on the whole machine). OCL leaves it open to any sort of undefined behavior or arbitrary global memory corruption.
Yes, as I said, DirectX gives specs for global memory and basically just use different words for the local/shared memory, they say the content of the shared memory (all of it) is undefined after OOB writes, i.e. it is unspecified, what operations are exactly carried out, same as with OpenCL. That they restrict corruption to the shared memory has the simple effect that you are not allowed to run shaders using shared memory without having it in hardware as a separate array (or to provide equivalent means of doing so).
Anyways I'm sad you didn't focus more on my summary of the two remaining discussion points, as those questions are what's actually interesting at this point I think.
I seem to have missed that.
Generally, if I didn't reply to something, I probably didn't have an urgent issue with it (or just gave up because I have more pressing things to do). And I don't want to start nitpicking abut the ideal solution for OOB accesses. You say, it would be to crash and tell you, where it was. With GCN 1.1 a memory access violation exception is raised and a trap handler can gracefully quit the program and provide you with a meaningful error message. Or you can use this functionality in a debugger. Doesn't sound that overengineered to me, more like a useful design. As I said, it's the better solution. ;)

Edit:
You probably meant this (I was thinking about to answer, but it was late here):
So I think we're mostly in agreement here, with the only disagreements being around...

1) You think the additional isolation of workgroups is useful; I don't. Obviously it's fine and spec compliant either way.

2) You think the bounds checking hardware is required because of indirect RF accesses. I claim those indirect accesses are an implementation choice, and alternate choices would not require that bounds checking hardware. But in any case, you are correct that it doesn't need a full 32-bit compare in general as high address bits can just be dropped (with proper sign handling) on a given architecture with a fixed maximum shared memory size.
1):
It's the cleaner way (again, take the example of a possible register access to another thread of the same process, even if you don't like that example; it would be considered a serious flaw on CPUs). And as explained, it's the same effort as to ensure just process isolation.
It's for this sort of arbitrary concept of "workgroup resource boundaries"
It's the same arbitrary concept as having registers private to a thread. :rolleyes:

2):
The bounds checking for the reg file is very likely done by different hardware. It should just clarify, that it can't be neither expensive nor a rigorous distinction between registers and local/shared memory.
And to your claim that it is an implementation choice: Of course it is! Everything is one. You can decide to implement everything on a single scalar ALU without any data in registers if you want. Does it make sense? No!
 
Last edited by a moderator:
As stated several times already, I disagree. There is just a different kind of memory defined which is supposed to be private to a workgroup.
We're going in circles here. This entire line of questioning came out of the assertion that SLM is like registers, which makes sense from the perspective of GCN/hardware. But that's obviously not true from the programming language perspective. Registers have a consistent view from a single work item... they can't suddenly change unrelated to the work item's execution. SLM *can*. It *might* not depending on the implementation, but the fact that it can means it's a totally different beast from the programmer's perspective.

I'm not sure if we're just coming at this from two different perspectives (hardware/software) but we're just repeating ourselves now and not making any progress on this point, so maybe it's best if we just drop it.

You didn't com up with convincing arguments, sorry.
I'm not even sure of what we're arguing about or being unconvinced about at this point, so that's unsurprising :)

With GCN 1.1 a memory access violation exception is raised and a trap handler can gracefully quit the program and provide you with a meaningful error message. Or you can use this functionality in a debugger. Doesn't sound that overengineered to me, more like a useful design.
Sure, being able to debug it is great! But as I said above, it's really the most useful if it was bounds checking on *arrays*, not on SLM in general, which is pretty unlikely to bother putting in hardware. Whether or not I declare a second SLM array shouldn't affect whether or not OOB access to the first one is detected... Given that I want to fix these bugs before shipping something, I'm also fine with compiler-inserted bounds checking, but whatever.

It's the same arbitrary concept as having registers private to a thread. :rolleyes:
We're going to have to agree to disagree on this. I understand your point of view, but the two have widely different semantics in the API in terms of how they relate to the execution model. But as above, we appear to be just going in circles on that. Sigh.

And to your claim that it is an implementation choice: Of course it is! Everything is one. You can decide to implement everything on a single scalar ALU without any data in registers if you want. Does it make sense? No!
Are you seriously arguing that the GCN register file/LDS design is the only one that makes sense?
 
Last edited by a moderator:
We're going in circles here.
I would say, we are not moving at all. ;)
This entire line of questioning came out of the assertion that SLM is like registers, which makes sense from the perspective of GCN/hardware. But that's obviously not true from the programming language perspective.
You mix things up here. APIs define the local/shared memory as private to a workgroup the same way as the private memory of OpenCL is defined as private to a work item. That makes it fundamentally different to global memory which by definition is accessible by every work item of every workgroup and in principle even other kernels. There is no restriction defined, it is not owned or private to anything (besides maybe the process).
I find the introduction of an intermediate hierarchy step not too hard to grasp and completely understandable if one wants to accomodate to typical throughput architectures. It makes sense also from an engineering point of view.
Registers have a consistent view from a single work item...
OpenCL has no concept of registers. We have to decide if we want to talk in API terms or from the hardware point of view. You suggested the API. And there is the private and the local memory in OpenCL with defined ownership. One should just accept this concept. Maybe one can keep in mind, that on GPUs private memory maps to registers and local memory to the local/shared memory arrays. On other implementations, for instance on CPUs, the data of the private memory is also mapped on usual memory locations somewhere (basically relying on the caches to keep latencies down). Is it there the task of the compiler to ensure, that the private memory keeps consistent for each work item? I asked you already what happens with out of bounds indexing in private memory. If it's not suppressed, aren't you then corrupting the private memory of other work items the same way as OOB accesses of the local memory corrupts the memory for other workgroups? Where is the fundamental difference?
they can't suddenly change unrelated to the work item's execution. SLM *can*.
It can't change unreleated to a workgroup's execution which is the scope this whole damn thing is defined on. It's very simple. And I said this before already, the API doesn't define the execution of a single work item. It states, you can't rely on an individual execution of a work item. The only thing you can rely on, is the execution of an individual workgroup. For the rest, see above.
We're going to have to agree to disagree on this. I understand your point of view, but the two have widely different semantics in the API in terms of how they relate to the execution model.
The concept of work group private local/shared memory is actually the same from DX, through OpenCL to CUDA. It's differs quite a bit from global memory. But good that you understand my point of view.
Are you seriously arguing that the GCN register file/LDS design is the only one that makes sense?
Did I say so? I was just trying to express, that one has always choices for the implementations, some are good and have their distinctive set of pros and cons, others much more cons.

I guess we had enough good arguments for the design of GCN's LDS in the thread already. If I should boil down you stance it would say, that you are argueing against a clean partition of memory spaces defined in all relevant APIs as disjunct memory spaces for no apparent reason and don't accept that it actually makes sense to enforce this partition to get rid of undefined behaviours (and get guaranteed process isolation of private and shared memory in the course).
I would say the enforced partitioning is the better solution from both, the API and the ISA perspective compared to leaving huges holes with undefined behaviours in the specs. It doesn't apply so much for CPUs, as they come from a different side of the fence. Investing there for some additional hardware partition makes no sense (but registers, the only private resource they have in hardware, are safe from interference of other threads, too; why do you want to settle with less in GPUs?). But it does for GPUs, where you have dedicated memory arrays for the private and local memory anyway, so you can basically directly map the layout of the API to the hardware (or the other way around). And as said, the effort is the same as you would have to do for process isolation anyway (and don't want to compromise flexibility or performance in exchange).

But as I understood you, we can agree that there are only upsides to a strict partioning and no obvious downsides?
 
Last edited by a moderator:
If I should boil down you stance it would say, that you are argueing against a clean partition of memory spaces defined in all relevant APIs as disjunct memory spaces for no apparent reason and don't accept that it actually makes sense to enforce this partition to get rid of undefined behaviours (and get guaranteed process isolation of private and shared memory in the course).
I'm not arguing *against* that design, I'm just saying I don't really care, because accessing out of bounds is an application error. I'm also completely fine with not running multiple processes on the same CU simultaneously... I doubt you'd even see a measurable performance difference outside adversarial cases.

I would say the enforced partitioning is the better solution from both, the API and the ISA perspective compared to leaving huges holes with undefined behaviours in the specs.
I agree undefined behavior is undesirable, but conversely as I've said, I reject the notion that a shipping application should ever rely on the behavior of out-of-range accesses (other than texture sampling obviously, which has specific address translation baked into the semantics). Our only disagreement here is how much it matters.

But as I understood you, we can agree that there are only upsides to a strict partioning and no obvious downsides?
From a programmer's perspective, yes. From a hardware perspective we'd need a lot more analysis to prove that a fairly different design that didn't require bounds checking at all wouldn't even up being better, but I'm willing to buy that with the small number of local memory address bits it's not the end of the world either way.

I think ultimately having the separate memory spaces themselves are questionable. In practice with a full cache hierarchy, shared memory is mostly useful as a gather/scatter scratchpad, due to the highly banked nature of it. In GCN you get some niceness with shared memory atomics as well. But beyond that, most uses of shared memory are simply implementing software prefetching/caches, and then only with predictable memory access patterns. i.e. ultimately scratch pads are a poor substitute for read/write caches, but they can capture some of the benefit with simpler hardware.

That's another discussion and unrelated to GCN, but there's nothing holy about shared memory in the APIs... it's mostly a compromise designed around GPU hardware evolution.
 
Agreed for general array indexing, but this is less than that. It's for this sort of arbitrary concept of "workgroup resource boundaries"...
I don't find it arbitrary. Workgroups are supposed to be independent. Limiting sharing across workgroups is one way of exposing the notion of locality in the programming model. Enforcing it in hw is desirable. Workgroups are the natural units of isolation in ocl, since fine grained communication is possible only within a workgroup and inter workgroup communication/synchronization is expensive, if allowed at all.
 
I think ultimately having the separate memory spaces themselves are questionable. In practice with a full cache hierarchy, shared memory is mostly useful as a gather/scatter scratchpad, due to the highly banked nature of it. In GCN you get some niceness with shared memory atomics as well. But beyond that, most uses of shared memory are simply implementing software prefetching/caches, and then only with predictable memory access patterns. i.e. ultimately scratch pads are a poor substitute for read/write caches, but they can capture some of the benefit with simpler hardware.
I agree to a large extent. But I think having separate memory spaces in *programming model* is desirable. It's better to expose locality to code and let it use as much as it wants. For codes that dont care, just use all the SRAM as cache. Kepler is quite a bit like that KC is obviously fully there.
 
I agree to a large extent. But I think having separate memory spaces in *programming model* is desirable. It's better to expose locality to code and let it use as much as it wants. For codes that dont care, just use all the SRAM as cache. Kepler is quite a bit like that KC is obviously fully there.

With shared memory, you don't have to worry about coherency. This is a big win for many-core systems where the overhead of enforcing coherency adds up. Also, you don't have to take a cache miss when you first read from it, which an ordinary cache would require, even though it's actually being used as scratch memory. Indeed, many CPUs also have special instructions that operate on the cache as an incoherent scratch pad, though they still have to deal with cache misses since the scratch memory could be evicted at any time. Having shared memory in the programming model allows the programmer to actually take advantage of this without resorting to an ungodly mess of intrinsics.
 
With shared memory, you don't have to worry about coherency.

Surely you jest. The span of interesting things that can afford not to worry about coherency is low, and whether or not an explicitly managed scratchpad is in place or not is mostly orthogonal to whether or not coherency is desirable or achievable. I'm partisan to software managed scratchpads being available, but let's not make stuff up about what they are and what they are useful for.
 
Surely you jest. The span of interesting things that can afford not to worry about coherency is low, and whether or not an explicitly managed scratchpad is in place or not is mostly orthogonal to whether or not coherency is desirable or achievable. I'm partisan to software managed scratchpads being available, but let's not make stuff up about what they are and what they are useful for.

The whole point of a scratchpad is to allow for incoherence (and no, you compiler isn't likely to be able to do this automatically). Otherwise, what's the point?

Cache coherency is a nasty problem for massively parallel computing. Current CPU cache snooping protocols don't scale past 10's of cores, and the various interconnect and directory protocols that *do* greatly increase latency on cache misses.
 
The whole point of a scratchpad is to allow for incoherence (and no, you compiler isn't likely to be able to do this automatically). Otherwise, what's the point?
I suspect you mean that each CU's LDS is independent of the other CUs. If so, that's correct. With GCN, memory operations (local and global) are handled asynchronously from program execution. So when you issue a read from local memory, say, there's no stall until you need the data (you issue an explicit wait). This was a big change from EG/NI where global memory operations were handled in clauses and local memory operations would cause stalls. Now the compiler can better schedule instructions to hide latency.
 
I suspect you mean that each CU's LDS is independent of the other CUs. If so, that's correct. With GCN, memory operations (local and global) are handled asynchronously from program execution. So when you issue a read from local memory, say, there's no stall until you need the data (you issue an explicit wait). This was a big change from EG/NI where global memory operations were handled in clauses and local memory operations would cause stalls. Now the compiler can better schedule instructions to hide latency.

But even in the best case, the load ties up a register, and takes up extra bandwidth besides. Also, effective prefetching is hard, especially when you have to cover 100s of cycles latency. Why eat the cost if you don't actually need the coherency, which is often the case for inner loops? With a true L1 cache, writes eventually have to be flushed to main memory, and reads have to initially come from main memory, which is quite a waste if it's really a local scratchpad.
 
Last edited by a moderator:
Back
Top