AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

So if it is the same IP level as the other Southern Islands GPUs, how does it qualify for a new GPU family? For all intents and purposes it should be the smallest member of the Southern Islands family.

As for mobile:
Why not at least call these GPU Southern Islands, too, instead of "London" Why have Sea Islands AND Solar Systems? It is unnecessary since they are the same GPUs, just differently clocked or partially deactivated SKUs. Or we had "Chelsea, Heathrow, Wimbledon, Barts, Cayman, Juniper"...my head hurts ;)

What's wrong with sticking to one family and one family only per generation? Cape Verde, Pitcairn, Tahiti and that's it for the first 28nm gen. You can still call Pitcairn XT mobile "7970M" if OEMs want that, but at least the codename system would be lean and easy to follow.
 
We're talking about memory with addresses here (how would you even access registers OOB on the CPU? heh), so while it may be organized more like registers in GCN, that's not how it looks/works in the API. And yeah, any CPU core can read/write the same memory as any other obviously...
No, it can't. Hyperthreaded processes can not touch registers in use by other hyperthreaded processes.
Andrew Lauritzen said:
It's about out-of-bounds accesses... the only point where it becomes relevant what they do is where they start to interfere with process isolation. Not running two processes at the same time on the same CU seems like a pretty good solution to the problem to me, especially since you're not going to have multiple DMA buffers running simultaneously with current OSes anyways.
This is not a suitable solution at all. If you are worried about one process interfering with another, why wouldn't you worry about one workgroup trashing another and causing problems? The same solution fixes both cases trivially. Limiting a CU to a single wavefront/workgroup would not be acceptable for performance as it would severely limit the amount of latency hiding you could have.
 
GPU sales have been going up before ? why stop now ?

Also I dont think they are selling better than expected ! either that or their expectations are abnormally low ! they had to compromise several times , once with price reduction and another with crazy unprecedented game packages .. that is not a sign of good corporate economy .
Their game bundle has been doing very well.
 
So if it is the same IP level as the other Southern Islands GPUs, how does it qualify for a new GPU family? For all intents and purposes it should be the smallest member of the Southern Islands family.
Who said that all Sea Islands GPUs will have the exact same IP level?

Remember Northern Islands. Cayman was new VLIW4-IP, the rest was basically still Juniper-level VLIW5-IP, apart from very minor tweaks (AF bugfix and slightly improved tesselation performance, basically).
 
No, it can't. Hyperthreaded processes can not touch registers in use by other hyperthreaded processes.
No kidding. Am I speaking the same language here? That was my entire point - registers are *statically addressed* (encoded into the instructions). They are not memory, so there is no "out of bounds access". LDS is more like memory in use than registers, even if GCN conflates the two in hardware somewhat.

If you are worried about one process interfering with another, why wouldn't you worry about one workgroup trashing another and causing problems?
I feel like this is not a complicated concept to understand... memory from different *processes* should not be able to stomp on one another, as they are in different addresses spaces and there is no sharing. Memory in the same process is in the same address space, and thus is fully able to cause whatever data races and corruption within that memory you want. Applications can easily crash due to this, but it shouldn't bring down the whole operating system (or other processes).

There is no reason to expect any sort of safety "within" the same process. You can claim that LDS memory spaces are unique per workgroup - and that's fine - but there's no reason it has to be that way. Ultimately if I write bad code and access stuff out of bounds I can screw up anything in my own application, and amount of trying to baby my LDS access is going to prevent that.

And I'll reiterate again, regardless of what GCN does with LDS, it's not safe to assume the same when writing portable code. The DX spec says any out of bounds writes to shared memory can corrupt any other shared memory on the whole machine (but not global memory). So the debatable safety (from my own code?) actually gives no benefits in DirectX.

Limiting a CU to a single wavefront/workgroup would not be acceptable for performance as it would severely limit the amount of latency hiding you could have.
Only for small workgroups. And as I've said now twice, you don't need to limit a CU to a single workgroup, you only need to limit it to a single *process*. And it already is limited that way because of the OS, so that's not going to change anything.

Like I said, it's fine to have, but it's not necessary to provide memory isolation within the same process. And LDS *is* memory, even if it's in a separate address space than device global memory. Workgroups/wavefronts are like *threads*, not like processes.

Anyways it was a simple question and I got an answer.
 
Last edited by a moderator:
Who said that all Sea Islands GPUs will have the exact same IP level?

Remember Northern Islands. Cayman was new VLIW4-IP, the rest was basically still Juniper-level VLIW5-IP, apart from very minor tweaks (AF bugfix and slightly improved tesselation performance, basically).

That would make it better? Don't think so.
 
Better than what? Nvidia have been doing this for years and years. AMD as well depending on the generation. Some generations they don't, some they do.

Regards,
SB

Better than not doing it of course. I find it silly to resort to stuff like "but Nvidia does it too!"
Doesn't solve the problem at hand.
 
No kidding. Am I speaking the same language here? That was my entire point - registers are *statically addressed* (encoded into the instructions). They are not memory, so there is no "out of bounds access". LDS is more like memory in use than registers, even if GCN conflates the two in hardware somewhat.
Do you understand the OpenCL specification and what it says about "local" memory? Workgroups don't share local data. End of discussion.
Andrew Lauritzen said:
I feel like this is not a complicated concept to understand... memory from different *processes* should not be able to stomp on one another, as they are in different addresses spaces and there is no sharing. Memory in the same process is in the same address space, and thus is fully able to cause whatever data races and corruption within that memory you want. Applications can easily crash due to this, but it shouldn't bring down the whole operating system (or other processes).

There is no reason to expect any sort of safety "within" the same process. You can claim that LDS memory spaces are unique per workgroup - and that's fine - but there's no reason it has to be that way. Ultimately if I write bad code and access stuff out of bounds I can screw up anything in my own application, and amount of trying to baby my LDS access is going to prevent that.

And I'll reiterate again, regardless of what GCN does with LDS, it's not safe to assume the same when writing portable code. The DX spec says any out of bounds writes to shared memory can corrupt any other shared memory on the whole machine (but not global memory). So the debatable safety (from my own code?) actually gives no benefits in DirectX.
Yes, that's what DX spec, so what? OpenCL does not say that and it's more robust because of it. Also, there are advantages to not having to worry about going off the end of your local memory allocation, just like there are advantages to not worrying about where you sample a texture.
Andrew Lauritzen said:
Only for small workgroups. And as I've said now twice, you don't need to limit a CU to a single workgroup, you only need to limit it to a single *process*. And it already is limited that way because of the OS, so that's not going to change anything.

Like I said, it's fine to have, but it's not necessary to provide memory isolation within the same process. And LDS *is* memory, even if it's in a separate address space than device global memory. Workgroups/wavefronts are like *threads*, not like processes.
Define small? In OpenCL, we allow for workgroups of up to 256 threads. This allows to best performance and helps avoid spilling. DirectCompute requires that you expose a minimum maximum workgroup size of 1024 threads. That's still only 16 wavefronts per CU, which is still not very much if you are latency bound. Why should we restrict ourselves to a fraction of our occupancy?
 
Better than not doing it of course. I find it silly to resort to stuff like "but Nvidia does it too!"
Doesn't solve the problem at hand.

Here's the thing. If everyone is doing it, what incentive does a company have for not doing it?

When AMD chose not to do it, hardly anyone said anything to praise them for it. In that same generation Nvidia continued to do it, and got praised for it.

So again, what incentive is there for AMD to not do it?

And to be clear I was one of the ones applauding AMD for attempting to not reuse and repurpose (at least on the desktop 4xxx, 5xxx and I believe 6xxx series). While blasting Nvidia for continuing to do it. But the majority didn't care and were fine with Nvidia reusing some GPU's/GPU IP level for 3-4 generations. I then blasted AMD for doing it in the 7xxx generation but again most people thought it was fine.

So, if it gains AMD no increased brand awareness and doing it brings them no increased profits, why should they do it? Why shouldn't they just do it like everyone else is doing it?

Regards,
SB
 
Do you understand the OpenCL specification and what it says about "local" memory? Workgroups don't share local data. End of discussion.
You'll find my name in the OpenCL specification... suffice it to say I'm well aware of how this works in all the relevant APIs (not that I think CL is particularly relevant in practice at the moment, but still). And "workgroups don't share local data" says nothing about out-of-bounds read/write behavior...

OpenCL does not say that and it's more robust because of it.
OpenCL doesn't specify what happens with out of bounds access *at all* last I checked (probably "undefined" like practically everything else in CL). In fact CL specifically states that valid implementations can map local memory into global memory, so it doesn't even have the guarantee of out of bounds accesses not corrupting device global memory that DX has (otherwise how do you think it'd work on a unified memory device like a CPU?). That's significantly *less* robust.

Also, there are advantages to not having to worry about going off the end of your local memory allocation, just like there are advantages to not worrying about where you sample a texture.
Reads and writes are completely different beasts in this context. The point is, convenience aside, it's never safe to write that code unless you know you're running only on GCN or architectures with similar guarantees.

Why should we restrict ourselves to a fraction of our occupancy?
Again... only per-process, not per-workgroup. And again, the whole machine is already limited to this by the driver model, let alone per-CU.

Anyways, let's take this to another thread maybe if we want to continue the discussion, although like I said, I got my answer so I'm happy. Exceeding the spec is fine, but GCN's take is clearly not required by any of the specs.
 
Last edited by a moderator:
Doesn't solve the problem at hand.
Which problem? The OEMs probably don't care much what is under the hood, as long as it's cheap to buy, dissipates as little heat and draws as little power as possible, has some checkbox features for marketing and sells well. The vast majority of consumers don't care either, they wouldn't be able to tell the difference anyway. The competition is actually supposed to be kept in the dark as long as possible, although that's a moot point currently thanks to Mr. Feldstein and his fellas.


And using more straight-forward codenames just for the convenience of the information-greedy press and enthusiasts? Quite frankly, if I was in AMD's position, I wouldn't do that either. I'd rather be sitting in front of a screen reading through threads like these while grinning broadly ;)
 
We're talking about memory with addresses here (how would you even access registers OOB on the CPU? heh),
Exactly! OOB register accesses on a HTed CPU would be accessing the registers of the other thread. That is indeed impossible, like it is on GPUs. And for the exact same reasons, btw. Accessing the private resources of other threads/wavefronts/workgroups would also be a terrible idea. This isolation is necessary. Everything else would create an absolute mess of an environment where you could do all sorts of stupid things.

And what means memory with adresses? The LDS is basically adressed by quite small indices like the reg file. The physical register file for each vALU slot is 4kB in size. That's just about one order of magnitude different to the LDS size. Both get partioned in roughly the same way, each Wavefront (workgroup in case of the LDS) gets a base index and a size. For each access the base is added to the index used in the kernel and the latter checked against the allocated size. At this stage the hardware is very likely just working with the relatively small indexes, not with full blown 32/64bit adresses. Of course one can put some kind of address translation in front of the LDS to enable the flat memory model (which is a very recently added feature of C.I. or GCN 1.1 not found in any commercally available AMD GPU so far; in nV's GPUs it is basically a free feature as local memory is accessed through the L/S units also handling the usual memory accesses), but that doesn't change anything on the actual checks.
so while it may be organized more like registers in GCN, that's not how it looks/works in the API.
But it's the way the hardware works.
And which API? As far as I know, HLSL used to write DX Compute shaders doesn't support pointers at all. In this sense one always works with indices and not addresses there.
And I guess almost every API defines the local/shared memory owned by workgroup/threadgroup as private to that group and inaccessible by other groups. OpenCL does it, CUDA does it and DirectX does it too. Bound checks for the local/shared memory accesses are basically implied (edit: but obviously not strictly required) by the APIs I know of.
And yeah, any CPU core can read/write the same memory as any other obviously...
But the local/shared memory is a private resource like registers (just for the workgroup), not some global memory.
But hey, cool if there's that extra level of safety, just seems unnecessary compared to the simpler solutions.
It is the better solution. ;)
 
Last edited by a moderator:
Accessing the private resources of other threads/wavefronts/workgroups would also be a terrible idea.
On CPUs there is no memory safety inside a single process. Two threads can race/corrupt memory, etc. There's never any expectation that my code has to be safe from... uhh... itself. Bad code is completely allowed to corrupt any of my own memory space, and that's totally fine. Incidentally, dropping accesses to out of bound memory doesn't solve any problems with bad code what-so-ever. It's clearly an error in the application and needs to be fixed. The only bit you don't want is for it to spill over into separate *processes*. i.e. contexts/DMA buffers for GPUs, not workgroups.

And what means memory with adresses? The LDS is basically adressed by quite small indices like the reg file.
Sure, but shared memory addresses are defined *dynamically*. Buggy code can address outside the range of any array, and you can't statically prove anything about that in advance. You don't need pointers for this... I can do a[237462374] on a 16-long array just as easily in DX as in CL. I can feed 32-bit indices into it at the API level, but I suppose you can drop the high bits since shared memory has (static) size limitations for now so that's cheaper than 64-bit pointers at least.

But it's the way the hardware works.
It's the way *one piece* of hardware works. i.e. relevant if I'm only targeting that hardware and irrelevant if i'm writing portable code (which must be written to the API specifications). Try writing that code and running it on an OpenCL CPU implementation. You're either going to get badness or your code is going to be much slower than it needs to be.

And I guess almost every API defines the local/shared memory owned by workgroup/threadgroup as private to that group and inaccessible by other groups. OpenCL does it, CUDA does it and DirectX does it too. Bound checks for the local/shared memory accesses are basically required by the APIs I know of.
No, that's the fundamental mistake you guys are making in logic here. You're assuming that "this memory is well-defined to be shared between these invocations" (i.e. shares a common base pointer) means that out-of-bounds reads/writes are suppressed. That's not the case, and those are two separate constraints. Read the specs. While they all define it that way for global memory, CL is completely silent on the entire issue (and in fact since it cribs from the C spec the only reasonable assumption is that these out-of-bounds reads/writes are undefined... i.e. can corrupt memory, cause crashes, whatever) and DX only defines that OOB shared memory writes do not corrupt global memory.

But the local/shared memory is a private resource like registers (just for the workgroup), not some global memory.
The key difference here is that it takes dynamic addresses and thus can have out of bounds accesses. It's more similar to memory in semantics than registers. Separate out your thinking about specific hardware from API semantics for a second and it should be pretty clear how the two cases differ.

It is the better solution. ;)
It's fine, but as I've said it doesn't actually help matters (for a single process, although it's good for process isolation, just overkill). First since it's not guaranteed I can't rely on it and second that's not a good way to write code anyways. Even in languages with bounds checking on arrays it's for debugging purposes (i.e. throws an exception) not something you're supposed to intentionally do in production code. Silently failing (returning zero, not writing) is arguably even less useful than that.
 
Last edited by a moderator:
I just read the past 4 pages and there are lots of inconsistent posts about this, so I don't know what is taken as a fact or at least "most persistant" rumour.

Anandtech says there won't be any HD8000 in the desktop for 2013, with the current HD7000 chips being used throughout the rest of the year. The mobile parts will go through a renaming scheme in order to better fit the new Oland chip with 384 ALUs, that competes with GK107 and GF116.

Is this correct?
 
I just read the past 4 pages and there are lots of inconsistent posts about this, so I don't know what is taken as a fact or at least "most persistant" rumour.

Anandtech says there won't be any HD8000 in the desktop for 2013, with the current HD7000 chips being used throughout the rest of the year. The mobile parts will go through a renaming scheme in order to better fit the new Oland chip with 384 ALUs, that competes with GK107 and GF116.

Is this correct?

There will be new chips. Just how "new" they will be and how much better they will be compared to what AMD has now is unclear, however.
 
No, you're lumping stuff in here which isn't the same thing. On CPUs there is no memory safety inside a single process. Two threads can race/corrupt memory, etc.
As possible on GPUs.
There's never any expectation that my code has to be safe from... uhh... itself. Bad code is completely allowed to corrupt any of my own memory space, and that's totally fine.
I just don't regard the local memory as such a universally usable memory space as global memory. From my perspective it's much closer to the registers because local memory is a resource defined as private to a warp/wavefront/workgroup just like registers. Even within the same process, you can't access the registers of another thread on a CPU.
Incidentally, dropping accesses to out of bound memory doesn't solve any problems with bad code what-so-ever. It's clearly an error in the application and needs to be fixed. The only bit you don't want is for it to spill over into separate *processes*. i.e. contexts/DMA buffers for GPUs, not workgroups.
That's solved with that. How do you want to ensure isolation between the local memory of two different kernels running on the same CU? You would have to do the same amount of range checks anyway. So your argument, that it is overly complicated doesn't hold water in my view.
Sure, but shared memory addresses are defined *dynamically*. Buggy code can address outside the range of any array, and you can't statically prove anything about that in advance.
As you can do with registers on GPUs (which get boundary checked in these cases too as mentioned already). Your point is?
It's the way *one piece* of hardware works. i.e. relevant if I'm only targeting that hardware and irrelevant if i'm writing portable code (which must be written to the API specifications). Try writing that code and running it on an OpenCL CPU implementation. You're either going to get badness or your code is going to be much slower than it needs to be.
I thought we are talking about GPUs here. And the spec says out of bound accesses are illegal. If you write code to spec, it runs everywhere anyway. We are talking about how these illegal cases are handled (which was left out of the OpenCL spec probably because of fears that specifying it would make it harder to implement on the some devices). It doesn't change that the local memory is defined as private to a workgroup and the programmer is supposed to adhere to it. GCN enforces this in hardware, i.e. has a defined behaviour for this cases and will raise a memory access violation exception with the next iteration. That behaviour is completely within spec. And I can't see anything bad there.
No, that's the fundamental mistake you guys are making in logic here. You're assuming that "this memory is well-defined to be shared between these invocations" (i.e. shares a common base pointer) means that out-of-bounds reads/writes are suppressed.
Look how the discussion started! You were asking specifically about the implementation on AMD GPUs, where these OOB accesses get suppressed. I was specifically talking about that. It only later developed into in a discussion how much sense this implementation makes. So we are not making a fundamental mistake when we explained how it is implemented. All APIs basically state that out of bounds accesses on the local memory are illegal. It therefore doesn't hurt to do it. And it solves the process isolation issue with the same effort you would have to do anyway.
The key difference here is that it takes dynamic addresses and thus can have out of bounds accesses. It's more similar to memory in semantics than registers.
As stated already, that's actually not such a strict distinction as you think. It doesn't have to be static, you can also index dynamically into the reg file on GPUs. You can indeed have out of bounds indices into the regfile. By the way, that is an implementation specific detail which is not covered by the OpenCL spec, isn't it? :LOL:
AFAIK the OpenCL spec also doesn't say what happens if you index out of bounds into private memory. It also does not specify, that private memory has to sit in registers, actually it specifically mentions other possibilities (which gets heavily used in CPU implementations). It's therefore all implementation dependent.
It's fine, but as I've said it doesn't actually help matters. First since it's not guaranteed I can't rely on it and second that's not a good way to write code anyways.
RLY? Come on, you started the whole thing by doubting that GCN isolates processes also within the local memory. You get the answer that it does and the explanation how it is done, and now you say you can't rely on it?
You can rely on proper process isolation if you use their hardware. That AMD GPUs hinder local as well as global memory corruption by suppressing out of bounds accesses (and raises exceptions shortly) has some potentially useful applications (besides the obvious debugger think of MegaTexture/PRT, paging). Should you write illegal OpenCL code with undefined behaviour because of it? Of course not! That would be indeed bad practice. But wait for some future DX, OpenCL, or OpenGL version/extension and we may see this functionality exposed with a proper API to use (or wait until you get your hands on one of the next consoles ;)).

Btw., I don't know what intel GPUs do, but frankly, I doubt a bit you can corrupt others process' memory on them. And on nVidia it's also not possible (but they may crash last time I read something about it, could have changed though).
 
Last edited by a moderator:
(which was left out of the OpenCL spec probably because of fears that specifying it would make it harder to implement on the some devices).
... and because the CL spec is a mess that leaves a million basic things undefined (or worse, simply unmentioned) while spending 300 pages on trivial API explanations ;)

I just don't regard the local memory as such a universally usable memory space as global memory. From my perspective it's much closer to the registers because local memory is a resource defined as private to a warp/wavefront/workgroup just like registers. Even within the same process, you can't access the registers of another thread on a CPU.
There's a *big* difference between resources that are only visible to a single invocation/"thread of execution" and stuff that is visible to multiple different scheduled entities. Registers are the former, while LDS is the latter. I understand where you're coming from from a hardware perspective, but I think the API perspective is the more relevant one when considering parallel semantics.

Look how the discussion started! You were asking specifically about the implementation on AMD GPUs, where these OOB accesses get suppressed. I was specifically talking about that. It only later developed into in a discussion how much sense this implementation makes.
Well the quote that I was responding to there was long after that initial discussion had been settled, but I'm willing to chalk that one up to a misunderstanding :)

RLY? Come on, you started the whole thing by doubting that GCN isolates processes also within the local memory. You get the answer that it does and the explanation how it is done, and now you say you can't rely on it?
I think you misunderstood me... yes obviously you can rely on it for process isolation. But the claimed additional safety that it brings to "workgroup isolation" or whatever is not useful in practice as it is not portable. Also, like I said, that's not a good way to write code in general. You seem to agree with me here:

Should you write illegal OpenCL code with undefined behaviour because of it? Of course not! That would be indeed bad practice. But wait for some future DX, OpenCL, or OpenGL version/extension and we may see this functionality exposed with a proper API to use (or wait until you get your hands on one of the next consoles ;)).
It seems we're on the same page. However I will still note that my care level about suppressing OOB accesses to shared memory is quite low... it's not exactly difficult to accomplish that myself in the very few cases where I actually want those semantics :) If it could throw exceptions (or otherwise signal+terminate) then it would be useful for debugging though of course.

You can rely on proper process isolation if you use their hardware
...
Btw., I don't know what intel GPUs do, but frankly, I doubt a bit you can corrupt others process' memory on them. And on nVidia it's also not possible (but they may crash last time I read something about it, could have changed though).
As I said, you can't corrupt other processes' (shared local) memory on *any* GPUs right now because you can't have multiple DMA buffers from different processes in flight at the same time. This is all an academic discussion :) And obviously all APIs require not corrupting any global memory.

So I think we're mostly in agreement here, with the only disagreements being around...

1) You think the additional isolation of workgroups is useful; I don't. Obviously it's fine and spec compliant either way.

2) You think the bounds checking hardware is required because of indirect RF accesses. I claim those indirect accesses are an implementation choice, and alternate choices would not require that bounds checking hardware. But in any case, you are correct that it doesn't need a full 32-bit compare in general as high address bits can just be dropped (with proper sign handling) on a given architecture with a fixed maximum shared memory size.
 
Last edited by a moderator:
Back
Top