AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by Shtal, Dec 31, 2011.

Tags:
  1. boxleitnerb

    Regular

    Joined:
    Aug 27, 2004
    Messages:
    407
    Likes Received:
    0
    So if it is the same IP level as the other Southern Islands GPUs, how does it qualify for a new GPU family? For all intents and purposes it should be the smallest member of the Southern Islands family.

    As for mobile:
    Why not at least call these GPU Southern Islands, too, instead of "London" Why have Sea Islands AND Solar Systems? It is unnecessary since they are the same GPUs, just differently clocked or partially deactivated SKUs. Or we had "Chelsea, Heathrow, Wimbledon, Barts, Cayman, Juniper"...my head hurts ;)

    What's wrong with sticking to one family and one family only per generation? Cape Verde, Pitcairn, Tahiti and that's it for the first 28nm gen. You can still call Pitcairn XT mobile "7970M" if OEMs want that, but at least the codename system would be lean and easy to follow.
     
  2. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    No, it can't. Hyperthreaded processes can not touch registers in use by other hyperthreaded processes.
    This is not a suitable solution at all. If you are worried about one process interfering with another, why wouldn't you worry about one workgroup trashing another and causing problems? The same solution fixes both cases trivially. Limiting a CU to a single wavefront/workgroup would not be acceptable for performance as it would severely limit the amount of latency hiding you could have.
     
  3. Homeles

    Newcomer

    Joined:
    May 25, 2012
    Messages:
    234
    Likes Received:
    0
    Their game bundle has been doing very well.
     
  4. A1xLLcqAgt0qc2RyMz0y

    Veteran Regular

    Joined:
    Feb 6, 2010
    Messages:
    1,589
    Likes Received:
    1,490
    Are they selling it at a loss?

    AMD hardly made any money in discrete GPU's last quarter. Giving away game bundles can't help the bottom line.
     
  5. TKK

    TKK
    Newcomer

    Joined:
    Jan 12, 2010
    Messages:
    148
    Likes Received:
    0
    Who said that all Sea Islands GPUs will have the exact same IP level?

    Remember Northern Islands. Cayman was new VLIW4-IP, the rest was basically still Juniper-level VLIW5-IP, apart from very minor tweaks (AF bugfix and slightly improved tesselation performance, basically).
     
  6. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,532
    Likes Received:
    957
    Unless it increases sales more than it costs them on each one. Which it probably does, since they're doing it and they're not retarded.
     
  7. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,559
    Likes Received:
    670
    Location:
    British Columbia, Canada
    No kidding. Am I speaking the same language here? That was my entire point - registers are *statically addressed* (encoded into the instructions). They are not memory, so there is no "out of bounds access". LDS is more like memory in use than registers, even if GCN conflates the two in hardware somewhat.

    I feel like this is not a complicated concept to understand... memory from different *processes* should not be able to stomp on one another, as they are in different addresses spaces and there is no sharing. Memory in the same process is in the same address space, and thus is fully able to cause whatever data races and corruption within that memory you want. Applications can easily crash due to this, but it shouldn't bring down the whole operating system (or other processes).

    There is no reason to expect any sort of safety "within" the same process. You can claim that LDS memory spaces are unique per workgroup - and that's fine - but there's no reason it has to be that way. Ultimately if I write bad code and access stuff out of bounds I can screw up anything in my own application, and amount of trying to baby my LDS access is going to prevent that.

    And I'll reiterate again, regardless of what GCN does with LDS, it's not safe to assume the same when writing portable code. The DX spec says any out of bounds writes to shared memory can corrupt any other shared memory on the whole machine (but not global memory). So the debatable safety (from my own code?) actually gives no benefits in DirectX.

    Only for small workgroups. And as I've said now twice, you don't need to limit a CU to a single workgroup, you only need to limit it to a single *process*. And it already is limited that way because of the OS, so that's not going to change anything.

    Like I said, it's fine to have, but it's not necessary to provide memory isolation within the same process. And LDS *is* memory, even if it's in a separate address space than device global memory. Workgroups/wavefronts are like *threads*, not like processes.

    Anyways it was a simple question and I got an answer.
     
    #1127 Andrew Lauritzen, Feb 18, 2013
    Last edited by a moderator: Feb 18, 2013
  8. boxleitnerb

    Regular

    Joined:
    Aug 27, 2004
    Messages:
    407
    Likes Received:
    0
    That would make it better? Don't think so.
     
  9. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    18,365
    Likes Received:
    8,789
    Better than what? Nvidia have been doing this for years and years. AMD as well depending on the generation. Some generations they don't, some they do.

    Regards,
    SB
     
  10. boxleitnerb

    Regular

    Joined:
    Aug 27, 2004
    Messages:
    407
    Likes Received:
    0
    Better than not doing it of course. I find it silly to resort to stuff like "but Nvidia does it too!"
    Doesn't solve the problem at hand.
     
  11. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Do you understand the OpenCL specification and what it says about "local" memory? Workgroups don't share local data. End of discussion.
    Yes, that's what DX spec, so what? OpenCL does not say that and it's more robust because of it. Also, there are advantages to not having to worry about going off the end of your local memory allocation, just like there are advantages to not worrying about where you sample a texture.
    Define small? In OpenCL, we allow for workgroups of up to 256 threads. This allows to best performance and helps avoid spilling. DirectCompute requires that you expose a minimum maximum workgroup size of 1024 threads. That's still only 16 wavefronts per CU, which is still not very much if you are latency bound. Why should we restrict ourselves to a fraction of our occupancy?
     
  12. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    18,365
    Likes Received:
    8,789
    Here's the thing. If everyone is doing it, what incentive does a company have for not doing it?

    When AMD chose not to do it, hardly anyone said anything to praise them for it. In that same generation Nvidia continued to do it, and got praised for it.

    So again, what incentive is there for AMD to not do it?

    And to be clear I was one of the ones applauding AMD for attempting to not reuse and repurpose (at least on the desktop 4xxx, 5xxx and I believe 6xxx series). While blasting Nvidia for continuing to do it. But the majority didn't care and were fine with Nvidia reusing some GPU's/GPU IP level for 3-4 generations. I then blasted AMD for doing it in the 7xxx generation but again most people thought it was fine.

    So, if it gains AMD no increased brand awareness and doing it brings them no increased profits, why should they do it? Why shouldn't they just do it like everyone else is doing it?

    Regards,
    SB
     
  13. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,559
    Likes Received:
    670
    Location:
    British Columbia, Canada
    You'll find my name in the OpenCL specification... suffice it to say I'm well aware of how this works in all the relevant APIs (not that I think CL is particularly relevant in practice at the moment, but still). And "workgroups don't share local data" says nothing about out-of-bounds read/write behavior...

    OpenCL doesn't specify what happens with out of bounds access *at all* last I checked (probably "undefined" like practically everything else in CL). In fact CL specifically states that valid implementations can map local memory into global memory, so it doesn't even have the guarantee of out of bounds accesses not corrupting device global memory that DX has (otherwise how do you think it'd work on a unified memory device like a CPU?). That's significantly *less* robust.

    Reads and writes are completely different beasts in this context. The point is, convenience aside, it's never safe to write that code unless you know you're running only on GCN or architectures with similar guarantees.

    Again... only per-process, not per-workgroup. And again, the whole machine is already limited to this by the driver model, let alone per-CU.

    Anyways, let's take this to another thread maybe if we want to continue the discussion, although like I said, I got my answer so I'm happy. Exceeding the spec is fine, but GCN's take is clearly not required by any of the specs.
     
    #1133 Andrew Lauritzen, Feb 18, 2013
    Last edited by a moderator: Feb 18, 2013
  14. TKK

    TKK
    Newcomer

    Joined:
    Jan 12, 2010
    Messages:
    148
    Likes Received:
    0
    Which problem? The OEMs probably don't care much what is under the hood, as long as it's cheap to buy, dissipates as little heat and draws as little power as possible, has some checkbox features for marketing and sells well. The vast majority of consumers don't care either, they wouldn't be able to tell the difference anyway. The competition is actually supposed to be kept in the dark as long as possible, although that's a moot point currently thanks to Mr. Feldstein and his fellas.


    And using more straight-forward codenames just for the convenience of the information-greedy press and enthusiasts? Quite frankly, if I was in AMD's position, I wouldn't do that either. I'd rather be sitting in front of a screen reading through threads like these while grinning broadly :wink:
     
  15. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Exactly! OOB register accesses on a HTed CPU would be accessing the registers of the other thread. That is indeed impossible, like it is on GPUs. And for the exact same reasons, btw. Accessing the private resources of other threads/wavefronts/workgroups would also be a terrible idea. This isolation is necessary. Everything else would create an absolute mess of an environment where you could do all sorts of stupid things.

    And what means memory with adresses? The LDS is basically adressed by quite small indices like the reg file. The physical register file for each vALU slot is 4kB in size. That's just about one order of magnitude different to the LDS size. Both get partioned in roughly the same way, each Wavefront (workgroup in case of the LDS) gets a base index and a size. For each access the base is added to the index used in the kernel and the latter checked against the allocated size. At this stage the hardware is very likely just working with the relatively small indexes, not with full blown 32/64bit adresses. Of course one can put some kind of address translation in front of the LDS to enable the flat memory model (which is a very recently added feature of C.I. or GCN 1.1 not found in any commercally available AMD GPU so far; in nV's GPUs it is basically a free feature as local memory is accessed through the L/S units also handling the usual memory accesses), but that doesn't change anything on the actual checks.
    But it's the way the hardware works.
    And which API? As far as I know, HLSL used to write DX Compute shaders doesn't support pointers at all. In this sense one always works with indices and not addresses there.
    And I guess almost every API defines the local/shared memory owned by workgroup/threadgroup as private to that group and inaccessible by other groups. OpenCL does it, CUDA does it and DirectX does it too. Bound checks for the local/shared memory accesses are basically implied (edit: but obviously not strictly required) by the APIs I know of.
    But the local/shared memory is a private resource like registers (just for the workgroup), not some global memory.
    It is the better solution. ;)
     
    #1135 Gipsel, Feb 18, 2013
    Last edited by a moderator: Feb 18, 2013
  16. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,559
    Likes Received:
    670
    Location:
    British Columbia, Canada
    On CPUs there is no memory safety inside a single process. Two threads can race/corrupt memory, etc. There's never any expectation that my code has to be safe from... uhh... itself. Bad code is completely allowed to corrupt any of my own memory space, and that's totally fine. Incidentally, dropping accesses to out of bound memory doesn't solve any problems with bad code what-so-ever. It's clearly an error in the application and needs to be fixed. The only bit you don't want is for it to spill over into separate *processes*. i.e. contexts/DMA buffers for GPUs, not workgroups.

    Sure, but shared memory addresses are defined *dynamically*. Buggy code can address outside the range of any array, and you can't statically prove anything about that in advance. You don't need pointers for this... I can do a[237462374] on a 16-long array just as easily in DX as in CL. I can feed 32-bit indices into it at the API level, but I suppose you can drop the high bits since shared memory has (static) size limitations for now so that's cheaper than 64-bit pointers at least.

    It's the way *one piece* of hardware works. i.e. relevant if I'm only targeting that hardware and irrelevant if i'm writing portable code (which must be written to the API specifications). Try writing that code and running it on an OpenCL CPU implementation. You're either going to get badness or your code is going to be much slower than it needs to be.

    No, that's the fundamental mistake you guys are making in logic here. You're assuming that "this memory is well-defined to be shared between these invocations" (i.e. shares a common base pointer) means that out-of-bounds reads/writes are suppressed. That's not the case, and those are two separate constraints. Read the specs. While they all define it that way for global memory, CL is completely silent on the entire issue (and in fact since it cribs from the C spec the only reasonable assumption is that these out-of-bounds reads/writes are undefined... i.e. can corrupt memory, cause crashes, whatever) and DX only defines that OOB shared memory writes do not corrupt global memory.

    The key difference here is that it takes dynamic addresses and thus can have out of bounds accesses. It's more similar to memory in semantics than registers. Separate out your thinking about specific hardware from API semantics for a second and it should be pretty clear how the two cases differ.

    It's fine, but as I've said it doesn't actually help matters (for a single process, although it's good for process isolation, just overkill). First since it's not guaranteed I can't rely on it and second that's not a good way to write code anyways. Even in languages with bounds checking on arrays it's for debugging purposes (i.e. throws an exception) not something you're supposed to intentionally do in production code. Silently failing (returning zero, not writing) is arguably even less useful than that.
     
    #1136 Andrew Lauritzen, Feb 18, 2013
    Last edited by a moderator: Feb 18, 2013
  17. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,752
    Likes Received:
    7,763
    I just read the past 4 pages and there are lots of inconsistent posts about this, so I don't know what is taken as a fact or at least "most persistant" rumour.

    Anandtech says there won't be any HD8000 in the desktop for 2013, with the current HD7000 chips being used throughout the rest of the year. The mobile parts will go through a renaming scheme in order to better fit the new Oland chip with 384 ALUs, that competes with GK107 and GF116.

    Is this correct?
     
  18. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,532
    Likes Received:
    957
    There will be new chips. Just how "new" they will be and how much better they will be compared to what AMD has now is unclear, however.
     
  19. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    As possible on GPUs.
    I just don't regard the local memory as such a universally usable memory space as global memory. From my perspective it's much closer to the registers because local memory is a resource defined as private to a warp/wavefront/workgroup just like registers. Even within the same process, you can't access the registers of another thread on a CPU.
    That's solved with that. How do you want to ensure isolation between the local memory of two different kernels running on the same CU? You would have to do the same amount of range checks anyway. So your argument, that it is overly complicated doesn't hold water in my view.
    As you can do with registers on GPUs (which get boundary checked in these cases too as mentioned already). Your point is?
    I thought we are talking about GPUs here. And the spec says out of bound accesses are illegal. If you write code to spec, it runs everywhere anyway. We are talking about how these illegal cases are handled (which was left out of the OpenCL spec probably because of fears that specifying it would make it harder to implement on the some devices). It doesn't change that the local memory is defined as private to a workgroup and the programmer is supposed to adhere to it. GCN enforces this in hardware, i.e. has a defined behaviour for this cases and will raise a memory access violation exception with the next iteration. That behaviour is completely within spec. And I can't see anything bad there.
    Look how the discussion started! You were asking specifically about the implementation on AMD GPUs, where these OOB accesses get suppressed. I was specifically talking about that. It only later developed into in a discussion how much sense this implementation makes. So we are not making a fundamental mistake when we explained how it is implemented. All APIs basically state that out of bounds accesses on the local memory are illegal. It therefore doesn't hurt to do it. And it solves the process isolation issue with the same effort you would have to do anyway.
    As stated already, that's actually not such a strict distinction as you think. It doesn't have to be static, you can also index dynamically into the reg file on GPUs. You can indeed have out of bounds indices into the regfile. By the way, that is an implementation specific detail which is not covered by the OpenCL spec, isn't it? :lol:
    AFAIK the OpenCL spec also doesn't say what happens if you index out of bounds into private memory. It also does not specify, that private memory has to sit in registers, actually it specifically mentions other possibilities (which gets heavily used in CPU implementations). It's therefore all implementation dependent.
    RLY? Come on, you started the whole thing by doubting that GCN isolates processes also within the local memory. You get the answer that it does and the explanation how it is done, and now you say you can't rely on it?
    You can rely on proper process isolation if you use their hardware. That AMD GPUs hinder local as well as global memory corruption by suppressing out of bounds accesses (and raises exceptions shortly) has some potentially useful applications (besides the obvious debugger think of MegaTexture/PRT, paging). Should you write illegal OpenCL code with undefined behaviour because of it? Of course not! That would be indeed bad practice. But wait for some future DX, OpenCL, or OpenGL version/extension and we may see this functionality exposed with a proper API to use (or wait until you get your hands on one of the next consoles ;)).

    Btw., I don't know what intel GPUs do, but frankly, I doubt a bit you can corrupt others process' memory on them. And on nVidia it's also not possible (but they may crash last time I read something about it, could have changed though).
     
    #1139 Gipsel, Feb 18, 2013
    Last edited by a moderator: Feb 19, 2013
  20. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,559
    Likes Received:
    670
    Location:
    British Columbia, Canada
    ... and because the CL spec is a mess that leaves a million basic things undefined (or worse, simply unmentioned) while spending 300 pages on trivial API explanations ;)

    There's a *big* difference between resources that are only visible to a single invocation/"thread of execution" and stuff that is visible to multiple different scheduled entities. Registers are the former, while LDS is the latter. I understand where you're coming from from a hardware perspective, but I think the API perspective is the more relevant one when considering parallel semantics.

    Well the quote that I was responding to there was long after that initial discussion had been settled, but I'm willing to chalk that one up to a misunderstanding :)

    I think you misunderstood me... yes obviously you can rely on it for process isolation. But the claimed additional safety that it brings to "workgroup isolation" or whatever is not useful in practice as it is not portable. Also, like I said, that's not a good way to write code in general. You seem to agree with me here:

    It seems we're on the same page. However I will still note that my care level about suppressing OOB accesses to shared memory is quite low... it's not exactly difficult to accomplish that myself in the very few cases where I actually want those semantics :) If it could throw exceptions (or otherwise signal+terminate) then it would be useful for debugging though of course.

    As I said, you can't corrupt other processes' (shared local) memory on *any* GPUs right now because you can't have multiple DMA buffers from different processes in flight at the same time. This is all an academic discussion :) And obviously all APIs require not corrupting any global memory.

    So I think we're mostly in agreement here, with the only disagreements being around...

    1) You think the additional isolation of workgroups is useful; I don't. Obviously it's fine and spec compliant either way.

    2) You think the bounds checking hardware is required because of indirect RF accesses. I claim those indirect accesses are an implementation choice, and alternate choices would not require that bounds checking hardware. But in any case, you are correct that it doesn't need a full 32-bit compare in general as high address bits can just be dropped (with proper sign handling) on a given architecture with a fixed maximum shared memory size.
     
    #1140 Andrew Lauritzen, Feb 19, 2013
    Last edited by a moderator: Feb 19, 2013
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...