AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

OpenGL guy · Feb 17, 2013

Andrew Lauritzen said:
Right, i.e. it can't preempt. Can GCN actually handle multiple simultaneous applications and maintain process and memory isolation though?

Yes, we have memory virtualization that isolates processes.

IIRC DX Spec allows some weird stuff like out of bounds local memory writes to corrupt *any* local memory on the chip. Now I doubt with GCN OOB local memory writes can corrupt another "core"'s local memory, but it would definitely have to ensure things like separate processes can't be scheduled on the same "core", and technically would have to wipe the local memory in between context switches.

Not possible on our chips. First, under DirectCompute, the hardware can do bounds checking, thus you won't even corrupt you own memory. Under OpenCL it is possible to corrupt your own memory, but you can never access another processes resources.

Incidentally these sorts of things is why WebGL is sort of hilarious to me... people really think there's suitable security sand-boxing all through the GPU software/hardware stack? Heh

Bigger problems are DoS attacks IMO.

Gipsel · Feb 17, 2013

I really wonder, why nobody posted the link to the Sea Islands ISA manual so far.

And even in there, Sea Islands is abbreviated C.I.

Important differences between S.I. and C.I. GPUs

•Multi queue compute
Lets multiple user-level queues of compute workloads be bound to the device and processed simultaneous. Hardware supports up to eight compute pipelines with up to eight queues bound to each pipeline.
•System unified addressing
Allows GPU access to process coherent address space.
•Device unified addressing
Lets a kernel view LDS and video memory as a single addressable memory. It also adds shader instructions, which provide access to “flat” memory space.
•Memory address watch
Lets a shader determine if a region of memory has been accessed.
•Conditional debug
Adds the ability to execute or skip a section of code based on state bits under control of debugger software. This feature adds two bits of state to each wavefront; these bits are initialized by the state register values set by the debugger, and they can be used in conditional branch instructions to skip or execute debug-only code in the kernel.
•Support for unaligned memory accesses
•Detection and reporting of violations in memory accesses

Summary of kernel instruction change from S.I. to C.I.

New instruction formats:
•FLAT

New instructions:
•FLAT_* (entire family of operations)
•S_CBRANCH_CDBGUSER
•S_CBRANCH_CDBGSYS
•S_CBRANCH_CDBGSYS_OR_USER
•S_CBRANCH_CDBGSYS_AND_USER
•S_DCACHE_INV_VOL
•V_TRUNC_F64
•V_CEIL_F64
•V_FLOOR_F64
•V_RNDNE_F64
•V_QSAD_PK_U16_U8
•V_MQSAD_U16_U8
•V_MQSAD_U32_U8
•V_MAD_U64_U32
•V_MAD_I64_I32
•V_EXP_LEGACY_F32
•V_LOG_LEGACY_F32
•DS_NOP
•DS_GWS_SEMA_RELEASE_ALL
•DS_WRAP_RTN_B32
•DS_CNDXCHG32_RTN_B64
•DS_WRITE_B96
•DS_WRITE_B128
•DS_CONDXCHG32_RTN_B128
•DS_READ_B96
•DS_READ_B128
•BUFFER_LOAD_DWORDX3
•BUFFER_STORE_DWORDX3

Removed instructions:
•V_QSAD_U8
•V_MQSAD_U8
•BUFFER_ATOMIC_RSUB, _X2
•IMAGE_ATOMIC_RSUB, _X2

Found by Locuza from 3DC.

Edit:
I just see that this gets dicussed in th HD7000 thread for some reason. Dave Baumann said, that the currently available "Sea Islands" GPUs (Oland) are actually not C.I., i.e. they are still SI and don't have the new features.

caveman-jim · Feb 17, 2013

gongo said:
I always wondered....with so much available Tflops and the hype of Direct Compute (lol seems so long)...is it so hard to do those fancy sparks on Tahiti? What happened to AMD graphics physics department?

The guy that was leading it jumped ship to Intel (6 mo. before his boss left for Qualcomm).

caveman-jim · Feb 17, 2013

RedVi said:
UHD builds on what they already had - although maybe a little too subtle of a change. Also, they won't let older hardware already supporting UHD resolutions stand in the way of a new buzzword! The original Radeon 7000 from 2000 supported 1080p yet somehow adding a HD suffix was valid for the 2000 series from 2007. I think adding a UHD suffix makes even more sense this time around.

A look ahead to 2013…

While 2012 has been mega, 2013 is shaping up to be just as hectic! The next 12 months is poised to see AMD remain at the heart of the biggest and best industry gaming events, working with the top publishers to get the best out of upcoming titles and, of course, launching the next generation of graphics cards, the AMD Radeon HD 8000 Series graphics (codename: “Sea Islands”) – from which you can expect a hefty performance boost.

Peter Ross is Manager, AMD Gaming Evolved Marketing at AMD.

http://blogs.amd.com/play/2012/12/21/2012-a-memorable-year-in-pc-gaming/

December 21st, 2012 AMD said Sea Islands was the AMD Radeon HD 8000 series.

caveman-jim · Feb 17, 2013

Gipsel said:
Dave Baumann said, that the currently available "Sea Islands" GPUs (Oland) are actually not C.I., i.e. they are still SI and don't have the new features.

That's interesting because when the 8000M series were previewed, it was stated that the new GPU's had new compute features. Apparently that meant the 8700/8800/8900M asic's and not the 8500/8600M ones shipping now, but that was not made clear in the call.

Dave Baumann · Feb 17, 2013

Gipsel said:
Dave Baumann said, that the currently available "Sea Islands" GPUs (Oland) are actually not C.I., i.e. they are still SI and don't have the new features.

Actually, what I said was using a roadmap descriptor is not accurate to denote feature level!

boxleitnerb · Feb 17, 2013

Am I the only one who thinks all those roadmaps and codenames are utterly confusing and without obvious logic or structure?

Alexko · Feb 17, 2013

boxleitnerb said:
Am I the only one who thinks all those roadmaps and codenames are utterly confusing and without obvious logic or structure?

No, you're not.

Gipsel · Feb 17, 2013

Dave Baumann said:
Unfortunately the document above is not described correctly because the only "Sea Islands" part that is available right now is Oland which uses the same IP set as Tahiti/Pitcairn/Verde.

Dave Baumann said:
Actually, what I said was using a roadmap descriptor is not accurate to denote feature level!

Actually, I don't care what names you use to distinguish the different feature levels, as long as there is some distinction possible. There was already a similar mix during the Northern Islands generation with the VLIW5/4 GPUs.
Maybe you should introduce a system somewhat resembling nVidia's compute capabilities. As someone else (edit: Alexko was it, he linked it even in the post above!) suggested already, name CapeVerde/Pitcairn/Tahiti/Oland GCN 1.0 (Tahiti maybe 1.01 because of the higher DP and int32 multiplication performance) and the C.I. from the new ISA manual GCN 1.1 or whatever you like.
Maybe that even simplifies the handling of the manuals as you need to roll out a new one only in case of major changes and the smaller revisions can be kept together. It will make copy/paste errors (the description of C.I. in chapter 2 is such a carryover from the VLIW architectures and most probably simply wrong, someone should have copied the chapter 2 from the S.I. manual instead) less likely. Furthermore, it makes it probably simpler to get the differences, especially if just some details differ. It's simpler to read a single manual than to compare two or even more.

Andrew Lauritzen · Feb 17, 2013

OpenGL guy said:
Yes, we have memory virtualization that isolates processes.

Even for local memory accesses? Really, you're going to add a compare or two to *every* local memory op? Seems unlikely/over-engineered.

OpenGL guy said:
Bigger problems are DoS attacks IMO.

Sure, but that's just a symptom of the underlying problem that the requisite robustness *isn't* there yet, even in the hardware, let alone the software.

Dave Baumann · Feb 17, 2013

Gipsel said:
Maybe you should introduce a system somewhat resembling nVidia's compute capabilities. As someone else (edit: Alexko was it, he linked it even in the post above!) suggested already, name CapeVerde/Pitcairn/Tahiti/Oland GCN 1.0 (Tahiti maybe 1.01 because of the higher DP and int32 multiplication performance) and the C.I. from the new ISA manual GCN 1.1 or whatever you like.

I thought the rest of my comment already covered that...

I've contacted the publisher of the doc and suggested that it would be more accurate to describe the document by Graphics IP level

Gipsel · Feb 17, 2013

Andrew Lauritzen said:
Even for local memory accesses? Really, you're going to add a compare or two to *every* local memory op? Seems unlikely/over-engineered.

I guess that's not much more effort than to add a wavefront specific offset to the register number for each register file access to get the physical memory location in the register file right. The LDS is partitioned similarly, each workgroup gets a LDS base address and a size it can access. The base address is added and the kernel supplied LDS address is checked against the allocated size. Doesn't appear too much effort to me. Or as the SI ISA manual puts it:

SI ISA manual said:
M0[15:0] provides the size in bytes for this access. The size sent to LDS is MIN(M0, LDS_SIZE), where LDS_SIZE is the amount of LDS space allocated by the shader processor interpolator, SPI, at the time the wavefront was created. The address comes from VGPR, and both ADDR and InstrOffset are byte addresses. At the time of wavefront creation, LDS_BASE is assigned to the physical LDS region owned by this wavefront or work-group.

Out-of-range can occur through GPR-indexing or bad programming. It is illegal
to index from one register type into another (for example: SGPRs into trap
registers or inline constants). It is also illegal to index within inline constants.
The following describe the out-of-range behavior for various storage types.
• SGPRs
– Source or destination out-of-range = (sgpr < 0 || (sgpr >= sgpr_size)).
– Source out-of-range: returns the value of SGPR0 (not the value 0).
– Destination out-of-range: instruction writes no SGPR result.
• VGPRs
– Similar to SGPRs. It is illegal to index from SGPRs into VGPRs, or vice
versa.
– Out-of-range = (vgpr < 0 || (vgpr >= vgpr_size))
– If a source VGPR is out of range, VGPR0 is used.
– If a destination VGPR is out-of-range, the instruction is ignored (treated
as an NOP).
• LDS
– If the LDS-ADDRESS is out-of-range (addr < 0 or > (MIN(lds_size, m0)):
◊ Writes out-of-range are discarded; it is undefined if SIZE is not a
multiple of write-data-size.
◊ Reads return the value zero.
– If any source-VGPR is out-of-range, use the VGPR0 value is used.
– If the dest-VGPR is out of range, nullify the instruction (issue with
exec=0)
• Memory, LDS, and GDS: Reads and atomics with returns.
– If any source VGPR or SGPR is out-of-range, the data value is
undefined.
– If any destination VGPR is out-of-range, the operation is nullified by
issuing the instruction as if the EXEC mask were cleared to 0.
◊ This out-of-range check must check all VGPRs that can be returned
(for example: VDST to VDST+3 for a BUFFER_LOAD_DWORDx4).
◊ This check must also include the extra PRT (partially resident
texture) VGPR and nullify the fetch if this VGPR is out-of-range, no
matter whether the texture system actually returns this value or not.
◊ Atomic operations with out-of-range destination VGPRs are nullified:
issued, but with exec mask of zero.
Instructions with multiple destinations (for example: V_ADDC): if any
destination is out-of-range, no results are written.

M0 provides a mean to further restrict the access to an ever smaller region. But it is impossible to read from an LDS region not owned by that workgroup. You don't need some fancy additional LDS isolation. Different workgroups are basically by definition isolated from each other even for the same kernel. With C.I./GCN 1.1 there is also exception support for these out-of-bounds accesses added.

C.I. ISA manual said:
A Memory Violation is reported from:
•LDS access out of range: 0 < addr < lds_size. This can occur for indexed and direct access.
•LDS alignment error.
•Memory read/write/atomic out-of-range.
•Memory read/write/atomic alignment error.
•Flat access where the address is invalid (does not fall in any aperture).
•Write to a read-only surface.
•GDS alignment or address range error.
•GWS operation aborted (semaphore or barrier not executed).
[..]
When a memory access is in violation, the appropriate memory (LDS or TC) returns MEM_VIOL to the wave. This is stored in the wave’s TRAPSTS.mem_viol bit. This bit is sticky, so once set to 1, it remains at 1 until the user clears it. There is a corresponding exception enable bit (EXCP_EN.mem_viol). If this bit is set when the memory returns with a violation, the wave jumps to the trap handler.

OpenGL guy · Feb 18, 2013

Andrew Lauritzen said:
Even for local memory accesses? Really, you're going to add a compare or two to *every* local memory op? Seems unlikely/over-engineered.

You need to be more clear. Do you mean "local frame buffer memory" or "local shared memory". In either case, it is impossible for one process to access another process' data. For LDS (local data share) it's even impossible to go outside of a workgroup's allocated space.

Textures have boundaries and every access has to be checked against those bounds to get the correct result, so why do you feel it's so different for buffers?

Frame buffer virtualization offers many advantages, one of which is memory protection, but if you want to read about some others then read the AMD APP SDK Programmer's Guide.

hoom · Feb 18, 2013

And now they are doing 8000 series officially 2H? :nope:

http://pcworld.co.nz/pcworld/pcw.nsf/news/amd-to-release-new-radeon-hd-8000-graphics-cards-in-2013

I've not really been keeping up with performance lately due to my 6950 being adequate for 1920*1200 but I did recently change to a 30" 2560*1600 so had been thinking 1H might be a good time for an upgrade.
Checking out some reviews & current local prices (7970ghz for under NZ$600 vs $755 min for GTX680) I can see why they would not be in any rush to bring out new cards unless NV has a new card that will be both cheaper & faster.

Not too keen on the power/noise factor, a lower power respin with same clocks & price would be extremely attractive though.

Andrew Lauritzen · Feb 18, 2013

OpenGL guy said:
You need to be more clear. Do you mean "local frame buffer memory" or "local shared memory". In either case, it is impossible for one process to access another process' data. For LDS (local data share) it's even impossible to go outside of a workgroup's allocated space.

Shared local memory - i.e. "LDS" on GCN. I guess to pop up one level further... can wavefronts from separate workgroups run in parallel on the same core simultaneously? If so, they can be accessing the same LDS (for that core) at the same time, and the hardware will ensure that every address in any LDS G/S does not stomp outside of the range of the relevant workgroup? If so, neat, but that does seem over-engineered as it requires a compare per lane of the wavefront on every LDS access... LDS is supposed to be "like registers" in terms of speed, and throwing an ALU op in there doesn't seem to fit that criteria. Obviously can be hidden in the pipeline, but adding instruction latency isn't free either.

OpenGL guy said:
Textures have boundaries and every access has to be checked against those bounds to get the correct result, so why do you feel it's so different for buffers?

I'm not talking about DRAM/GDDR. That's fairly easy and required by the DX spec so all GPUs do it.

OpenGL guy · Feb 18, 2013

Andrew Lauritzen said:
Shared local memory - i.e. "LDS" on GCN. I guess to pop up one level further... can wavefronts from separate workgroups run in parallel on the same core simultaneously? If so, they can be accessing the same LDS (for that core) at the same time, and the hardware will ensure that every address in any LDS G/S does not stomp outside of the range of the relevant workgroup? If so, neat, but that does seem over-engineered as it requires a compare per lane of the wavefront on every LDS access... LDS is supposed to be "like registers" in terms of speed, and throwing an ALU op in there doesn't seem to fit that criteria. Obviously can be hidden in the pipeline, but adding instruction latency isn't free either.

It doesn't cost any instructions at all, go look at an ISA dump sometime. There are 32 banks in LDS and 32 LDS requests in a single clock. Given the granularity of LDS allocations (this is documented in the AMD APP SDK Programmer's Guide as well), it's pretty simple for the hardware to do bounds checking. I.e. we are talking a few bits per comparison so it's quite cheap.

3dcgi · Feb 18, 2013

Andrew Lauritzen said:
LDS is supposed to be "like registers" in terms of speed

Not it's not. LDS is supposed to allow for sharing data between work items. There's no implied performance requirement in the feature.

Squilliam · Feb 18, 2013

I was forced to purchase a new graphics card (HD 7870) due to my previous one breaking... I'm hoping selfishly that there isn't a new revision coming out soon due to the fear of missing out on cool new AMD technology. I guess I'll have to wait until the 20nm revision, any ideas when I might be waiting for that???

lanek · Feb 18, 2013

Squilliam said:
I was forced to purchase a new graphics card (HD 7870) due to my previous one breaking... I'm hoping selfishly that there isn't a new revision coming out soon due to the fear of missing out on cool new AMD technology. I guess I'll have to wait until the 20nm revision, any ideas when I might be waiting for that???

Current 2014..

Gipsel · Feb 18, 2013

Andrew Lauritzen said:
Shared local memory - i.e. "LDS" on GCN. I guess to pop up one level further... can wavefronts from separate workgroups run in parallel on the same core simultaneously?

Yes.

Andrew Lauritzen said:
If so, they can be accessing the same LDS (for that core) at the same time, and the hardware will ensure that every address in any LDS G/S does not stomp outside of the range of the relevant workgroup?

Yes.
As cited above, bound checks are even done for register accesses. It can't be expensive.

Andrew Lauritzen said:
If so, neat, but that does seem over-engineered as it requires a compare per lane of the wavefront on every LDS access...

No, not in my opinion. It would be a complete mess if one workgroup or wavefront could write in the LDS region or even the registers of another one. For what have you defined the workgroups in the first place? What happens, if two workgroups running on the same CU belong do vastly different things or belong to different kernels? As said, it would be a mess.
Or can one thread on a intel CPU with hyperthreading read the registers of the other thread?

Andrew Lauritzen said:
LDS is supposed to be "like registers" in terms of speed,

They never were, also not on nV hardware. That was some notion coming up with nV GPUs in the G80/G92 days, but it was far from the truth.

Andrew Lauritzen said:
and throwing an ALU op in there doesn't seem to fit that criteria. Obviously can be hidden in the pipeline, but adding instruction latency isn't free either.

As said already, this is not done by ALU ops.

AMD: Sea Islands R1100 (8*** series) Speculation/ Rumour Thread

OpenGL guy

Gipsel

caveman-jim

caveman-jim

caveman-jim

Dave Baumann

Gamerscore Wh...

boxleitnerb

Alexko

Gipsel

Andrew Lauritzen

Moderator

Dave Baumann

Gamerscore Wh...

Gipsel

OpenGL guy

hoom

Andrew Lauritzen

Moderator

OpenGL guy

3dcgi

Squilliam

Beyond3d isn't defined yet

lanek

Gipsel

Similar threads