AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Finding it hard to find much that's concrete except pretty pictures stating first generation HBM has ~half the latency of DDR3.
I guess I'm not sure which powerpoint had that particular value. I'm aware of the leaked Hynix presentation, which had one particular latency parameter for column to column access at half of DDR3, but that would be a minor contributor.

I'm wondering if the striping algorithms will be different, e.g. to the extent that striping is abandoned for most of the small texture sizings.
A 2013 presentation about HSA and GPU had memory addresses distributed as Address/256%N where N was the channel count. 256 bytes has some nice synergies with the write throughput of a vector instruction, as well as a 64-pixel block, although saying the striping is due to how well the SIMD architecture likes 256 bytes may be looking at things backwards. It may be that the architecture for both the vector and ROP systems is structured to appeal to a certain range of strides that DRAM arrays like to work with.

4 Gbps GDDR5 has a prefetch of 8 on a 32 bit channel, so it delivers 128 bits in 1ns. Faster than that, and banks are subdivided into four groups that require interleaving to avoid exceeding the speed of the arrays.
HBM Gen1 at 1 Gbps has a prefetch of 2 on a 128 bit channel, so it delivers 128 bits in 1ns.
There is apparently some margin to push the arrays faster, going by the possible 1.2 Gbps leaked device, but it takes a likely change to DDR signaling with Gen2 to get near 2 Gbps.
GDDR5 has 16 banks per channel if we're talking a single device and not clamshell. Clamshell is physically 32, but the chips behave as if they were a double-capacity single channel.
HBM has 8 banks per channel--but the banks are actually subdivided into 2 sub-banks each. There are also 2 channels per slice, so 16 banks or 32 sub-banks per layer in HBM.
Striding through all the banks in GDDR5 seems to only go up to 128 bytes, but it seems like it helps to have overlapping accesses to hide command and refresh overhead.

I'm fumbling through what amounts to numerology due to my limited understanding of DRAM to describe the base constraint, that HBM and GDDR interfaces are pulling data from very similar ploddingly slow DRAM arrays.
It seems like AMD and Hynix kept an eye on what existing hardware is optimized for when designing HBM, so if one standard likes a granularity of 256 bytes per channel to keep the arrays happy, then I think it's possible a similar granularity is helpful for the other.

Looking more closely at the ISA manual, these bits can be read or written by an SALU instruction, e.g. s_setreg_b32. So "abandoned" is not the case.
IB_STS is hardware register 7 and is listed as being read-only in the ISA doc. I don't know how the hardware is supposed to react to a setreg on a read-only register. It might be ignored. It would seem likely to cause a serious problem if it did allow the operation to execute, should a program overwrite one of the other counters that are known to be used. One simplifying assumption for the hardware would be to not make the internal adder fully-featured enough to worry about a nonsense scenario where loads are launched and VMCNT is set back to zero before the decrement can kick in.
 
As a matter of interest, I found a while loop that does the grunt work in SALU, using a set of 64-bit lane-execution masks (22:23, 24:25 and 20:21:

Code:
  v_add_i32       v67, vcc, 4, v67       // loop increment
  v_add_i32       v0, vcc, s0, v0         // not relevant
  v_cmp_gt_i32     s[22:23], v10, v67
  v_add_i32       v41, vcc, 64, v41       // not relevant
  s_mov_b64       s[24:25], exec
  s_andn2_b64     exec, s[24:25], s[22:23]
  s_andn2_b64     s[20:21], s[20:21], exec
  s_cbranch_scc0   label_0415           // exit loop
  s_and_b64       exec, s[24:25], s[20:21]
  s_branch       label_005A           // back to start of loop
label_0415:

This while loop is interesting because every work item has its own start and stop indices in the While. But all work items run for the same count of iterations (there is no divergence).
 
uuh, GPU assembly is always hard to me, but it is interesting.
A question: what's held in s20:21? it seems it represents some additional condition used to mask out the loop execution (maybe some custom condition to end the loop early for the thread?)
 
I believe it is a side effect of detecting whether a const parameter of the kernel is 0, and if so, entirely skipping the loop. This is the start of the kernel:

Code:
  s_mov_b32     m0, 0x00008000              
  s_buffer_load_dwordx2  s[0:1], s[4:7], 0x18
  s_buffer_load_dword  s4, s[8:11], 0x00    
  s_buffer_load_dword  s5, s[8:11], 0x04    
  s_buffer_load_dword  s6, s[8:11], 0x08    // constant kernel parameter?
  s_buffer_load_dword  s7, s[8:11], 0x0c    
  s_buffer_load_dword  s14, s[8:11], 0x10   
  s_buffer_load_dword  s8, s[8:11], 0x14    
  v_lshlrev_b32  v2, 2, v0                  
  v_lshlrev_b32  v3, 5, v1                  
  s_lshl_b32    s9, s12, 3                  
  s_lshl_b32    s10, s13, 3                 
  s_waitcnt     lgkmcnt(0)                  
  s_add_u32     s0, s9, s0                  
  v_add_i32     v4, vcc, s0, v0             
  s_add_u32     s0, s10, s1                 
  v_add_i32     v5, vcc, s0, v1             
  v_lshlrev_b32  v5, 3, v5                  
  v_mul_lo_i32  v6, v5, s7                  
  v_add_i32     v6, vcc, v4, v6             
  s_cmp_ge_i32  s6, 1                                            //  test kernel param
  s_cbranch_scc0  label_042B            // skip way beyond the end of loop, to code that writes 0 to the output

then a bit later (this is before the loop), the execute mask resulting from s_cmp is copied, twice:

Code:
  s_mov_b64     s[10:11], exec  // kept for after the loop
  s_mov_b64     s[20:21], exec  // used within loop

The thing is, if this parameter is 0, then all work items will skip the loop. In which case it's completely pointless caring which of them entered the loop (storing that in s[20:21]) since they will all have entered the loop. exec, upon entering the loop, is all that needs to be known for each successive iteration.

I've discovered that IFs inside the loop, when all are commented-out, cause 2 of these 3 pairs of scalars to disappear (now we just have 4:5). Here's the test at the end of the loop in that alternative code:

Code:
  v_cmp_gt_i32  s[4:5], v10, v162  // v162 was incremented by 4 much earlier in loop
  v_add_i32  v41, vcc, 64, v41        // irrelevant
  s_and_b64  exec, exec, s[4:5]     // what I would expect to see in the original code I posted
  s_cbranch_execnz  label_0055   // back to start of loop

The IFs do not affect the loop's iteration count. So it just seems that I confused the compiler by having IFs inside the While, wasting cycles on those 3 pairs of scalar registers.

This variant still does the test of s6, loaded with s_buffer_load_dword s6, s[8:11], 0x08 and still makes a copy of exec into s10:11. This is required for the instructions that execute after the loop, since their execution is not conditional upon what occurred during the loop.
 
I take that back, the IFs inside the While affect exec, so exec needs to use 20:21 at the time the While condition is being evaluated.
 
Original source: http://www.chiphell.com/thread-1196441-1-1.html (but it just keeps loading, loading and loading)
Graphs available here too: http://www.forum-3dcenter.org/vbulle...7#post10455327

Claimed tests from ChipHell, a site known for both legit and fake leaks, but without getting to see if those tests are actually ChipHell's own or just random forum post, it's hard to say one way or another whether it's more likely fake or real.

According to first graph, averaging performance over 20 games, Fiji XT Engineering Sample is somewhere around 10-15% faster than GTX 980 while GM200's cut-down version is some 2-5% faster than Fiji XT ES.
They claim that in BF4 multiplayer, Fiji XT ES uses some 15-20% more power than GTX 980 while cut-down GM200 uses some 5% give or take few %'s more than Fiji XT ES.

The second graph has numbers for BF4 MP, CoD AW, DA:Inquisition, Ryse and Watch Dogs, it also includes "full fat GM200" and Bermuda XT ES in addition to the Fiji XT ES and cut-down GM200
In it, Bermuda XT ES is faster than GM200 full-fat on all games, difference ranging from just few %'s to well over 10%. Fiji XT meanwhile is slower than GM200 cut-down version in most games, but slightly faster in DragonAge and Ryse.
 
Looks like another fake, probably by the same person as the last one by the looks of it. Any leak of a ES playing multiple games on a graph is fake by my books.
 
Looks like another fake, probably by the same person as the last one by the looks of it. Any leak of a ES playing multiple games on a graph is fake by my books.
While I don't doubt there is fake information in there, I'm pretty sure there is some accurate info sprinkled in.
 
Just thinking about this, if it were true (380X around 1.3x 290X, 390X around 30% more than that) that could mean ~290 performance for a 370X (about 40-45% jump from 370X to 380X). That seems too good to be true surely, granted it is the first node shrink in a long time and the claimed/"leaked" perf/watt gains are huge (about 1.8x) which would roughly fit 290 perf. for a ~150W part. Hopefully we'll have official info sooner rather than later.
 
Probably not, AMD will be focusing on launching Carizo at CES. I assume nothing on the GPU side until feb because of that.
 
Not sure but it would make sense because it was the last thing they were talking about before the holidays. The only other thing they were talking about was freesync.
 
I can't find any keynote schedules in the CES website.
Does anyone have any idea over when AMD is going to present Carrizo?
 
Back
Top