AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Discussion in 'Architecture and Products' started by iMacmatician, Apr 10, 2014.

Tags:
  1. 3dilettante

    3dilettante Legend Alpha

    I guess I'm not sure which powerpoint had that particular value. I'm aware of the leaked Hynix presentation, which had one particular latency parameter for column to column access at half of DDR3, but that would be a minor contributor.

    A 2013 presentation about HSA and GPU had memory addresses distributed as Address/256%N where N was the channel count. 256 bytes has some nice synergies with the write throughput of a vector instruction, as well as a 64-pixel block, although saying the striping is due to how well the SIMD architecture likes 256 bytes may be looking at things backwards. It may be that the architecture for both the vector and ROP systems is structured to appeal to a certain range of strides that DRAM arrays like to work with.

    4 Gbps GDDR5 has a prefetch of 8 on a 32 bit channel, so it delivers 128 bits in 1ns. Faster than that, and banks are subdivided into four groups that require interleaving to avoid exceeding the speed of the arrays.
    HBM Gen1 at 1 Gbps has a prefetch of 2 on a 128 bit channel, so it delivers 128 bits in 1ns.
    There is apparently some margin to push the arrays faster, going by the possible 1.2 Gbps leaked device, but it takes a likely change to DDR signaling with Gen2 to get near 2 Gbps.
    GDDR5 has 16 banks per channel if we're talking a single device and not clamshell. Clamshell is physically 32, but the chips behave as if they were a double-capacity single channel.
    HBM has 8 banks per channel--but the banks are actually subdivided into 2 sub-banks each. There are also 2 channels per slice, so 16 banks or 32 sub-banks per layer in HBM.
    Striding through all the banks in GDDR5 seems to only go up to 128 bytes, but it seems like it helps to have overlapping accesses to hide command and refresh overhead.

    I'm fumbling through what amounts to numerology due to my limited understanding of DRAM to describe the base constraint, that HBM and GDDR interfaces are pulling data from very similar ploddingly slow DRAM arrays.
    It seems like AMD and Hynix kept an eye on what existing hardware is optimized for when designing HBM, so if one standard likes a granularity of 256 bytes per channel to keep the arrays happy, then I think it's possible a similar granularity is helpful for the other.

    IB_STS is hardware register 7 and is listed as being read-only in the ISA doc. I don't know how the hardware is supposed to react to a setreg on a read-only register. It might be ignored. It would seem likely to cause a serious problem if it did allow the operation to execute, should a program overwrite one of the other counters that are known to be used. One simplifying assumption for the hardware would be to not make the internal adder fully-featured enough to worry about a nonsense scenario where loads are launched and VMCNT is set back to zero before the decrement can kick in.
     
    Lightman likes this.
  2. pMax

    pMax Regular

    ...one never stop to learn, no way, thank you.
     
  3. Jawed

    Jawed Legend

    Doh :oops:
     
  4. Jawed

    Jawed Legend

    As a matter of interest, I found a while loop that does the grunt work in SALU, using a set of 64-bit lane-execution masks (22:23, 24:25 and 20:21:

    Code:
      v_add_i32       v67, vcc, 4, v67       // loop increment
      v_add_i32       v0, vcc, s0, v0         // not relevant
      v_cmp_gt_i32     s[22:23], v10, v67
      v_add_i32       v41, vcc, 64, v41       // not relevant
      s_mov_b64       s[24:25], exec
      s_andn2_b64     exec, s[24:25], s[22:23]
      s_andn2_b64     s[20:21], s[20:21], exec
      s_cbranch_scc0   label_0415           // exit loop
      s_and_b64       exec, s[24:25], s[20:21]
      s_branch       label_005A           // back to start of loop
    label_0415:
    
    This while loop is interesting because every work item has its own start and stop indices in the While. But all work items run for the same count of iterations (there is no divergence).
     
  5. pMax

    pMax Regular

    uuh, GPU assembly is always hard to me, but it is interesting.
    A question: what's held in s20:21? it seems it represents some additional condition used to mask out the loop execution (maybe some custom condition to end the loop early for the thread?)
     
  6. Jawed

    Jawed Legend

    I believe it is a side effect of detecting whether a const parameter of the kernel is 0, and if so, entirely skipping the loop. This is the start of the kernel:

    Code:
      s_mov_b32     m0, 0x00008000              
      s_buffer_load_dwordx2  s[0:1], s[4:7], 0x18
      s_buffer_load_dword  s4, s[8:11], 0x00    
      s_buffer_load_dword  s5, s[8:11], 0x04    
      s_buffer_load_dword  s6, s[8:11], 0x08    // constant kernel parameter?
      s_buffer_load_dword  s7, s[8:11], 0x0c    
      s_buffer_load_dword  s14, s[8:11], 0x10   
      s_buffer_load_dword  s8, s[8:11], 0x14    
      v_lshlrev_b32  v2, 2, v0                  
      v_lshlrev_b32  v3, 5, v1                  
      s_lshl_b32    s9, s12, 3                  
      s_lshl_b32    s10, s13, 3                 
      s_waitcnt     lgkmcnt(0)                  
      s_add_u32     s0, s9, s0                  
      v_add_i32     v4, vcc, s0, v0             
      s_add_u32     s0, s10, s1                 
      v_add_i32     v5, vcc, s0, v1             
      v_lshlrev_b32  v5, 3, v5                  
      v_mul_lo_i32  v6, v5, s7                  
      v_add_i32     v6, vcc, v4, v6             
      s_cmp_ge_i32  s6, 1                                            //  test kernel param
      s_cbranch_scc0  label_042B            // skip way beyond the end of loop, to code that writes 0 to the output
    
    then a bit later (this is before the loop), the execute mask resulting from s_cmp is copied, twice:

    Code:
      s_mov_b64     s[10:11], exec  // kept for after the loop
      s_mov_b64     s[20:21], exec  // used within loop
    The thing is, if this parameter is 0, then all work items will skip the loop. In which case it's completely pointless caring which of them entered the loop (storing that in s[20:21]) since they will all have entered the loop. exec, upon entering the loop, is all that needs to be known for each successive iteration.

    I've discovered that IFs inside the loop, when all are commented-out, cause 2 of these 3 pairs of scalars to disappear (now we just have 4:5). Here's the test at the end of the loop in that alternative code:

    Code:
      v_cmp_gt_i32  s[4:5], v10, v162  // v162 was incremented by 4 much earlier in loop
      v_add_i32  v41, vcc, 64, v41        // irrelevant
      s_and_b64  exec, exec, s[4:5]     // what I would expect to see in the original code I posted
      s_cbranch_execnz  label_0055   // back to start of loop
    
    The IFs do not affect the loop's iteration count. So it just seems that I confused the compiler by having IFs inside the While, wasting cycles on those 3 pairs of scalar registers.

    This variant still does the test of s6, loaded with s_buffer_load_dword s6, s[8:11], 0x08 and still makes a copy of exec into s10:11. This is required for the instructions that execute after the loop, since their execution is not conditional upon what occurred during the loop.
     
  7. Jawed

    Jawed Legend

    I take that back, the IFs inside the While affect exec, so exec needs to use 20:21 at the time the While condition is being evaluated.
     
  8. DieH@rd

    DieH@rd Legend

  9. Kaotik

    Kaotik Drunk Member Legend

    Original source: http://www.chiphell.com/thread-1196441-1-1.html (but it just keeps loading, loading and loading)
    Graphs available here too: http://www.forum-3dcenter.org/vbulle...7#post10455327

    Claimed tests from ChipHell, a site known for both legit and fake leaks, but without getting to see if those tests are actually ChipHell's own or just random forum post, it's hard to say one way or another whether it's more likely fake or real.

    According to first graph, averaging performance over 20 games, Fiji XT Engineering Sample is somewhere around 10-15% faster than GTX 980 while GM200's cut-down version is some 2-5% faster than Fiji XT ES.
    They claim that in BF4 multiplayer, Fiji XT ES uses some 15-20% more power than GTX 980 while cut-down GM200 uses some 5% give or take few %'s more than Fiji XT ES.

    The second graph has numbers for BF4 MP, CoD AW, DA:Inquisition, Ryse and Watch Dogs, it also includes "full fat GM200" and Bermuda XT ES in addition to the Fiji XT ES and cut-down GM200
    In it, Bermuda XT ES is faster than GM200 full-fat on all games, difference ranging from just few %'s to well over 10%. Fiji XT meanwhile is slower than GM200 cut-down version in most games, but slightly faster in DragonAge and Ryse.
     
  10. Esrever

    Esrever Regular

    Looks like another fake, probably by the same person as the last one by the looks of it. Any leak of a ES playing multiple games on a graph is fake by my books.
     
  11. LordEC911

    LordEC911 Regular

    While I don't doubt there is fake information in there, I'm pretty sure there is some accurate info sprinkled in.
     
  12. homerdog

    homerdog donator of the year Legend Subscriber

    And that pretty much describes all fakes :)
     
  13. Newguy

    Newguy Regular

    Just thinking about this, if it were true (380X around 1.3x 290X, 390X around 30% more than that) that could mean ~290 performance for a 370X (about 40-45% jump from 370X to 380X). That seems too good to be true surely, granted it is the first node shrink in a long time and the claimed/"leaked" perf/watt gains are huge (about 1.8x) which would roughly fit 290 perf. for a ~150W part. Hopefully we'll have official info sooner rather than later.
     
  14. Wynix

    Wynix Veteran

    Any chance of a CES reveal?
     
  15. Esrever

    Esrever Regular

    Probably not, AMD will be focusing on launching Carizo at CES. I assume nothing on the GPU side until feb because of that.
     
  16. Alexko

    Alexko Veteran Subscriber

    Is it a given that Carrizo will be released at CES? AMD has been very quiet lately.
     
  17. Esrever

    Esrever Regular

    Not sure but it would make sense because it was the last thing they were talking about before the holidays. The only other thing they were talking about was freesync.
     
  18. moozoo

    moozoo Newcomer

  19. I can't find any keynote schedules in the CES website.
    Does anyone have any idea over when AMD is going to present Carrizo?
     
  20. lanek

    lanek Veteran

Loading...

Share This Page

Loading...