AMD: R9xx Speculation

Discussion in 'Architecture and Products' started by Lukfi, Oct 5, 2009.

  1. jaredpace

    Newcomer

    Joined:
    Sep 28, 2009
    Messages:
    157
    Likes Received:
    0
    Cayman:

    [​IMG]
    [​IMG]

    [​IMG]

    [​IMG]

    Barts:

    [​IMG]
    [​IMG]
     
    #1721 jaredpace, Sep 7, 2010
    Last edited by a moderator: Sep 10, 2010
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    You're referring to the limitation of there being a maximum of 3 operands from each of .x, .y, .z or .w? So if 12 operands were all ".x" that would lead to severely constrained throughput (e.g. 1 MUL per cycle).

    Assuming there is ILP then in theory it can be faster. Commonly the pipeline registers (PV.xyzw and PS) are used as operands. Also T registers' 4 components can be freely used.

    In theory these two things will "unlock" the single-bank restriction, but that depends on the compiler.

    I just wrote Brook+, OpenCL and DirectCompute test shaders. Brook+ and OpenCL worked moderately well, using PV and T registers resulting in ~2 MULs per cycle (instead of 1). It's actually 7 cycles for 11 MULs.

    DirectCompute "cheated" by loading operands from memory into .x and .y components, though still ended-up with ~2 MULs per cycle.

    Increasing ILP results in more MULs per cycle.

    Generally the compiler allocates registers for intermediate results with packing, so it'll create .y, .z and .w for anything reasonably complex. Though the packing varies in quality in my experience.

    Overall the question of ILP seems to be dominant, not banking.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    That was what I was referencing in the portion of my statement with regards to the packing. For clarity, that is separate from the portion concerning the exposure of an ugly low-level implementation detail.

    This assumes that there is ILP that does not run afoul of bank conflicts, since ILP is not defined relative to banking issues.
    In a perverse sense, it is the opposite of ILP that hides some of the problem, since the pipeline registers are populated when there are dependences in the current cycle on the cycle prior.

    The pipeline registers are also a low-level implementation detail that is exposed to software, though typically this would be fine since most architectures would have bypass networks that do this automatically.

    Are these test shaders coded specifically to test banking conflicts?

    That sounds like the compiler in that case at least knew there would be a problem.
     
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Code:
    __kernel void test2(
    global float* A,
    global float* B, 
    global float* C,
    global float* D,
    global float* E,
    global float* F,
    global float* G,
    global float* H,
    global float* I,
    global float* J,
    global float* K,
    global float* L,
    global float* Z) {
     int x = get_global_id(0) ;
     float Z1 = A[x] * B[x] * C[x] * D[x] ;
     float Z2 = E[x] * F[x] * G[x] * H[x] ;
     float Z3 = I[x] * J[x] * K[x] * L[x] ;
     Z[x] = Z1 * Z2 * Z3;
    }
    Produces:

    Code:
    02 ALU: ADDR(68) CNT(13) 
         23  x: LSHR        R0.x,  R5.y,  (0x00000002, 2.802596929e-45f).x      
             w: MUL_e       ____,  R0.x,  R4.x      
         24  x: MUL_e       ____,  R5.x,  R6.x      
             w: MUL_e       T0.w,  PV23.w,  R1.x      VEC_021 
         25  x: MUL_e       ____,  PV24.x,  R8.x      
             y: MUL_e       ____,  R7.x,  R2.x      VEC_021 
         26  x: MUL_e       ____,  T0.w,  R10.x      
             y: MUL_e       ____,  PV25.x,  R9.x      VEC_021 
             z: MUL_e       ____,  PV25.y,  R11.x      VEC_102 
         27  z: MUL_e       ____,  PV26.z,  R3.x      
             w: MUL_e       ____,  PV26.y,  PV26.x      
         28  x: MUL_e       R3.x,  PV27.w,  PV27.z  
    I've excluded all the addressing instruction crap and the memory operations.

    Obviously I created ILP here. Purely serial scalar runs at 1 instruction per clock:

    Code:
    __kernel void test3(
    global float* A,
    global float* Z) {
     int x = get_global_id(0) ;
     Z[x] = A[x] * A[x + 1] * A[x + 2] * A[x + 3] * A[x + 4] * A[x + 5] * A[x + 6] * A[x + 7] * A[x + 8] * A[x + 9] * A[x + 10] * A[x + 11];
    }
    Code:
    02 ALU: ADDR(73) CNT(11) 
         23  x: MUL_e       ____,  R3.x,  R4.x      
         24  y: MUL_e       ____,  PV23.x,  R5.x      
         25  y: MUL_e       ____,  PV24.y,  R0.x      
         26  z: MUL_e       ____,  PV25.y,  R6.x      
         27  z: MUL_e       ____,  PV26.z,  R1.x      
         28  w: MUL_e       ____,  PV27.z,  R2.x      
         29  w: MUL_e       ____,  PV28.w,  R7.x      
         30  w: MUL_e       ____,  PV29.w,  R8.x      
         31  x: MUL_e       ____,  PV30.w,  R9.x      
         32  y: MUL_e       ____,  PV31.x,  R10.x      
         33  x: MUL_e       R11.x,  PV32.y,  R11.x  
    Evergreen is supposed to be able to do 2 serially dependent MULs per clock so AMD still hasn't got that working. EDIT: I have to take that back as instruction ordering here prevents the possibility of issuing two per cycle.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Actually that ordering shouldn't prevent the dual-issue of serially dependent ops.

    Experimenting some more, I can't get the compiler to do this at all, which is pretty crap for a chip that's been around for over a year.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Is the compiler being conservative because these are floating point values? Would ints be handled differently?
     
  8. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    That's not all, a * b * c * d isn't trully dependent, if I told the compiler that I don't care about precision it should be able to reorder as (a * b) * (c * d).

    Also I bet 99% of programmers who wrote something like a * b * c * d in the code wouldn't be able to say if this would produce more correct results than (a * b) * (c * d) but they would be fine with either result.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I don't think this scheduling is supported for normal integer math (the bit count and SAD stuff does though).

    As for MUL, no change even with brackets.

    a * b * c * d is conventionally serially dependent left-to-right.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I can understand that there is a serialization when using a*b*c*d, though it seems odd to me that brackets wouldn't change anything.
    There must be something I haven't considered for that situation.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    By "no change" I mean dual-issue doesn't occur. The compiled sequence is affected by brackets, as you'd expect.
     
  12. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
  13. Kaldskryke

    Newcomer

    Joined:
    May 9, 2008
    Messages:
    16
    Likes Received:
    0
    Are we supposedly looking at Antilles here, or just two Cayman in crossfire? If it's Antilles I doubt 900MHz is possible without exceeding 300W TDP.
     
  14. UniversalTruth

    Veteran

    Joined:
    Sep 5, 2010
    Messages:
    1,747
    Likes Received:
    22
    What a fake. :roll: Why 128 bit memory interface? :roll:
     
  15. jaredpace

    Newcomer

    Joined:
    Sep 28, 2009
    Messages:
    157
    Likes Received:
    0
    The memory on the AMD Caicos board is Hynix H5TQ2G63BFR-12C-024A GDDR3, and the memory on ATI Cedar (HD5450) is Samsung K4W1G1646E-HC12. Both are GDDR3 specd at 800mhz, but the hynix on the new board has a density of 2G / 8k Ref / 64ms, while the cedar board has a density 1G / 8k / 64ms. Memory bandwidth has increased from 12.8 gb/s on cedar to 25.6 gb/s on caicos. This is because of changing density from 1g > 2g?

    Talking Juniper ->Barts, increasing bus width from 128bit->256bit, would this effectively double the board's memory bandwidth while keeping the same Samsung GDDR5 memory modules from Juniper boards?

    Also does anyone know how to read the fab codes from the dies? I know the first four digits are year/workweek, eng = sample; whats the other stuff - wafer batch code?

    [​IMG]
    [​IMG]
     
  16. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    No memory bandwidth will stay the same (but memory size will double, of course). Who said it increased?
    Yes. Note Juniper used 8 chips in clamshell mode, which effectively halfs the per-chip data width from 32bit to 16bit, presumably Barts would use 8 chips in "normal" mode.
     
  17. jaredpace

    Newcomer

    Joined:
    Sep 28, 2009
    Messages:
    157
    Likes Received:
    0
    Well i checked a press deck slide for Cedar to get the 11.8gb/s figure, and there was a Caicos GPUz screenshot showing 25.6gb/s. Could just be an error on gpuz I suppose.
     
  18. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    297
    Likes Received:
    38
    Location:
    Herwood, Tampere, Finland
    AFAIK GPUz cannot detect the memory bus width of the chip; it's just "guessing" it based on the chip's PCI-ID and memory size. (stored in database where the makers of gpuz have stored the info). And for new unreleased chips's there are errors in the gpuz database, as the makers of gpuz have had to guess some values when they had no official info.

    And then the gpuz have detected the clock speed, and calculated the bandwidth from the detected clock speed and guessed bit width, and gotten the wrong result.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The compiler could be being conservative if the serial MULs are paired in such a way as the first MUL's result is not rounded and normalised (as would happen in a dot-product, I presume). Is that what you were thinking of?

    I tried this in both OpenCL and DirectCompute. Neither provides serial MULs. The rounding/normalisation issue is not documented in the ISA. Since there are "IEEE" versions of the serial instructions (MUL and MULADD), one would be forgiven for thinking that a pairing of correctly rounded/normalised MULs is possible.

    I haven't tried a pixel (or other graphics) shader, that's the last resort I guess.

    If it only ever occurs in a pixel shader then it's pretty much useless, since the default is high levels of ILP anyway.
     
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I think ocl compiler will do maintain strict ordering unless you opt-in to relax fp order. I forget the exact name of the compiler option but did you give any compiler options like fast-math or relaxed-math?

    Giving no option should strictly maintain program order.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...