AMD GPU OpenCL Beta ATI Stream SDK

Discussion in 'GPGPU Technology & Programming' started by Dave Baumann, Oct 13, 2009.

  1. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,003
    Location:
    O Canada!
    AMD GPU OpenCL ATI Stream SDK

    First beta release of ATI Stream SDK with OpenCL™ GPU support is now posted

    Supported GPU's in the initial release:

    ATI Radeon™ HD 5870
    ATI Radeon™ HD 5850
    ATI Radeon™ HD 5770
    ATI Radeon™ HD 5750
    ATI Radeon™ HD 4890
    ATI Radeon™ HD 4870 X2
    ATI Radeon™ HD 4870
    ATI Radeon™ HD 4850 X2
    ATI Radeon™ HD 4850
    ATI Radeon™ HD 4830
    ATI Radeon™ HD 4770
    ATI Radeon™ HD 4670
    ATI Radeon™ HD 4650
    ATI Radeon™ HD 4550
    ATI Radeon™ HD 4350

    ATI FirePro™ V8750
    ATI FirePro™ V8700
    ATI FirePro™ V7750
    ATI FirePro™ V5700
    ATI FirePro™ V3750

    AMD FireStream™ 9270
    AMD FireStream™ 9250

    ATI Mobility Radeon™ HD 4870
    ATI Mobility Radeon™ HD 4860
    ATI Mobility Radeon™ HD 4850 X2
    ATI Mobility Radeon™ HD 4850
    ATI Mobility Radeon™ HD 4830
    ATI Mobility Radeon™ HD 4670
    ATI Mobility Radeon™ HD 4650
    ATI Mobility Radeon™ HD 4500 Series
    ATI Mobility Radeon™ HD 4300 Series

    ATI Mobility FirePro™ M7740

    ATI Radeon™ Embedded E4690 Discrete GPU

    Update 21-Dec-2009

     
  2. Broken Hope

    Regular

    Joined:
    Jul 13, 2004
    Messages:
    483
    Location:
    England
    XP driver is corrupt. Can somewhat repair it with Winrar though but don't feel comfortable installing it.
     
  3. pcchen

    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,645
    Location:
    Taiwan
    Great! Now I only have to port my NLM denoise program to OpenCL then... :)
     
  4. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,099
    Location:
    Germany
    Mandelbrot and another sample asked me for Glut32 and Glew32, but they aren't part of Beta 4 (Win7 x32).
    But they are part of the Beta 3, so you can copy and paste them in C:\Program Files\ATI Stream\bin\x86.
     
  5. CNCAddict

    Regular

    Joined:
    Aug 14, 2005
    Messages:
    288
    They put them in the ATI Stream -> bin folder. Just copy the dlls over to the sample folder and everything works fine.

    It would be nice if the Stream SDK was included with some sort of software environment IDE, is that something planned for the future?
     
  6. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha Subscriber

    Joined:
    May 14, 2005
    Messages:
    1,306
    Location:
    NY
    For Windows just use Visual Studio.
     
  7. CNCAddict

    Regular

    Joined:
    Aug 14, 2005
    Messages:
    288
  8. CNCAddict

    Regular

    Joined:
    Aug 14, 2005
    Messages:
    288
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,644
    Location:
    London
    IL for OpenCL kernels

    http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=120683

    That's interesting. From the IL it's but a short step to assembly code, using SKA. It's this latter step where optimisation occurs, generally, so no need to worry about the IL being a mess.

    Oh, and that looks like some variant of llvm is being used, does it not?

    Jawed
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    He hasn;t gotten llvm's optimizers running so far. That's why IL is not looking good.

    LLVM is so cool and so meant for something like this that you will be crazy to not use it.
     
  11. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    BTW, does this release have some kind of timed lockout like the previous version? Or will I have to keep updating this thing?
     
  12. Karoshi

    Newcomer

    Joined:
    Aug 31, 2005
    Messages:
    181
    Location:
    Mars
    I've been wondering if AMD was considering releasing some of it's llvm code. Apple obviously uses llvm on osx. It seems AMD uses llvm for x86, and possibly gpu.
    It'd be nice if they released their opencl frontend back to the community. And possibly their gpu backend, to help ppl writing r7/r8 drivers. I mean, they are giving them the docs, why not help them a bit more by providing them with the backend to generate ps/vs/cs. It makes sense to me.
    But maybe AMD got some code from apple and are not free to release it.
    Anybody has heard rumours on llvm code release from amd?
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Nah, no code will be coming out of amd. That's what they said when they announced their opensource driver strategy. Considering that their driver team is smaller than nv, I guessestimate that their jit compilers are prolly developed by a 3rd party.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,644
    Location:
    London
    The IL produced by Brook+ often looks horrible. Usually the assembly is fine. Though there are odd problems (superfluous MOVs, VLIW packing opportunities missed, register components going unused in vec4 registers...).

    I doubt there's much, if any, value in using IL to judge optimality.

    There's still the proviso of irreducible control flow hanging over the heads of the compilers for the graphics chips.

    Jawed
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,644
    Location:
    London
    Yeah, end of February next year.

    Jawed
     
  16. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    I think you have a point here, converting IL to LLVM IR, and then running a mem2reg pass should produce very good IR, and then your usual llvm transformations. IL, I think here serves only as a glorified bytecode.

    I don't think you can help it. The gpu code usually is reducible, and handling irreducible control flow is any way optional for ocl drivers. If you can write your code with just while and for loops, you should not get irreducible control flows, I think.
     
  17. Forrest

    Newcomer

    Joined:
    Jul 22, 2008
    Messages:
    39
    Current SKA doesn't support OpenCL.

    Here is the IL generated from this kernel :

    Code:
    __kernel void templateKernel(__global  float * output,
                                 __global  float * input,
                                 const unsigned int multiplier)
    {
    	int i = get_global_id(0);
    
    	output[i] = input[i];
    	
    }
    The generated IL :

    Code:
    il_cs_2_0
    dcl_raw_uav_id(0)
    dcl_cb cb0[9] ; Constant buffer that holds ABI data
    dcl_literal l0, 4, 1, 2, 3
    dcl_literal l1, 0x00FFFFFF, -1, -2, -3
    dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC
    dcl_literal l3, 24, 16, 8, 0xFFFFFFFF
    dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF
    dcl_literal l5, 0, 4, 8, 12
    mov r769, cb0[8].x
    call 1174;$
    endmain
    func 1174 ; __OpenCL_templateKernel_kernel
    mov r770, l1.0
    dcl_literal l6, 0x00000000, 0x00000000, 0x00000000, 0x00000000; int: 0
    dcl_literal l7, 0x00000002, 0x00000002, 0x00000002, 0x00000002; int: 2
    dcl_num_thread_per_group 256, 1, 1              
    imul r0.w, cb0[2].x, cb0[2].y
    mov r0.z, vThreadGrpIdFlat.x
    mov r1022.xyz0, vTidInGrp.xyz
    umod r1023.x, r0.z, cb0[2].x
    udiv r1023.y, r0.z, cb0[2].x
    umod r1023.y, r1023.y, cb0[2].y
    udiv r1023.z, r0.z, r0.w
    imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0
    iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0
    iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0
    mov r1023.w, r0.z
    imad r772.x, r1023.w, cb0[4].y, cb0[4].x
    ishl r1023.w, r1023.w, l0.z
    imad r771.x, vAbsTidFlat.x, cb0[3].y, cb0[3].x
    dcl_cb cb1[3]
    ; Kernel arg setup: output
    mov r1, cb1[0]
    ; Kernel arg setup: input
    mov r2, cb1[1]
    ; Kernel arg setup: multiplier
    mov r113, cb1[2]
    call 1176 ; templateKernel
    ret
    endfunc ; __OpenCL_templateKernel_kernel
    ;ARGSTART:__OpenCL_templateKernel_kernel
    ;version:1:2:35
    ;uniqueid:1174
    ;memory:private:0
    ;memory:local:0
    ;pointer:output:float:1:1:0:uav:0
    ;pointer:input:float:1:1:16:uav:0
    ;value:multiplier:i32:1:1:32
    ;function:1:1176
    ;intrinsic:0
    ;ARGEND:__OpenCL_templateKernel_kernel
    func 1176 ; templateKernel
    	mov r1026.x___, r113.xxxx
    	mov r1025.x___, r2.xxxx
    	mov r1024.x___, r1.xxxx
    	mov r1027.x___, l6
    	mov r1.x___, r1027.xxxx
    	call 1027 ; get_global_id
    	mov r1028.x___, r1.xxxx
    	mov r1029.x___, l7
    	ishl r1030.x___, r1028.xxxx, r1029.xxxx
    	iadd r1031.x___, r1025.xxxx, r1030.xxxx
    	mov r1.x___, r1031.xxxx
    	call 1083 ; get32BitLoadUAV
    	mov r1032.x___, r1.xxxx
    	add r1033.x___, r1032.xxxx, r1032.xxxx
    	iadd r1034.x___, r1024.xxxx, r1030.xxxx
    	mov r2.x___, r1033.xxxx
    	mov r1.x___, r1034.xxxx
    	call 1078 ; get32BitStoreUAV
    	ret
    endfunc ; templateKernel
    ;ARGSTART:templateKernel
    ;uniqueid:1176
    ;memory:private:0
    ;memory:local:0
    ;function:3:1027:1078:1083
    ;intrinsic:0
    ;ARGEND:templateKernel
    func 1027 ; get_global_id
    iadd r1020, r1.xxxx, l1.0yzw
    ieq r1020, r1020, l0.0000
    cmov_logical r1.x, r1020.x, r1021.x, r1021.0
    cmov_logical r1.x, r1020.y, r1021.y, r1.x
    cmov_logical r1.x, r1020.z, r1021.z, r1.x
    ret
    endfunc ; get_global_id
    func 1078 ; Store32BitsUAV
    uav_raw_store_id(0) mem0.x___, r1.x, r2
    ret
    endfunc ; Store32BitsUAV
    func 1083 ; Load32BitsUAV
    uav_raw_load_id(0) r1.x___, r1.x
    ret
    endfunc ; Load32BitsUAV
    
    end
     
    #17 Forrest, Oct 24, 2009
    Last edited by a moderator: Oct 24, 2009
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,644
    Location:
    London
    Hopefully it will, soon.

    Euh, that's a shedload of addressing computation. It results in 31 cycles of integer operations for addressing, which means it's amazingly slow (anything under 10 would be free on HD4870/5870 - so this is more than 3x slower).

    The "logically equivalent" Brook+ ("compute shader", using gather and scatter rather than streams):

    Code:
    Attribute[GroupSize(256,1,1)]
    kernel void templateKernel(out float output[], float input[])
    {
     int4 i = instance();
     output[i.x] = input[i.x];
    }
    generates:

    Code:
    il_cs_2_0
    ; l0 = (0.0f, 0.0f, 0.0f, 0.0f, )
    dcl_literal l0, 0x00000000, 0x00000000, 0x00000000, 0x00000000
    ; l1 = (1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, )
    dcl_literal l1, 0x00000001, 0x00000001, 0x00000001, 0x00000001
    ; l2 = (-1.#QNANf, -1.#QNANf, -1.#QNANf, -1.#QNANf, )
    dcl_literal l2, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF
    ; l3 = (1.#QNANf, 1.#QNANf, 1.#QNANf, 1.#QNANf, )
    dcl_literal l3, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF
    ; l4 = (1.#INFf, 1.#INFf, 1.#INFf, 1.#INFf, )
    dcl_literal l4, 0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000
    ; l5 = (0.0f, 0.0f, 0.0f, 0.0f, )
    dcl_literal l5, 0x80000000, 0x80000000, 0x80000000, 0x80000000
    ; l6 = (0.30103001f, 0.30103001f, 0.30103001f, 0.30103001f, )
    dcl_literal l6, 0x3E9A209B, 0x3E9A209B, 0x3E9A209B, 0x3E9A209B
    ; l7 = (0.6931471825f, 0.6931471825f, 0.6931471825f, 0.6931471825f, )
    dcl_literal l7, 0x3F317218, 0x3F317218, 0x3F317218, 0x3F317218
    ; l8 = (3.141592741f, 3.141592741f, 3.141592741f, 3.141592741f, )
    dcl_literal l8, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB
    ; l9 = (1.570796371f, 1.570796371f, 1.570796371f, 1.570796371f, )
    dcl_literal l9, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB
    ; l10 = (4.203895393e-45f, 4.203895393e-45f, 4.203895393e-45f, 4.203895393e-45f, )
    dcl_literal l10, 0x00000003, 0x00000003, 0x00000003, 0x00000003
    ; l11 = (2.802596929e-45f, 2.802596929e-45f, 2.802596929e-45f, 2.802596929e-45f, )
    dcl_literal l11, 0x00000002, 0x00000002, 0x00000002, 0x00000002
    ;global (g) declared, size = 4096
    ; l12 = (1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, )
    dcl_literal l12, 0x00000001, 0x00000001, 0x00000001, 0x00000001
    dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_num_thread_per_group 256 
    mov r275.x___, vTidInGrpFlat0
    mov r276.x___, vAbsTidFlat0
    mov r277.x___, vThreadGrpIdFlat0
    call 38
    call 0
    endmain
    func 0
    ret
    func 2
        ieq r0.x___, r17.x000, l0.x000
        if_logicalnz r0.x000
            sample_resource(0)_sampler(0) r19, r18.xy00
        endif
        mov r16.x___, r19.x000
        ret_dyn
    ret
    func 37
        mov r271, r270
        mov r272.x___, r271.x000
        umul r273.x___, l12.x000, r272.x000
        iadd r273.x___, r273.x000, l0.x000
        itof r280.x___, r271.x000
        mov r274.x___, r280.x000
        mov r274._y__, l0.0x00
        mov r17.x___, r269.x000
        mov r18.xy__, r274.xy00
        call 2
        mov r281.x___, r16.x000
        mov g[r273.x+0].x___, r281.x000
    ret
    func 38
        itof r282.x___, r276.x000
        mov r279.x___, r282.x000
        mov r279._y__, l0.0x00
        mov r283, r279.xy00
        mov r278, r283
        mov r269.x___, l0.x000
        ftoi r284, r278
        mov r270, r284
        call 37
    ret
    end
    This Brook+ version only has 3 ALU cycles of addressing (though that's got junk code in it).

    A pure stream Brook+ kernel doesn't have any addressing. But this is constrained by the hardware's texture dimension limits, so the Brook+ kernel would result in instances of the kernel being called for each whole/part of 8192 addresses in the domain (double that on D3D11 hardware, in theory). Which would slow it down a fair bit.

    The addressing of the Brook+ compute shader uses vAbsTidFlat0, which is a vertex attribute for the "absolute thread ID". The OpenCL code has access to such an attribute, too, vAbsTidFlat - but never uses it (despite the use of the get_global_id function). Instead the address is computed from vThreadGrpIdFlat and vTidInGrp.

    Technically, the Brook+ isn't functionally identical. The OpenCL version has to generically address two blocks of read/write memory (input and output, which could overlap, or even be identical pointers), whereas the Brook+ version is working with two explicitly distinct blocks of memory, one bound for gather and the second for scatter. A truly generic version is something like:

    Code:
    Attribute[GroupSize(256,1,1)]
    kernel void generic(out float memory[], uint input, uint output)
    {
     int4 i = instance();
     memory[output + i.x] = memory[input + i.x];
    }
    
    which results in 4 cycles of addressing.

    It would be interesting to compare the OpenCL with Direct Compute, since the abstraction is pretty much the same as far as I can tell.

    Jawed
     
  19. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,353
    Is this the actual OpenCL kernel used to generate the IL you posted? I ask because of the following line:
    Code:
    	add r1033.x___, r1032.xxxx, r1032.xxxx
    
    This causes the output to be multiplied by 2, which is not what the OpenCL kernel is doing. This is also clear in the HW code:
    Code:
    03 TEX: ADDR(82) CNT(1) 
         25  VFETCH R0.x___, R1.z, fc160  MEGA(4) 
             FETCH_TYPE(NO_INDEX_OFFSET) 
    04 ALU: ADDR(68) CNT(1) 
         26  x: MOV*2       R0.x,  R0.x      
    05 MEM_RAT_CACHELESS_STORE_RAW: RAT(0)[R1], R0,  MARK  VPM
    Note the "MOV*2" instruction.
     
  20. Forrest

    Newcomer

    Joined:
    Jul 22, 2008
    Messages:
    39
    My bad, this is the actual kernel :

    Code:
    __kernel void templateKernel(__global  float * output,
                                 __global  float * input,
                                 const unsigned int multiplier)
    {
    	int i = get_global_id(0);
    	output[i] = input[i] * 2;
    	
    }
    The ISA you posted is for 8xx right?
     

Share This Page

Loading...