If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 | ||
|
Gamerscore Wh...
Join Date: Jan 2002
Posts: 12,989
|
First beta release of ATI Stream SDK with OpenCL™ GPU support is now posted
Quote:
ATI Radeon™ HD 5870 ATI Radeon™ HD 5850 ATI Radeon™ HD 5770 ATI Radeon™ HD 5750 ATI Radeon™ HD 4890 ATI Radeon™ HD 4870 X2 ATI Radeon™ HD 4870 ATI Radeon™ HD 4850 X2 ATI Radeon™ HD 4850 ATI Radeon™ HD 4830 ATI Radeon™ HD 4770 ATI Radeon™ HD 4670 ATI Radeon™ HD 4650 ATI Radeon™ HD 4550 ATI Radeon™ HD 4350 ATI FirePro™ V8750 ATI FirePro™ V8700 ATI FirePro™ V7750 ATI FirePro™ V5700 ATI FirePro™ V3750 AMD FireStream™ 9270 AMD FireStream™ 9250 ATI Mobility Radeon™ HD 4870 ATI Mobility Radeon™ HD 4860 ATI Mobility Radeon™ HD 4850 X2 ATI Mobility Radeon™ HD 4850 ATI Mobility Radeon™ HD 4830 ATI Mobility Radeon™ HD 4670 ATI Mobility Radeon™ HD 4650 ATI Mobility Radeon™ HD 4500 Series ATI Mobility Radeon™ HD 4300 Series ATI Mobility FirePro™ M7740 ATI Radeon™ Embedded E4690 Discrete GPU Update 21-Dec-2009 Quote:
|
||
|
|
|
|
|
#2 |
|
Member
Join Date: Jul 2004
Location: England
Posts: 452
|
XP driver is corrupt. Can somewhat repair it with Winrar though but don't feel comfortable installing it.
|
|
|
|
|
|
#3 |
|
Moderator
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
|
Great! Now I only have to port my NLM denoise program to OpenCL then...
|
|
|
|
|
|
#4 |
|
Senior Member
Join Date: Oct 2006
Location: Germany
Posts: 1,003
|
Mandelbrot and another sample asked me for Glut32 and Glew32, but they aren't part of Beta 4 (Win7 x32).
But they are part of the Beta 3, so you can copy and paste them in C:\Program Files\ATI Stream\bin\x86.
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke Eta Kooram Nah Smech! Find Chuck Norris. |
|
|
|
|
|
#5 |
|
Member
Join Date: Aug 2005
Posts: 278
|
They put them in the ATI Stream -> bin folder. Just copy the dlls over to the sample folder and everything works fine.
It would be nice if the Stream SDK was included with some sort of software environment IDE, is that something planned for the future? |
|
|
|
|
|
#6 |
|
super willyjuice
Join Date: May 2005
Location: Astoria, NY
Posts: 998
|
For Windows just use Visual Studio.
|
|
|
|
|
|
#7 |
|
Member
Join Date: Aug 2005
Posts: 278
|
Does this mean we can do some neato bullet physics now??
http://www.bulletphysics.com/Bullet/...hp?f=18&t=4067 |
|
|
|
|
|
#8 |
|
Member
Join Date: Aug 2005
Posts: 278
|
FYI many of the OpenCL demos from nVidia's SDK actually work on the Radeon!! Try it yourself.
http://developer.nvidia.com/object/get-opencl.html |
|
|
|
|
|
#9 |
|
Regular
|
http://forums.amd.com/devforum/messa...hreadid=120683
That's interesting. From the IL it's but a short step to assembly code, using SKA. It's this latter step where optimisation occurs, generally, so no need to worry about the IL being a mess. Oh, and that looks like some variant of llvm is being used, does it not? Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#10 | |
|
Senior Member
|
Quote:
LLVM is so cool and so meant for something like this that you will be crazy to not use it. |
|
|
|
|
|
|
#11 |
|
Senior Member
|
BTW, does this release have some kind of timed lockout like the previous version? Or will I have to keep updating this thing?
|
|
|
|
|
|
#12 | |
|
Member
Join Date: Aug 2005
Location: Mars
Posts: 181
|
Quote:
It'd be nice if they released their opencl frontend back to the community. And possibly their gpu backend, to help ppl writing r7/r8 drivers. I mean, they are giving them the docs, why not help them a bit more by providing them with the backend to generate ps/vs/cs. It makes sense to me. But maybe AMD got some code from apple and are not free to release it. Anybody has heard rumours on llvm code release from amd? |
|
|
|
|
|
|
#13 |
|
Senior Member
|
Nah, no code will be coming out of amd. That's what they said when they announced their opensource driver strategy. Considering that their driver team is smaller than nv, I guessestimate that their jit compilers are prolly developed by a 3rd party.
|
|
|
|
|
|
#14 | ||
|
Regular
|
Quote:
I doubt there's much, if any, value in using IL to judge optimality. Quote:
Jawed
__________________
Can it play WoW? |
||
|
|
|
|
|
#15 | |
|
Regular
|
Quote:
Jawed
__________________
Can it play WoW? |
|
|
|
|
|
|
#16 | ||
|
Senior Member
|
Quote:
Quote:
|
||
|
|
|
|
|
#17 |
|
Junior Member
Join Date: Jul 2008
Posts: 36
|
Current SKA doesn't support OpenCL.
Here is the IL generated from this kernel : Code:
__kernel void templateKernel(__global float * output,
__global float * input,
const unsigned int multiplier)
{
int i = get_global_id(0);
output[i] = input[i];
}
Code:
il_cs_2_0 dcl_raw_uav_id(0) dcl_cb cb0[9] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 mov r769, cb0[8].x call 1174;$ endmain func 1174 ; __OpenCL_templateKernel_kernel mov r770, l1.0 dcl_literal l6, 0x00000000, 0x00000000, 0x00000000, 0x00000000; int: 0 dcl_literal l7, 0x00000002, 0x00000002, 0x00000002, 0x00000002; int: 2 dcl_num_thread_per_group 256, 1, 1 imul r0.w, cb0[2].x, cb0[2].y mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz umod r1023.x, r0.z, cb0[2].x udiv r1023.y, r0.z, cb0[2].x umod r1023.y, r1023.y, cb0[2].y udiv r1023.z, r0.z, r0.w imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z imad r772.x, r1023.w, cb0[4].y, cb0[4].x ishl r1023.w, r1023.w, l0.z imad r771.x, vAbsTidFlat.x, cb0[3].y, cb0[3].x dcl_cb cb1[3] ; Kernel arg setup: output mov r1, cb1[0] ; Kernel arg setup: input mov r2, cb1[1] ; Kernel arg setup: multiplier mov r113, cb1[2] call 1176 ; templateKernel ret endfunc ; __OpenCL_templateKernel_kernel ;ARGSTART:__OpenCL_templateKernel_kernel ;version:1:2:35 ;uniqueid:1174 ;memory:private:0 ;memory:local:0 ;pointer:output:float:1:1:0:uav:0 ;pointer:input:float:1:1:16:uav:0 ;value:multiplier:i32:1:1:32 ;function:1:1176 ;intrinsic:0 ;ARGEND:__OpenCL_templateKernel_kernel func 1176 ; templateKernel mov r1026.x___, r113.xxxx mov r1025.x___, r2.xxxx mov r1024.x___, r1.xxxx mov r1027.x___, l6 mov r1.x___, r1027.xxxx call 1027 ; get_global_id mov r1028.x___, r1.xxxx mov r1029.x___, l7 ishl r1030.x___, r1028.xxxx, r1029.xxxx iadd r1031.x___, r1025.xxxx, r1030.xxxx mov r1.x___, r1031.xxxx call 1083 ; get32BitLoadUAV mov r1032.x___, r1.xxxx add r1033.x___, r1032.xxxx, r1032.xxxx iadd r1034.x___, r1024.xxxx, r1030.xxxx mov r2.x___, r1033.xxxx mov r1.x___, r1034.xxxx call 1078 ; get32BitStoreUAV ret endfunc ; templateKernel ;ARGSTART:templateKernel ;uniqueid:1176 ;memory:private:0 ;memory:local:0 ;function:3:1027:1078:1083 ;intrinsic:0 ;ARGEND:templateKernel func 1027 ; get_global_id iadd r1020, r1.xxxx, l1.0yzw ieq r1020, r1020, l0.0000 cmov_logical r1.x, r1020.x, r1021.x, r1021.0 cmov_logical r1.x, r1020.y, r1021.y, r1.x cmov_logical r1.x, r1020.z, r1021.z, r1.x ret endfunc ; get_global_id func 1078 ; Store32BitsUAV uav_raw_store_id(0) mem0.x___, r1.x, r2 ret endfunc ; Store32BitsUAV func 1083 ; Load32BitsUAV uav_raw_load_id(0) r1.x___, r1.x ret endfunc ; Load32BitsUAV end Last edited by Forrest; 24-Oct-2009 at 04:47. Reason: clarity |
|
|
|
|
|
#18 | |
|
Regular
|
Hopefully it will, soon.
Quote:
The "logically equivalent" Brook+ ("compute shader", using gather and scatter rather than streams): Code:
Attribute[GroupSize(256,1,1)]
kernel void templateKernel(out float output[], float input[])
{
int4 i = instance();
output[i.x] = input[i.x];
}
Code:
il_cs_2_0
; l0 = (0.0f, 0.0f, 0.0f, 0.0f, )
dcl_literal l0, 0x00000000, 0x00000000, 0x00000000, 0x00000000
; l1 = (1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, )
dcl_literal l1, 0x00000001, 0x00000001, 0x00000001, 0x00000001
; l2 = (-1.#QNANf, -1.#QNANf, -1.#QNANf, -1.#QNANf, )
dcl_literal l2, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF
; l3 = (1.#QNANf, 1.#QNANf, 1.#QNANf, 1.#QNANf, )
dcl_literal l3, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF
; l4 = (1.#INFf, 1.#INFf, 1.#INFf, 1.#INFf, )
dcl_literal l4, 0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000
; l5 = (0.0f, 0.0f, 0.0f, 0.0f, )
dcl_literal l5, 0x80000000, 0x80000000, 0x80000000, 0x80000000
; l6 = (0.30103001f, 0.30103001f, 0.30103001f, 0.30103001f, )
dcl_literal l6, 0x3E9A209B, 0x3E9A209B, 0x3E9A209B, 0x3E9A209B
; l7 = (0.6931471825f, 0.6931471825f, 0.6931471825f, 0.6931471825f, )
dcl_literal l7, 0x3F317218, 0x3F317218, 0x3F317218, 0x3F317218
; l8 = (3.141592741f, 3.141592741f, 3.141592741f, 3.141592741f, )
dcl_literal l8, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB
; l9 = (1.570796371f, 1.570796371f, 1.570796371f, 1.570796371f, )
dcl_literal l9, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB
; l10 = (4.203895393e-45f, 4.203895393e-45f, 4.203895393e-45f, 4.203895393e-45f, )
dcl_literal l10, 0x00000003, 0x00000003, 0x00000003, 0x00000003
; l11 = (2.802596929e-45f, 2.802596929e-45f, 2.802596929e-45f, 2.802596929e-45f, )
dcl_literal l11, 0x00000002, 0x00000002, 0x00000002, 0x00000002
;global (g) declared, size = 4096
; l12 = (1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, )
dcl_literal l12, 0x00000001, 0x00000001, 0x00000001, 0x00000001
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_num_thread_per_group 256
mov r275.x___, vTidInGrpFlat0
mov r276.x___, vAbsTidFlat0
mov r277.x___, vThreadGrpIdFlat0
call 38
call 0
endmain
func 0
ret
func 2
ieq r0.x___, r17.x000, l0.x000
if_logicalnz r0.x000
sample_resource(0)_sampler(0) r19, r18.xy00
endif
mov r16.x___, r19.x000
ret_dyn
ret
func 37
mov r271, r270
mov r272.x___, r271.x000
umul r273.x___, l12.x000, r272.x000
iadd r273.x___, r273.x000, l0.x000
itof r280.x___, r271.x000
mov r274.x___, r280.x000
mov r274._y__, l0.0x00
mov r17.x___, r269.x000
mov r18.xy__, r274.xy00
call 2
mov r281.x___, r16.x000
mov g[r273.x+0].x___, r281.x000
ret
func 38
itof r282.x___, r276.x000
mov r279.x___, r282.x000
mov r279._y__, l0.0x00
mov r283, r279.xy00
mov r278, r283
mov r269.x___, l0.x000
ftoi r284, r278
mov r270, r284
call 37
ret
end
A pure stream Brook+ kernel doesn't have any addressing. But this is constrained by the hardware's texture dimension limits, so the Brook+ kernel would result in instances of the kernel being called for each whole/part of 8192 addresses in the domain (double that on D3D11 hardware, in theory). Which would slow it down a fair bit. The addressing of the Brook+ compute shader uses vAbsTidFlat0, which is a vertex attribute for the "absolute thread ID". The OpenCL code has access to such an attribute, too, vAbsTidFlat - but never uses it (despite the use of the get_global_id function). Instead the address is computed from vThreadGrpIdFlat and vTidInGrp. Technically, the Brook+ isn't functionally identical. The OpenCL version has to generically address two blocks of read/write memory (input and output, which could overlap, or even be identical pointers), whereas the Brook+ version is working with two explicitly distinct blocks of memory, one bound for gather and the second for scatter. A truly generic version is something like: Code:
Attribute[GroupSize(256,1,1)]
kernel void generic(out float memory[], uint input, uint output)
{
int4 i = instance();
memory[output + i.x] = memory[input + i.x];
}
It would be interesting to compare the OpenCL with Direct Compute, since the abstraction is pretty much the same as far as I can tell. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
|
#19 | |
|
Senior Member
|
Quote:
Code:
add r1033.x___, r1032.xxxx, r1032.xxxx Code:
03 TEX: ADDR(82) CNT(1)
25 VFETCH R0.x___, R1.z, fc160 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
04 ALU: ADDR(68) CNT(1)
26 x: MOV*2 R0.x, R0.x
05 MEM_RAT_CACHELESS_STORE_RAW: RAT(0)[R1], R0, MARK VPM
__________________
I speak only for myself. |
|
|
|
|
|
|
#20 | |
|
Junior Member
Join Date: Jul 2008
Posts: 36
|
Quote:
Code:
__kernel void templateKernel(__global float * output,
__global float * input,
const unsigned int multiplier)
{
int i = get_global_id(0);
output[i] = input[i] * 2;
}
|
|
|
|
|
|
|
#21 | |
|
Senior Member
|
Quote:
Yes, the ISA was from 5870. Note that your address calculation only works if your domain is 1-dimensional since get_global_id(0) only returns the x-coordinate.
__________________
I speak only for myself. |
|
|
|
|
|
|
#22 |
|
Regular
|
These two threads make it pretty clear:
http://forums.amd.com/devforum/messa...hreadid=121273 http://forums.amd.com/devforum/messa...hreadid=121298 I now realise the basis of my misunderstanding of the absolute addressing mode for LDS in R700 - a mode that I thought meant that writes aren't owner-private. So, global memory emulation is being used, and presumably has terrible performance. RV670 is still locked-out, I guess, because it doesn't have the compute shader mode. Jawed
__________________
Can it play WoW? |
|
|
|
|
|
#23 |
|
Member
Join Date: Sep 2003
Location: Zwijndrecht/Rotterdam, Netherlands and Phobos
Posts: 847
|
Hmm, my screen gets a distortion (out of focus) with these when selecting 60 hz, with 59 everything is fine?
HD5870 CFX Windows7 64 bits
__________________
Schieten op de beesten. |
|
|
|
|
|
#24 |
|
Member
Join Date: Jan 2009
Posts: 229
|
Has anyone done any tests with regards to local memory bandwidth on RV8xx? I don't have one at hand but am quite interested in the RV8xx implementation. The implementation on RV7xx is restrictive in terms of programming and unimpressive in terms of bandwidth.
|
|
|
|
|
|
#25 |
|
Senior Member
|
So, R700 is not exactly OCL 1.0 compliant, hardware wise at least?!
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|