Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 13-Oct-2009, 19:24   #1
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 13,304
AMD AMD GPU OpenCL ATI Stream SDK

First beta release of ATI Stream SDK with OpenCL™ GPU support is now posted

Quote:
The ATI Stream Team is proud to make available the fourth beta release of ATI Stream SDK v2.0 that provides the first complete OpenCL™ development platform. This release is certified fully compliant with OpenCL 1.0 by the Khronos Group and is supported on a wide range of AMD GPUs (see http://developer.amd.com/streambeta for the full support list) as well as any x86 multi-core CPU supporting SSE3. AMD offers the market both high-performance CPU and GPU technology, and as such we are delivering on this unique ability to provide an OpenCL platform that enables developers to create applications that run the way they were meant to be run, on all the available processors in the system! The beta is available for immediate download as part of our ATI Stream SDK beta program and we encourage you to take a look.
Supported GPU's in the initial release:

ATI Radeon™ HD 5870
ATI Radeon™ HD 5850
ATI Radeon™ HD 5770
ATI Radeon™ HD 5750
ATI Radeon™ HD 4890
ATI Radeon™ HD 4870 X2
ATI Radeon™ HD 4870
ATI Radeon™ HD 4850 X2
ATI Radeon™ HD 4850
ATI Radeon™ HD 4830
ATI Radeon™ HD 4770
ATI Radeon™ HD 4670
ATI Radeon™ HD 4650
ATI Radeon™ HD 4550
ATI Radeon™ HD 4350

ATI FirePro™ V8750
ATI FirePro™ V8700
ATI FirePro™ V7750
ATI FirePro™ V5700
ATI FirePro™ V3750

AMD FireStream™ 9270
AMD FireStream™ 9250

ATI Mobility Radeon™ HD 4870
ATI Mobility Radeon™ HD 4860
ATI Mobility Radeon™ HD 4850 X2
ATI Mobility Radeon™ HD 4850
ATI Mobility Radeon™ HD 4830
ATI Mobility Radeon™ HD 4670
ATI Mobility Radeon™ HD 4650
ATI Mobility Radeon™ HD 4500 Series
ATI Mobility Radeon™ HD 4300 Series

ATI Mobility FirePro™ M7740

ATI Radeon™ Embedded E4690 Discrete GPU

Update 21-Dec-2009

Quote:
Originally Posted by Dave Baumann View Post
FYI - The ATI Stream SDK 2.0 is now out of Beta and into a production release:

http://developer.amd.com/stream

What’s New in v2.0
  • First production release of ATI Stream SDK with OpenCL™ 1.0 support.
  • New: Support for OpenCL™ ICD (Installable Client Driver).
  • New: Support for atomic functions for 32-bit integers.
  • New: Microsoft® Visual Studio® 2008-integrated ATI Stream Profiler performance analysis tool.
  • Preview: Support for OpenCL™ / OpenGL® interoperability.
  • Preview: Support for OpenCL™ / Microsoft® DirectX® 10 interoperability.
  • Preview: Support for double-precision floating point basic arithmetic in OpenCL™ C kernels.
  • Updated OpenCL™ runtime to conditionally load ATI CAL runtime libraries to allow execution on compatible CPUs without ATI Catalyst™ installed.
  • Updated OpenCL™ runtime to allow simultaneous use of OpenCL™ and ATI CAL APIs in a single user application.
  • Updated cl.hpp from the Khronos OpenCL working group release.
  • Various OpenCL™ compiler and runtime fixes and enhancements (see developer release notes for more details).
__________________
Radeon is Gaming
Tweet Tweet!
Dave Baumann is offline   Reply With Quote
Old 13-Oct-2009, 21:04   #2
Broken Hope
Member
 
Join Date: Jul 2004
Location: England
Posts: 476
Default

XP driver is corrupt. Can somewhat repair it with Winrar though but don't feel comfortable installing it.
Broken Hope is offline   Reply With Quote
Old 13-Oct-2009, 21:05   #3
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,467
Default

Great! Now I only have to port my NLM denoise program to OpenCL then...
pcchen is offline   Reply With Quote
Old 13-Oct-2009, 23:02   #4
Arnold Beckenbauer
Senior Member
 
Join Date: Oct 2006
Location: Germany
Posts: 1,004
Default

Mandelbrot and another sample asked me for Glut32 and Glew32, but they aren't part of Beta 4 (Win7 x32).
But they are part of the Beta 3, so you can copy and paste them in C:\Program Files\ATI Stream\bin\x86.
__________________
Hail Brothers and Sisters! Coranon Silaria, Ozoo Mahoke
Eta Kooram Nah Smech!

Find Chuck Norris.
Arnold Beckenbauer is offline   Reply With Quote
Old 14-Oct-2009, 00:07   #5
CNCAddict
Member
 
Join Date: Aug 2005
Posts: 283
Default

They put them in the ATI Stream -> bin folder. Just copy the dlls over to the sample folder and everything works fine.

It would be nice if the Stream SDK was included with some sort of software environment IDE, is that something planned for the future?
CNCAddict is offline   Reply With Quote
Old 14-Oct-2009, 01:08   #6
willardjuice
super willyjuice
 
Join Date: May 2005
Location: Astoria, NY
Posts: 1,109
Default

Quote:
Originally Posted by CNCAddict View Post
They put them in the ATI Stream -> bin folder. Just copy the dlls over to the sample folder and everything works fine.

It would be nice if the Stream SDK was included with some sort of software environment IDE, is that something planned for the future?
For Windows just use Visual Studio.
willardjuice is offline   Reply With Quote
Old 14-Oct-2009, 01:31   #7
CNCAddict
Member
 
Join Date: Aug 2005
Posts: 283
Default

Does this mean we can do some neato bullet physics now??

http://www.bulletphysics.com/Bullet/...hp?f=18&t=4067
CNCAddict is offline   Reply With Quote
Old 16-Oct-2009, 22:49   #8
CNCAddict
Member
 
Join Date: Aug 2005
Posts: 283
Default

FYI many of the OpenCL demos from nVidia's SDK actually work on the Radeon!! Try it yourself.

http://developer.nvidia.com/object/get-opencl.html
CNCAddict is offline   Reply With Quote
Old 23-Oct-2009, 15:56   #9
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default IL for OpenCL kernels

http://forums.amd.com/devforum/messa...hreadid=120683

That's interesting. From the IL it's but a short step to assembly code, using SKA. It's this latter step where optimisation occurs, generally, so no need to worry about the IL being a mess.

Oh, and that looks like some variant of llvm is being used, does it not?

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 23-Oct-2009, 16:13   #10
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,218
Send a message via Skype™ to rpg.314
Default

Quote:
The bitcode that you are passing into llc is not the same bitcode that the runtime passes to llc.
He hasn;t gotten llvm's optimizers running so far. That's why IL is not looking good.

LLVM is so cool and so meant for something like this that you will be crazy to not use it.
rpg.314 is offline   Reply With Quote
Old 23-Oct-2009, 16:14   #11
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,218
Send a message via Skype™ to rpg.314
Default

BTW, does this release have some kind of timed lockout like the previous version? Or will I have to keep updating this thing?
rpg.314 is offline   Reply With Quote
Old 23-Oct-2009, 16:20   #12
Karoshi
Member
 
Join Date: Aug 2005
Location: Mars
Posts: 181
Default

Quote:
Originally Posted by Jawed View Post
http://forums.amd.com/devforum/messa...hreadid=120683

That's interesting. From the IL it's but a short step to assembly code, using SKA. It's this latter step where optimisation occurs, generally, so no need to worry about the IL being a mess.

Oh, and that looks like some variant of llvm is being used, does it not?

Jawed
I've been wondering if AMD was considering releasing some of it's llvm code. Apple obviously uses llvm on osx. It seems AMD uses llvm for x86, and possibly gpu.
It'd be nice if they released their opencl frontend back to the community. And possibly their gpu backend, to help ppl writing r7/r8 drivers. I mean, they are giving them the docs, why not help them a bit more by providing them with the backend to generate ps/vs/cs. It makes sense to me.
But maybe AMD got some code from apple and are not free to release it.
Anybody has heard rumours on llvm code release from amd?
Karoshi is offline   Reply With Quote
Old 23-Oct-2009, 16:42   #13
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,218
Send a message via Skype™ to rpg.314
Default

Nah, no code will be coming out of amd. That's what they said when they announced their opensource driver strategy. Considering that their driver team is smaller than nv, I guessestimate that their jit compilers are prolly developed by a 3rd party.
rpg.314 is offline   Reply With Quote
Old 23-Oct-2009, 16:57   #14
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
He hasn;t gotten llvm's optimizers running so far. That's why IL is not looking good.
The IL produced by Brook+ often looks horrible. Usually the assembly is fine. Though there are odd problems (superfluous MOVs, VLIW packing opportunities missed, register components going unused in vec4 registers...).

I doubt there's much, if any, value in using IL to judge optimality.

Quote:
LLVM is so cool and so meant for something like this that you will be crazy to not use it.
There's still the proviso of irreducible control flow hanging over the heads of the compilers for the graphics chips.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 23-Oct-2009, 17:01   #15
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by rpg.314 View Post
BTW, does this release have some kind of timed lockout like the previous version? Or will I have to keep updating this thing?
Yeah, end of February next year.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 23-Oct-2009, 17:58   #16
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,218
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Jawed View Post
I doubt there's much, if any, value in using IL to judge optimality.
I think you have a point here, converting IL to LLVM IR, and then running a mem2reg pass should produce very good IR, and then your usual llvm transformations. IL, I think here serves only as a glorified bytecode.

Quote:
There's still the proviso of irreducible control flow hanging over the heads of the compilers for the graphics chips.
I don't think you can help it. The gpu code usually is reducible, and handling irreducible control flow is any way optional for ocl drivers. If you can write your code with just while and for loops, you should not get irreducible control flows, I think.
rpg.314 is offline   Reply With Quote
Old 24-Oct-2009, 04:45   #17
Forrest
Junior Member
 
Join Date: Jul 2008
Posts: 36
Default

Quote:
Originally Posted by Jawed View Post
From the IL it's but a short step to assembly code, using SKA.
Current SKA doesn't support OpenCL.

Here is the IL generated from this kernel :

Code:
__kernel void templateKernel(__global  float * output,
                             __global  float * input,
                             const unsigned int multiplier)
{
	int i = get_global_id(0);

	output[i] = input[i];
	
}
The generated IL :

Code:
il_cs_2_0
dcl_raw_uav_id(0)
dcl_cb cb0[9] ; Constant buffer that holds ABI data
dcl_literal l0, 4, 1, 2, 3
dcl_literal l1, 0x00FFFFFF, -1, -2, -3
dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC
dcl_literal l3, 24, 16, 8, 0xFFFFFFFF
dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF
dcl_literal l5, 0, 4, 8, 12
mov r769, cb0[8].x
call 1174;$
endmain
func 1174 ; __OpenCL_templateKernel_kernel
mov r770, l1.0
dcl_literal l6, 0x00000000, 0x00000000, 0x00000000, 0x00000000; int: 0
dcl_literal l7, 0x00000002, 0x00000002, 0x00000002, 0x00000002; int: 2
dcl_num_thread_per_group 256, 1, 1              
imul r0.w, cb0[2].x, cb0[2].y
mov r0.z, vThreadGrpIdFlat.x
mov r1022.xyz0, vTidInGrp.xyz
umod r1023.x, r0.z, cb0[2].x
udiv r1023.y, r0.z, cb0[2].x
umod r1023.y, r1023.y, cb0[2].y
udiv r1023.z, r0.z, r0.w
imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0
iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0
iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0
mov r1023.w, r0.z
imad r772.x, r1023.w, cb0[4].y, cb0[4].x
ishl r1023.w, r1023.w, l0.z
imad r771.x, vAbsTidFlat.x, cb0[3].y, cb0[3].x
dcl_cb cb1[3]
; Kernel arg setup: output
mov r1, cb1[0]
; Kernel arg setup: input
mov r2, cb1[1]
; Kernel arg setup: multiplier
mov r113, cb1[2]
call 1176 ; templateKernel
ret
endfunc ; __OpenCL_templateKernel_kernel
;ARGSTART:__OpenCL_templateKernel_kernel
;version:1:2:35
;uniqueid:1174
;memory:private:0
;memory:local:0
;pointer:output:float:1:1:0:uav:0
;pointer:input:float:1:1:16:uav:0
;value:multiplier:i32:1:1:32
;function:1:1176
;intrinsic:0
;ARGEND:__OpenCL_templateKernel_kernel
func 1176 ; templateKernel
	mov r1026.x___, r113.xxxx
	mov r1025.x___, r2.xxxx
	mov r1024.x___, r1.xxxx
	mov r1027.x___, l6
	mov r1.x___, r1027.xxxx
	call 1027 ; get_global_id
	mov r1028.x___, r1.xxxx
	mov r1029.x___, l7
	ishl r1030.x___, r1028.xxxx, r1029.xxxx
	iadd r1031.x___, r1025.xxxx, r1030.xxxx
	mov r1.x___, r1031.xxxx
	call 1083 ; get32BitLoadUAV
	mov r1032.x___, r1.xxxx
	add r1033.x___, r1032.xxxx, r1032.xxxx
	iadd r1034.x___, r1024.xxxx, r1030.xxxx
	mov r2.x___, r1033.xxxx
	mov r1.x___, r1034.xxxx
	call 1078 ; get32BitStoreUAV
	ret
endfunc ; templateKernel
;ARGSTART:templateKernel
;uniqueid:1176
;memory:private:0
;memory:local:0
;function:3:1027:1078:1083
;intrinsic:0
;ARGEND:templateKernel
func 1027 ; get_global_id
iadd r1020, r1.xxxx, l1.0yzw
ieq r1020, r1020, l0.0000
cmov_logical r1.x, r1020.x, r1021.x, r1021.0
cmov_logical r1.x, r1020.y, r1021.y, r1.x
cmov_logical r1.x, r1020.z, r1021.z, r1.x
ret
endfunc ; get_global_id
func 1078 ; Store32BitsUAV
uav_raw_store_id(0) mem0.x___, r1.x, r2
ret
endfunc ; Store32BitsUAV
func 1083 ; Load32BitsUAV
uav_raw_load_id(0) r1.x___, r1.x
ret
endfunc ; Load32BitsUAV

end

Last edited by Forrest; 24-Oct-2009 at 04:47. Reason: clarity
Forrest is offline   Reply With Quote
Old 24-Oct-2009, 12:22   #18
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default

Quote:
Originally Posted by Forrest View Post
Current SKA doesn't support OpenCL.
Hopefully it will, soon.

Quote:
Here is the IL generated from this kernel :
Euh, that's a shedload of addressing computation. It results in 31 cycles of integer operations for addressing, which means it's amazingly slow (anything under 10 would be free on HD4870/5870 - so this is more than 3x slower).

The "logically equivalent" Brook+ ("compute shader", using gather and scatter rather than streams):

Code:
Attribute[GroupSize(256,1,1)]
kernel void templateKernel(out float output[], float input[])
{
 int4 i = instance();
 output[i.x] = input[i.x];
}
generates:

Code:
il_cs_2_0
; l0 = (0.0f, 0.0f, 0.0f, 0.0f, )
dcl_literal l0, 0x00000000, 0x00000000, 0x00000000, 0x00000000
; l1 = (1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, )
dcl_literal l1, 0x00000001, 0x00000001, 0x00000001, 0x00000001
; l2 = (-1.#QNANf, -1.#QNANf, -1.#QNANf, -1.#QNANf, )
dcl_literal l2, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF
; l3 = (1.#QNANf, 1.#QNANf, 1.#QNANf, 1.#QNANf, )
dcl_literal l3, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF
; l4 = (1.#INFf, 1.#INFf, 1.#INFf, 1.#INFf, )
dcl_literal l4, 0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000
; l5 = (0.0f, 0.0f, 0.0f, 0.0f, )
dcl_literal l5, 0x80000000, 0x80000000, 0x80000000, 0x80000000
; l6 = (0.30103001f, 0.30103001f, 0.30103001f, 0.30103001f, )
dcl_literal l6, 0x3E9A209B, 0x3E9A209B, 0x3E9A209B, 0x3E9A209B
; l7 = (0.6931471825f, 0.6931471825f, 0.6931471825f, 0.6931471825f, )
dcl_literal l7, 0x3F317218, 0x3F317218, 0x3F317218, 0x3F317218
; l8 = (3.141592741f, 3.141592741f, 3.141592741f, 3.141592741f, )
dcl_literal l8, 0x40490FDB, 0x40490FDB, 0x40490FDB, 0x40490FDB
; l9 = (1.570796371f, 1.570796371f, 1.570796371f, 1.570796371f, )
dcl_literal l9, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB, 0x3FC90FDB
; l10 = (4.203895393e-45f, 4.203895393e-45f, 4.203895393e-45f, 4.203895393e-45f, )
dcl_literal l10, 0x00000003, 0x00000003, 0x00000003, 0x00000003
; l11 = (2.802596929e-45f, 2.802596929e-45f, 2.802596929e-45f, 2.802596929e-45f, )
dcl_literal l11, 0x00000002, 0x00000002, 0x00000002, 0x00000002
;global (g) declared, size = 4096
; l12 = (1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, 1.401298464e-45f, )
dcl_literal l12, 0x00000001, 0x00000001, 0x00000001, 0x00000001
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_num_thread_per_group 256 
mov r275.x___, vTidInGrpFlat0
mov r276.x___, vAbsTidFlat0
mov r277.x___, vThreadGrpIdFlat0
call 38
call 0
endmain
func 0
ret
func 2
    ieq r0.x___, r17.x000, l0.x000
    if_logicalnz r0.x000
        sample_resource(0)_sampler(0) r19, r18.xy00
    endif
    mov r16.x___, r19.x000
    ret_dyn
ret
func 37
    mov r271, r270
    mov r272.x___, r271.x000
    umul r273.x___, l12.x000, r272.x000
    iadd r273.x___, r273.x000, l0.x000
    itof r280.x___, r271.x000
    mov r274.x___, r280.x000
    mov r274._y__, l0.0x00
    mov r17.x___, r269.x000
    mov r18.xy__, r274.xy00
    call 2
    mov r281.x___, r16.x000
    mov g[r273.x+0].x___, r281.x000
ret
func 38
    itof r282.x___, r276.x000
    mov r279.x___, r282.x000
    mov r279._y__, l0.0x00
    mov r283, r279.xy00
    mov r278, r283
    mov r269.x___, l0.x000
    ftoi r284, r278
    mov r270, r284
    call 37
ret
end
This Brook+ version only has 3 ALU cycles of addressing (though that's got junk code in it).

A pure stream Brook+ kernel doesn't have any addressing. But this is constrained by the hardware's texture dimension limits, so the Brook+ kernel would result in instances of the kernel being called for each whole/part of 8192 addresses in the domain (double that on D3D11 hardware, in theory). Which would slow it down a fair bit.

The addressing of the Brook+ compute shader uses vAbsTidFlat0, which is a vertex attribute for the "absolute thread ID". The OpenCL code has access to such an attribute, too, vAbsTidFlat - but never uses it (despite the use of the get_global_id function). Instead the address is computed from vThreadGrpIdFlat and vTidInGrp.

Technically, the Brook+ isn't functionally identical. The OpenCL version has to generically address two blocks of read/write memory (input and output, which could overlap, or even be identical pointers), whereas the Brook+ version is working with two explicitly distinct blocks of memory, one bound for gather and the second for scatter. A truly generic version is something like:

Code:
Attribute[GroupSize(256,1,1)]
kernel void generic(out float memory[], uint input, uint output)
{
 int4 i = instance();
 memory[output + i.x] = memory[input + i.x];
}
which results in 4 cycles of addressing.

It would be interesting to compare the OpenCL with Direct Compute, since the abstraction is pretty much the same as far as I can tell.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 25-Oct-2009, 01:04   #19
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,327
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by Forrest View Post
Here is the IL generated from this kernel :

Code:
__kernel void templateKernel(__global  float * output,
                             __global  float * input,
                             const unsigned int multiplier)
{
	int i = get_global_id(0);

	output[i] = input[i];
	
}
Is this the actual OpenCL kernel used to generate the IL you posted? I ask because of the following line:
Code:
	add r1033.x___, r1032.xxxx, r1032.xxxx
This causes the output to be multiplied by 2, which is not what the OpenCL kernel is doing. This is also clear in the HW code:
Code:
03 TEX: ADDR(82) CNT(1) 
     25  VFETCH R0.x___, R1.z, fc160  MEGA(4) 
         FETCH_TYPE(NO_INDEX_OFFSET) 
04 ALU: ADDR(68) CNT(1) 
     26  x: MOV*2       R0.x,  R0.x      
05 MEM_RAT_CACHELESS_STORE_RAW: RAT(0)[R1], R0,  MARK  VPM
Note the "MOV*2" instruction.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 26-Oct-2009, 07:26   #20
Forrest
Junior Member
 
Join Date: Jul 2008
Posts: 36
Default

Quote:
Originally Posted by OpenGL guy View Post
Is this the actual OpenCL kernel used to generate the IL you posted?
My bad, this is the actual kernel :

Code:
__kernel void templateKernel(__global  float * output,
                             __global  float * input,
                             const unsigned int multiplier)
{
	int i = get_global_id(0);
	output[i] = input[i] * 2;
	
}
The ISA you posted is for 8xx right?
Forrest is offline   Reply With Quote
Old 27-Oct-2009, 21:56   #21
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,327
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by Forrest View Post
My bad, this is the actual kernel :
Code:
__kernel void templateKernel(__global  float * output,
                             __global  float * input,
                             const unsigned int multiplier)
{
	int i = get_global_id(0);
	output[i] = input[i] * 2;
	
}
The ISA you posted is for 8xx right?
Thanks for posting the corrected kernel, I was getting a little worried

Yes, the ISA was from 5870.

Note that your address calculation only works if your domain is 1-dimensional since get_global_id(0) only returns the x-coordinate.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 01-Nov-2009, 14:03   #22
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,948
Send a message via Skype™ to Jawed
Default Local memory is a disaster zone on R700

These two threads make it pretty clear:

http://forums.amd.com/devforum/messa...hreadid=121273

http://forums.amd.com/devforum/messa...hreadid=121298

I now realise the basis of my misunderstanding of the absolute addressing mode for LDS in R700 - a mode that I thought meant that writes aren't owner-private.

So, global memory emulation is being used, and presumably has terrible performance. RV670 is still locked-out, I guess, because it doesn't have the compute shader mode.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 03-Nov-2009, 22:44   #23
Skinner
Member
 
Join Date: Sep 2003
Location: Zwijndrecht/Rotterdam, Netherlands and Phobos
Posts: 863
Default

Hmm, my screen gets a distortion (out of focus) with these when selecting 60 hz, with 59 everything is fine?
HD5870 CFX Windows7 64 bits
__________________
Schieten op de beesten.
Skinner is offline   Reply With Quote
Old 04-Nov-2009, 05:13   #24
codedivine
Member
 
Join Date: Jan 2009
Posts: 257
Default

Has anyone done any tests with regards to local memory bandwidth on RV8xx? I don't have one at hand but am quite interested in the RV8xx implementation. The implementation on RV7xx is restrictive in terms of programming and unimpressive in terms of bandwidth.
codedivine is offline   Reply With Quote
Old 04-Nov-2009, 09:11   #25
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,989
Send a message via Skype™ to fellix
Default

So, R700 is not exactly OCL 1.0 compliant, hardware wise at least?!
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:06.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.