PDA

View Full Version : Rv770 LDS in CAL : Peak bandwidth?


codedivine
09-Mar-2009, 10:17
I was wondering what the peak bandwidth of a single LDS is?

As I understand, the lds_read_vec still first requires a texture address calculation? Is this correct?

If so, then a LDS on a SIMD still cannot provide bandwidth more than 64*750e6=48 Gbytes/second per SIMD (16 fp32/clock/SIMD) or 480 gbytes/second total on a Radeon 4870?

edit : To be more clear, if the LDS access first requires a texture address calculation then you are screwed by the TU's peak and the peak b/w is pretty much the same as b/w from L1 cache in such a case. In an ideal shared memory implementation, the access to LDS should be as fast as accessing registers. The b/w from register bank is about 10-12 times more than b/w to L1.

Jawed
09-Mar-2009, 11:33
The bandwidth of lds_read_vec or lds_write_vec is 1 scalar or a vector upto size 4. It's like "mov" only specific to LDS.

LDS has nothing to do with TUs as far as I can tell. But hey, I don't write IL, just observing from the sidelines.

Looking at the IL Specification document, page 6-30 (76), I don't see anything related specifically to "texturing". The address of the foreign thread has to be calculated, but it's not "texture address calculation" that you're doing at this point.

In bandwidth terms it's nothing special, yes, definitely. It's not a bandwidth saving technique, but a way to share computation results across multiple threads (obviously that can indirectly save bandwidth).

And unlike CUDA, an operand cannot come directly from LDS. It has to be brought into a thread's own register space first.

Jawed

codedivine
09-Mar-2009, 11:52
See the reason I think its going through the TU is after looking at the disassembly of CAL IL.
For 4 lds_read_vec in my code I got the following disassembly :

10 TEX: ADDR(118) CNT(4)
22 LOCAL_DS_READ R1, R1.xy WATERFALL
23 LOCAL_DS_READ R2, R2.xy WATERFALL
24 LOCAL_DS_READ R3, R3.zy WATERFALL
25 LOCAL_DS_READ R4, R4.xy WATERFALL

The instructions are being processed through TEX? The "lds_read_vec dst, src.xy" is basically being handed over to the TEX unit. This means that if the TU had been the bottleneck in your computation, it still remains the bottleneck if using the LDS since your sample instructions are now replaced with lds_read_vec but still handled by TU.

codedivine
09-Mar-2009, 12:00
It isnt the b/w per se which I am worried about .. its that the ratio of ALU:TEX is so ridicuosly high (coming from a classical CPU HPC perspective) that in many computations the TEX is becoming a bottleneck in my codes.

Then I guess the only way in CAL to reduce TEX instructions is to bring as much stuff as possible into registers by taking advantage of data locality.

Jawed
09-Mar-2009, 13:33
Hmm, well that's the first assembled LDS code I've seen. I haven't got an R700 ISA, so I've resorted to taking the lds_read_CS.cpp sample file and, ahem, hand compiling/running it to generate the source code it generates :lol:

So this is the IL:

il_cs_2_0
dcl_cb cb0[1]
dcl_num_thread_per_group 64
dcl_lds_size_per_thread 4
dcl_lds_sharing_mode _wavefrontRel
dcl_literal l0, 64, 64, 64, 4
iadd r0, vTid0.x0, cb0[0].x0
mov r2, r2.0000
iadd r0.x, r0.x, cb0[0].y
iadd r0.y, r0.y, l0.w
and r0.x, r0.x, l0.x
lds_read_vec r1, r0.xy
fence_lds_memory
add r2, r2, r1
lds_write_vec mem, r2
end ;

It reads a single LDS and writes a single LDS. When I compile it using Stream KernelAnalyzer, I get:

00 ALU: ADDR(32) CNT(19) KCACHE0(CB0:0-15)
0 z: AND_INT ____, R0.x, (0x0000003F, 8.828180325e-44f).x
w: LSHR ____, R0.x, (0x00000006, 8.407790786e-45f).y
1 y: ADD_INT ____, PV0.z, KC0[0].x
z: AND_INT ____, PV0.w, (0x0000000F, 2.101947696e-44f).x
2 w: ADD_INT ____, KC0[0].y, PV1.y
t: MULLO_INT T0.x, PV1.z, (0x00000040, 8.968310172e-44f).x
3 z: AND_INT ____, PV2.w, (0x00000040, 8.968310172e-44f).x
4 x: LSHR ____, PV3.z, (0x00000002, 2.802596929e-45f).x
z: AND_INT R0.z, PV3.z, (0x00000003, 4.203895393e-45f).y
5 t: MULLO_INT ____, PV4.x, (0x00000004, 5.605193857e-45f).x
6 w: ADD_INT ____, PS5, (0x00000004, 5.605193857e-45f).x
7 y: ADD_INT R0.y, PV6.w, T0.x
01 TEX: ADDR(64) CNT(1)
8 LOCAL_DS_READ R0, R0.zy WATERFALL
02 ALU: ADDR(51) CNT(4)
9 x: ADD R0.x, R0.x, 0.0f
y: ADD R0.y, R0.y, 0.0f
z: ADD R0.z, R0.z, 0.0f
w: ADD R0.w, R0.w, 0.0f
03 TEX: ADDR(66) CNT(1)
10 LOCAL_DS_WRITE (0) R0, STRIDE(4) SIMD_REL
END_OF_PROGRAM

Notice the final CF instruction? It's a TEX instruction. But the TUs can't "write".

My interpretation here is that "TEX" is the CF instruction being given, but it's actually just signifying that non-ALU operations are being performed. Bearing in mind that GDS is on the outside of the clusters, next to L1 cache, it seems to me that what we're seeing is the GPU's register access hardware being instructed to deal with data that isn't to/from the ALUs.

It's almost like an Export CF instruction in a pixel shader.

Strictly speaking the throughput of the TUs for fp32 data is 1/4 the LDS throughput, since the TUs are single cycle for 8-bit results.

It's worth noting that under CUDA shared memory suffers a significantly longer latency than registers, so while the bandwidth is reasonable the trebled latency (I think it's treble) has a serious impact on the scheduler's ability to hide any latency incurred in off-die memory reads/writes.

Jawed

Jawed
09-Mar-2009, 13:36
Then I guess the only way in CAL to reduce TEX instructions is to bring as much stuff as possible into registers by taking advantage of data locality.
I've just noticed that ACML for GPUs has arrived:

http://developer.amd.com/gpu/acmlgpu/Pages/default.aspx

I don't know whether you can get in there to see if the matmult code uses CS and if so, how they've optimised memory/LDS usage.

I can't install it because it's 64-bit only :cry:

Jawed

codedivine
09-Mar-2009, 13:50
ACML-GPU is binary .. no source code (except examples of usage).
I was not able to run it because it requires older version of GCC.
One interesting thing in the ACML docs is that it refers to Stream SDK 1.4 as an optional requirement so I guess 1.4 is coming soon.

About the TEX instruction being kind of a placeholder and not a real TEX instruction ..well that kind of makes sense. I will try and do some experiments later and maybe also post some numbers from the ldsread sample here to try and make sense of whats going on.

It would be nice though if AMD releases some more information about LDS.

rpg.314
09-Mar-2009, 15:47
AMD Core Math Library for Graphic Processors (ACML-GPU) provides an ATI Stream-accelerated version of ACML. ACML-GPU accelerates certain routines in ACML, such as SGEMM and DGEMM, by off-loading the computation to the compatible GPUs in the system. The library dynamically decides, based on the parameters passed to the routines, whether to run the computation on the CPU or GPU, depending on which processor will yield the best performance.

ACML-GPU automatically scales its computation across multiple GPUs, if available and can take advantage of the double precision floating point hardware in the GPU on products that contain hardware DPFP support.



Automatic multi GPU scaling. That's nice. :) Something for nvidia to chase in gpgpu space, for the first time.

CarstenS
09-Mar-2009, 21:24
Really? Multi-GPU-Acceleration doesn't work with Nvidia-GPUs? It looked quite natural, when I used BarsWF on an 9800 GX2 and GTX 295 - just started the programm and I was good to go.

entity279
09-Mar-2009, 21:34
Really? Multi-GPU-Acceleration doesn't work with Nvidia-GPUs? It looked quite natural, when I used BarsWF on an 9800 GX2 and GTX 295 - just started the programm and I was good to go.

It doesn't wok on Cublas AFAIK

rpg.314
10-Mar-2009, 04:38
No it doesn't. Which makes the extra time taken by AMD time well spent. :)

DegustatoR
17-Mar-2009, 01:15
Automatic multi GPU scaling. That's nice. :) Something for nvidia to chase in gpgpu space, for the first time.
Note that AMD needs that more than NVIDIA due to the absence of top-end GPUs in their line-up.

Jawed
17-Mar-2009, 04:10
Eh?, in both single precision and double precision a single RV770 is faster than a single GT200.

Jawed

DegustatoR
17-Mar-2009, 04:16
Theoretically yes. I've yet to see any real test where RV770 would be faster in GP calculations than GT200.

(Well, i've seen one -- the one about password cracking, but in all other applications GT200 seems to be faster than RV770; even G92 turns out to be faster in many benchmarks.)

Jawed
17-Mar-2009, 04:45
As far as GPGPU apps go, not benchmarks, feel free to provide a list. Any app that's using shared memory on GT200 for inter-thread sharing, but not using LDS on RV770 likewise, won't make for compelling support for your point of view.

Jawed

pcchen
17-Mar-2009, 05:36
But until recently you can't use shared memory in Brook+. It was much easier to use shared memory in CUDA than in IL, and one should put this into account too.

Jawed
17-Mar-2009, 05:45
:lol: Until recently AMD didn't even have shared memory. I dare say this is the big mistake made with R600 in terms of GPGPU.

Well, there was the memory read/write cache, but that appears to provide no acceleration for inter-thread sharing.

Jawed

pcchen
17-Mar-2009, 12:51
Oh, also it seems to me that R600's registers can be indexed, but G80/G200 can't. So if you want to use a fast indexed array (I use one for my n-queen solver to simulate a stack) you have to use shared memory in G80/G200. Since now Brook+ seems to support VS2008 properly, I'll try to port my n-queen solver to Brook+ sometime later. It'd be interesting to see the performance between them :)

Jawed
17-Mar-2009, 13:20
If you're really adventurous you can also try some recursion while you're at it. Though not in Brook+. 32-deep call stack in IL, though.

Jawed