PDA

View Full Version : R600 Gpgpu


Rufus
15-May-2007, 17:49
So there's no doubt that R600 will be a GPGPU beast, assuming there's no hidden flaws waiting to be found. Reading Ars Technica's (http://arstechnica.com/news.ars/post/20070514-amd-launches-the-hd-2000-series.html) launch writeup has a bit of new information that I've never heard before about ATI doing some sort of Just-In-Time compiler from C/C++ to x86/CTM (the article says CUDA...wtf?!). However the article gets very convoluted and starts talking about various abstraction layers and how this is hurting performance or something.

Does anyone know of a writeup about this that makes more sense?

mhouston
15-May-2007, 18:45
I don't believe there is a writeup, but I'll give it a go.

"HAL" (Hardware Abstraction Layer) is the CTM that most people think about. This is the extremely low level access to the hardware including the assembly for the stream processors. You also have full control over memory layout and allocation. This is what many of the third party vendors use if they have their own nifty compiler technology.

"CAL" (Compute Abstraction Layer (?)) provides compilers, memory managers, and a huge stack of utilities to help. It's still not as high level as CUDA/Brook, but you do get a compiler that can take in ps3/4 code and can deal with HLSL using microsoft's compilers. This is where the JIT'ing can occur and support for multi-core can be added. Since CTM is a data parallel abstract machine, it's "easy" to drive lots of cores be it on a GPU or a CPU, as long as you can compile the kernels to the machine's ISA.

On top of that, AMD says they will start to support ACML on GPUs soon, as well as a few other libraries I believe. AMD is also now directly supporting Brook development via the CTM backend.

The Brook CTM backend uses features from both HAL and CAL. We do all the layout and allocation currently ourselves, but use the CTM compilers to convert ps3/4 code into assembly. You can see a bunch of the helper library stuff that makes up part of CAL on the sourceforge site (http://sourceforge.net/projects/amdctm).

I should also add that I don't believe the GL or DirectX driver is built atop CTM. CTM is really designed as a GPGPU interface.

Tim Murray
15-May-2007, 20:34
"CAL" (Compute Abstraction Layer (?)) provides compilers, memory managers, and a huge stack of utilities to help. It's still not as high level as CUDA/Brook, but you do get a compiler that can take in ps3/4 code and can deal with HLSL using microsoft's compilers. This is where the JIT'ing can occur and support for multi-core can be added. Since CTM is a data parallel abstract machine, it's "easy" to drive lots of cores be it on a GPU or a CPU, as long as you can compile the kernels to the machine's ISA.
Is there any information about this online? I can't find any reference to the CAL besides one AMD presentation from a year ago, where they mention that CAL will sit on top of DirectX (which seems like nonsense).

AMD is also now directly supporting Brook development via the CTM backend.
Am I reading this correctly by believing that AMD is providing developers to work specifically on Brook's CTM backend?

I should also add that I don't believe the GL or DirectX driver is built atop CTM. CTM is really designed as a GPGPU interface.
I've heard something interesting in this regard, but I'll need some more time to see what the deal is with that before I say anything else.

mhouston
16-May-2007, 06:10
Is there any information about this online? I can't find any reference to the CAL besides one AMD presentation from a year ago, where they mention that CAL will sit on top of DirectX (which seems like nonsense).


There were slides on this during the launch, but I don't think anyone in the press has posted this. CAL is just an abstraction, so in theory you could run it on any driver base. In fact, AMD mentioned plans to support stream programming on multi-core CPUs through this mechanism (as well as others).


Am I reading this correctly by believing that AMD is providing developers to work specifically on Brook's CTM backend?


We are getting some time from AMD/ATI engineers on this. I don't know if anyone is assigned to only do this, but we are getting patch sets and some of the engineers now have direct access to the tree on Sourceforge.


I've heard something interesting in this regard, but I'll need some more time to see what the deal is with that before I say anything else.

I'm curious what you hear back. In theory, you could implement a bunch of a driver atop CTM, but for graphics stuff, you will want access to all the fixed function units and graphics specific hardware (blend units, texture filtering, etc).

RacingPHT
16-May-2007, 15:37
Is there going to be something like shared memory? I've seen a slide said R600 supports inter-thread communication.

mhouston
17-May-2007, 05:41
There is a read/write cache that is 8K, but not a whole lot is known about it outside AMD including how it will be exported. I'd like access through ctm. ;-)

RacingPHT
17-May-2007, 06:41
Thanks mhouston.
I'm also very interested in how your k-D Tree GPU Raytracing performs on R600. Hope there is some news.:smile:

mhouston
17-May-2007, 06:56
We haven't had enough time with it yet. Branch granularity is slighly worse on R6XX, 64 vs 48 pixels, and we were divergent bound before, but the bandwidths and flop rates are roughly double depending on the use of the preadders on R580. The main issue at the moment is the ability of the compiler to schedule our shaders well and the general optimization level of the DX driver. Patching things up in CTM, like we did with R580 should help, but it's a new ISA to learn to get the same level of tuning.

RacingPHT
17-May-2007, 09:51
Just an idea, not test yet...

Is it possible to sort/compact the work units in a large group(say, 256-1024 elements), so that each batch in the group will not divergent? It may not be very practical on R5xx, but with shared memory, It seems to be possible.

mhouston
17-May-2007, 16:06
For this to work, you would have to finish your kernel (shader) and then take a sort pass. On both R6XX and G8X, there is not enough shared memory to do the full sort. Although sorting is fast, it is not very efficient on GPUs. However, if your rate of divergence is slow, then it might make sense to do this every so often. But for things like raytracing, you will end up blowing your caches, detroying the memory coherence, and the divergence occurs so rapidly that I don't think sorting will help.