CUDA or CTM for Bioinformatics?

Discussion in 'GPGPU Technology & Programming' started by cbone, May 4, 2007.

  1. cbone

    Newcomer

    Joined:
    Aug 13, 2002
    Messages:
    14
    Likes Received:
    0
    Location:
    Columbia, MO
    Primarily for sequence alignments and processing search algorithms, which would be better?

    Which scales better with Multi-GPU setups, and do you need to explicitly code for Multi-GPUs?

    I should add that I would be doing the programming myself and I don't have a super strong background in programming, but I have a passing familiarity with C, Perl, Java, and Assembly.

    Any bit of help is appreciated!
     
  2. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Well, it depends on a number of factors, and I know enough to tell you I don't know enough to give you a good answer. ;) Seriously though, CUDA/CTM aren't for the weak of heart. You'll have some hard development ahead and to get good performance out of these architectures you will need to become fairly intimate with the hardware. CUDA should be easier to deal with than CTM (at least on the surface). I have done a bit of programming on the Cell and from what I've heard it's easier than either. The question to ask yourself is whether or not it's worth the extra work you'd need to put in to make it work. Is a 5-10x speed up in run time worth the 5-10x increase in development time?

    As far as multi-gpu setups go, are you interested in SLI or something more exotic (like GPU clusters)? You may need to utilize MPI or some other scheme to span processing across multiple systems. I was actually thinking that writing GPU based processing services as part of a grid or SOA would be an interesting project.

    Nite_Hawk
     
  3. cbone

    Newcomer

    Joined:
    Aug 13, 2002
    Messages:
    14
    Likes Received:
    0
    Location:
    Columbia, MO
    I'm not looking to make a commercial application, this is for algorithm and alignment research, so the development time shouldn't be as long as if it were for novel applications where I would be in uncharted waters. One of the nice things is that work or not, I can still get some mileage out of the process, as learning the languages can only help.

    Currently, the alignments are sent through the internet and results are emailed or available on a website when completed, so the ability to run alignments and searches at any time locally would be worth it.

    For Multi-GPU, I was thinking SLI, Quad-SLI, or Crossfire. My budget is limited, so I'm trying to use off-the-shelf parts.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  5. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    AFAIK, CTM is low level (ala assembly language) and ATI hardware specific, CUDA presents a C-like high level interface, and is NVidia hardware specific. Then there's Brook which is high level with multiple low level backends.

    Thus, your decision is whether you want to code in low level, or high level, and which HW you will be buying, ATI or NVidia.

    IMHO, having experimented with implementing both Smith-Waterman and hidden markov model search/alignment algorithms, CUDA would be easier and more productive. But if you don't want to tie yourself to a particular vendor, and aren't concerned with trading off some performance for portability, take a look at Brook.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    You can write HLSL (high level shader language) and have CTM compile it into low level code.

    Jawed
     
  7. cbone

    Newcomer

    Joined:
    Aug 13, 2002
    Messages:
    14
    Likes Received:
    0
    Location:
    Columbia, MO
    Thanks, guys!

    I think that I'll give CUDA a try and take a gander at Brook.


    CBONE
     
  8. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    At this point, you basically have three options: CUDA, BrookGPU, or PeakStream.

    Brook has backends for CTM, D3D9, and a generic CPU backend, so it'll run on pretty much anything. However, there's not a huge amount of documentation (compared to the other two options), although Mike Houston seems happy to answer any questions you'd have if you post here.

    CUDA is G8x-only and is similar to Brook--it was primarily designed by Ian Buck, who also one of the principle authors of Brook (Mike, correct me if I'm wrong!). It's kind of a superset of Brook, as it exposes more functionality than can be exposed in Brook in a cross-platform way and has very similar syntax. It has some libraries included, like an FFT and a BLAS implementaiton, if that matters. I'm not a huge fan of its memory model; it seems like CUDA tries to hide GPU implementation details except when it doesn't (warps, three different memory regions, etc), which is kind of confusing.

    PS is relatively new and is commercial, but it's the only way (besides writing D3D9 HLSL and compiling it) to write CTM in a high-level language. I'm personally a bigger fan of its syntax model than I am of CUDA or Brook. It reminds me of (the very little I know of) OpenMP, where particular code regions are marked to be executed on the GPU. Dispatching programs to the GPU is usually handled automatically, which is nice, and memory management seems much more straightforward than in CUDA. Like CUDA, it has FFT and BLAS libraries included. Right now, it only compiles to CTM or to a CPU backend, but I'm certainly not convinced that this will be the case in the future. Even though it is commercial, I'm hoping that PS is going to be the cross-platform API that brings GPGPU to a wider acceptance, since it does seem to be pretty sensibly designed and doesn't require programmers to learn totally new programming paradigms. It's also got good docs. :D

    I haven't had enough time to do anything meaningful with any of these languages (summer is coming up, though!), so I'm certainly curious to hear others' impressions of the languages as well. If you find one to be significantly more suitable for your needs than the others, make sure to let us know.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Presumably both CUDA and Peakstream allow you to evaluate in a CPU-only environment (that is, try the language/environment without worrying over whether your GPU is suitable). Is that right?

    Jawed
     
  10. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    Peakstream does, CUDA doesn't really--the only CPU environment it has is the debugger, which emulates an entire G80. I guess if you're only concerned about trying it and you realize that it won't actually take a gig of RAM to run, it's a reasonable choice to trying it.
     
  11. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    According to the Nvidia reps in the CUDA forum, the CPU-only development environment doesn't emulate any of the characteristics of the memory and threading architecture of the GPU. (Not surprising, really.) It's perfectly possible to do develop something that works on the CPU-only environment that's entirely non-functional on the GPU.

    Good enough if you already know what you're doing (say, cranking out code during a long flight without a GPU at hand), but probably not suitable to evaluate and get to know the style of programming.

    So it doesn't even emulate. I think CUDA specific pragma's are more or less ignored or something and a flat address space is used.
     
  12. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Question: does CTM have provisions to make code portable across GPU generations? Are the instructions an exact representation of the instructions that will run on the GPU or is there still some kind of mapping and optimization phase?
    With the latter, I assume it would still be possible to run the code later on an R600.
     
  13. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The compiled assembly that is generated will run only on similar generations of cards. If you coded everything in HLSL then it should be fairly straightforward moving between different generations of video cards, if not different implementations. Basically anything that runs a DX9 backend could probably share code with CTM so long as you don't use any features made available only through CTM.

    1900 Assembly
    Code:
      0 alu 00 rgb:  out0.rgb =           mad(r00.0.0rg, c00.0.0rg, r01.1.0rg) sem_wait
             alpha:  out0.a   =           mad(r00.b, c00.b, r01.b)  last
    I'd assume the instructions will be the same so long as each card is using a Vec3+1 setup. It might still run on R600 but I'd doubt it.
     
  14. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    But of what benefit is this, besides replacing FXC HLSL->PS intermediate->CTM with HLSL->CTM? Sure, it allows the optimizer to do a better job, as I've been arguing for years for OpenGL's GLSL model of compilation as opposed to MS's 'one compiler to rule them all with profiles' But if your writing HLSL and compiling to CTM, it seems you would face double the limitations: First, you'd be stuck with a low level 'shader' model of programming with no high level abstractions afforded by stacks like Brook, PeakStream, and CUDA (e.g. hiding some of the details of kernel execution and memory), and secondly, pure HLSL would not expose any ATI HW specific extensions, so now you have to hope and pray the compiler knows the right magic to turn HLSL into optimal CTM.

    My own guess (It's never read the HLSL->CTM stuff only the CTM instruction set docs), is that they'd work around the issue by using compiler instrinic functions.

    IMHO, pretty much all of the GPGPU programming models out there are wrong. It's pretty clear to me atleast that stream processing is a much more natural fit for functional languages.

    To add to that, IMHO, GPGPU platforms should meet two goals:

    1) programmer productivity. To enable terse, elegant, efficient representation of algorithms that are a good fit for the GPGPU streaming model (e.g. don't strive to make arbitrary C code or graph pointer chasing code automagically compile and run well)

    2) to allow programmers to fully utilize the HW they are using, especially when the 'least common denominator' APIs don't expose everything. That is, either by optimizer, intrinsics, annotations, or pragmas, the platform should allow you to custom tweak for the platform of your choice.

    And yes, #1 and #2 conflict sometimes, it's high level elegance vs down and dirty 'I want to interfere with the compiler', but IMHO, good platforms permit both.

    Hell, even Haskell permits unsafe code and foreign function calls/direct memory manipulation.
     
  15. Tim Murray

    Tim Murray the Windom Earle of mobile SOCs
    Veteran

    Joined:
    May 25, 2003
    Messages:
    3,278
    Likes Received:
    66
    Location:
    Mountain View, CA
    Don't know at this time. I believe there are revisions of CTM, but I'm going to need to disassemble the CTM DLL that comes with Peakstream (based on the PS release notes where they said "a new and unnamed ATI card is now functional with CTM") to know how different they are.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    CTM provides for a clean execution environment theoretically devoid of the hoop-jumping that OGL or D3D enforce. It's not just about "optimal code".

    Brook and CUDA both seem to enable a soft entry into GPGPU, but the impression I get is the high level abstractions flounder on the machine's threading/memory model. And they provide the illusion that one's code will translate to newer GPUs while retaining "optimality".

    The "compiler" you refer to is, in fact, CTM-specific as far as I'm aware. So one would hope that it has a concept of the "magic for GPU X".

    And, optimality isn't a problem that HLSL->CTM uniquely faces, Brook and CUDA also have fundamental issues there. "High level" programming is still mired in trial and error for basic algorithmic concepts as far as I can tell.

    For what it's worth, as I'm speaking from the sidelines.

    Jawed
     
  17. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    The advice and opinions here so far are spot on. Brook is perhaps the oldest, but is still largely academic research although AMD supports the CTM backend for Brook currently, the compiler and language side of Brook is still academic. However, with that being said, Brook drives some really big applications (Folding@Home, CFD, some finance apps).

    PeakStream is really slick mainly because they have an entire development platform with compiler+profile+debugger, with the later being a HUGE deal for real development. With Brook/CUDA/CTM, you effectively end up doing "printf" debugging. However, it's unclear how they handle looping constructs within a shader currently. But, being an array based language, it should be an easy swtich for Matlab/Fortran users.

    CTM's main advantage/disadvantage is that it's pretty low level, however it matches most of the GL/D3D abstractions for GPGPU the community talks about. It can be useful as a driver update won't potentially break your code like it can and often does with Brook/CUDA since they rely on the shader compiler in the driver. Because you have raw access to the ISA if you want, you can also tune things deeply and/or fix compiler silliness (GPU vendors still seem to struggle with compiler correctness/performance/stability!).

    CUDA, as was said, is similar to Brook but exposes and forces you to program knowing a little more about the actual architecture. The claim is that CUDA is also portable, but we won't know until we see a new architecture. My guess is code will still continue to run since they can recompile, but it may need to be retuned if the sizes of memories and/or warps changes in the future. I should also mention that to get good performance from Brook, you need to understand a fair amount about the architecture of the GPUs you are targetting.

    In the end, they all have the same basic data-parallel model. The differences are really in syntax and final mapping to the hardware. The "hard" part of GPGPU is getting your application converted to data-parallel/streaming. Once you do that, you can run on many more systems than GPUs efficiently (Cell, clusters, multi-core/SMP, out-of-core).
     
  18. elroy

    Regular

    Joined:
    Jan 29, 2003
    Messages:
    269
    Likes Received:
    1
    Mike, just wondering if you can clarify this point for me. Do you mean that, once your app is optimised for parallel processing you are better off running it on, for example, a cluster over a GPU? Or do you mean that, once it is optimised for parallel processing, it can be applied to a variety of systems more efficiently/easily, including GPU's?
     
  19. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    Once you go through the effort to run it on a GPU, i.e. converting to data-parallel/streaming form, you can run on lots of other systems easily. GPUs still tend to dominate for many of these apps, but PS3s are also interesting since they are reasonably cheap and easy to get. But, much of what we do on GPUs with Brook, can quickly be converted to Sequoia and retargetted to lots of other machines.
     
  20. elroy

    Regular

    Joined:
    Jan 29, 2003
    Messages:
    269
    Likes Received:
    1
    Thanks Mike, that clears things up nicely :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...