CUDA x86

Discussion in 'GPGPU Technology & Programming' started by thop, Sep 21, 2010.

  1. thop

    thop Great Member
    Veteran

    Joined:
    Feb 23, 2003
    Messages:
    1,286
    Likes Received:
    0
    http://venturebeat.com/2010/09/21/n...mputer-even-those-without-its-graphics-chips/

    Nvidia announced today that it can run its graphics-based programming technology on any computer regardless of whether it uses an Nvidia graphics chip or not. In short, Nvidia has worked with software company PGI to create a new compiler to take code for its graphics chips and run them on machines without its graphics chips.
     
  2. flynn

    Regular

    Joined:
    Jan 8, 2009
    Messages:
    400
    Likes Received:
    0
  3. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    Hmm... the article mentions "Intel-compatible" cpu's only... AMD ones will be gimped? :???:

    This is a very good move for nVidia. I was considering getting a nVidia GPU to learn some CUDA programming, now I don't have to. :grin:

    Seriously though, it is a good move, provided the compiler will be free, will work provide autovectorization like Intel's openCL compiler and works on AMD's too. ANd most importantly, isn't slower than OpenCL.
     
  4. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    504
    Likes Received:
    187
    Yes, it will be interesting to see how well the compiler works in practice. Nvidia has no reason to gimp things for AMD, they're certainly not trying to give Intel a leg up here. In fact, PGI announced specifically that the CUDA x86 compiler will work with AMD processors.
    http://www.prnewswire.com/news-rele...architecture-for-x86-platforms-103457159.html

    One of the most interesting questions about how CUDA-x86 will work revolves around the warp. Although the warp is defined as part of the CUDA programming model, it's always had a somewhat secondary status and official descriptions usually try to hide the warp. Since x86 SIMD is shorter and somewhat different from the CUDA warp, emulating the warp in a high performance way is going to be tricky. They may just define the warp in some future version of CUDA in a more platform independent way, and declare CUDA programs which rely on the warp size being 32 as invalid. I think CUDA already allows you to query the runtime to discover the warp size, which could be helpful. But still, there's a lot of CUDA code which won't execute correctly if the warp isn't faithfully emulated, and doing so on x86 is problematic.

    As far as vectorization is concerned, I think it is likely that the PGI compiler will vectorize across SSE/AVX lanes, similarly to the Intel OpenCL compiler. I think this will happen because using float4 indiscriminately (as AMD's OpenCL runtime requires to get SSE vectorization, and as AMD's OpenCL runtime for ATI GPUs requires to get good performance) is actually harmful on Nvidia GPUs, due to the dramatically increased register file pressure it creates. So, if the PGI CUDA compiler doesn't vectorize, CUDA will find itself becoming fragmented similar to how OpenCL is fragmenting, since CUDA programs for x86 will need to be written with manual vectorization. But I could be too optimistic here.
     
  5. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    You make a good point about the warps. But from what I understood from your link (thanks!), this is going to be a normal compiler, not an interpreter or emulator. SO there is going to be optimizations towards current CPU architecture, most likely droping the warps or rearanging them to something more fitting.

    What they seem to want to achieve is that you take a correct, optimized GPU-CUDA code and without any fuss you can compile it to run on CPU. So it will indeed require vectorization to help them achieve performance. With a little porting, openCL code might compile too.

    Maybe it will be possible to have CUDA code, that wouldn't run fast or at all on GPU, compile succesfully and run fast on CPU. But I doubt they'll allow that. So my plan will not work, I will still have to buy a GPU to see if it works as intended. :p
     
  6. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,001
    Likes Received:
    553
    Location:
    Taiwan
    I'm more curious about how they handle some features designed specifically for GPU. Warp vote functions would be the obvious ones. Imaging functions (i.e. textures) can also be very troublesome. Another potential problem is, since NVIDIA is still actively adding new features into CUDA, how close will CUDA-x86 follow these new features?
     
  7. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
  8. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,569
    Likes Received:
    756
    Location:
    British Columbia, Canada
    That shouldn't be that tough... SSE already has more horizontal operations than in needs :) You can always scalarize that stuff too as it isn't particularly common.

    The biggest problems are predication (induced by use of control flow) and gather/scatter. Neither are well-supported in SSE and represent a large fraction of the operations in a typical kernel-style program. It'll be interesting to see how hard NVIDIA tries to do things like collapse G/S into linear load/stores where possible in the compiler or whether they just bail and scalarize all G/S and complex control flow. It'll be interesting to see.

    Sure these are going to be slow compared to GPUs, no doubt about it. Texture fetches are one of the places where GPUs are actually legitimately way faster, for obvious reasons :) Quite possible to do in software, but very slow.
     
  9. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    This is easy, while not ideal it's reasonable since the first SSE interation.

    This is hard, until SSE4 this was "patetic" and after SSE4 this becomes just "bad".
     
  10. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,569
    Likes Received:
    756
    Location:
    British Columbia, Canada
    What you mean with just conditionally moving everything? Not exactly ideal performance-wise.
     
  11. Arun

    Arun Unknown.
    Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I was very enthusiastic about this... two years ago! http://www.beyond3d.com/content/articles/106/4
    AFAIK, the original CUDA x86 project was entirely in-house R&D with a few people being very excited about it, but it ran into technical difficulties (mostly performance in some scenarios iirc but I'm not sure) and there wasn't enough support behind it to keep going. Presumably they cancelled it completely and this is an independent effort by PGI.

    I still think it's a good idea, but it would have been a lot more disruptive in 2008.
     
  12. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    I mean, doing exactly what AMD and nVidia do on their GPUs, but with masks instead of predication and using a few more instructions.
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I think it would have been pretty useful even if they could not hit their performance targets. They might have had issues with vectorization, but we would have still seen scaling across cores. Nothing to sniff at, IMO.
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    AMD and nv implement masks in hw, transparent to the programmer or even the compiler. So you can't just "do what amd/nv does with gpus". The entire predication stack has to be implemented in sw. Worst case, there is a three instruction overhead for every instruction in a dense forest of branches.

    Life would be so much better if Intel just introduced scatter/gather and predication. :yep2:

    That would have been better than AVX, IMHO. May be Ivy Bridge/Haswell then. :???:
     
  15. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,569
    Likes Received:
    756
    Location:
    British Columbia, Canada
    Right but SSE does not have free masking on its instructions so it means every masked operation has to be followed by a conditional move which doubles the instruction count. This is unlike GPUs (and Larrabee) which have "free" masks on all instructions.
     
  16. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,001
    Likes Received:
    553
    Location:
    Taiwan
    Of course, in theory it's possible to do the conditional move only after a conditional block, so it's not that bad, but it's not good either.
     
  17. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    Not every masked instruction, every block, AMD GPU also have some penality for switching blocks and nVidia may have too, also, if code is sooo branchy it may not worth the effort to vectorize it.
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    AFAIK, nv gpu's don't have a warp switch penalty.

    Also the code is vectorized in hw any way. Take it or leave it.
     
  19. MrGaribaldi

    Regular

    Joined:
    Nov 23, 2002
    Messages:
    611
    Likes Received:
    0
    Location:
    In transit
    Sorry for going off on a bit of a tangent here, but I'm still having a bit of trouble seeing what problem CUDA x86 solves, so if anyone could enlighten me I'd appreciate it.

    As I see it, to get good performance, we'd be writing one kernel for the GPU and one for the CPU playing to the different architectures strength, and launching them with different dimensions of threads to utilise the chip fully, without drowning it in threads.

    It "solves" the problem of thread-interdependence since in CUDA there is no easy mechanism for this, so we would have to rethink have the problem is solved, which could be a very good thing, but something that is still very much possible in other languages.

    Comparing it with OpenCL, I guess one could say that CUDA is better since it has not made allowances for it to make use of the resources of different platforms. But on the other hand, that makes it harder to get the full performance of the CPU.

    I guess it could be argued that it opens up testing and development of CUDA programs to a larger group, but unless it is then done along with heavy use of occupancy calculator and profiling/register analysis, how will it teach you what code runs well on a GPU. And if you're willing to do that, why not just pop in a cheap CUDA capable card instead of running on the CPU?

    The only thing I can see this would be useful for would be the Intels SCC chip, larabee or similar architectures. But I guess I am being pessimistic/negative here, and look forward to hearing the upsides of CUDA x86.
     
  20. Andrew Lauritzen

    Andrew Lauritzen Moderator
    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,569
    Likes Received:
    756
    Location:
    British Columbia, Canada
    Depends... if you do it that way you drive up the register pressure since you have to duplicate any live registers inside a masked (control flow) block since the values of the unmasked lanes much be maintained. Thus you may end up adding register copy or even spill/fill code as well.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...