View Full Version : OpenCL (Open Compute Library)
NocturnDragon
10-Jun-2008, 15:08
As Jen-Hsun Huang pre-announced the other day:
"Apple knows a lot about CUDA," Huang said, implying the company might be ready to formally embrace Nvidia's technology to make it easier to exploit graphics chips inside Macs. Apple's implementation "won't be called CUDA, but it will be called something else,"Apple yesterday announced it's own GPGPU solution called OpenCL. (not to be confused with OpenCL the cryptography library now called Botan)
So far there are only a few details available:
It will be released on apple platforms a year from now with OS X 10.6 Snow Leopard.
Apple proposed it as an open standard. (through Khronos?)
it will have a C based syntax (but what language today doesn't?)On the apple site you can read:
http://www.apple.com/macosx/snowleopard/
OpenCL
Another powerful Snow Leopard technology, OpenCL (Open Compute Library), makes it possible for developers to efficiently tap the vast gigaflops of computing power currently locked up in the graphics processing unit (GPU). With GPUs approaching processing speeds of a trillion operations per second, they’re capable of considerably more than just drawing pictures. OpenCL takes that power and redirects it for general-purpose computing.
Press Release:
http://www.apple.com/pr/library/2008/06/09snowleopard.html
Is there any other information about it?
Steve's reality distortion field, or is there something that is missing?
Mr. Jobs described a new processing standard that Apple is proposing called OpenCL (Open Computing Language) which is intended to refocus graphics processors on standard computing functions.
“Basically it lets you use graphics processors to do computation,” he said. “It’s way beyond what Nvidia or anyone else has, and it’s really simple.”
OpenCL is based on LLVM (http://en.wikipedia.org/wiki/LLVM) and Clang (http://en.wikipedia.org/wiki/Clang).
In fact, Grand Central Dispatch is also very interesting.
NocturnDragon
12-Jun-2008, 08:12
OpenCL is based on LLVM (http://en.wikipedia.org/wiki/LLVM) and Clang (http://en.wikipedia.org/wiki/Clang).
In fact, Grand Central Dispatch is also very interesting.
Yeah I was guessing LLVM would have been involved...
Do you have any link about that tho?
I think the Wiki entry for OpenCL says as much.
I am trying to find more info about Grand Central Dispatch from Apple's developer site (Though it is not GPGPU per se). Like the Cell processor, GCD reminds me of supercomputing concept(s) from more than a decade ago (From the Cray, Thinking Machine, PVM supercomputing era).
jimmyjames123
13-Jun-2008, 02:06
I thought this was very interesting news, especially the tidbit by Steve Jobs that this was "way beyond" what NVIDIA and others have. Logically I don't think that makes sense (since CUDA has already proven it's worth in various real-world scenarios), unless he meant that it is way beyond in the sense that it is not specific to only one set or type of graphics cards.
Does anyone know what are the real differences between CUDA and OpenCL? They sure sound similar in function on the surface.
One thing I don't quite follow is that, on Wikipedia CUDA page, it says that OpenCL is similar technology to CUDA and that Apple has a partnership with NVIDIA and others in promoting this standard.
So maybe NVIDIA realizes that CUDA has to become universally used for it to be successful, and is therefore supporting OpenCL as an alternative to Intel's upcoming GPGPU software platform?
It probably means that nVidia's CUDA implementation can be folded under LLVM's compiler framework on Mac OS X. LLVM (and hence, OpenCL) can support other language run-time.
NocturnDragon
16-Jun-2008, 10:50
http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~126593,00.html (http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543%7E126593,00.html)
In keeping with its open systems philosophy, AMD has also joined the Khronos Compute Working Group. This working group’s goals include developing industry standards for data parallel programming and working with proposed specifications like OpenCL. The OpenCL specification can help provide developers with an easy path to development across multiple platforms.
“An open industry standard programming specification will help drive broad-based support for stream computing technology in mainstream applications,” said Rick Bergman, senior vice president and general manager, Graphics Product Group, AMD. “We believe that OpenCL is a step in the right direction and we fully support this effort. AMD intends to ensure that the AMD Stream SDK rapidly evolves to comply with open industry standards as they emerge.”
Simon F
17-Jun-2008, 10:53
Khronos Press Release on the related topic (http://www.khronos.org/news/press/releases/khronos_launches_heterogeneous_computing_initiativ e/)
NocturnDragon
17-Jun-2008, 11:03
Khronos Press Release on the related topic (http://www.khronos.org/news/press/releases/khronos_launches_heterogeneous_computing_initiativ e/)
Too bad there is not much information in it.
“Significantly, this initiative is aimed at both desktop and embedded devices – the day when you will be able to hold a supercomputer in the palm of your hand is perhaps not so far away.”
Future embedded devices, or current ones (OpenGL ES 2.0 generation)?
Well, it has seemed obvious for a while now that the main roadblock to wide (or even wider) GPGPU adoption is the lack of a platform and vendor agnostic, open, standardized programming interface. If this OpenCL initiative is actually serious and capable of producing something usable in a reasonable timeframe it will become very significant. I for one am quite excited.
NocturnDragon
17-Jun-2008, 11:48
Well, it has seemed obvious for a while now that the main roadblock to wide (or even wider) GPGPU adoption is the lack of a platform and vendor agnostic, open, standardized programming interface. If this OpenCL initiative is actually serious and capable of producing something usable in a reasonable timeframe it will become very significant. I for one am quite excited.
I sure share your feelings, but I am also a bit scared of the timeframe, as the OpenGL 3 fiasco showed, khronos can take it's time to standardize things.
Hopefully the push Apple is giving will be enough to overcome that.
I sure share your feelings, but I am also a bit scared of the timeframe, as the OpenGL 3 fiasco showed, khronos can take it's time to standardize things.Thinking of the OpenGL 3 "process" is what made me put the qualifier "in a reasonable timeframe" there ;)
NocturnDragon
17-Jun-2008, 13:36
Thinking of the OpenGL 3 "process" is what made me put the qualifier "in a reasonable timeframe" there ;)
Well my impression was that the 10.6 version they gave to WWDC attendees already has an early implementation... But I might be way wrong.
Karoshi
17-Jun-2008, 16:21
Dont remember if dx11 had a compute shader planed. If it didnt, we can look forward to a rushed one. No way MS will let anything multiplatform go ahead unchallenged by a propietary alternative.
DX11 compute shaders are similar to CUDA..
darkblu
17-Jun-2008, 17:21
so when you try to do a software rasterizer on opencl will the universe implode into a violent singularity?
I think they should focus on getting OpenGL3 or even OpenGL 3.1 done first. Then OpenCL.
And may be when they have time to fix the OpenAL problems with a rewrite.
randomhack
18-Jun-2008, 06:11
Maybe they are including OpenCL or something related in OpenGL 3 and maybe thats why its taking so much time?
NocturnDragon
18-Jun-2008, 08:05
I think they should focus on getting OpenGL3 or even OpenGL 3.1 done first. Then OpenCL.
And may be when they have time to fix the OpenAL problems with a rewrite.
They who?
I don't think it's the OpenGL and OpenCL groups are the same.
And btw OpenAL is not a Khronos standard.
Just wondering what are the chances of using OpenCL on Intel X4500?
randomhack
24-Jun-2008, 22:00
I have to wonder whether Intel is supporting OpenCL on Larrabee. If claims of Intel that Larrabee is much more programmable than current cores is correct, they shouldnt have much of a problem in supporting any standard.
I am also wondering if Apple will be using Larrabee? Are Snow Leopard and Larrabee scheduled in the same time frame?
trinibwoy
24-Jun-2008, 22:01
DX11 compute shaders are similar to CUDA..
In what respect? Do the structures map well to the Grid -> Block -> Warp -> Thread CUDA hierarchy?
TimothyFarrar
24-Jun-2008, 23:35
In what respect? Do the structures map well to the Grid -> Block -> Warp -> Thread CUDA hierarchy?
Or a better question is how exactly they are making the functionality portable without hindering performance. CUDA performance is all about fetching aligned full memory bus granularity blocks into "shared memory" and then allowing the SIMD units to do swizzled fetches from that shared pool. Kind of hard to see this porting well to AMD chips..
In what respect? Do the structures map well to the Grid -> Block -> Warp -> Thread CUDA hierarchy?
That's an abstraction that would hardly work on a wide range or architectures.
Compute shaders should be an opportunity to refine CUDA and get it a bit more right (who cares about number of warps/blocks/grids/wavefront/whatever..)
Uhm, that kind of stuff is pretty easy for the HPC community. Certainly for a game developer that's probably intidimating and compute shaders in DX11 should abstract at least some of it, but I think for their initial target market that was not really a problem. Their guide is a strange and rather suboptimal way to teach the paradigm & language though, they told me they're working on revamping it completely but we'll see what happens...
It really doesn't matter if your 'average' HPC engineer can get all his/her CUDA numbers right, the first company that can deliver a language and/or framework that let you focus on the real important things wins. Many of us (me included) get all excited about hardware architectures but if I have learnt something after having spent a relatively long time working on next gen consoles CPUs is that who gets the software architecture 'right' (whatever it might mean) will take the crown.
CUDA is nice and everything but I refuse to believe that in 2 or 3 years from now we are still going to use as it is now, it will evolve or it will eventually lose its leadership.
In such a small and new field is very easy to go from being first of your class to fall into oblivion.
CUDA is certainly going to evolve, let alone because of changes in their DX11 hardware architecture and the fact individual developers & consumer apps will become even more important in the future (and those have much lower complexity tolerance). However, I don't think it's really necessary to completely hide all of those implementation details; just a layer API with a higher level of abstraction would do the trick. Hitting the right sweetspot for it may be difficult however.
(who cares about number of warps/blocks/grids/wavefront/whatever..)
Anyone using local/shared storage?
Andrew Lauritzen
25-Jun-2008, 15:54
The problem with CUDA IMHO is that it's a little too specific to the G80/92/T200 architecture. It doesn't map naturally to other architectures with different memory hierarchies and although it can be "made to work", something a bit more abstract is needed for a standard that is meant to be targeted to a wide range of parallel processors with varying memory hierarchies.
The other problem with CUDA is that it's just too damn hard to make it fast/optimal ;) This is more a problem with the complexity of the underlying hardware than the language itself, but the point remains that the language does nothing to prevent you from seriously shooting yourself in the foot, which is never a good thing. As it stands, even simple problems require highly non-linear optimization and machine-learning style optimization algorithms (http://www.crhc.uiuc.edu/IMPACT/ftp/conference/cgo-08-ryoo.pdf) to even approach 50% of peak performance. There are just too many variables that affect performance in highly non-linear ways for us mere mortals to get right ;)
Now the above is just a tough problem with parallel programming and complex architectures in general, but it begs the question as to whether we need to be specifying algorithms in something a bit more general and tunable than CUDA, and then the backend/compilers can handle the heavy-lifting as far as optimization and targeting to a specific memory model go.
Anyways there are certainly many interesting topics moving forward, and it will be fascinating to see what falls out of OpenCL and similar initiatives (DX compute shaders, etc).
TimothyFarrar
25-Jun-2008, 16:46
Great paper BTW. Interesting that the difference between worst and peak is only 235%.
As for peak performance, I'm assuming you are referring to ALU utilization? How many graphics programs ever reach peak ALU performance?
The point being that it is always tough to reach peak performance under any platform, and in all cases you have to have intimate hardware knowledge to tune (or engineer the algorithm in the first place). I think a great example of this is the potential of floating point performance on the xbox 360 or cell/ps3. In both cases you need to vectorize. On 360 you have to stay in cache and aligned, and have a huge amount of work going in parallel to hide really long instruction latencies... ie you really have to program as you do on a GPU to get anywhere close to peak ALU performance. Most developers either will not or cannot do this for anything but a small amount of code.
Andrew Lauritzen
25-Jun-2008, 17:00
As for peak performance, I'm assuming you are referring to ALU utilization?
Well not just ALU utilization... I'm also considering things like how cleverly you touch memory, avoid cache misses, etc. Basically everything that makes your algorithm as fast as it can theoretically be on a given set of hardware. I realize this is largely hand-wavy, but I'm just trying to make the distinction between - say - the naive vs. hand-tuned vs. autotuned versions of algorithms.
And yes, it's definitely tough to reach any sort of peak performance, but I'm concerned that on G8x and similarly complex architectures it has gone beyond "tough" into the realm of automated empirical optimization (as the paper that I referenced does). This process can potentially be "guided" or hinted or pruned by the user in the majority of cases, but with all of the factors that come into play when making something fast on G8x/CUDA, it is simply infeasible for even a ninja programmer to find a globally optimal configuration of tuning parameters except in the simplest of cases. The best we can do is a sort of orthogonal gradient ascent (in each dimension) which can be quite suboptimal in the case of something like G8x.
Anyways my only real point here is that CUDA is pretty tied to a specific architecture, and pretty complex in terms of extracting excellent performance out of that architecture. I submit that these are characteristics of a low-level, relatively non-portable language which is great in its own right, but not suitable as-is for something like OpenCL or DX compute shaders.
I agree that CUDA is too tied to a specific architecture, which makes it very hard to "generalize." However, the problem of "hard to optimize" is very difficult to solve. Even CPU have the same problem. For example, a matrix multiplication algorithm, even written in C/C++, without considering SIMD, will not have optimal performance if the cache size is not considered.
Of course, the beautiful thing of a CPU is (especially a x86 CPU), even a "normal" program (not specifically optimized for a certain architecture) may perform relatively well. The same can't be said for GPU, or any other more "exotic" architectures, including CELL.
IMHO, it's almost impossible to hide all architecture details while maintaining high performance. To do so, it would require a lot of "helper" hardwares, which sort of defeat the idea of GPGPU. Therefore, the most important problem right now, is probably to figure out what is the "best" architecture for GPGPU, which all major vendors can accept, and also useful for most application developers.
Andrew Lauritzen
25-Jun-2008, 18:35
However, the problem of "hard to optimize" is very difficult to solve.
Oh no doubt! I didn't mean to imply that optimizing for G8x is in any way hindered by CUDA... just that writing optimal CUDA code is sufficiently tied to the G8x platform that I consider it a fairly "low-level" language. Clearly CUDA is the best (and only) language for targeting G8x hardware "to the metal", but I remain unconvinced that it provides a good general-purpose, portable programming model.
Anyways I don't want to come off as anti-CUDA - quite the contrary! I just don't think it makes sense for something like CUDA to be the programming model of choice for writing code to target stuff like AMD GPUs, multicore CPUs, Larrabee and Cell.
! I just don't think it makes sense for something like CUDA to be the programming model of choice for writing code to target stuff like AMD GPUs, multicore CPUs, Larrabee and Cell.
Or for whatever NVIDIA will unleash in the next 18/24 months..
Dave Baumann
17-Jul-2008, 16:54
http://www.guardian.co.uk/technology/2008/jul/17/news.computing
Worthwhile reading.
Good read indeed... :)
However I have one major problem with it: the whole handheld thing is patently absurd. Mostly visionaries who fail IMO to understand the difference between theory and practice... There is no use case for a FP32-centric device in this field, and there are massively better architectures *on the market today* for every single application you could ever imagine. These solutions already are orders of magnitude more efficient than x86 CPUs which GPUs compare favorably to.
Might be useful for non-graphics tasks in games, especially because proprietary hardware won't be often exposed let alone standardized, but beyond that I'm very very skeptical. I also laughed at this sentence: "such as being able to point the phone's camera at a building and then process the image so that it can tell you which building it is." - right, because GPS and location-aware services (showing nearby buildings) could *never* do that for a billionth the cost and the power while delivering a better user experience... right?
I'm sorry for being a bit mean here, but I'm not a big fan of random predictions that contradict the fundamental dynamics of computer architecture and system design. just because they'd benefit you strategically. And I thought Intel had patented that intellectual process, anyway?
Arnold Beckenbauer
14-Nov-2008, 11:32
GPGPU Revolution: OpenCL to launch next week (http://theovalich.wordpress.com/2008/11/14/gpgpu-revolution-opencl-to-launch-next-week/)
:smile:
OpenCL on the Fast Track (http://www.hpcwire.com/blogs/OpenCL_On_the_Fast_Track_33608199.html)
TimothyFarrar
14-Nov-2008, 15:24
"Wearing his NVIDIA hat, Trevett says his company is fully supportive of the OpenCL effort and they're going to be careful not to set up CUDA as an OpenCL competitor."
So perhaps CUDA remains as the low level interface OpenCL uses to access the hardware on NVidia's cards?
There's been stuff on-line about OpenCL for quite some time, URL doesn't seem to be widely spread though.
There's a whole load of other interesting stuff as well. Enjoy:
http://s08.idav.ucdavis.edu/
Pressure
19-Nov-2008, 16:40
OpenCL is now completed! A record breaking 6 months because of Apple's tight schedule for Snow Leopard (Mac OSX 10.6).
Read about it at Macworld (http://www.macworld.com/article/136921/2008/11/opencl.html).
rpg.314
19-Nov-2008, 16:43
It'll be a while before they release the specs. Small tidbits were shown at sc08 though.
Tim Murray
19-Nov-2008, 17:42
I was at the OpenCL meeting at SC2008, and I don't think there's anything that was shown here that's new compared to what you saw at SIGGRAPH.
rpg.314
19-Nov-2008, 19:11
I saw the siggraph slides. Can we take it then there is not much extra in the spec compared to what was shown at sc08?
mhouston
19-Nov-2008, 21:02
Only a snippet of what is in the spec was shown at SC08 to give people a simple example. Things have changed in the spec since SIGGRAPH, but the talks at SIGGRAPH and SC08 were very similar.
I'm an OpenCL. And, I'm a CUDA.
rpg.314
21-Nov-2008, 20:19
I'm an OpenCL. And, I'm a CUDA.
I am sorry, but what's your point?
digitalwanderer
21-Nov-2008, 21:53
I'm an OpenCL. And, I'm a CUDA.
I am a rock, I am an island. :|
NocturnDragon
21-Nov-2008, 23:49
I'm an OpenCL. And, I'm a CUDA.
Are you also a bee? :wink:
Well he was obviously referring to the "I'm a Mac. And I'm a PC" ads ;)
Freak'n Big Panda
24-Nov-2008, 17:16
Pardon the ignorant question but how exactly does OpenCL work? How does it target all these various hardware platforms? Will CUDA and CTM act as a layer between OpenCL and the hardware? Interpreting the OpenCL calls and dispatching them like the drivers do with OpenGL and Direct 3D? How does the picture change when OpenCL is used to code for cell, larry, or others?
Dave Baumann
24-Nov-2008, 19:13
There is no CTM...
Tim Murray
24-Nov-2008, 19:48
As far as I can tell, CTM went away and was replaced by CAL, which seems to encompass both a hardware-independent intermediate assembly format (PTX in the CUDA vernacular) and the actual assembly that the hardware executes (which was basically the old CTM). I assume there are also host-side APIs for controlling the GPU that differ between the two.
vjPiedPiper
24-Nov-2008, 21:18
Does anyone know how OpenCL will work with OpenGL..
eg. my application currently uses OpenGL for displaying the GUI, and video/images - will it be possible to simply add in some OpenCL compute stuff in the background?
or will it require a considerable re-design to access both OpenGL and CL functionality
( hoping to use CL on a high end GeForce/Quadro card)
Cheers,
Vj PiedPiper
Arnold Beckenbauer
24-Nov-2008, 21:50
Pardon the ignorant question but how exactly does OpenCL work? How does it target all these various hardware platforms? Will CUDA and CTM act as a layer between OpenCL and the hardware? Interpreting the OpenCL calls and dispatching them like the drivers do with OpenGL and Direct 3D? How does the picture change when OpenCL is used to code for cell, larry, or others?
...
I don't think it was that press release, rather a conference held before HD 4870 X2's launch - there was an element of sensationalism in the headline. In reality CTM evolved into CAL some time ago, and CAL will remain as the enabler for our Stream compute ecosystem by being the interface to the hardware - OpenCL, Cobra, Brook+, 3rd party toolsets will layer on top of this.
...
Mr. Baumann is lazy. :lol:
Freak'n Big Panda
25-Nov-2008, 13:44
Well thanks arnold! So my assumption was right then, CUDA and CAL will act as a layer between OpenCL and the hardware. So if IHVs want to support OpenCL then have to be prepared to code that layer.
I hope NV decides to support OpenCL over CUDA but I doubt that will happen, it'll probably be the other way around... CUDA getting the majority of the support with OpenCL a mere afterthought in comparison. Companies always prefer to promote proprietary technologies at the expense of consumers. Sad but true.
This seems very simple to me. I don't see how anyone can think Nvidia won't support OpenCL. If they dont, they won't get any newer Apple contracts.
I think the fact that there's heat-gpu issues with the nVidia chips has really shot them in the foot for intensive GPU operations via OpenCL with Apple..
LLVM/Clang is a pseudo instruction set - this is then mapped to the GPU/CPU however it does support SIMD in the form of vector result = operation <vector input>,<vector input>.
It doesn't support GPU based SPMD (ie apply all this fragment program to this vector input) with gather & scatter from what I can see. If this is the case then I think Apple's direction is Intel Larabee rather than GPU based..
If you've not seen the patent application and information:
http://forums.macrumors.com/showthread.php?t=588206
It's just a compiler framework, since OpenCL is unlikely to be much higher level than CUDA the SPMD angle isn't all that relevant to the actual compilation.
GLSmurf
08-Dec-2008, 17:53
The spec is out now out: http://www.khronos.org/registry/cl/
NocturnDragon
08-Dec-2008, 18:56
Specs are out:
http://www.khronos.org/registry/cl/
The OpenCL API registry contains specifications of the core API; specifications of Khronos- and vendor-approved OpenCL extensions; header files corresponding to the specifications; and other related documentation.
OpenCL Core API Specification, Headers, and Documentation
OpenCL 1.0 Specification (http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf) (December 5, 2008).
cl.h (http://www.khronos.org/registry/cl/api/1.0/cl.h) - OpenCL 1.0 Header File.
cl_gl.h (http://www.khronos.org/registry/cl/api/1.0/cl_gl.h) - OpenCL 1.0 OpenGL Integration Header File.
cl_platform.h (http://www.khronos.org/registry/cl/api/1.0/cl_platform.h) - OpenCL 1.0 Platform-Dependent Macros.
rpg.314
08-Dec-2008, 20:52
thanks for that. Will go thru these.
I just skimmed over it for a few minutes, and I think it looks quite nice. In general, it's not very different from CUDA, but it looks more "tight," which is not a bad thing. It also supports SIMD, which should be quite good for CPU, CELL, and R600 I presume. Support for 16 bits floating point numbers seem to be useful mainly for embedded/mobile devices, though.
I also like the data parallel model support, where a program won't have to worry about work group details. CUDA does not seem to support this as programs still have to decide a "block size" even in data parallel programs. Of course, this is a small problem for CUDA as in general a large enough block size can cover most cases efficiently.
I'm looking forward to a nice OpenCL implementation, either on CPU or GPU, which would be very good at least for testing purpose.
Yeah, it looks a *lot* like CUDA, especially 2.0. Not sure what you mean by "tight", but I must say I appreciate the more loose execution order, which helps a lot on Cell. CUDAs channels are more friendly, on the other hand.
I wouldn't expect too much magic from the automatic data-parallelism. You are still limited by local-memory (i.e. shared in CUDA) size, so you still have to optimize block size. Basically, you only get automatic register-based block-size detection, if I understand correctly.
What I want to see is how (or if) Nvidia deals with the task-parallel model. This seems to be mostly aimed at SIMD architectures which can shuffle elements in the vectors. Now I am not sure if Tesla can do that, as it appears to me that they basically transform all vector-code into scalar code and then run 8 instances of that in their 8-way SIMD-processors (*). This would basically mean that there is no difference between the models.
If, however, they really export float8 (or float16?), that would be fun. :)
FP16 is actually a really useful format. A 10bit mantissa is quite ok for a lot of image data, even in scientific computing.
(* Wild guess based on CUDA behaviour. Probably ridiculously wrong.)
Oh, by "tight" I mean it's more well documented in some ways. For example, the behavior of type casts is extensively specified, compared to the rather loose CUDA document, which many behaviors are "implied."
Regarding to the automatic data parallelism, I think it's possible to some extent, as the size of local memory required by a kernel is known at compile time, so a compiler can decide whether the shared memory is big enough to be used as local memory (that's mostly because registers in G80/G200 can't be indexed). Although, bank conflict can be a serious trouble, which sometimes can't be easily solved with a compiler.
The SIMD part is a bit more complicate though. Internally G80/G200 is grouped with two 4D SIMD processors, but they tend to work along, like an 8D SIMD processor, with some relaxation especially in branch divergence (two 4D processor can go different ways in a conditional branch). However, due to latency concerns, they tend to work on 4 "threads" to hide latency, so in some ways they behave like two 16D SIMD processors. Of course, if you have enough threads, it should be possible to present them as scalar processors (like what CUDA does), 4D vector processors, or 16D vector processors (if you have less threads).
Will OpenCL run on any current cards with a driver update (like OpenGL 3 does) or will it require new hardware like D3D10?
If it does run, what cards can be expected to run it? Are those of us with 32-bit only cards out of luck?
rpg.314
09-Dec-2008, 21:26
It has native float16 support requirement. Wonder if Intel's LRB dreams induced it in the spec.....
16 bits floating point is only required as a storage format. Computation ability is specified in an extension. So I guess even if a device does not support 16 bits floating point, it can still support OpenCL as long as it can load/store 16 bits floating point numbers, which can be done in "software" if not supported in hardware. Although most GPU should have support for loading/storing 16 bits floating point numbers at least in texture samplers, so it shouldn't be a problem.
As for support of current devices... I don't see any part of OpenCL which can't be supported by G8X/G200. The "minimum" size of some attributes (such as local memory size) seems to be the same as G8X. I'm not sure about R600 though, but I think it should be ok.
rpg.314
09-Dec-2008, 21:49
IIRC IBM is on this group. The spec specifically mentions CELL/BE in at least one place. It would be interesting to know how they implement stuff like image read and write on Cell's SPE's which support only explicit DMA's. On GPU's, it's a fairly simple texture fetch.
I guess that they will cache the image in LS but of course it depends on how big an image can you have. I don't remember them having power of two or size requirements on image sizes though.
Implementing texture sampling on CELL is relatively easy, to make it fast is another matter.
CouldntResist
10-Dec-2008, 01:38
Surprised with:
+ Broad scope, encompassing both CUDA's scalar grid model and explicit SIMD
+ Extension mechanism copied from GL
Disappointed with:
- Missing types: 3D vectors, matrices
- Lots of missing texturing capablities: mipmapping, anisotropy, 1D, Cubemap. I can only forgive lack of 1D/2D/Cubemap/Array, it's ok to leave them for an extension.
I'm afraid DX11 will have all of these.
Disappointments continued:
- Too primitive language. This domain begs for "C with templates" at the very least (and it's not a big problem for a C-dialect compiler, believe me)
- Why all GPGPU API designers keep pretending ROPs don't exist? Why?
OpenCL will basically run on any card that can do CUDA, which is a lot of chips -- pretty much any 8-series GPU with 256MB of RAM or above.
http://www.engadget.com/2008/12/09/nvidia-dishes-about-opencl/
Great stuff.
Andrew Lauritzen
10-Dec-2008, 04:54
- Why all GPGPU API designers keep pretending ROPs don't exist? Why?
Because it's not meant to be just a GPGPU language. It's meant to target multi-core CPUs, Cell and other future parallel hardware as well.
Tim Murray
10-Dec-2008, 05:13
Because it's not meant to be just a GPGPU language. It's meant to target multi-core CPUs, Cell and other future parallel hardware as well.
yeah, I don't think that ROPs really get you that much...
*ducks as all the people who have been complaining about atomic floating point operations throw things at him*
It's a language with extensions, so other than committee pressure, there's nothing stopping vendor exposure of that hardware in OpenCL.I don't think that ROPs really get you that much...Yeah, you better duck :razz:
CouldntResist
10-Dec-2008, 12:15
Because it's not meant to be just a GPGPU language. It's meant to target multi-core CPUs, Cell and other future parallel hardware as well.
Well, Cell doesn't have texturing hardware, yet this unit is exposed in OpenCL (albeit in half assed way, but that's whole another problem). There are other examples of optional functionality in the spec, so I don't see a reason to ignore ROP units.
atomic_add currently works with int32/64. Extend it to handle float/unorm vectors, and you have additive blending... Actually, I'd prefer more complete solution with equivalent of sampler_t, which would be used for texture writing functions.
Well, Cell doesn't have texturing hardware, yet this unit is exposed in OpenCL (albeit in half assed way, but that's whole another problem). There are other examples of optional functionality in the spec, so I don't see a reason to ignore ROP units.
There is basically only one use for having full-blown support for texturing units: Fast rendering. Now adding something to the standard for just one type of application what will probably only run on hardware that already supports this application more directly, isn't really a good idea. Especially if people might use it, thinking it's cheap.
ROPs are a bit of a different beast, as I'm pretty sure they are implemented in a nicely inflexible way. I would be very surprised if the hardware supported "uncoalescing" writes.
Oh, and Andy, cool to see your name in the spec. ;)
CouldntResist
10-Dec-2008, 13:59
There is basically only one use for having full-blown support for texturing units: Fast rendering.
If full-blown texturing is unecessary, then wy bother with addressing modes, normalized coordinates and limited choice of pixel formats? You could have just plain C-style array, marked as read_only and cachable.
I'm seeing this in application agnostic way. If someone sees texturing useful for anything at all (that is, he sees an adventage over plain array), then it is highly probable he would also find those missing features useful too. Seriously, I don't see a reason to expose them crippled. (Just for clarity, I'm fine with optional way)
Now adding something to the standard for just one type of application what will probably only run on hardware that already supports this application more directly, isn't really a good idea. Especially if people might use it, thinking it's cheap.As I said, the concept of exposing optional functionality already covers that argument.
ROPs are a bit of a different beast, as I'm pretty sure they are implemented in a nicely inflexible way. I would be very surprised if the hardware supported "uncoalescing" writes.It didn't stop CUDA & OpenCL from exposing atomic ops.
If full-blown texturing is unecessary, then wy bother with addressing modes, normalized coordinates and limited choice of pixel formats? You could have just plain C-style array, marked as read_only and cachable.
I have been wondering about this, to be honest. It makes sense to expose it for CUDA, as it's GPU only. Being able to filter fast is a good thing in many applications. On the other hand, they don't expose any functionality that needs derivatives, which is probably a typical Nvidia "We don't want to tell you how this works"-thing, more than a technical limitation.
Now having it in CL is a bit... strange. You could argue that it's a convenience function, or that you use it to allow the use of texture caches on GPUs, or whatever. But quite frankly, I think it should have been an extension.
(Just for clarity, I'm fine with optional way)
Then go an hassle AMD and Nvidia about it. ;)
It didn't stop CUDA & OpenCL from exposing atomic ops.
Not sure how that relates. I haven't used CUDA's atomics yet, but I understood them to be a relatively low performance option. From the documentation, they don't even seem to support coalescing, so you'll get sequential access, even if none of the threads in your half-warp cause a collision.
I guess the reason you want ROPs is because they are fast. But then again, I'm not sure it will do you much good. ROPs are not really atomic, they simply exploit that a rasterizer will never generate two fragments at the same position for the same triangle. To be able to even process more than two triangles at a time, you basically have to make sure they don't use the same pixels. Translated into CL, this basically requires you to do the ordering yourself and then use the ROPs for simple blending. Now given enough threads, this is only a handful of cycles...
I have been wondering about this, to be honest. It makes sense to expose it for CUDA, as it's GPU only. Being able to filter fast is a good thing in many applications. On the other hand, they don't expose any functionality that needs derivatives, which is probably a typical Nvidia "We don't want to tell you how this works"-thing, more than a technical limitation.
To my understanding, CUDA's texture operations provide two essential functions, the first one is automatic type conversion for many common pixel types (e.g. from 32 bits RGBA to a 4D floating point vector), the second one is that reading from texture is spatially cached, so it can be helpful in some special cases.
In OpenCL, the second function is probably downplayed a little. So I guess it's mostly about the first one.
I read most of the specs and as expected the data-parallel execution model and the memory model are very similar to CUDA, the latter basically being CUDA-with-collapsible-memory-spaces. I like the addition of 8 and 16 value vector types for more explicit SIMD. And I much prefer the way kernel invocations work -- CUDA's <<< >>> extension is cute but makes parsing the code with standardized tools needlessly complicated.
Is there any information/speculation about who (if anyone) will provide optimized multicore CPU implementations of OpenCL? I guess Apple will do so for macs, but what about other OSs? Will Intel and/or AMD step forward?
Also, a few months ago I heard somewhere that Apple uses llvm/clang in their implementation. I don't expect anyone here to be able to confirm/deny this, but it doesn't hurt to ask. (I'm currently evaluating clang for a mostly unrelated project)
There is basically only one use for having full-blown support for texturing units: Fast rendering.That's not 100% true. I used texture filtering to speed up a multigrid solver that has nothing at all to do with rendering. I admit that's a slightly far-fetched example though, and I'd be fine with an extension.
Tim Murray
11-Dec-2008, 19:16
And I much prefer the way kernel invocations work -- CUDA's <<< >>> extension is cute but makes parsing the code with standardized tools needlessly complicated.
you realize that the OpenCL host-side API is pretty much just the CUDA driver API, right, which works from any C compiler you want?
mhouston
11-Dec-2008, 23:01
Many on the working group would disagree with that statement I think. ;-) OpenCL can do stuff CUDA can't, like the queuing model, event tracking, host pointer usage, full async support, etc.
Tim Murray
11-Dec-2008, 23:37
What does OpenCL do in terms of the asynchronous model that CUDA doesn't? CUDA certainly can do queuing (command streams), async, event tracking--the only thing I'm not sure about is host pointer usage, and that's just because I'm not sure how OpenCL uses host pointers.
mhouston
11-Dec-2008, 23:58
Take a long look at the OpenCL queue model. I'm not saying you couldn't contort things to do something similar in CUDA, and I expect Nvidia will do a runtime that maps things to the CUDA driver API, but the CUDA API doesn't present the same abstractions or level of control. There are also many other subtle differences.
There are similarities between OpenCL and CUDA, but there are also similarities to the Cell SDK, TBB, CAL, and a plethora of other parallel programming languages and APIs. There were a lot of people involved in designing OpenCL from many companies and with different backgrounds, and this is reflected in the specification.
That's not 100% true. I used texture filtering to speed up a multigrid solver that has nothing at all to do with rendering. I admit that's a slightly far-fetched example though, and I'd be fine with an extension.
That's interesting. :)
Did you just use mipmaps to select sampling frequency, or did you actually find use for AF?
CouldntResist
12-Dec-2008, 20:58
I guess the reason you want ROPs is because they are fast. But then again, I'm not sure it will do you much good. ROPs are not really atomic, they simply exploit that a rasterizer will never generate two fragments at the same position for the same triangle. To be able to even process more than two triangles at a time, you basically have to make sure they don't use the same pixels. Translated into CL, this basically requires you to do the ordering yourself and then use the ROPs for simple blending. Now given enough threads, this is only a handful of cycles...
Surely, it is possible to code your way around missing access to ROPs. If I understand you right, you are arguing that it wouldn't lead to big losses. But I think it can get more complicated than that. In OpenCL you're not allowed to do read-modify-write operation on image2D_t data, because it must be declared either read_only or write_only (thanks to Timothy for pointing that out). Therefore, results of read-modify-write must go to OpenCL Buffer object. This has several negative effects:
a) storage of results in Buffer object means no sampling for you in subsequent commands. If you needed to do so, you'd have to enqueue manual copy from buffer to image, and then sample data from there.
b) it defies "data is shared, not copied" principle of integration with GL.
c) it adds to your frustration, as you're constantly aware that in your GPU there is nice piece of hardware specialised in efficient read-modify-write ops, and it's staying all idle ;)
TimothyFarrar
13-Dec-2008, 00:55
Really what is needed in terms of general programming from the ROP unit is an atomic coherent data cache which is vector scatter/gather friendly. I isn't as if this cache needs to be huge, just needs to be a good bit more bandwidth efficient than going directly to global memory. I'd also argue that latency isn't a huge concern with atomic operations either (given the massively parallel model of GPUs). So perhaps that is the approach which future hardware is going to steer towards, eventually fixed function ROP goes away.
OpenCL does support this model through atomic operations on buffer objects, however currently atomic operations are all extensions to my knowledge.
Two other things lost with missing ROP interface are free format conversion and address translation for 2D cache locality. Personally I'd rather have the possibly more ALU of the above model (missing a dedicated ROP unit) then a dedicated ROP which duplicates the format conversion hardware in the ALU units.
However one thing I would have liked to have seen done differently in OpenCL is dealing with address translation for 2D locality. What I'd like from the hardware, and have available API access, is a standard way to control address translation of both buffers and images in a generic way (and the same way). So all buffers and images would be linear program side, but during buffer/image creation there would be a per object flag to turn on address bit reordering for better 2D locality of data. Then also allow buffer and image aliasing. So you could write to a buffer using a linear address which is also aliased as a texture, and it ends up as your responsibility to deal with coherency issues of a non-coherent read-only texture cache (think of this responsibility as being similar to the restrict keyword in C99). So basically think generic "tiling" or "Z-ordering" or "swizzling" depending on what term you use to describe this, but happening at the memory controller (say on a machine with virtual memory this would be a per page flag) instead of address translation done in the texture unit.
you realize that the OpenCL host-side API is pretty much just the CUDA driver API, right, which works from any C compiler you want?
Sure, the problem is that most "real" CUDA apps I have come across use the runtime and not just the driver API. Which uses the nonstandard C syntax extensions. (this is a problem for me since I'd like to use standard parsing tools for generating an AST and subsequent analysis of those programs. Of course it's possible to work around this)
That's interesting. :)
Did you just use mipmaps to select sampling frequency, or did you actually find use for AF?No, no AF. In fact in the end I didn't use that particular implementation at all since the low accuracy of the hardware filtering ops reduced my algorithm's convergence more than the added speed helped. As I said, it's a far-fetched example.
Andrew Lauritzen
15-Dec-2008, 17:31
No, no AF. In fact in the end I didn't use that particular implementation at all since the low accuracy of the hardware filtering ops reduced my algorithm's convergence more than the added speed helped. As I said, it's a far-fetched example.
Oh yeah, abusing hardware filtering for general purpose computation is a regular pastime of mine :) I've used aniso hardware for (approximate) integrations through volumes, and bilinear/trilinear is useful everywhere; in particular, I've used it for b-spline patch evaluation, although I did also run into the low-precision of the fractional part of the coordinates though :(
Really what is needed in terms of general programming from the ROP unit is an atomic coherent data cache which is vector scatter/gather friendly. I isn't as if this cache needs to be huge, just needs to be a good bit more bandwidth efficient than going directly to global memory. I'd also argue that latency isn't a huge concern with atomic operations either (given the massively parallel model of GPUs). So perhaps that is the approach which future hardware is going to steer towards, eventually fixed function ROP goes away.
Of course latency, small size, and lots of parallel threads are at odds with each other. I understand you mean latency isn't important from a software standpoint, but from a hardware standpoint a lot of parallel threads are required to hide high latencies and in order to have a lot of parallel threads you need a lot of memory.
frogblast
17-Dec-2008, 01:57
Surely, it is possible to code your way around missing access to ROPs. If I understand you right, you are arguing that it wouldn't lead to big losses. But I think it can get more complicated than that. In OpenCL you're not allowed to do read-modify-write operation on image2D_t data, because it must be declared either read_only or write_only (thanks to Timothy for pointing that out). Therefore, results of read-modify-write must go to OpenCL Buffer object. This has several negative effects:
a) storage of results in Buffer object means no sampling for you in subsequent commands. If you needed to do so, you'd have to enqueue manual copy from buffer to image, and then sample data from there.
b) it defies "data is shared, not copied" principle of integration with GL.
c) it adds to your frustration, as you're constantly aware that in your GPU there is nice piece of hardware specialised in efficient read-modify-write ops, and it's staying all idle ;)
GPUs have dedicated hardware for pulling data from textures, and have dedicated hardware for fixed function read-modify-write, but it isn't the same hardware! On a graphics-first/gpgpu-second device (which GPUs from both NV and ATI are, no matter what their marketing claims), why would you pay the performance cost for making these operations coherent?
Sure, OpenCL could present a much more uniform memory model, if they were unconcerned with it actually running on current-gen GPUs.
CouldntResist
27-Dec-2008, 21:54
The latest DX SDK includes DX 11 Preview. The documentation is swiss cheese yet, but I learned few things:
They do have textures that can read from and written to in Compute programs. And these textures can have mipmaps (and array slices) accessed by integer coordinates.
I think it is very highly probable that Compute programs will be able to access fully featured read-only texture objects, with the only exception being sampling with implicit derivatives (obviously).
A consequnce of availablity read-write images is that it will be able to reimplement traditional ping-pong usage pattern, where render target switch would incur only cost of memory barrier, instead of issuing another program invocation (as it's done with graphics APIs).
By the way, a question: does contents of local memory survive between kernel calls?
CouldntResist
27-Dec-2008, 21:59
Then also allow buffer and image aliasing. So you could write to a buffer using a linear address which is also aliased as a texture
If aliasing is the solution, then I wonder if the API would tolerate a little abuse, when the user tried to bind single image object to two kernel params: one read_only, the other write_only?
By the way, a question: does contents of local memory survive between kernel calls?
No, it doesn't. I would have to retry, but IIRC shared-mem is all zeros when a block starts (in CUDA). It wouldn't be too much use anyway, as you don't know which blocks get scheduled to which MP.
Tim Murray
05-Jan-2009, 09:11
No, it doesn't. I would have to retry, but IIRC shared-mem is all zeros when a block starts (in CUDA). It wouldn't be too much use anyway, as you don't know which blocks get scheduled to which MP.
Shared (or local, if you're in OpenCL-land) memory is undefined at the beginning of a kernel invocation.
Arnold Beckenbauer
07-Feb-2009, 00:15
AMD OpenCL parallel computing demo from Siggraph Asia 2008 (http://fireuser.com/blog/amd_opencl_parallel_computing_demo_from_siggraph_a sia_2008/)
willardjuice
07-Feb-2009, 00:39
Weird, they used Phenom II to display OpenCL's power? As someone who is looking to do GPGPU work on a R770, this is not an encouraging sign.
rpg.314
07-Feb-2009, 05:13
Good, so they got their cpu backend up and running.
Tim Murray
25-Feb-2009, 19:08
Linux OpenCL on a Quadro FX 570M:
http://www.youtube.com/watch?v=dXy_ssSGuy0
This is the demo we showed at SIGGRAPH Asia back in December. It's the CUDA SDK nbody sample ported to OCL (so no native kernels or anything like that, it's all 100% OCL).
trinibwoy
25-Feb-2009, 21:12
Any word on when OpenCL support will appear in public drivers?
willardjuice
25-Feb-2009, 21:24
Err scratch that
(edited to remove unsubstantiated information)
trinibwoy
25-Feb-2009, 21:43
Why is it up to them? I thought that version 1.0 of the spec was done and dusted.
rpg.314
26-Feb-2009, 06:39
Linux OpenCL on a Quadro FX 570M:
http://www.youtube.com/watch?v=dXy_ssSGuy0
This is the demo we showed at SIGGRAPH Asia back in December. It's the CUDA SDK nbody sample ported to OCL (so no native kernels or anything like that, it's all 100% OCL).
Cool
Arnold Beckenbauer
20-Mar-2009, 00:34
Damien's article is finally in english: OpenCL: democracy for GPU computing? (http://www.behardware.com/articles/744-1/opencl-democracy-for-gpu-computing.html)
Here some new creations from the ATi Stream SDK 1.4:
http://www.abload.de/img/streamt1cz.jpg
Ailuros
20-Mar-2009, 08:41
Granted Damien is most likely reporting what NV tells him, but in my mind the following doesn't render in those strict absolutes:
Indeed Neil Trevett, formerly of 3D Labs and currently VP of Embedded Content at NVIDIA, heads up the OpenCL work group. Given the similarity between C for CUDA and OpenCL and the fact that, officially, Apple initiated the idea and put it forward for discussion (after having decided to equip its new Macs with NVIDIA products…), NVIDIA can in reality be seen as a joint instigator.
Neil Trevett isn't only the "head" the OpenCL work group but the entire Khronos Group. The fact the he's VP for Embedded Content at NV doesn't mean anything to me in the grander scheme of things. Since the Khronos Group is also tremendously concentrating on embedded markets, Apple has not only a multi-year/multi-license agreement with Imagination and shares in the latter, but I also don't see any Tegra in Apple's latest iPhone. Nor has NV the same market penetration as IMG currently has in those markets when it comes to 3D.
If Apple's OpenCL initiative is concentrated only on mainstream and upwards devices the picture would naturally change. Since though Apple has obviously a wider range of markets in mind, I wouldn't jump to such preliminary conclusions. I had read in the past the patent Apple intended to file for it's heterogenous computing API before it became OpenCL and it sounded to the layman here like a pure Apple initiative.
CNCAddict
26-Mar-2009, 21:00
This maybe a bit off topic, but will opencl work in a windows environment? I'm hearing all about this great opencl development at GDC but are these macOS and linux only? Surely M$ has to support OpenCL eventually, no?
mhouston
26-Mar-2009, 23:01
AMD and Nvidia showed demos of OpenCL running on Windows.
trinibwoy
27-Mar-2009, 00:36
Does Microsoft even have to get involved to support something like OpenCL?
CNCAddict
27-Mar-2009, 02:58
THANKS Mike, that clears up a lot. Next 12 months are gonna be awesome :cool:
Dave Baumann
28-Mar-2009, 02:53
Does Microsoft even have to get involved to support something like OpenCL?
Well, think of it like this - Vista was only going to support OpenGL up to 1.4, until the IHV's provided a full OpenGL driver stack...
http://www.nvidia.com/object/io_1240224603372.html
SANTA CLARA, CA—APRIL 20, 2009—NVIDIA Corporation, the inventor of the GPU, today announced the release of its OpenCL driver and software development kit (SDK) to developers participating in its OpenCL Early Access Program. NVIDIA is providing this release to solicit early feedback in advance of a beta release which will be made available to all GPU Computing Registered Developers in the coming months.
“The OpenCL standard was developed on NVIDIA GPUs and NVIDIA was the first company to demonstrate OpenCL code running on a GPU,” said Tony Tamasi, senior vice president of technology and content at NVIDIA. “Being the first to release an OpenCL driver to developers cements NVIDIA’s leadership in GPU Computing and is another key milestone in our ongoing strategy to make the GPU the soul of the modern PC.”
At the core of NVIDIA®’s GPU Computing strategy is the massively parallel CUDA™ architecture that NVIDIA pioneered and has been shipping since 2006. Accessible today through familiar industry standard programming environments such as C, Java, Fortran and Python, the CUDA architecture supports all manner of computational interfaces and, as such, is a perfect complement to OpenCL. Enabled on over 100 million NVIDIA GPUs, the CUDA architecture is enabling developers to innovate with the GPU and unleash never before seen performance across a wide range of applications.
Developers can apply to become a GPU Computing Registered Developer at: www.nvidia.com/opencl (http://www.nvidia.com/opencl)
Go get em!
Jawed
trinibwoy
20-Apr-2009, 18:40
Well, think of it like this - Vista was only going to support OpenGL up to 1.4, until the IHV's provided a full OpenGL driver stack...
Hmmm, is that a no?
Go get em!
Nice, I expect CUDA vs OpenCL benchmarks within the week!
[/URL]Go get em!
[url]http://www.nvidia.com/opencl (http://www.nvidia.com/object/io_1240224603372.html) => "Page not found"
trinibwoy
20-Apr-2009, 19:52
Link works for me - it redirects to http://www.nvidia.com/object/cuda_opencl.html
Tim Murray
20-Apr-2009, 20:01
Nice, I expect CUDA vs OpenCL benchmarks within the week!
Why would they be any different until you get into the different capabilities of the languages (OCL's event model versus CUDA's memory model)?
trinibwoy
20-Apr-2009, 21:20
Well they shouldn't but I'm sure a lot of folks have been chomping at the bit to port stuff over to OpenCL. And presumably the first thing they'll do is compare performance :)
Tim Murray
13-May-2009, 06:55
NVIDIA OCL 1.0 conformance candidate drivers are up on the registered developer site (for anyone who's registered for the CUDA/GPU computing registered developer program, too). :)
willardjuice
13-May-2009, 07:53
NVIDIA OCL 1.0 conformance candidate drivers are up on the registered developer site (for anyone who's registered for the CUDA/GPU computing registered developer program, too). :)
So close, yet so far... :razz:
Any news on when ATi will offer OpenCL drivers to developers?
Arnold Beckenbauer
14-May-2009, 13:23
Any news on when ATi will offer OpenCL drivers to developers?
http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=112397&enterthread=y
We are working with strategic partners currently and providing them with a preview release of OpenCL. A public release of OpenCL will be available in the second half of 2009.
According to slide 7's timeline:
http://developer.download.nvidia.com/presentations/2009/GDC/OpenCL_Overview_GDC_Mar09.pdf
the conformance tests are due around now.
Jawed
According to slide 7's timeline:
http://developer.download.nvidia.com/presentations/2009/GDC/OpenCL_Overview_GDC_Mar09.pdf
the conformance tests are due around now.
Yea, the drivers that nVidia made available to registered developers are the "conformance candidates".
spacemonkey
28-Jun-2009, 01:36
Where does OpenCL stand in relation to the MS Compute Shader in the software stack? Will both be running on top of CUDA? ... or am I comparing apples and oranges?
Where does OpenCL stand in relation to the MS Compute Shader in the software stack? Will both be running on top of CUDA? ... or am I comparing apples and oranges?
Well, looking at this diagram here, it's all running on top of Cuda, but in a slightly different way:
http://developer.download.nvidia.com/compute/cuda/docs/CUDA_Architecture_Overview.pdf
Apparently there's an extra usermode driver layer between OpenCL and Cuda, where DirectX Compute runs directly on a lower level of the driver (makes sense, since it's part of the DX API and driver model, where OpenCL is only an API specification, with the driver implementation depending on the OS and hardware).
By the way, a few days ago, the nVidia OpenCL 1.0 candidate drivers have passed conformance testing. So they're officially OpenCL 1.0-conformant now.
Not sure if AMD has submitted any drivers for conformance-testing yet.
By the way, a few days ago, the nVidia OpenCL 1.0 candidate drivers have passed conformance testing. So they're officially OpenCL 1.0-conformant now.
I don't understand why the conformance tests haven't been officially announced if it's possible to have passed them. They're supposed to have been published in June. Not long...
Not sure if AMD has submitted any drivers for conformance-testing yet.
I wonder how AMD's going to treat CPU and GPU. Since OpenCL is meant to run across both, but a single "driver" has to encompass all variations, it seems to me AMD could have a bit of a tussle. CPU -only variant + GPU-only variant + CPU/GPU variant? Hmm.
Jawed
I don't understand why the conformance tests haven't been officially announced if it's possible to have passed them. They're supposed to have been published in June. Not long...
Not sure what you mean.
But the nVidia drivers are still beta drivers available to developers only. Not sure when they'll be released to the public. Perhaps when the 190 series goes WHQL?
I wonder how AMD's going to treat CPU and GPU. Since OpenCL is meant to run across both, but a single "driver" has to encompass all variations, it seems to me AMD could have a bit of a tussle. CPU -only variant + GPU-only variant + CPU/GPU variant? Hmm.
I think AMD actually has the least of the worries. They have both CPU and GPU drivers in-house.
What will happen if you want to have Intel OpenCL support for your quadcore, but you also have an AMD or nVidia GPU which comes with OpenCL drivers?
What if you have both an AMD and an nVidia GPU in your system? Etc. I hope there's going to be a good way to manage these drivers together.
rpg.314
28-Jun-2009, 16:44
They could get 2 drivers qualified now and merge them later into one.
Not sure what you mean.
If the conformance tests are final then why aren't they published?
But the nVidia drivers are still beta drivers available to developers only. Not sure when they'll be released to the public. Perhaps when the 190 series goes WHQL?
Not sure what would connect OpenCL and WHQL. Beta status for drivers is "normal" for NVidia, I don't think it's a meaningful qualification.
I think AMD actually has the least of the worries. They have both CPU and GPU drivers in-house.
What will happen if you want to have Intel OpenCL support for your quadcore, but you also have an AMD or nVidia GPU which comes with OpenCL drivers?
What if you have both an AMD and an nVidia GPU in your system? Etc. I hope there's going to be a good way to manage these drivers together.
Yes, AMD has a good chance to produce a comprehensive environment (and Intel will have CPU+Larrabee in theory at some point). But it's a lot more work than just producing a GPU-only environment.
With both Brook+ (though not all of Stream, i.e. excluding apps coded with IL) and CUDA providing some kind of CPU runtime capability, there's a kind of overlap. e.g. NVidia may in fact produce a driver that fully supports CPU execution - not merely for debugging (which seems to be the usual reason) but for full performance. But if NVidia intends to go beyond debugging I'd guess it's a low priority.
Maybe OpenCL 1.1 will sort out the interoperability. Maybe it'll take much longer.
Jawed
If the conformance tests are final then why aren't they published?
How do you publish a conformance test? I don't understand what you mean here?
Not sure what would connect OpenCL and WHQL. Beta status for drivers is "normal" for NVidia, I don't think it's a meaningful qualification.
Not sure what you're driving at.
The fact that nVidia's drivers passed OpenCL conformance simply means that they conform to OpenCL 1.0 specifications. It doesn't mean that nVidia considers the entire driver stable enough for release yet (as in, downloadable for end-users as officially supported WHQL drivers, from their main website).
I know end-users can get their hands on *some* of the beta drivers through the website aswell, but afaik not the ones that include OpenCL support. The only way for that is through 'leaks' from registered developers.
How do you publish a conformance test? I don't understand what you mean here?
The inner circle develops the conformance test for any one else to then use, when they pay the appropriate fee. Those third parties can't use the OpenCL logo until the test is available. Clearly the specification is already public, so third parties can get their product underway.
http://www.khronos.org/adopters/
Not sure what you're driving at.
The fact that nVidia's drivers passed OpenCL conformance simply means that they conform to OpenCL 1.0 specifications. It doesn't mean that nVidia considers the entire driver stable enough for release yet (as in, downloadable for end-users as officially supported WHQL drivers, from their main website).
I know end-users can get their hands on *some* of the beta drivers through the website aswell, but afaik not the ones that include OpenCL support. The only way for that is through 'leaks' from registered developers.
All I'm saying is that the tag "beta" means nothing about the quality of NVidia's drivers. e.g. the driver that has OpenCL support may be unleashed, unchanged, when OpenCL goes fully official.
WHQL, per se, has nothing to do with Khronos. Does NVidia normally tie Khronos and Windows drivers together?
Jawed
The inner circle develops the conformance test for any one else to then use, when they pay the appropriate fee. Those third parties can't use the OpenCL logo until the test is available. Clearly the specification is already public, so third parties can get their product underway.
Well, I'm not entirely sure how that goes. Perhaps the conformance tests are available, just not through their website or anything, so we can't see them.
All I know is that on nVidia's site there is a statement that nVidia has sent OpenCL conformance candidate drivers to Khronos:
http://news.developer.nvidia.com/2009/05/nvidia-submits-opencl-10-driver-to-khronos-for-conformance-certification-for-windows-and-linux-.html
While they haven't updated the news yet, various other sites have reported that the drivers came back from Khronos a few days ago, and were fully conformant:
http://www.tcmagazine.com/comments.php?id=27287&catid=3
All I'm saying is that the tag "beta" means nothing about the quality of NVidia's drivers. e.g. the driver that has OpenCL support may be unleashed, unchanged, when OpenCL goes fully official.
WHQL, per se, has nothing to do with Khronos. Does NVidia normally tie Khronos and Windows drivers together?
Why do you say that as if I somehow claimed otherwise?
I'm just saying that these particular OpenCL drivers and OpenCL SDK, although fully conformant now, haven't been released to the general public officially yet. Currently, only registered developers have access.
WHQL may not have anything to do with OpenCL/Khronos per se, but you must realize that OpenCL is just one part of their driver package. nVidia only supports WHQL drivers for regular end-users, and generally they do keep version numbers and releases of Windows and other OSes close together. So in that sense, yes, WHQL has something to do with it. But you already knew that, so I don't see why you're being so argumentative?
Well, I'm not entirely sure how that goes.
Read the link then.
I'm just saying that these particular OpenCL drivers and OpenCL SDK, although fully conformant now, haven't been released to the general public officially yet. Currently, only registered developers have access.
But the nVidia drivers are still beta drivers available to developers only. Not sure when they'll be released to the public. Perhaps when the 190 series goes WHQL?
I don't see any reason for WHQL to get in the way, since NVidia releases beta drivers for public consumption. So with a green light, it could be quite quick.
It's not a big deal. I think OpenCL could be in public hands very rapidly, because WHQL is an un-related log-jam (something like 3 weeks' lead-time, supposedly).
Jawed
Read the link then.
You'll have to excuse me, at work I don't always have time to read every link, nor do I have access to every site.
But yea, that site answers your question: the tests aren't public, you can only get them when you pay the fee and sign the legal agreement.
After that you still need to send your drivers to Khronos to let them test conformance of the implementation.
Well, nVidia apparently has gone through all that now.
I don't see any reason for WHQL to get in the way, since NVidia releases beta drivers for public consumption. So with a green light, it could be quite quick.
Well no, it's just that I never considered any beta drivers to be 'officially released to the public'. Matter of perspective I guess.
If you really want to get your hands on it, there's a torrent of the drivers and SDK on pirate bay. But you didn't get that from me :)
willardjuice
29-Jun-2009, 19:37
You guys have no idea how ridiculous you sound right now. :razz:
rpg.314
18-Jul-2009, 19:24
Does the opencl language support goto/switch/case/break keywords? I have looked all over the spec and even googled it for a while, but couldn't find anything explicit. The spec does say that it is based on C99's subset with extensions. And they have pointed out in places parts of spec they don't support, such as recursion, standard c99 headers etc. So I think that it is supported, but could we have a word from those who know more please?
I am particularly interested in knowing the goto's fate.
Panajev2001a
19-Jul-2009, 09:46
Does the opencl language support goto/switch/case/break keywords? I have looked all over the spec and even googled it for a while, but couldn't find anything explicit. The spec does say that it is based on C99's subset with extensions. And they have pointed out in places parts of spec they don't support, such as recursion, standard c99 headers etc. So I think that it is supported, but could we have a word from those who know more please?
I am particularly interested in knowing the goto's fate.
Are you really? Even knowing what can happen?
http://imgs.xkcd.com/comics/goto.png
Neal Stephenson thinks it's cute to name his labels 'dengo'
rpg.314
19-Jul-2009, 13:01
:razz::razz::razz:
Thanks for the link.
Well, actually I wasn't intending to write even a single goto myself. I hate it with passion. It's just that I was thinking of a python based meta opencl kernel generator. The script would generate LLVM IR internally and then I was planning to use it's C backend to finally feed the beast. The C backend generates raw goto's.
Note that the generated C code will be very low level (all loops are lowered to gotos, etc) and not very pretty (comments are stripped, original source formatting is totally lost, variables are renamed, expressions are regrouped), so this may not be what you're looking for.
rpg.314
19-Jul-2009, 13:28
Is there any one around here with access to beta drivers from nv that is willing to try?
mhouston
20-Jul-2009, 00:16
LLVM can do "interesting" things with nicely structured control flow...
As per the original question, yes goto/switch/case/break are supported in OpenCL. However, support for irreducible control flow, the main issue with arbitrary gotos, is implementation defined. See Section 6.8 of the OpenCL 1.0 spec.
And yes, LLVM can generate irreducible control flow.
rpg.314
20-Jul-2009, 02:01
What is irreducible control flow?
LLVM can do "interesting" things with nicely structured control flow...
Care to elaborate?
Are you saying that if I compile a C code containing no goto's to LLVM IR, and then use LLVM's C backend to regenerate C code, I can get what you are calling "irreducible control flow"?
mhouston
20-Jul-2009, 06:23
What is irreducible control flow?
Care to elaborate?
Are you saying that if I compile a C code containing no goto's to LLVM IR, and then use LLVM's C backend to regenerate C code, I can get what you are calling "irreducible control flow"?
Irreducible control flow has to do with the structure of the control flow graph. Generally, "normal" loops, branches, switch statements generally fall in the realm of structured and therefore reducible graphs. Goto's can create unstructured control flow which can get you to an irreducible graph. An example of an irreducible graph would be goto with a target inside the body of a loop.
We have seen LLVM generate unstructured control flow, but I don't have an easy example of something that will generate irreducible control flow, but yes it does happen.
As far as I can tell Larrabee and NVidia both have named predicate registers (16 and 4, I think), whereas ATI has a 32-level (I think) predicate stack. With named predicates, unstructured gotos works fine, but with stacked predicates, it all comes undone, since it's not possible to balance either one or two pushes with a statically compiled pop.
Is that what's causing the grief?
Jawed
mhouston
20-Jul-2009, 15:47
It's not quite that simple. In general if you have a vector/simd architecture, under control flow you need to know at which point you can converge again. If you have heavily unstructured or in the worse case irreducible flow, you are going to run into a problem. Depending on how your predicate (mask) registers are setup will effect what your options are. Most x86 compilers just give up on SSE/MMX under unhappy control flow.
mhouston
21-Jul-2009, 18:25
By the way, there is a whole Khronos forum about OpenCL that gets much more coverage from the different folks involved with OpenCL:
http://www.khronos.org/message_boards/viewforum.php?f=28
Serious question - why go through LLVM IR at all? Why not generate C directly?
Serious question - why go through LLVM IR at all? Why not generate C directly?
Perhaps because you might want to take advantage of the many-ready-to-use LLVM code transformations?
Wouldn't those or similar transforms also be run by the C-compiler of the OpenCL implementation? It's non-obvious to me that running through these kinds of optimization passes, emitting C, and then running that through a compiler again is something that necessarily nets you a performance gain.
Andrew Lauritzen
22-Jul-2009, 06:00
OpenCL/Compute Shader/HLSL and similar languages offer WAAAAAAAAY more opportunity for aggressive optimization than C. C really is a terrible language to compile efficient code for, so generating naive C from OpenCL and then optimizing the C code is much more difficult and will produce worse results than optimizing under the might tighter constrains imposed by the higher level languages.
Right - but it sounded to me like rpg.314 wanted to produce LLVM IR from some sort of meta-programming framework (perhaps something like http://documen.tician.de/codepy/, not necessarily native OpenCL) and then use the LLVM C backend to write out OpenCL-compatible C code (which is what I should have said in place of just generic C in my previous post; sorry for the confusion). That design was startling to me since I was assuming that a typical OpenCL compiler would be quite aggressive already. If I've misunderstood rpg.314, my apologies and please ignore my posts :)
Isn't part of the deal that LLVM is tried and tested by Apple, the key to its OpenGL implementation. Since it works well there, unifying graphics hardware and software implementations of OpenGL functionality, then it's a good foundation for OpenCL.
I'm also kinda curious about Mike's point earlier about irreducible control flow. If OpenGL is already implemented through LLVM, why isn't irreducible control flow causing grief in OpenGL shaders. Maybe it is, but it's a corner case in graphics at the moment?
(I'm such an OpenGL noob...)
Jawed
I'm also kinda curious about Mike's point earlier about irreducible control flow. If OpenGL is already implemented through LLVM, why isn't irreducible control flow causing grief in OpenGL shaders. Maybe it is, but it's a corner case in graphics at the moment?
The control flow functions in OpenGL shading language is quite limited. Basically, only structured control flows are supported. So the control flow graph should be always reducible.
Hmm, I got the impression that Mike's talking about irreducible control flow being produced by LLVM even though LLVM's been told the target can't accept it :???:
Jawed
LLVM is a rather fluid entity and the development lead works at Apple, if it needs fixing he'll fix it. That said, I don't think Apple will start writing OpenCL/OpenGL drivers for the graphics card manufacturers and neither do I see either AMD or NVIDIA embracing LLVM for their drivers.
mhouston
23-Jul-2009, 02:33
There isn't a way to tell LLVM the device can't handle irreducible control flow. A "structured control flow" pass which would convert irreducible control flow is in the future work realm, but I'll bet people are working on it. The LLVM guys are great and knock down issues and add features at an amazing pace.
rpg.314
23-Jul-2009, 07:27
Right - but it sounded to me like rpg.314 wanted to produce LLVM IR from some sort of meta-programming framework (perhaps something like http://documen.tician.de/codepy/, not necessarily native OpenCL) and then use the LLVM C backend to write out OpenCL-compatible C code (which is what I should have said in place of just generic C in my previous post; sorry for the confusion). That design was startling to me since I was assuming that a typical OpenCL compiler would be quite aggressive already. If I've misunderstood rpg.314, my apologies and please ignore my posts :)
You understood correctly. I have looked at codepy and it looks fugly to me. My goal is to get real Python (well, almost...., some extra annotations will be there, but very minimal) to generate LLVM IR, perform some operations there (like loop unrolling etc. ), and then use the c backend to feed the beast. I intend to do no optimizations. That I intend to leave to the driver vendor's compiler. LLVM is there to perform code transformations at a much higher level.
The only control flow statements I intend to support are those in python. IE, if, if-else, nested if-else, while and for. So the IR generated should have a reducible control flow. I am asking this question because I want to know if after doing C codegen (with some LLVM xforms, if requested), I'll land up with something that has irreducible control flow?
Isn't part of the deal that LLVM is tried and tested by Apple, the key to its OpenGL implementation. Since it works well there, unifying graphics hardware and software implementations of OpenGL functionality, then it's a good foundation for OpenCL.
As far as I know, within Apple OpenGL, LLVM is only used to compile shaders which do not fit in hardware constraints, so they can run on the CPU side. This happens a lot on the Intel GPU's for example. So it is a mechanism to degrade gracefully when confronted with over-complicated shaders or shaders which exceed hardware limits. I can't recall if this is only on the vertex side or if they bothered to do it for the fragment side too.
There are also some API's to detect when the software fallback has happened, after a draw call, so a developer can try not to trip over this possible performance pothole.
Shaders that are actually running on the GPU as they should be, have code generated for them by compilers provided by the vendors as part of the driver package.
On Apple OpenCL, LLVM is AFAIK being used for all the x86 codegen for CPU side, and vendor compilers again enter the picture for GPU side codegen.
rpg.314
03-Aug-2009, 07:52
I thought Apple (pretty much) wrote their own drivers for the gpu's they supported. When you have an LLVM frontend for parsing the code to LLVM IR, your own optimization passes (common for all drivers), what is left is perhaps only the final codegen backend for the gpu's. And I am sure that apple will do it themselves instead of letting AMD/nv/intel screw the mac experience for their users.
I thought Apple (pretty much) wrote their own drivers for the gpu's they supported. When you have an LLVM frontend for parsing the code to LLVM IR, your own optimization passes (common for all drivers), what is left is perhaps only the final codegen backend for the gpu's. And I am sure that apple will do it themselves instead of letting AMD/nv/intel screw the mac experience for their users.
The final codegen happens inside code that the GPU vendors own which is proprietary. The driver dev process for AMD and NV GPU's on the Mac is totally non-100% Apple. There are dozens of people involved at the IHV's at the lower levels of the driver stack.
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=118126
The implementation of OpenCL that is part of Snow Leopard is provided by Apple. As such, we cannot comment on the features that are being exposed by that implementation.
I'll get my Snow Leopard next Monday and I'll see how it works on my Mac mini (GF 9400M) :)
I've installed Snow Leopard and wrote a small OpenCL program for testing. It works on both x86_64 mode and i386 mode. In the directory of the OpenCL framework, there are several dynamic linked libraries which seem to indicate that there are currently three different implementations, CPU, AMD IL, and PTX.
http://www.apple.com/macosx/specs.html
GeForce 9400M
GeForce 9600M GT
GeForce 8600M GT
GeForce GT 120
GeForce GT 130
GeForce GTX 285
GeForce 8800 GT
GeForce 8800 GS
Quadro FX4800
Quadro FX5600
Radeon 4850
Radeon 4870
http://arstechnica.com/apple/news/2009/08/opencl-gets-tires-kicked-run-around-the-block.ars
http://forums.macrumors.com/showpost.php?p=8385862&postcount=6
The LLVM compiler can't compile the second benchmark to run on ATI.
More stuff that doesn't like ATI:
http://netkas.org/?p=164
---
Apple introduction and tutorial stuff:
http://developer.apple.com/mac/library/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html
http://www.macresearch.org/opencl
Jawed
spacemonkey
01-Sep-2009, 01:03
Why would they omit the GTX 275 & 280?
spacemonkey
01-Sep-2009, 01:22
I've installed Snow Leopard and wrote a small OpenCL program for testing. It works on both x86_64 mode and i386 mode. In the directory of the OpenCL framework, there are several dynamic linked libraries which seem to indicate that there are currently three different implementations, CPU, AMD IL, and PTX.
Is it simple enough to rewrite in CUDA? Could you share some comparative benchmark results?
Tim Murray
01-Sep-2009, 01:52
Why would they omit the GTX 275 & 280?
there are no Mac versions of either
digitalwanderer
01-Sep-2009, 03:59
You mean nVidia hates Apple too? :shock:
Is it simple enough to rewrite in CUDA? Could you share some comparative benchmark results?
OpenCL runtime works much like CUDA's driver API. That is, it's more "wordy" than CUDA runtime API, but it's not too bad. Most of my time spent yesterday was actually on Xcode because I didn't know how to "link" to OpenCL libraries (and later I found out that it's a "framework" and I have to add the OpenCL framework into my project).
I can try to port some CUDA programs into OpenCL, but I found that if I write the compiled "binary" to a file when using GPU OpenCL, it's actually PTX (the device is GF 9400M), so I guess if there are any performance disparity it must be related to the compilers.
You mean nVidia hates Apple too? :shock:
Nah they love being in Apples, but seeing as it is a bit of a pita stocking devices with the special firmware, you typically see far fewer SKUs.
I wrote a simple device query program which basically lists every queriable information in OpenCL and displays them, for all devices. The result running on my Mac mini is like this:
Device: GeForce 9400
Vendor: NVIDIA
Driver version: CLH 1.0
Device version: OpenCL 1.0
Device profile: FULL_PROFILE
Supported extensions: cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions
Vendor ID: 1022600
Max compute units: 16
Device work item sizes: 512x512x64
Max work group size: 512
Best vector width for char: 1
Best vector width for short: 1
Best vector width for int: 1
Best vector width for long: 1
Best vector width for float: 1
Best vector width for double: 0
Device clock: 1100 MHz
Memory address bits: 32
Max memory allocation size: 134217728 bytes
Support image: Yes
Max image objects can be read: 128
Max image objects can be wrote: 8
Max 2D image width: 8192
Max 2D image height: 8192
Max 3D image width: 2048
Max 3D image height: 2048
Max 3D image depth: 2048
Max number of samplers: 16
Max parameter size: 4352 bytes
Memory base alignment: 1024 bits
Data alignment: 128 bytes
Single precision FP cap: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
Global cache type: None
Global cache line size: 0 bytes
Global cache size: 0 bytes
Global memory size: 268435456 bytes
Constant buffer size: 65536 bytes
Max constant arguments: 9
Local memory type: Dedicated
Local memory size: 16384 bytes
Support error correction: No
Profiler timer resolution: 1000 ns
Little endian: Yes
Device available: Yes
Device compiler available: Yes
Device execution cap: CL_EXEC_KERNEL
Device command queue properties: CL_QUEUE_PROFILING_ENABLE
Device: Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz
Vendor: Intel
Driver version: 1.0
Device version: OpenCL 1.0
Device profile: FULL_PROFILE
Supported extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions
Vendor ID: 1020400
Max compute units: 2
Device work item sizes: 1x1x1
Max work group size: 1
Best vector width for char: 16
Best vector width for short: 8
Best vector width for int: 4
Best vector width for long: 2
Best vector width for float: 4
Best vector width for double: 2
Device clock: 2000 MHz
Memory address bits: 64
Max memory allocation size: 1073741824 bytes
Support image: Yes
Max image objects can be read: 128
Max image objects can be wrote: 8
Max 2D image width: 8192
Max 2D image height: 8192
Max 3D image width: 2048
Max 3D image height: 2048
Max 3D image depth: 2048
Max number of samplers: 16
Max parameter size: 4096 bytes
Memory base alignment: 1024 bits
Data alignment: 128 bytes
Single precision FP cap: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
Global cache type: Read-write
Global cache line size: 64 bytes
Global cache size: 3145728 bytes
Global memory size: 3221225472 bytes
Constant buffer size: 65536 bytes
Max constant arguments: 8
Local memory type: Global
Local memory size: 16384 bytes
Support error correction: No
Profiler timer resolution: 1 ns
Little endian: Yes
Device available: Yes
Device compiler available: Yes
Device execution cap: CL_EXEC_KERNEL
Device command queue properties: CL_QUEUE_PROFILING_ENABLE
madyasiwi
01-Sep-2009, 23:48
...
http://netkas.org/?p=164
Name: Radeon HD 4870
Vendor: AMD
Type: GPU
Device Version: OpenCL 1.0
Driver Version: 1.0
Compute Units: 4
Work Group Size: 1024
Clock: 750 MHz
Global Memory: 128 MB
Local Memory: 16 KB
Cache Size: 0 KB
Cache Line Size: 128 Bytes
Only 4 compute units available?
But those are Über compute units...
madyasiwi
02-Sep-2009, 00:05
Isn't it suppose to be like 10?
Yeah, that number is not very well defined. It's described in the specification as a loosely "The number of parallel compute cores on the OpenCL device." What a "compute core" means is apparently up to each vendor's interpretation.
Yeah, that number is not very well defined. It's described in the specification as a loosely "The number of parallel compute cores on the OpenCL device." What a "compute core" means is apparently up to each vendor's interpretation.
Then again, AMD seems to say that the OpenCL implementation in Snow Leopard is Apple's responsibility.
And while Apple seems to call every shader processor on Nvidia chips a 'compute core', the 800 stream processors on RV770 (which is the only supported/tested ATI chip so far) only yield 4, and the performance is completely out of line with both X86 CPUs and Nvidia chips.
So what are there 4 of in RV770? 4 FP MADD/DP ALUs on the one shader core that is actually being used?
A reason could be the maturity of AMD IL's driver on Mac. NVIDIA has CUDA on Mac for quite a long time and it's generally up to date with other platforms (Windows and Linux). On the other hand, AMD Stream SDK doesn't seem to have a Mac version.
Also I believe that performance characteristics between NVIDIA's GPU and AMD's GPU is quite different. That is, an OpenCL kernel tuned for NVIDIA's GPU may not run very well on AMD's GPU, and vice versa. Unfortunately, I don't have access to a Mac with AMD's GPU so I have no experiences on this matter.
Unknown Soldier
10-Sep-2009, 16:54
Getting good performance out of general-purpose apps that run on the GPU could get a little bit easier thanks to Nvidia. The company has introduced the OpenCL Visual Profiler, a software tool that gives developers "insight into performance bottlenecks and opportunities for optimization." The OpenCL Visual Profiler brings the following perks:
# Profiling of actual hardware signals, kernel efficiency, and instruction issue rate
# Timing of memory copies between system memory and GPU dedicated memory
# Customizable graphs to help developers focus in on problem areas
# Basic auto-analysis to reveal warp serialization problems
# Easy import/export to CSV for custom analysis
You may have to jump through a few hoops to get going, though—downloading the software involves signing up as part of Nvidia's Registered Developer program (http://http://developer.nvidia.com/page/registered_developer_program.html).
Nvidia does however offer an OpenCL Best Practices Guide (PDF) to anyone (http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf) without mandatory registration. The 49-page guide largely focuses on performance optimization, as well.
News Source: http://www.techreport.com/discussions.x/17562
rendezvous
10-Sep-2009, 17:44
This is not Nexus which trinibwoy posted a thread about two weeks ago.
http://forum.beyond3d.com/showthread.php?t=55076&highlight=nexus
(Youtube video linked in triniboys post, I woudl recomend you to see it if you missed it)
edit: I made an incorrect assumption, now corrected.
Tim Murray
10-Sep-2009, 18:07
I think they are refering to Nexus which trinibwoy posted a thread about two weeks ago.
http://forum.beyond3d.com/showthread.php?t=55076&highlight=nexus
(Youtube video linked in triniboys post, I woudl recomend you to see it if you missed it)
nope, this is the OpenCL version of the CUDA Visual Profiler, not Nexus.
Andrew Lauritzen
10-Sep-2009, 18:59
Nvidia does however offer an OpenCL Best Practices Guide (PDF) to anyone without mandatory registration. The 49-page guide largely focuses on performance optimization, as well.
Interesting, although too bad they didn't use the proper OpenCL terminology in their document. Looks like it was more of a quick port from a CUDA guide rather than a properly written one for OpenCL. It's surprising and a bit ridiculous how much marketing there is in there too (mostly in the introduction), but I guess that's life.
Yea, I got spammed today, and downloaded the goodies from nVidia. Microsoft also released a public beta for the D3D11 RTM runtimes for Vista. Now I have OpenCL *and* DirectCompute to play with :)
trinibwoy
10-Sep-2009, 22:12
It's surprising and a bit ridiculous how much marketing there is in there too (mostly in the introduction), but I guess that's life.
Like this? Thousands! :lol:
While NVIDIA devices are primarily associated with rendering graphics, they also are powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution well.
Andrew Lauritzen
11-Sep-2009, 00:38
Like this? Thousands! :lol:
Yeah... the whole section on "threads" is odd since the entire reason that it's not called a "thread" in OpenCL is because it's not the same as the concept in other domains (i.e. CPUs) and thus calling "work items" that and "comparing" them to CPU OS threads is pure marketing.
But hey we all know that CPUs are slow and even the best can only do 16 things "in parallel" while GPUs can do THIRTY THOUSAND!!!! I'm as much of a GPU computing guy as anyone else but come on... leave this BS out of my developer guides ;)
(PS: No one tell them that they could start counting the bits operated on by their ALUs to get another 32x more "threads" running in parallel :D)
Arnold Beckenbauer
15-Sep-2009, 13:31
Compilers and More: OpenCL Promises and Potential (http://www.hpcwire.com/features/Compilers-and-More-OpenCL-Promises-and-Potential-58625442.html?page=1)
So let's accept and even celebrate OpenCL for what it is, and not try to make it what it can't be. There's danger is raising expectations too high, or claiming too much (a la Bernie Madoff); OpenCL can be influential and succeed without replacing other parallel languages. To correct Steve Job's quote: "While OpenCL is very similar in many respects to NVIDIA's CUDA, it adds features to take advantage of other targets; and though it's quite complex, it has the potential to deliver very high performance, and is much easier than trying to map your computation into OpenGL or graphics primitives." Hype I can agree with; but then, I'm not the Apple CEO.
Opinions? Is he (too) pessimistic or realistic?
My expectations on OpenCL have dropped considerably once I actually got hold of a working development environment myself... and dropped even further when I got a taste of DirectCompute.
How much longer are nVidia and AMD going to keep OpenCL in beta?
I think OpenCL's only hope is that XP remains popular... Because if Windows 7 takes off, the road is clear for DirectCompute to take over. Sure, OpenCL might still exist on Apple and linux... but it will probably become as much a niche product as OpenGL is for graphics.
rpg.314
15-Sep-2009, 16:14
How much longer are nVidia and AMD going to keep OpenCL in beta?
Amd will keep it under wraps till 22nd sep. (just my guess)
nv will keep it under wraps till amd keeps it under wraps to lock in as many people into the dead end that is CUDA. :evil:
I think OpenCL's only hope is that XP remains popular... Because if Windows 7 takes off, the road is clear for DirectCompute to take over. Sure, OpenCL might still exist on Apple and linux... but it will probably become as much a niche product as OpenGL is for graphics.
For graphics, obviously opencl is a no go unless you are an opengl person.
But do you expect gpu-lapack, F@H, MATLAB, the entire HPC domain (it includes not just big iron, but J. Random Scientist too) to use Compute shader over opencl?
If your answer is no, then sure as hell opencl ain't gonna be a niche product in the forseeable future.
But if hw trends change, those with crappy sw will prolly nuke the development of ocl. :oops:
The best bet for ocl to stay relevant is for the sw (apple) and hw folks (ati/intel/nv) to keep iterating ocl quickly and not let it stagnate like ogl did in the 90's.
Amd will keep it under wraps till 22nd sep. (just my guess)
And this is... because AMD's current hardware isn't up to the task (Sep 22 being the introduction of their new GPU)?
nv will keep it under wraps till amd keeps it under wraps to lock in as many people into the dead end that is CUDA. :evil:
Perhaps... or well, they could release what they have now, but its not exactly a competitor to Cuda... Probably no coincidence there either :)
For graphics, obviously opencl is a no go unless you are an opengl person.
Why is that obvious though?
As far as I know, OpenCL can be used with OpenGL, Direct3D (9, 10 or 11), or just without any graphics library at all.
On XP with only Direct3D 9, it'd be an alternative to Cuda. DirectCompute won't run there.
But do you expect gpu-lapack, F@H, MATLAB, the entire HPC domain (it includes not just big iron, but J. Random Scientist too) to use Compute shader over opencl?
I don't understand the question. They aren't currently using OpenCL... they are using Cuda, if anything. So OpenCL has its work cut out. It needs to deliver in both performance and ease of development, and currently it's good at neither.
DirectCompute is better in those respects, imho, although not quite as sophisticated as Cuda.
The best bet for ocl to stay relevant is for the sw (apple) and hw folks (ati/intel/nv) to keep iterating ocl quickly and not let it stagnate like ogl did in the 90's.
Well, so far they're doing a lousy job. They've not even managed to get a working initial version out of the door before DirectCompute. It almost looks like it's going to stagnate before it ever took off.
nutball
15-Sep-2009, 16:26
But if hw trends change, those with crappy sw will prolly nuke the development of ocl. :oops:
For "those with crappy sw" substitute "those who have better things to do than recode their apps from scratch every couple of years".
The best bet for ocl to stay relevant is for the sw (apple) and hw folks (ati/intel/nv) to keep iterating ocl quickly and not let it stagnate like ogl did in the 90's.
Not if HPC apps are to be coded directly to OpenCL it isn't - rapid evolution is the best way to ensure that it doesn't get touched with a barge pole in that arena.
rpg.314
15-Sep-2009, 16:28
Opinions? Is he (too) pessimistic or realistic?
My opinion: it's likely to be more useful as a target language for higher level programming languages, tools, and environments, or as a language to implement optimized libraries, than as a language for a more general programming community.
I agree with this quite a lot. Broadly speaking, he's on a generally right track, but yeah, he's a bit pessimistic.
As regards performance portability, I think the notion that you can take a piece of code and run it on any at A grade speed without doing any hardware specific optimizations is, at the very minimum, ridiculous according to me.
You can likely get good performance portability if your architectures are similar enough, but seriously, it's gonna degrade the farther they are in design space.
Serious performance needs coding to the metal, and there are no two ways about it. Period.
rpg.314
15-Sep-2009, 16:35
And this is... because AMD's current hardware isn't up to the task (Sep 22 being the introduction of their new GPU)?
It is definitley up to the job. I think they just want to make a bigger bang on 22nd. :smile:
Perhaps... or well, they could release what they have now, but its not exactly a competitor to Cuda
It is a competitor to cuda, which will in all probability go the way of Cg?
I am not very well versed in the history of programmable graphics, but which came out first? Cg or GLSL? BU how much?
As far as I know, OpenCL can be used with OpenGL, Direct3D (9, 10 or 11), or just without any graphics library at all.
On XP with only Direct3D 9, it'd be an alternative to Cuda. DirectCompute won't run there.
Does opencl allow for interoperability with DX? Does microsoft allow it?
To the best of my knowledge, the answer is yes and no. But if you know better, care to elaborate?
I don't understand the question. They aren't currently using OpenCL... they are using Cuda, if anything. So OpenCL has its work cut out. It needs to deliver in both performance and ease of development, and currently it's good at neither.
I meant their future versions, say 2 years from now (should have made that explicit). You expect Matlab, f@h etc. to be written in CS over ocl 3 yrs from now?
Well, so far they're doing a lousy job. They've not even managed to get a working initial version out of the door before DirectCompute. It almost looks like it's going to stagnate before it ever took off.
OCL's competition aint with DXCS. It is with whatever MS will put out as their windows only api for accessing gpu's.
rpg.314
15-Sep-2009, 16:41
Not if HPC apps are to be coded directly to OpenCL it isn't - rapid evolution is the best way to ensure that it doesn't get touched with a barge pole in that arena.
By "rapid evolution", I mean not letting
a) mass of vendor conflicting extensions proliferate
b) not keeping the api up to date with new hw features
c) keeping b/c compatibility as far as possible and when the need arises (it certainly will) gracefully deprecating stuff like is being done with ogl3.x. I am pissed at the lack of new object based api even in 3.2, but their deprecation model is certainly noteworthy.
It is a competitor to cuda, which will in all probability go the way of Cg?
I am not very well versed in the history of programmable graphics, but which came out first? Cg or GLSL? BU how much?
It's NOT a competitor. Neither performance nor ease-of-use of OpenCL are close to what C for Cuda offers... Not to mention the Cuda libraries that you can get for Matlab...
Cg was no longer required because GLSL offered the same. OpenCL currently doesn't offer the same.
Does opencl allow for interoperability with DX? Does microsoft allow it?
To the best of my knowledge, the answer is yes and no. But if you know better, care to elaborate?
I think the answers are yes and yes.
I meant their future versions, say 2 years from now (should have made that explicit). You expect Matlab, f@h etc. to be written in CS over ocl 3 yrs from now?
I don't see why not. If OpenCL doesn't get its act together, then I don't think it's going to take off.
OCL's competition aint with DXCS. It is with whatever MS will put out as their windows only api for accessing gpu's.
And how is that not DXCS?
rpg.314
15-Sep-2009, 17:24
It's NOT a competitor. Neither performance nor ease-of-use of OpenCL are close to what C for Cuda offers... Not to mention the Cuda libraries that you can get for Matlab...
ocl has something cuda will never have, vendor neutrality. And that is more than enough to compensate CUDA's ease of use in the short term.
Cg was no longer required because GLSL offered the same. OpenCL currently doesn't offer the same.
Ease of use perhaps, but like I said, it is intended to the assembly/bytecode of GPU programming. The goal should be for middleware developers to abstract out the command queues, contexts, buffers etc.
I think the answers are yes and yes.
OK.
I don't see why not. If OpenCL doesn't get its act together, then I don't think it's going to take off.
Well, I think it has begun well and is doing well for now. Let's let time and market forces decide whether it will fade out or not.
And how is that not DXCS?
Because it is tied permanently to graphics, about which a typical scientist knows less than zero. :lol:
rpg.314
15-Sep-2009, 17:27
For "those with crappy sw" substitute "those who have better things to do than recode their apps from scratch every couple of years".
That is an unfortunate fact of life for many decades now, certainly more so in HPC. :sad:
What is the ease of use advantage of CUDA over OpenCL?
Jawed
ocl has something cuda will never have, vendor neutrality. And that is more than enough to compensate CUDA's ease of use in the short term.
Why would that compensate anything? I don't think people care about that in the first place.
In fact, my company has a strict policy of only supporting nVidia hardware. We don't want the hassle of supporting more than one vendor.
Secondly, since people in the HPC market are already using nVidia Tesla/Cuda solutions... why would they bother to move to OpenCL? It's going to be slower, and porting the code over to OpenCL is probably going to be harder than it was to write it in the first place. Why bother?
Well, I think it has begun well and is doing well for now. Let's let time and market forces decide whether it will fade out or not.
Well, I find the performance of OpenCL to be quite poor currently, on nVidia hardware. Both DirectCompute and C for Cuda seem to be considerably faster (think factors 3-4 in some cases).
Because it is tied permanently to graphics, about which a typical scientist knows less than zero. :lol:
Did you ever try to actually use it? Setting up an OpenCL context is actually harder than setting up a DX11 device. After that, the APIs are pretty similar in compiling a kernel, setting up buffers, and running the actual code. The fact that a DX11 device can also do graphics operations aside from compute shaders, is not going to bother a scientist. The August 2009 DirectX SDK actually shows some non-graphics examples (some commandline-based calculation samples). They can just copy-paste the code.
What is the ease of use advantage of CUDA over OpenCL?
Jawed
Runtime API.
nutball
15-Sep-2009, 17:57
That is an unfortunate fact of life for many decades now, certainly more so in HPC. :sad:
In the UK the gross cost of paying a halfway decent scientific programmer is about £80k/yr. That sum will buy you an x86 cluster of ~500-600 cores and ~1TB memory, if you catch the sales rep in a good mood.
So if you have a well proven, scalable MPI-parallel code, and a budget of £80k/yr do you
a) buy a £1k graphics card and spend £79k paying someone to port your code to OpenCL and verify it, then face the prospect of having to prove at peer review that the results are numerically correct. Then next year buy the new architecture £1k graphics card and pay the same person another £79k to port and verify it again, rinse and repeat.
b) buy 500-600 x86 cores and 1TB memory and run your code now, and next year buy another (500-600)*1.6 x86 cores and 1.6TB memory to add to what you've already got, and run your code even faster
?
At the end of this you'll be judged on the amount of science you've produced, not on how wonderfully elegant your code is, by the way.
rpg.314
15-Sep-2009, 18:42
In fact, my company has a strict policy of only supporting nVidia hardware. We don't want the hassle of supporting more than one vendor.
You may be in a minority. Others may know better about this.
Secondly, since people in the HPC market are already using nVidia Tesla/Cuda solutions... why would they bother to move to OpenCL? It's going to be slower, and porting the code over to OpenCL is probably going to be harder than it was to write it in the first place. Why bother?
I'd say vendor neutrality. When they'll see th benches of 5870 possibly lrb1 too) beating the crap out of a tesla, they'll want to.
Well, I find the performance of OpenCL to be quite poor currently, on nVidia hardware. Both DirectCompute and C for Cuda seem to be considerably faster (think factors 3-4 in some cases).
Yeah, the drivers are beta-ish now. I guess they'll invest more effort if AMD can do a half decent ocl drivers.
Did you ever try to actually use it? Setting up an OpenCL context is actually harder than setting up a DX11 device. After that, the APIs are pretty similar in compiling a kernel, setting up buffers, and running the actual code. The fact that a DX11 device can also do graphics operations aside from compute shaders, is not going to bother a scientist. The August 2009 DirectX SDK actually shows some non-graphics examples (some commandline-based calculation samples). They can just copy-paste the code.
May be you are right. But I am sceptical for now. Many strange things have happened in the past. Let's see what happens in future.
You may be in a minority. Others may know better about this.
I don't think I am. Most organizations have long-term service contracts with suppliers/IHVs anyway.
I'd say vendor neutrality. When they'll see th benches of 5870 possibly lrb1 too) beating the crap out of a tesla, they'll want to.
The irony of the success of a vendor neutral solution depending on the performance of particular vendors...
Yeah, the drivers are beta-ish now. I guess they'll invest more effort if AMD can do a half decent ocl drivers.
Yea, but the irony is that AMD doesn't have ANY GPU-accelerated drivers available, not even in beta-state (okay, maybe if you're in the 'inner circle', but not in the standard beta program anyway, you only get a CPU implementation... wow).
rpg.314
15-Sep-2009, 18:55
At the end of this you'll be judged on the amount of science you've produced, not on how wonderfully elegant your code is, by the way.
yet ppl have written a lot of code for cuda, even ported some mpi stuff :)
May be not every year. But yeah, even mature codes are under constant revision and are hardly static.
And if you develop libraries, the cycles are much shorter.
Yea, nVidia offers some nice standard cublas/fft libraries and all that. Haven't seen any libs for OpenCL yet.
nutball
15-Sep-2009, 19:12
yet ppl have written a lot of code for cuda, even ported some mpi stuff :)
For production, that they intend to support for the next decade come what may? I don't think so.
May be not every year. But yeah, even mature codes are under constant revision and are hardly static.
Functionality is under revision. The devs can add functionality because they're not spending their time chasing the latest flavour-of-the-month version of the API/language extensions required to keep it running on the flavour-of-the-month hardware.
And if you develop libraries, the cycles are much shorter.
Well maybe. A lot of HPC happens outside libraries.
Did you ever try to actually use it? Setting up an OpenCL context is actually harder than setting up a DX11 device. After that, the APIs are pretty similar in compiling a kernel, setting up buffers, and running the actual code.
Really? I think setting up an OpenCL context is pretty straight forward, while setting up a DX11 device is unnecessarily complex. Of course, if you go from the "create a platform and enumerate" way you'll find it very complex. However, most programs should just use "create an OpenCL context from device type" which is pretty straight forward. Just one function call, actually.
Furthermore, DirectCompute does not support anything other than a GPU, while OpenCL does not have this restriction.
My biggest problem with DirectCompute is with its syntax. It's derived from HLSL so general programmers who have no experience in 3D programming are not going to be very familiar with it. OpenCL and CUDA are much more similar with standard C syntax.
Comparing CUDA with OpenCL, of course CUDA is much more mature right now. Also it has some functions that OpenCL currently does not have. However, OpenCL has the advantage of vendor neutrality (which many people find it useful) and it still has the ability to introduce extensions if necessary. So IMHO OpenCL isn't going anywhere, just like OpenGL. Of course, most games in Windows do not use OpenGL now, but the world is not just Windows, you know.
Really? I think setting up an OpenCL context is pretty straight forward, while setting up a DX11 device is unnecessarily complex. Of course, if you go from the "create a platform and enumerate" way you'll find it very complex. However, most programs should just use "create an OpenCL context from device type" which is pretty straight forward. Just one function call, actually.
At the end of the day, this debate is about as pointless as it was with setup code of OpenGL vs Direct3D. You write that code only once, probably on the first day of your project. Only someone with an agenda would try to make a big deal out of it. Not an actual developer.
Furthermore, DirectCompute does not support anything other than a GPU, while OpenCL does not have this restriction.
I honestly see absolutely no value in this.
OpenCL may run on CPUs, but the code is going to be nowhere near optimal for CPU architectures.
Seeing as the main point of writing massively parallel code is performance, I don't see myself ever using OpenCL on anything that is not a massively parallel architecture. So at this point I see no use for OpenCL outside GPUs.
rpg.314
16-Sep-2009, 08:27
I honestly see absolutely no value in this.
I see the value of printf() debugging there. :lol:
Panajev2001a
16-Sep-2009, 09:01
I honestly see absolutely no value in this.
OpenCL may run on CPUs, but the code is going to be nowhere near optimal for CPU architectures.
Seeing as the main point of writing massively parallel code is performance, I don't see myself ever using OpenCL on anything that is not a massively parallel architecture. So at this point I see no use for OpenCL outside GPUs.
Portion of your code with good deal of data dependencies, branching, and lower degree of data level parallelism?
There seems something to being able to keep under the same code-base different "problems" and get the best device available for such task handle it (I assume you would be able to dispatch work to the CPU or the GPU without just letting the system decide for you).
Portion of your code with good deal of data dependencies, branching, and lower degree of data level parallelism?
There seems something to being able to keep under the same code-base different "problems" and get the best device available for such task handle it (I assume you would be able to dispatch work to the CPU or the GPU without just letting the system decide for you).
Well, before I even THINK about using something like Cuda, OpenCL or DirectCompute, I ask myself the following questions:
1) Does performance matter?
2) Is the problem parallelizable?
So if performance didn't matter, I wouldn't be using OpenCL in the first place. Since performance matters by default in my OpenCL code, I don't see myself running it on CPUs.
At the end of the day, this debate is about as pointless as it was with setup code of OpenGL vs Direct3D. You write that code only once, probably on the first day of your project. Only someone with an agenda would try to make a big deal out of it. Not an actual developer.
I'm just saying that's my experience w.r.t. OpenCL vs DX11 because what you said is completely different from my experiences.
I honestly see absolutely no value in this.
OpenCL may run on CPUs, but the code is going to be nowhere near optimal for CPU architectures.
Seeing as the main point of writing massively parallel code is performance, I don't see myself ever using OpenCL on anything that is not a massively parallel architecture. So at this point I see no use for OpenCL outside GPUs.
Maybe from your perspective, but other people have other types of workloads which may be able to use CPU or other non-GPU devices (such as DSP or CELL). Having a common platform is definitely a plus. In a presentation in Hotchip 2009 by Intel they said they can achieve 95% performance in OpenCL compared to hand-written SSEx with multithreading. I think that'd be quite useful.
In a presentation in Hotchip 2009 by Intel they said they can achieve 95% performance in OpenCL compared to hand-written SSEx with multithreading. I think that'd be quite useful.
Well, even if they reach that 95% (which I doubt they will, execpt for isolated cases perhaps), that's still 95% of CPU performance. With GPUs having about a factor 10-20 more performance.
Thing is, in the real world, we already HAVE CPU-optimized code. Why bother recoding it in OpenCL if it's not going to be faster? I'd invest in a GPU and THEN rewrite it, because then I'd have significant performance gains. And I'd suggest everyone to invest in a GPU, because the price/performance can't be beat by CPUs, not by a longshot.
But for CPUs, no. Just stick to the code we already have. Any new code will be prototyped in C/C++ anyway, before going the GPGPU-route. So CPU-code will always be there, OpenCL would just be an extra, to leverage GPUs.
Karoshi
16-Sep-2009, 10:13
JIT. Sandy Bridge comes out, havok runs twice as fast. Same binaries.
rpg.314
16-Sep-2009, 10:16
Well, before I even THINK about using something like Cuda, OpenCL or DirectCompute, I ask myself the following questions:
1) Does performance matter?
2) Is the problem parallelizable?
I'd use it on cpu's for prototyping, debugging and for developing on machines (temporarily, of course as I move around) where igp's are worse off then cpu's.
rpg.314
16-Sep-2009, 10:17
JIT. Sandy Bridge comes out, havok runs twice as fast. Same binaries.
A very good point there.
JIT. Sandy Bridge comes out, havok runs twice as fast. Same binaries.
I'll believe it when I see it :)
Well, even if they reach that 95% (which I doubt they will, execpt for isolated cases perhaps), that's still 95% of CPU performance. With GPUs having about a factor 10-20 more performance.
Thing is, in the real world, we already HAVE CPU-optimized code. Why bother recoding it in OpenCL if it's not going to be faster? I'd invest in a GPU and THEN rewrite it, because then I'd have significant performance gains. And I'd suggest everyone to invest in a GPU, because the price/performance can't be beat by CPUs, not by a longshot.
But for CPUs, no. Just stick to the code we already have. Any new code will be prototyped in C/C++ anyway, before going the GPGPU-route. So CPU-code will always be there, OpenCL would just be an extra, to leverage GPUs.
They said they have 95% performance from OpenCL. Of course, that's just one case (their OpenCL program is also heavily optimized).
Furthermore, in real world, we don't always have CPU-optimized codes. There may be new algorithms coming need implementing. Or new applications. We constantly need to write something in SSE/SSE2 to achieve better performance, and that takes time. If OpenCL can provide a consistent way to write for CPU, GPU, DSP, so I don't have to write in SSE, Altivec, VFPlite, or some obscure vector instructions, that's a plus for me.
Furthermore, in real world, we don't always have CPU-optimized codes. There may be new algorithms coming need implementing. Or new applications. We constantly need to write something in SSE/SSE2 to achieve better performance, and that takes time.
But OpenCL and SSE aren't the same thing. Nor are DSPs the same as CPUs or GPUs.
OpenCL is designed for massive parallelism, while DSPs and SSE/AltiVec can also accelerate problems which are mostly sequential in nature.
I really think you're WAY over-simplifying the issue here. Look at point 2) I mentioned earlier.
OpenCL isn't going to solve all your SSE/DSP needs, not by a long shot. If the problem is well-suited for parallellism in the first place, then GPUs are the processors to run it on, even low-end ones.
But OpenCL and SSE aren't the same thing. Nor are DSPs the same as CPUs or GPUs.
OpenCL is designed for massive parallelism, while DSPs and SSE/AltiVec can also accelerate problems which are mostly sequential in nature.
I really think you're WAY over-simplifying the issue here. Look at point 2) I mentioned earlier.
OpenCL isn't going to solve all your SSE/DSP needs, not by a long shot. If the problem is well-suited for parallellism in the first place, then GPUs are the processors to run it on, even low-end ones.
I don't think so. In which way does one have to use OpenCL in a massively parallel way? I mean, what prevents me from writing an OpenCL kernel which works with only two threads?
Of course, OpenCL is largely similar to CUDA, which is designed for a massively parallel platform. However, that does not necessarily mean that it can only do massively parallel workloads. OpenCL does need some better tools, of course, such as a step debugger, and better profilers, but that's not a problem in OpenCL itself.
I agree that current OpenCL implementations are immature (even the Snow Leopard ones) and are having some straneg quirks, but I do think OpenCL has a future here. Comparing to CUDA, it's vendor neutral, and it's similar enough that it has the potential to reach the same level as CUDA. Comparing to DX11 compute shader, it's platform neutral, so it's not confined in a specific operating system. Also it supports more devices than DX11 rather than just supporting GPU. For example, it's not hard to imagine an OpenCL implementation for CELL, but it's almost impossible to do so for DX11 compute shader.
I mean, what prevents me from writing an OpenCL kernel which works with only two threads?
Hopefully, common sense.
rpg.314
16-Sep-2009, 11:38
OpenCL isn't going to solve all your SSE/DSP needs, not by a long shot.
Which problem amenable to SSEx opencl will have trouble with? Can I have some examples?
Which problem amenable to SSEx opencl will have trouble with? Can I have some examples?
Many types of realtime audio processing for example. Lots of effects are of a time-linear and sequential nature. SSE will be useful becasue of its special signal-processing features (mulitply-and-accumulate, saturated operations etc, just like what many DSPs are designed for), and you may be able to exploit some minor parallellism, but throwing a lot of threads at the problem isn't really going to get you far.
As I say, common sense should prevent you from using OpenCL on (SSE-friendly) problems that aren't going to benefit from parallelism.
Many types of realtime audio processing for example. Lots of effects are of a time-linear and sequential nature. SSE will be useful becasue of its special signal-processing features (mulitply-and-accumulate, saturated operations etc, just like what many DSPs are designed for), and you may be able to exploit some minor parallellism, but throwing a lot of threads at the problem isn't really going to get you far.
As I say, common sense should prevent you from using OpenCL on (SSE-friendly) problems that aren't going to benefit from parallelism.
And you can't use OpenCL with just even one thread on these problems?
Of course, OpenCL does not have some DSP oriented vector operations right now, but that does not mean a vendor can't provide some extensions for it.
To me, OpenCL has a potential for abstracting underlying architectural details including multi-threading, vector instructions, etc. It does not have to be a CUDA-clone, although it looks like one right now. And obviously DX11 compute shader is even farer away from this goal.
And you can't use OpenCL with just even one thread on these problems?
I said common sense should prevent you from doing such things. Please try to keep up.
I said common sense should prevent you from doing such things. Please try to keep up.
<sigh> So you mean I shouldn't write my FFT kernel with OpenCL for running on a CPU so I don't have to bother with special instruction sets just because your common sense. Good.
<sigh> So you mean I shouldn't write my FFT kernel with OpenCL for running on a CPU so I don't have to bother with special instruction sets just because your common sense. Good.
No I didn't mean that at all, but you understand that I'm going to give up on this discussion now.
entity279
16-Sep-2009, 17:41
I don't think so. In which way does one have to use OpenCL in a massively parallel way? I mean, what prevents me from writing an OpenCL kernel which works with only two threads?
Especially since OpenCL also supports task-parallelism. So on at least on CPU, this will have a use.
Well, I find the performance of OpenCL to be quite poor currently, on nVidia hardware. Both DirectCompute and C for Cuda seem to be considerably faster (think factors 3-4 in some cases).
Why do you think this is?
Karoshi
16-Sep-2009, 20:51
Why do you think this is?
Obviously because CUDA is the better solution. A $129 GTS250 running CUDA can beat a $500 GTX295 running OCL!!!11!!!1!!
Why do you think this is?
I can only guess at what the status of the current OpenCL runtime is.
Perhaps the runtime is a debug build, full of excess overhead, rather than a streamlined release build.
All I'm saying is that in the current form, I don't think it's ready for release.
aaronspink
16-Sep-2009, 22:43
I honestly see absolutely no value in this.
OpenCL may run on CPUs, but the code is going to be nowhere near optimal for CPU architectures.
Seeing as the main point of writing massively parallel code is performance, I don't see myself ever using OpenCL on anything that is not a massively parallel architecture. So at this point I see no use for OpenCL outside GPUs.
being able to do functional test on just about any hardware config is kind of a nice thing quite honestly.
You also assume GPUs are the only devices which exist which couldn't be further from the truth...
You also assume GPUs are the only devices which exist which couldn't be further from the truth...
I have made no such assumption anywhere.
I have only said that I currently don't see any common devices other than GPUs that really benefit from the parallel programming model that OpenCL uses, in terms of performance.
Reading comprehension?
To explain it in terms that a 6-year old can understand: What good does it do me that my OpenCL code could theoretically run on an iPhone or whatever, when I'm going for HPC?
I mean, I know I could dust off my old Pentium and theoretically do some x264 HD encoding on it. The software will run on it. It would probably take the machine years to actually finish a movie though, so I just pick a fast system instead.
aaronspink
16-Sep-2009, 22:51
But OpenCL and SSE aren't the same thing. Nor are DSPs the same as CPUs or GPUs.
OpenCL is designed for massive parallelism, while DSPs and SSE/AltiVec can also accelerate problems which are mostly sequential in nature.
support for seamless parallelism is but one facet of OpenCL. You do know what CL stands for right?
The point was to provide a language/interface which could easily JIT/recompile to a range of devices (and yes, even 2 different GPUs are a range of devices) to enable a write once, run anywhere system. One way to thing of CL is as effectively the JAVA like system designed for computational workloads. Yes it can do everything CUDA and CS can do, but it also does a lot more.
And really, if SSE/AV accelerate a problem, then it likely isn't mostly serial in nature.
I really think you're WAY over-simplifying the issue here. Look at point 2) I mentioned earlier.
OpenCL isn't going to solve all your SSE/DSP needs, not by a long shot. If the problem is well-suited for parallellism in the first place, then GPUs are the processors to run it on, even low-end ones.
There are various degrees of parallelism in the world scali. In addition, you should write off portability so quickly. For all we know, nvidia won't exist in 2 years... What then?
And really, if SSE/AV accelerate a problem, then it likely isn't mostly serial in nature.
Instructions with 4 or 8 scalar operations in parallel is not exactly the type of parallelism that Cuda, OpenCL or DirectCompute are aiming at.
SSE benefits me with small levels of parallelism, eg a sequential set of operations on 4d or 8d data types.
Doesn't mean the parallelism can be extended beyond the implicit parallelism in the datatypes used. So it's still mostly serial in nature, which is actually more likely than that it is massively parallel.
There are various degrees of parallelism in the world scali. In addition, you should write off portability so quickly. For all we know, nvidia won't exist in 2 years... What then?
For your information, I actually joined the registered GPU developer program to get my hands on OpenCL, because I wanted to move away from Cuda.
Really, try not assuming so much about people, you only make a fool out of yourself.
In fact, I don't see why you mentioned portability in the first place, especially relating to that quote? You make no sense.
Andrew Lauritzen
17-Sep-2009, 04:33
I have only said that I currently don't see any common devices other than GPUs that really benefit from the parallel programming model that OpenCL uses, in terms of performance.
I don't think this is true. The restricted programming model of OpenCL and similar languages provides huge benefits for compiling performance code, such as aliasing guarantees, straightforward SIMD lane packing and predication, etc. As it turns out, C is really an awful language for performance work, and often a language like OpenCL, HLSL, etc. can be compiled to run better on the CPU than the equivalent C code. RapidMind and others have demonstrated this many times on lots of different HPC kernels. Furthermore the benefit of compiling to the 8+ HW threads with excellent scaling on modern CPUs shouldn't be understated.
Really, a more restricted programming model is better for all performance programming moving forward, CPU, GPU or otherwise. OpenCL/ComputeShader/CUDA clearly aren't the end goals here, but they're steps in the right direction. While they have some constructs that currently map better to some architectures than others, the general concepts of a safer language with stricter data locality and aliasing guarantees is totally required moving forward, and already pays dividends even on CPUs.
aaronspink
17-Sep-2009, 05:28
I have made no such assumption anywhere.
I have only said that I currently don't see any common devices other than GPUs that really benefit from the parallel programming model that OpenCL uses, in terms of performance.
Reading comprehension?
Hmm, I'm pretty sure OCL provides a nice way to do things like program multi-core CPUs with SIMD...
Tim Murray
17-Sep-2009, 05:57
Yes, OCL is a much friendlier way of writing SSE code than writing, you know, SSE. Especially since with a decent compiler you won't end up rewriting big chunks of code for AVX--it'll just work.
(Congratulations, you made me agree with Aaron. I hope you're happy :evil:)
Hmm, I'm pretty sure OCL provides a nice way to do things like program multi-core CPUs with SIMD...
Yea it does, but as I was saying earlier, what's the point? It won't be faster than existing CPU code (pretty much all major parallel problems already have multicore SIMD optimized libraries -> no benefit), and it will be nowhere as fast as running the same code on a GPU (no performance).
I never said you *can't* do it, I just said it doesn't make sense to me, from a performance point-of-view. Again, reading comprehension.
Yea it does, but as I was saying earlier, what's the point? It won't be faster than existing CPU code (pretty much all major parallel problems already have multicore SIMD optimized libraries -> no benefit), and it will be nowhere as fast as running the same code on a GPU (no performance).
Not all problems have multi-core SIMD optimized libraries. And not all customers want or have GPU in their computers. It's as simple as that. Not even all problems suitable for a multi-core SIMD CPU is as suitable on a GPU, unless your GPU is called Larrabee.
I don't think this is true. The restricted programming model of OpenCL and similar languages provides huge benefits for compiling performance code, such as aliasing guarantees, straightforward SIMD lane packing and predication, etc. As it turns out, C is really an awful language for performance work, and often a language like OpenCL, HLSL, etc. can be compiled to run better on the CPU than the equivalent C code. RapidMind and others have demonstrated this many times on lots of different HPC kernels. Furthermore the benefit of compiling to the 8+ HW threads with excellent scaling on modern CPUs shouldn't be understated.
Really, a more restricted programming model is better for all performance programming moving forward, CPU, GPU or otherwise. OpenCL/ComputeShader/CUDA clearly aren't the end goals here, but they're steps in the right direction. While they have some constructs that currently map better to some architectures than others, the general concepts of a safer language with stricter data locality and aliasing guarantees is totally required moving forward, and already pays dividends even on CPUs.
I'm sorry, but it sounds like you didn't understand anything of what I said. I agree with everything you say here (it's blatantly obvious to anyone who understands the basics of OpenCL and C/C++), but it has nothing to do with any of the arguments I gave.
You focus only on massively parallel problems, which will always run better on GPUs than on CPUs.
Yea it does, but as I was saying earlier, what's the point? It won't be faster than existing CPU code (pretty much all major parallel problems already have multicore SIMD optimized libraries -> no benefit), and it will be nowhere as fast as running the same code on a GPU (no performance).
Portability, scalability, easier code iteration/shortened/simplified development, etc..
I wish I had something like OpenCL when I was a (multiplatform) console programmer, it would have made my life so much easier..
I'm sorry, but it sounds like you didn't understand anything of what I said. I agree with everything you say here (it's blatantly obvious to anyone who understands the basics of OpenCL and C/C++), but it has nothing to do with any of the arguments I gave.
You focus only on massively parallel problems, which will always run better on GPUs than on CPUs.
Strongly disagree, not all massively parallel problems map well to GPUs. And CPUs aren't immutable either..
Not all problems have multi-core SIMD optimized libraries. And not all customers want or have GPU in their computers. It's as simple as that. Not even all problems suitable for a multi-core SIMD CPU is as suitable on a GPU, unless your GPU is called Larrabee.
I think you mean to say: not all problems suitable for a multi-core SIMD CPU are as suitable for OpenCL.
Portability, scalability, easier code iteration/shortened/simplified development, etc..
Not if I have already written the code.
Strongly disagree, not all massively parallel problems map well to GPUs. And CPUs aren't immutable either..
I never said CPUs aren't immutable, I clearly stated 'at this time'. Please, read more clearly.
Besides, the point isn't really about mapping problems to GPUs, as it is about mapping problems to OpenCL.
Not if I have already written the code.
Written how? Will it scale to more cores? to wider SIMD ISAs? Is it portable and to what degree? etc.. There's more to life than 30 years old mainframes code :)
I think you mean to say: not all problems suitable for a multi-core SIMD CPU are as suitable for OpenCL.
No. I used to write some search problems for GPU (in CUDA). It works, actually works pretty well, but it certainly can use some helps from the CPU. In some cases CPU runs better than GPU.
These problems are in some ways "massively parallel" but they are branchy inside the kernel. There wouldn't be too much trouble writing these kernels with OpenCL, but GPU certainly do not like these branchy code.
Written how? Will it scale to more cores? to wider SIMD ISAs? Is it portable and to what degree? etc.. There's more to life than 30 years old mainframes code :)
I think you just managed to contradict yourself :)
Firstly, what I meant with "already written the code" is when I've already ported it to other architectures etc. So maybe it wasn't exactly write once, run anywhere, but if the work is already done, I'm not going to re-do it in OpenCL just for kicks. I may have to redo it for other architectures in the future, but we'll cross that bridge when we get there.
Secondly, your argument about 30 years old mainframe code also applies to OpenCL.
I think people are looking at OpenCL from a far too theoretical point-of-view. Yes, it's portable, and will run on different types of processors... but although the differences aren't as obvious as with native code, you will want to rewrite your OpenCL code for different architectures if you want to get good performance. Things like branching, cache sizes, total parallelism, serial execution speed, general overhead of executing kernels and extracting results and all that...
These things are all susceptible to change on different architectures. So I doubt that in 30 years you'd be running the same OpenCL code you wrote today.
New architectures will arrive, new revisions of OpenCL with extra features for these architectures etc... You might be looking at at least the same amount of speedups by rewriting your code, as you would now when going from SSE to AVX.
The same arguments also apply to bog-standard C/C++ (not including any SSE extensions or whatever). Sure, it's portable, can be compiled for many types of processors... but even then, there's plenty of room for architecture-specific optimizations. I can think of plenty examples of code that would run very well on one architecture, but have appalling performance on another. It's hard to know the 'middle ground' for architectures that haven't even been developed yet.
No. I used to write some search problems for GPU (in CUDA). It works, actually works pretty well, but it certainly can use some helps from the CPU. In some cases CPU runs better than GPU.
Are you talking about Cuda running on CPU vs Cuda on GPU? Or about Cuda vs regular CPU code?
These problems are in some ways "massively parallel" but they are branchy inside the kernel. There wouldn't be too much trouble writing these kernels with OpenCL, but GPU certainly do not like these branchy code.
This at least proves my point that the code isn't as write once, run anywhere as some people make it out to be. Even with OpenCL you still have to pay attention to the architecture you're using.
However, I'm not sure if I see your point though... 'branching' in OpenCL on a CPU is done with masking on the SSE registers, much like on a GPU, as far as I know. So it wouldn't be able to take advantage of the branching hardware in the CPU.
I think you just managed to contradict yourself :)
Firstly, what I meant with "already written the code" is when I've already ported it to other architectures etc. So maybe it wasn't exactly write once, run anywhere, but if the work is already done, I'm not going to re-do it in OpenCL just for kicks. I may have to redo it for other architectures in the future, but we'll cross that bridge when we get there.
Secondly, your argument about 30 years old mainframe code also applies to OpenCL.
I think people are looking at OpenCL from a far too theoretical point-of-view. Yes, it's portable, and will run on different types of processors... but although the differences aren't as obvious as with native code, you will want to rewrite your OpenCL code for different architectures if you want to get good performance. Things like branching, cache sizes, total parallelism, serial execution speed, general overhead of executing kernels and extracting results and all that...
These things are all susceptible to change on different architectures. So I doubt that in 30 years you'd be running the same OpenCL code you wrote today.
New architectures will arrive, new revisions of OpenCL with extra features for these architectures etc... You might be looking at at least the same amount of speedups by rewriting your code, as you would now when going from SSE to AVX.
The same arguments also apply to bog-standard C/C++ (not including any SSE extensions or whatever). Sure, it's portable, can be compiled for many types of processors... but even then, there's plenty of room for architecture-specific optimizations. I can think of plenty examples of code that would run very well on one architecture, but have appalling performance on another. It's hard to know the 'middle ground' for architectures that haven't even been developed yet.
No one is saying that you won't need to update your OpenCL code or to tweak it to run it more efficiently or this or that architecture, no one is claiming that OpenCL won't evolve, or perhaps even be replaced by something better (for instance I believe its memory model is seriously flawed) On the other hand even when your code will run well on architecture A but not so well on architecture B *it will still run*.
Optimized C/C++ code that make use of vector extensions/intrinsics/inline asm and/or auto-vectorizing compilers is basically not portable, doesn't naturally scale with more cores or wider (or even narrower!) SIMD vectors and it's a freaking mess to maintain. There's a dramatic difference between an application that doesn't run efficiently and an application that doesn't run at all.
However, I'm not sure if I see your point though... 'branching' in OpenCL on a CPU is done with masking on the SSE registers, much like on a GPU, as far as I know. So it wouldn't be able to take advantage of the branching hardware in the CPU.
Branching is simply an implementation detail, you can implement it with predication (not just masking, which can have serious side effects) or without on CPUs or GPUs. A smart compiler would generate a proper branch if it can avoid to execute hundreds of instructions and simply (dynamically) skip them. Think about early-z rejection in a pixel shader.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.