NVIDIA GF100 & Friends speculation

Isn't CUDA a bit faster? At least that's what I've seen from the few benchmarks I've stumbled upon.

Edit: after reading Rys's post, I guess it may just be due to it being used better.
Not entirely certain. The person I spoke to mostly used CUDA for spherical harmonic transformations (typically for ell values up to a few thousand), which requires large numbers of FFT's, so maybe that's a special case. But he claimed that there really wasn't any performance reason for choosing CUDA over OpenCL.
 
Not entirely certain. The person I spoke to mostly used CUDA for spherical harmonic transformations (typically for ell values up to a few thousand), which requires large numbers of FFT's, so maybe that's a special case. But he claimed that there really wasn't any performance reason for choosing CUDA over OpenCL.

These are the only benchmarks I've seen so far, and I've mostly just skimmed through the paper, but CUDA does seem to have a consistent advantage.

It's a shame they didn't benchmark a couple of AMD cards, though.
 
These are the only benchmarks I've seen so far, and I've mostly just skimmed through the paper, but CUDA does seem to have a consistent advantage.

It's a shame they didn't benchmark a couple of AMD cards, though.
If this is representative of today, it won't make much difference to most users who are going for large-scale computation. In those situations, an improvement of 10% just means waiting 20 hours for your computation to finish instead of 22 hours. The improvements that really matter in those situations are order of magnitude improvements, not ones of tens of percent.

So I could have understood this guy saying this to me, even if there was a small performance benefit for CUDA still. I just don't think it'll be important for anything but realtime calculations (e.g. games). But for large-scale computation, a few tens of percent performance difference is most definitely not worth developing a different compute interface for different hardware.
 
Do you have a source for that? I always thought PTX translation happened with a non-LLVM front end.

EDIT: never mind.

Anyway, once ocl is compiled to ptx, there is not going to be much difference there.

Besides, at this point cuda vs ocl is essentially open64 vs llvm. And IMHO, llvm is developing at a better rate than open64.
 
Last edited by a moderator:
Besides, at this point cuda vs ocl is essentially open64 vs llvm.
That's essentially irrelevant. NVidia is in control of compilation from CUDA API down to binary, i.e. it "owns" all the code, it's a compilation pipeline that's entirely in-house.

Apple "owns" LLVM.

This is the same as Microsoft "owns" the HLSL compiler. So when working on Direct Compute it's quite possible for the HLSL compiler to generate code that defeats the optimisers in the IHVs' driver compilers. See the recent discussion of NLM image denoising for a perfect example of a third-party compiler totally screwing with compilation quality and performance.
 
I think earlier marketing/PR had it as 460 (or even 360) no?


oh, and grapevine time! first batch 465s will be sellable on the fact that they're 470s that didn't make the cut, so BIOS flash at your own risk.

/prepares glee for idiot gamers who don't know the brevity of a their new card with a borked flash
 
That's essentially irrelevant. NVidia is in control of compilation from CUDA API down to binary, i.e. it "owns" all the code, it's a compilation pipeline that's entirely in-house.

They own the ocl to ptx pipeline too. Nobody's forcing them to use llvm there. They could very well parse OCL to WHIRL and then use their open64 bits to compile it down to ptx if they choose to. And frankly, maintaining two separate compilation paths long term is as dumb as it gets.

They are using LLVM for ocl out of their own choice, not because of anyone else.

Apple "owns" LLVM.
Not really. Look at the people involved here. http://llvm.org/devmtg/2009-10/

This is the same as Microsoft "owns" the HLSL compiler. So when working on Direct Compute it's quite possible for the HLSL compiler to generate code that defeats the optimisers in the IHVs' driver compilers. See the recent discussion of NLM image denoising for a perfect example of a third-party compiler totally screwing with compilation quality and performance.

Unlike MS's HLSL compiler, LLVM is not a monolith. It is actually a vast (really vast) library used for writing compilers. You just pick and choose whichever bits of LLVM you like most for your own compiler. And using it is hardly mandated by OCL. If it gets too troublesome, they can just get rid of it, unlike HLSL compiler.
 
Nobody's forcing them to use llvm there.
My understanding is that OpenCL requires use of LLVM.

It would be interesting if LLVM directly targeted the ISA of each GPU - but that's unlikely to happen because the IHVs have enough work to support/tune their own intermediate languages (IL and PTX). I suppose it could work for Larrabee though.
 
My understanding is that OpenCL requires use of LLVM.
WHY?:oops:

People use LLVM because it is, frankly, awesome.

It would be interesting if LLVM directly targeted the ISA of each GPU - but that's unlikely to happen because the IHVs have enough work to support/tune their own intermediate languages (IL and PTX). I suppose it could work for Larrabee though.

I think AMD compiles IL to ISA using an LLVM based optimizer and code generator.

Besides, not exposing the ISA gives the IHVs a lot of flexibility.
 
Because OpenCL is owned by Apple.

People use LLVM because it is, frankly, awesome.
But as used in OpenCL it's miles away from knowing about GPUs.

I think AMD compiles IL to ISA using an LLVM based optimizer and code generator.
I expect you'll come to your senses on that.

Besides, not exposing the ISA gives the IHVs a lot of flexibility.
At least each family of ATI GPUs will require tailored compilation - within R700 family you'll see tailored compilations for different GPUs. Regardless of what compiler is used inside the driver (AMD's own or some imaginary LLVM-based chain), it has to be tailored. In other words the flexibility is a function of the ethos of GPU implementation, not something nice to have. It's not optional.

Additionally, the "ISA code" you see generated by GPUSA or SKA is not enough to actually run the kernel on the GPU. There are registers inherited by the kernel and of course there's a pile of state and pipeline configuration that also has to be done. That's all stuff the driver does, and another variable in performance.
 
Because OpenCL is owned by Apple.
Oh dear....

Seriously, then who owns OGL? Khronos? Cad companies? MS by proxy?:LOL:

But as used in OpenCL it's miles away from knowing about GPUs.
LLVM doesn't know about CPUs either. It's a library containing a bunch of IR optimization/transformation passes. Along with some code for generating IR and JITing it. All of it is separate. You use whatever pieces of the pie you like. You bend it. You mend it.

Here's one way of using LLVM.

http://donsbot.wordpress.com/2010/03/01/evolving-faster-haskell-programs-now-with-llvm/

People have built static bug finders with LLVM too. Does it mean that LLVM understands CPUs?

I expect you'll come to your senses on that.
Fine, you tell us your theory.
At least each family of ATI GPUs will require tailored compilation - within R700 family you'll see tailored compilations for different GPUs. Regardless of what compiler is used inside the driver (AMD's own or some imaginary LLVM-based chain), it has to be tailored. In other words the flexibility is a function of the ethos of GPU implementation, not something nice to have. It's not optional.

Big deal. May be you should look into the compilers for uCs. They also tailor their code for each member of the family. It's a done deal for like what, decades? ;)

Besides the Tablegen DSL LLVM has is pretty nicely suited for this sort of per gpu-codegen.

Additionally, the "ISA code" you see generated by GPUSA or SKA is not enough to actually run the kernel on the GPU. There are registers inherited by the kernel and of course there's a pile of state and pipeline configuration that also has to be done. That's all stuff the driver does, and another variable in performance.
All those variables apply to CUDA too.
 
Oh dear....
After your earlier brainfart, "AFAIK, ptx is parsed into LLVM IR and then some optimizations and code gen happens." : :rolleyes:

Seriously, then who owns OGL? Khronos? Cad companies? MS by proxy?:LOL:
SGi owns the trademark. The difference, as far as I can tell, is that Apple makes use of LLVM mandatory for implementors of OpenCL - seems like a good idea in the long run, to be honest. Open source goodness all the way, eventually.

Fact is, there are problems in OpenCL due to LLVM producing code that the IHVs' compilers can't work with. Dunno how much longer this basic stuff is going to continue to throw up deal-breaking errors.

Obviously we can't tell how much of that is the fault of the users of LLVM (AMD, NVidia) as opposed to LLVM itself - clearly there have been language conformance problems in both AMD's and NVidia's implementations of OpenCL.

Ditching Brook+ in favour of OpenCL wasn't much of a problem for AMD, because AMD didn't really have a huge investment there. Brook+ was originally a front-end for DX HLSL (evolved from BrookGPU, as distinct from Brook, a stream language), and started off solely generating pixel shaders (with supporting vertex shader to kick them into life).

By contrast NVidia has spent many years heavily investing in its own languages - going back to Cg thence to C for CUDA - designed specifically for its own hardware. So, with OpenCL, NVidia loses some control. Theoretically, once LLVM's working right in this environment, there'll be gains to be had.

You might argue that NVidia will ditch its internal technologies and adopt an LLVM-based chain entirely across all its platforms (including D3D graphics). This doesn't address the basic fact that OpenCL is a generalised API, aimed at including chips that aren't GPUs. It's not the closest-fit for NVidia's hardware and the variant of C in OpenCL is theoretically immature compared with what NVidia is doing (or has planned) for C for CUDA.

Even with NVidia all-LLVM, OpenCL is still "relatively distant", an abstraction in the middle that serves no purpose in a pure NVidia environment.

LLVM doesn't know about CPUs either. It's a library containing a bunch of IR optimization/transformation passes. Along with some code for generating IR and JITing it. All of it is separate. You use whatever pieces of the pie you like. You bend it. You mend it.
That's my point - no-one's done LLVM from a C-like language (HLSL-like or OpenCL-C-like) to GPU ISA yet. If it's work in progress then maybe we'll get to hear about it some time.

Just in Time ... for lunch :LOL:

People have built static bug finders with LLVM too. Does it mean that LLVM understands CPUs?
LLVM and its environment is a major work in progress, in case you haven't noticed. It's why we see OpenCL dependent upon fixes to LLVM, it's quite immature in this application. Go have a look at AMD's and NVidia's developer forums.

Fine, you tell us your theory.
I suggest you read the first page of the IL Specification. You'll see that AMD's IL is based upon D3D10's assembly language, which is an evolution of DX9 assembly. IL was designed as a thin layer twixt D3D assembly and ISA upon the introduction of R600 (i.e. D3D10). AMD (ATI) has been doing DX-style assembly->ISA since before LLVM came into existence.

And, by the way, I'm not suggesting that LLVM isn't usable in doing OpenCL->ISA without intermediate steps. I'm saying that the LLVM ecosystem isn't mature enough for AMD and NVidia to entirely junk their driver-level JIT technology. And that OpenCL is an abstraction at further remove from NVidia's hardware than CUDA.

Big deal. May be you should look into the compilers for uCs. They also tailor their code for each member of the family. It's a done deal for like what, decades? ;)
GPUs are the daddies of microcontrollers.

Besides the Tablegen DSL LLVM has is pretty nicely suited for this sort of per gpu-codegen.
Lovely.

All those variables apply to CUDA too.
I'm sure you'll let us know of the first GPU doing graphics whose JIT is LLVM based.

Jawed
 
After your earlier brainfart, "AFAIK, ptx is parsed into LLVM IR and then some optimizations and code gen happens." : :rolleyes:
Sorry about that.

The difference, as far as I can tell, is that Apple makes use of LLVM mandatory for implementors of OpenCL.

I don't see that. May be you know something we don't know.

Fact is, there are problems in OpenCL due to LLVM producing code that the IHVs' compilers can't work with. Dunno how much longer this basic stuff is going to continue to throw up deal-breaking errors.

Obviously we can't tell how much of that is the fault of the users of LLVM (AMD, NVidia) as opposed to LLVM itself - clearly there have been language conformance problems in both AMD's and NVidia's implementations of OpenCL.

This is pretty common. CUDA had similar beginnings. Friends of mine who worked on it in 2007 have lots of horror stories from that era. Tools need ~1-1.5 years out of beta to mature it seems. No compiler is perfect.

Here's a recent one (HINT: look at the date :smile:)

http://gpuray.blogspot.com/2009/08/as-fast-as-glsltm.html

It turns out that not all is well on the CUDA compiler front. I ran into multiple bugs while implementing this last optimization.
Firstly, I was unable to pass a particular float argument from the cpu code through the kernel, and into 2 device functions. It simply kept mucking up each and every time. I finally gave up and hard-coded that value into the kernel. Later I found that the bug was gone in CUDA 2.3. Not really sure what the bug was, but it may have been related to this next issue.
And this one's pretty good too.
For example, I had defined a pure function which had a few floating point constants inside it. A single invocation of this function, per iteration in the kernel, would work fine. But 2 or 3 invocations would result in an abrupt return from the kernel, so no processing would be done at all.

CUDA is ahead, no doubt. But blaming LLVM is an over simplification.
By contrast NVidia has spent many years heavily investing in its own languages - going back to Cg thence to C for CUDA - designed specifically for its own hardware. So, with OpenCL, NVidia loses some control. Theoretically, once LLVM's working right in this environment, there'll be gains to be had.

You might argue that NVidia will ditch its internal technologies and adopt an LLVM-based chain entirely across all its platforms (including D3D graphics). This doesn't address the basic fact that OpenCL is a generalised API, aimed at including chips that aren't GPUs. It's not the closest-fit for NVidia's hardware and the variant of C in OpenCL is theoretically immature compared with what NVidia is doing (or has planned) for C for CUDA.

Even with NVidia all-LLVM, OpenCL is still "relatively distant", an abstraction in the middle that serves no purpose in a pure NVidia environment.


That's my point - no-one's done LLVM from a C-like language (HLSL-like or OpenCL-C-like) to GPU ISA yet. If it's work in progress then maybe we'll get to hear about it some time.


LLVM and its environment is a major work in progress, in case you haven't noticed. It's why we see OpenCL dependent upon fixes to LLVM, it's quite immature in this application. Go have a look at AMD's and NVidia's developer forums.


I suggest you read the first page of the IL Specification. You'll see that AMD's IL is based upon D3D10's assembly language, which is an evolution of DX9 assembly. IL was designed as a thin layer twixt D3D assembly and ISA upon the introduction of R600 (i.e. D3D10). AMD (ATI) has been doing DX-style assembly->ISA since before LLVM came into existence.

And, by the way, I'm not suggesting that LLVM isn't usable in doing OpenCL->ISA without intermediate steps. I'm saying that the LLVM ecosystem isn't mature enough for AMD and NVidia to entirely junk their driver-level JIT technology. And that OpenCL is an abstraction at further remove from NVidia's hardware than CUDA.

OCL drivers are still maturing, sure. But this is no reason to diss LLVM. People have simply invested more in their own compilers so far. And no, I am not buying that OCL is an abstraction far removed from your hw so it is somehow difficult to optimize. It might be more generic, but if nv chose to generate WHIRL from OCL kernels and then use their own optimizers, it would hardly be any less performant. Once the frontend transforms the code into IR, all HLSL/GLSL/OCL/CUDA differences are washed away. And the optimizations happen at the IR level.

Perhaps, Apple had other plans. ;)

I'm sure you'll let us know of the first GPU doing graphics whose JIT is LLVM based.
The opensource 3D drivers are supposed to have LLVM based optimizers. To be sure, they are lagging, but building high performance drivers takes time and effort.
 
The opensource 3D drivers are supposed to have LLVM based optimizers.
They don't for any of the GPU paths at the moment. There's llvmpipe that generates the entire rasterization pipeline to (CPU) machine code using llvm (and is quite awesome). There is also some handwaving going on about using llvm in the upper layers of the stack to aid in GLSL and OpenCL compilation, but nothing concrete and nothing targeting GPUs directly as far as I'm aware.
 
They don't for any of the GPU paths at the moment. There's llvmpipe that generates the entire rasterization pipeline to (CPU) machine code using llvm (and is quite awesome). There is also some handwaving going on about using llvm in the upper layers of the stack to aid in GLSL and OpenCL compilation, but nothing concrete and nothing targeting GPUs directly as far as I'm aware.

I thought that the new Gallium based drivers were using LLVM to JIT the shaders. I know they are not widespread yet but Gallium is the future of foss drivers for sure.
 
I thought that the new Gallium based drivers were using LLVM to JIT the shaders.
Only vertex and geometry shaders for hardware without hardware T&L which uses the Gallium draw module (i915 and some IGP Radeons. possibly some Geforces.) . Nothing that actually ends up as GPU instructions goes through LLVM at this point as far as I'm aware.
 
Back
Top