NVIDIA GF100 & Friends speculation

Only vertex and geometry shaders for hardware without hardware T&L which uses the Gallium draw module (i915 and some IGP Radeons. possibly some Geforces.) . Nothing that actually ends up as GPU instructions goes through LLVM at this point as far as I'm aware.

What does noveau use then?
 
I don't see that. May be you know something we don't know.
I can't find the posts again. And Google Desktop on my PC has a very selective approach to my web history. I don't know of any explicit statement to this effect, merely hints that it's a condition of OpenCL - which is why I don't state it as a fact, but an understanding/impression/belief.

Coming back to the original statement I made: 'I can't see OpenCL ever achieving "performance parity" with CUDA because LLVM is in the way.' I think I should modify that with a caveat "as long as LLVM isn't modelling all of the GPU's relevant architectural parameters." (And this is apart from the general looseness of OpenCL as compared with CUDA.)

e.g. on ATI clause temporary registers have a maximum lifetime of 32 cycles, or ~128 scalar operations (with a maximum of 5 packed into a cycle). How these are used can have a major effect on both cycle count and register allocation, i.e. major performance differences, not 5 or 10%.

Or, how does one get LLVM to model memory latencies? Are there heuristics or flags for burstiness? Modelling of hardware-thread-count versus latency-hiding?

This is pretty common. CUDA had similar beginnings. Friends of mine who worked on it in 2007 have lots of horror stories from that era. Tools need ~1-1.5 years out of beta to mature it seems. No compiler is perfect.
I agree. There are broken things in AMD's toolchain that have been broken for years - and that's at the IL level.

NVidia's advantage with a pure CUDA chain is that it can throw its own staff at the problem, because it's internal technology. OpenCL is a third party - though I presume issues in LLVM, say, can be attacked by NVidia as a contributor to it, in its status as an open source project.

Here's a recent one (HINT: look at the date :smile:)

http://gpuray.blogspot.com/2009/08/as-fast-as-glsltm.html


And this one's pretty good too.
I'm not asserting CUDA is faultless, merely that NVidia has tighter reins.

I have to say that person says some pretty worrying things. Confusing float and double: as one of my maths teachers used to say, "that's a schoolboy error" (for a class of 16-18 year-olds).

I'm bemused by this sentence, "It turns out that CUDA gpus do not do 4 wide vector operations (according to a reasonably reliable source)." Sigh.

CUDA is ahead, no doubt. But blaming LLVM is an over simplification.
It's only one factor. And I don't think it's all bad, I'm sure I've seen odd stuff where the OpenCL is faster than the CUDA. This is similar to the way in which some people are writing kernels in PTX instead of C for CUDA. And then there are people who are modifying the binary for a kernel...

OCL drivers are still maturing, sure. But this is no reason to diss LLVM. People have simply invested more in their own compilers so far. And no, I am not buying that OCL is an abstraction far removed from your hw so it is somehow difficult to optimize. It might be more generic, but if nv chose to generate WHIRL from OCL kernels and then use their own optimizers, it would hardly be any less performant. Once the frontend transforms the code into IR, all HLSL/GLSL/OCL/CUDA differences are washed away. And the optimizations happen at the IR level.
CUDA/OpenCL programming is more than just a sequence of opcodes (or a question of how many hours it takes to JIT it) - it's a memory-hierarchy programming paradigm. I won't say it's intractable for LLVM, merely that the work needs to be done. And NVidia has full control over CUDA, but can only act as a contributor for LLVM.

It doesn't have to descend into something like the OpenGL extension wars, but NVidia might want to keep some of its technologies out of the public domain. Especially if they reveal particularities of the internal workings of the GPUs.

Perhaps, Apple had other plans. ;)

The opensource 3D drivers are supposed to have LLVM based optimizers.
http://llvm.org/devmtg/2007-05/10-Lattner-OpenGL.pdf

I'm unclear on the way things work when an ATI or NVidia GPU is running the OpenGL app. e.g what happens when OpenGL 3.2 is used on a system with an OpenGL 2.0 graphics card.

To be sure, they are lagging, but building high performance drivers takes time and effort.
Yes and having the entire toolchain in-house for your own custom hardware (and its next generation: as you try to design the next generation of the architecture as a match for the problem space - and also achieve timely delivery of a revised toolchain) beats being dependent on third parties. Sure, there's marketing in play when NVidia says it's a software company, but it's inescapable due to the sheer complexity of current GPUs and the current state of parallel programming.

Intel didn't abandon at least one iteration of Larrabee for no good reason, after all.

Jawed
 
I can't find the posts again. And Google Desktop on my PC has a very selective approach to my web history. I don't know of any explicit statement to this effect, merely hints that it's a condition of OpenCL - which is why I don't state it as a fact, but an understanding/impression/belief.
Well, Apple owns the ocl trademark. "Plays dirty" is actually their middle name. So may be there's something fishy going on. Perhaps, IBM's or Intel's ocl drivers will tell us more.

e.g. on ATI clause temporary registers have a maximum lifetime of 32 cycles, or ~128 scalar operations (with a maximum of 5 packed into a cycle). How these are used can have a major effect on both cycle count and register allocation, i.e. major performance differences, not 5 or 10%.

Or, how does one get LLVM to model memory latencies? Are there heuristics or flags for burstiness? Modelling of hardware-thread-count versus latency-hiding?

ATI's hw is pretty exotic. So their tools are going to be equally exotic. For more normal stuff like nv, should be doable without much fuss.

NVidia's advantage with a pure CUDA chain is that it can throw its own staff at the problem, because it's internal technology. OpenCL is a third party - though I presume issues in LLVM, say, can be attacked by NVidia as a contributor to it, in its status as an open source project.

Both nv and AMD (well, most of the corps) are pretty hostile to contributing code to FOSS projects. Even when they do like Google, they are often bad citizens. NV could fix some of the things, but they likely won't into a public tree.

I'm bemused by this sentence, "It turns out that CUDA gpus do not do 4 wide vector operations (according to a reasonably reliable source)." Sigh.
What's the big deal? He just didn't know nv is scalar.
CUDA/OpenCL programming is more than just a sequence of opcodes (or a question of how many hours it takes to JIT it) - it's a memory-hierarchy programming paradigm. I won't say it's intractable for LLVM, merely that the work needs to be done. And NVidia has full control over CUDA, but can only act as a contributor for LLVM.
From what I recall from an old nv presentation on open64, they didn't do much (if any) memory hierarchy optimizations. The optimizations were much more old school-ish. Honestly, prior to fermi, which memory-hierarchy dependent change would you do to the code?

And even with fermi, I am ambivalent as to what can be done in the compiler.

It doesn't have to descend into something like the OpenGL extension wars, but NVidia might want to keep some of its technologies out of the public domain. Especially if they reveal particularities of the internal workings of the GPUs.
CUDA compiler is GPLed.

http://llvm.org/devmtg/2007-05/10-Lattner-OpenGL.pdf

I'm unclear on the way things work when an ATI or NVidia GPU is running the OpenGL app. e.g what happens when OpenGL 3.2 is used on a system with an OpenGL 2.0 graphics card.
CPU FTW. :yep2:

I guess the bottomline is that LLVM based ocl today is somewhere around 2007 era cuda. And I don't think compilers do any memory-hierarchy related optimizations either. At the very least, I haven't come across this sort of stuff.
 
Well, Apple owns the ocl trademark. "Plays dirty" is actually their middle name. So may be there's something fishy going on. Perhaps, IBM's or Intel's ocl drivers will tell us more.
I'm pretty glad Apple forced hands and got OpenCL on the road - even if something like this was inevitable. That plus leadership in LLVM all seems like good stuff to me.

ATI's hw is pretty exotic. So their tools are going to be equally exotic. For more normal stuff like nv, should be doable without much fuss.
I think that's naive. e.g. the effect of code size on instruction cache behaviour, when there's hundreds of hardware threads in flight.

What's the big deal? He just didn't know nv is scalar.
The programming model supports vectors. He appears not to realise that the CUDA code he's running is performing well for exactly the same reason that his high-performance GLSL code performs well. The fact that a CUDA kernel is executed by a "scalar ALU" (which it isn't, sigh) is irrelevant in this particular case - he has merely translated a pixel shader into a kernel.

From what I recall from an old nv presentation on open64, they didn't do much (if any) memory hierarchy optimizations. The optimizations were much more old school-ish. Honestly, prior to fermi, which memory-hierarchy dependent change would you do to the code?
I'm referring to the way CUDA was designed on top of a memory model with shared memory for a "block of threads", and how this affects device-specific low level compilation/resource-allocation, e.g. the mapping of workgroups to SIMDs versus per work item register allocation.

And even with fermi, I am ambivalent as to what can be done in the compiler.
My perspective always includes the lowest-level.

CUDA compiler is GPLed.
How's that relevant to the driver compiler, which consumes PTX?

If the PTX is "badly formed" then it'll flummox the driver compiler - which is what we saw in the recent discussion of the NLM kernel. Saying "LLVM is awesome" doesn't address differences such as architectural alacrity for loop unrolling (register pressure).

Though I do wonder what kind of optimisation options OpenCL will end up exposing. Currently there aren't any performance-specific options, just math-implementation options (though those do have an effect on performance).

CPU FTW. :yep2:

I guess the bottomline is that LLVM based ocl today is somewhere around 2007 era cuda. And I don't think compilers do any memory-hierarchy related optimizations either.
You can see some stuff here:

http://www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc

At the very least, I haven't come across this sort of stuff.
Burstiness and coalescing are key issues in GPUs. The ISA interaction is quite strong here - very strong in ATI where clause-by-clause execution is a primary optimisation for controlling memory accesses. NVidia's memory controllers seem to be better at re-ordering buffer accesses (i.e. not texture nor render target accesses) but that's a pretty fuzzy topic on ATI. In theory NVidia is less bothered here.

We're still waiting to see what people find as they explore the cache hiearchy in Fermi - early days yet.

Jawed
 
I think that's naive. e.g. the effect of code size on instruction cache behaviour, when there's hundreds of hardware threads in flight.
I$ size should be less of an issue since gpu's tend to have much more straight line code. Also, the within a workgroup, the warps tend to stay relatively closer to each other in terms of PC. That should also help.

SW scheduling of warps would stress it badly though.

I don't see much beyond making (atleast trying to) everything a register and r/w combining.

Neither should be a big deal. My guess would be that former is rather simpler.

On the whole, considering the inline everything, promote everything to a register mentality in CUDA 2.3 and earlier, I doubt if I$ pressure is seen as a big issue.

I'm referring to the way CUDA was designed on top of a memory model with shared memory for a "block of threads", and how this affects device-specific low level compilation/resource-allocation, e.g. the mapping of workgroups to SIMDs versus per work item register allocation.

The register allocation should be done at the PTX level, right? And surely compiler has to set aside whatever amount of local memory is requested in the code. So those two allocations should be independent of the API. AFAIK, Workgroups per SIMD should be determined by these two factors, so no difference their either.

Bad PTX is another issue.

How's that relevant to the driver compiler, which consumes PTX?
I meant that cuda compiler doesn't hide any proprietary bits above PTX level.
 
Both nv and AMD (well, most of the corps) are pretty hostile to contributing code to FOSS projects.
"Hostile" is a bit too harsh when it comes to AMD I think. "Ambivalent" is more like it. :)

For example they currently employ at least 2 or 3 engineers to work on the Open Source Radeon drivers and associated documentation. Granted, it's a project that mostly benefits them (and not the competition) but they do contribute.
 
From what I recall from an old nv presentation on open64, they didn't do much (if any) memory hierarchy optimizations. The optimizations were much more old school-ish. Honestly, prior to fermi, which memory-hierarchy dependent change would you do to the code?

And even with fermi, I am ambivalent as to what can be done in the compiler.

Loop cache blocking to use caches effectively is a standard optimization. open64 does it extensively. It even has a fancy cache model for it. Take a look at osprey/be/lno/cache_model.*. There is also some work for this on LLVM going on, by porting the gcc polyhedron library, although it is not quite as sophisticated as what osprey does.

I guess the bottomline is that LLVM based ocl today is somewhere around 2007 era cuda. And I don't think compilers do any memory-hierarchy related optimizations either. At the very least, I haven't come across this sort of stuff.

You have not looked very closely then. Cache blocking is one of the standard loop optimizations these days.

D
 
"Hostile" is a bit too harsh when it comes to AMD I think. "Ambivalent" is more like it. :)

For example they currently employ at least 2 or 3 engineers to work on the Open Source Radeon drivers and associated documentation. Granted, it's a project that mostly benefits them (and not the competition) but they do contribute.

That's why I said hostile to contributing code, not docs. ;)
 
Loop cache blocking to use caches effectively is a standard optimization. open64 does it extensively. It even has a fancy cache model for it. Take a look at osprey/be/lno/cache_model.*. There is also some work for this on LLVM going on, by porting the gcc polyhedron library, although it is not quite as sophisticated as what osprey does.

You have not looked very closely then. Cache blocking is one of the standard loop optimizations these days.

NV's fork of open64 does not appear to do that.

And you seem to forget that gpu's (mostly) don't have r/w caches.

Even Fermi which has pretty small caches. It seems targeted towards register spill/tree oriented use instead of loop blocking oriented use. You have 48KB L2 per SM. For 256 cuda threads, that is ~200 bytes per thread. Pretty small, IMHO.
 
I$ size should be less of an issue since gpu's tend to have much more straight line code. Also, the within a workgroup, the warps tend to stay relatively closer to each other in terms of PC. That should also help.
Yeah those tend to mitigate things.

The register allocation should be done at the PTX level, right? And surely compiler has to set aside whatever amount of local memory is requested in the code. So those two allocations should be independent of the API. AFAIK, Workgroups per SIMD should be determined by these two factors, so no difference their either.
Not entirely - because of the mix that comes with register spill, i.e. registers per work group and then work groups per SIMD is more dynamic with performant register spill.

The driver compiler then has an optimisation space in which register spill looks like "friendly" memory accesses (coherent, bursty) versus other memory accesses which are more awkward. i.e. it becomes beneficial to favour improved coherency in gather/scatter at the expense of spilled registers.

Things like weather simulation physics:

http://forum.beyond3d.com/showthread.php?t=49266

have a huge notional register allocation.

Anyway, I think if NVidia becomes an active contributor to LLVM, it could entirely obviate the issues of "bad PTX" and timeliness (matching-up with iterations of GPU design). So in that sense there's no need for LLVM to remain a stumbling block. I think that's the basis of your point of view and on those terms I think it's quite reasonable to see LLVM as awesome.

Jawed
 
GTX460 still GF100?

According to vga.zol.cn, GTX460 is not GF104 based, but still GF100.

Translation by Bing:

Before there are reports that the core of the GTX 460 code-named GF104, but now it seems it is still should GF100 core, just stream processor and memory requirements are further exclusions to the, however it is not clear exactly what stream processors (estimated 256-320), frequency will be set at any grade.

Link: http://www.microsofttranslator.com/BV.aspx?ref=IE8Activity&a=http://vga.zol.com.cn/179/1793262.html
 
Well, I think performance is pretty much what you'd expect assuming it mostly scales with number of shaders (and shader frequency) compared to the rest of the GF100 lineup.
I think it would have been interesting to see a HD5830 in that comparison - granted that's an underwhelming product too but since the GTX 465 appears to be cheaper than the HD5850 but more expensive than the HD5830 (I guess we'll see how these prices turn out) it'll only really has to beat the HD5830 to represent at least some value.
 
I doubt anything about the GTX 465 will surprise people. The only real question is whether or not it will be priced at a reasonable level for its performance. nVidia doesn't really have a real chance to fix some of the shortcomings of the GF100 architecture until the next refresh.
 
As a cheap GF100, I'm sort of interested, to tide me over with a low financial hit until GF104, especially since I have little time to code for fun or profit on my own dime. Might do as an experimenters board for those looking to code against the arch.
 
At $279 US it hardly seems cheap, but in comparison to the 470 or 480 I guess it isn't too bad.
 
Too bad no power consumsion figures.

Indeed.

Rumor has it that this is a 215W TDP card, so, I've been wondering, if this is true, we will have a next gen 40nm product, being slower and more power consuming than the previous gen 55nm product (GTX 285 @ 183W TDP). What is this telling us?
 
Back
Top