Ageia, CUDA and 8800xx?

As far as medical imaging, its one area where they use of the GPUs for the computational parts actually makes me nervous since I'm not as convinced about the quality control and numerical stability of the GPUs vs CPUs.

This is probably one of the stupidest and lamest things I have ever read on these message boards. Talk about not knowing when to admit someone else might have an iota of a point.

Are you aware of the vast advances in medical imaging the past few years that have been made because of the processing power of GPUs? They are truly revolutionary and game changing. There is so much data to process that a potential for small error is nothing compared with vastly higher amount of data that is processed. You are talking about thousands of voxels now, and where is the limit. The data is only going to increase. So you can go back to having a very perfect but rudimentary image that has far less diagnostic relevance or have an image that is far more useful but might have an error if you want, but you would be a fool.

That an GPUs are only increasing in accuracy. So they are clearly better now for medical imaging and will clearly be in the future.
 
We're talking about how CUDA will compete. This isn't graphics.

Given that they are planning to integrate CUDA (along with PhysX) into their GPU drivers, i'd dare say that it is complimentary to graphics rendering -after all, that is still the most widespread parallel computing application, is it not ?-, not the other way around.

If CUDA needs a proper way to become mainstream, what better strategy than to bring it into the company's main cash cow (discrete consumer GPU's) to generate the kind of buzz that Tesla can't ?
Something as radical as GPGPU takes years -if not decades- to become a "natural" in software vendors' minds.

Intel has x86's ecosystems, yes, but that hasn't stopped companies like IBM and Sun from consistently leading in the HPC performance and density fields for years. Even Itanium is still far from breaking into that little club with any significant shares.
I wonder why... ?
 
What do you mean "planning to integrate Cuda"?
The XP version of the driver has had it integrated since 160.something as far as I know (I just have a standard driver installed anyway, not a special Cuda development driver anymore).
Since Cuda 1.1, XP x64 is also supported, with the standard release driver.
So they've already integrated it for XP and XP x64.
We're just waiting for Vista support now, I suppose.
 
It definitely does, but I was really thinking of task parallelism, not data parallelism. Obviously a very good solution for the former is harder to come up with than for the latter... :)
Ah okay, that makes more sense. Still, I'm not entirely convinced that we can afford to do task parallelism when we get thousands of cores... certainly you can do it at a course granularity, but the scheduling overhead of MIMD at the fine level with that many processes is gonna hit serious Amdahl's law problems.

I'm also not convinced that writing task-parallel code at that level (1000s of tasks) makes a lot of sense either. Data parallelism, together with control flow, scatter/gather and perhaps some efficient primitives for segmented scan, etc. seems to be a pretty powerful model that scales really well.
 
If CUDA needs a proper way to become mainstream, what better strategy than to bring it into the company's main cash cow (discrete consumer GPU's) to generate the kind of buzz that Tesla can't ?
Something as radical as GPGPU takes years -if not decades- to become a "natural" in software vendors' minds.

I think that's what NVIDIA is trying to do. To my understanding, every NVIDIA GPU after G80 supports CUDA, and all newer drivers support CUDA directly. Of course, Vista support is yet to come (it's coming in CUDA 2.0), and that's probably the biggest complaint now.

Of course, it's very unlikely CUDA will be "the one." We'll need some easy to use and standard stream programming language, which everyone (Intel, NVIDIA, AMD) supports. But even with such standard language available, programmers will still have to optimize for different architectures, just like they are optimizing for different CPUs right now (for best performance, of course).
 
Given that they are planning to integrate CUDA (along with PhysX) into their GPU drivers, i'd dare say that it is complimentary to graphics rendering -after all, that is still the most widespread parallel computing application, is it not ?-, not the other way around.
The complementary bit of CUDA is the use of graphics hardware for something that isn't graphics.

If CUDA needs a proper way to become mainstream, what better strategy than to bring it into the company's main cash cow (discrete consumer GPU's) to generate the kind of buzz that Tesla can't ?
Something as radical as GPGPU takes years -if not decades- to become a "natural" in software vendors' minds.
D3D will do this job much better for the IHVs as a whole - but D3D doesn't have NVidia's branding.

Intel has x86's ecosystems, yes, but that hasn't stopped companies like IBM and Sun from consistently leading in the HPC performance and density fields for years. Even Itanium is still far from breaking into that little club with any significant shares.
I wonder why... ?
http://www.top500.org/static/lists/2007/11/TOP500_200711_Poster.png

Jawed
 
Well you make a lot of bold statements and trash a lot of hardware, but in reality its x86 CPUs that are behind the curve right now. Whether or not they can "slap a few vector units on and catch up" remains to be seen.

Oh I certainly agree that x86 is currently behind the ball here. The question is how quickly will it catch up?


I'm not sure that I agree that we necessarily want cache coherency... when you really start to code for massively data-parallel systems, you know where you want your memory to be and at what time anyways. The majority of the work in these situations is indeed optimizing memory transfers, and hardware cache coherency takes away any power that you have to make this fast. Typical caches trash the efficiency of stuff like scatter/gather, particularly when the coherency is enforced across multiple "cores" that may be doing entirely unrelated tasks.

The benefit to a good cache coherency is that it provides some freedom to the programmer no to micro manage everything. Right now the availble cache coherent multi-core processors aren't that good at core-core transfers, but I don't believe thats the way it has to be. Given sufficient associativity and non-temporal loads, a cache function pretty much like a control store.

Time will tell, certainly, but I remain unconvinced that the assumptions about spatial and temporal locality that underlie current CPU-caching mechanisms necessarily remain as valid in the massively data-parallel world.

People were skeptical when cray went to lower memory bandwidth with caches but the results proved themselves across a large variety of vector workloads including the subset that GPGPU designs target.

aaron spink
speaking for myself inc.
 
If more money in a system is attributed to Larrabee, then less will be attributed to Nehalem/Sandy Bridge.

As long as those money stay within Intel ecosystem then it's fine, despite the lower margins.

You don't want to let that cash flow to another company, do you? :)
 
Fair enough, although there are a lot of Linux people doing HPC stuff (probably more than windows). Are they left out in the cold then, or do we expect OpenGL equivalents, or DX ports?

I've been wondering about this myself. Khronos group has been awfully silent about GL3, can only expect that some vendor disagreements about API are going on behind the scenes. However I would at least expect an NVidia DX11 feature level GL driver.
 
nVidia has done some nice work with CG in the past, bridging the gap between D3D and OGL shaders, and paving the way for GLSL... Cuda might be doing the same for GPGPU, who knows?
At least Cuda works with both D3D and OpenGL, and on both Windows and linux. There is a good chance that DX11 and Cuda will meet halfway (eg Cuda compiling to DX11-compatible code that runs on any DX11-hardware)...
 
(eg Cuda compiling to DX11-compatible code that runs on any DX11-hardware)...
I'd kind of expect it the other way around. Namely, I doubt that a "compute shader" in DX11 will expose local memories, etc. as CUDA does. CUDA being the "closer-to-hardware" language, it'll probably be what stuff gets compiled to by the NVIDIA driver (or rather, PTX).
 
Last edited by a moderator:
I'm late to the party, but I think the latency discussion between Scali and Aaron is missing the point. GPUs don't have high memory latency because GDDR has high latency. GDDR has "high" latency because GPUs are tolerant.

Because of the massively parallel workload that is graphics GPU memory controllers are designed to achieve high bandwidth at the expense of latency. The actual memory used is but a part of a GPUs memory latency.
 
I'm late to the party, but I think the latency discussion between Scali and Aaron is missing the point. GPUs don't have high memory latency because GDDR has high latency. GDDR has "high" latency because GPUs are tolerant.

But thats just the issue, GDDR latencies are basically the exact same latencies as for DDR because fundamentally they are using the same DRAM array design.

In absolute terms GPUs have the same latencies as CPUs and in clock cycle terms, CPUs have in general a factor of 2 or high latency than GPUs.

Basically Scali has it backwards.

Because of the massively parallel workload that is graphics GPU memory controllers are designed to achieve high bandwidth at the expense of latency. The actual memory used is but a part of a GPUs memory latency.

Meh, not really. Both CPU and GPU memory controllers do the basic things like blocking reads and writes, etc. In general the GPU memory controller should have an easier time because it should be largely dealing with a couple linear access streams where as the CPU memory controller is generally dealing with more randomized access streams.

At the end of the day they are both generally limited by the DRAMs themselves and the number of banks they support. The only main difference you are likely to see is that the GPU memory controller is more optimized for CAS access and the CPU memory controller is more optimized for RAS access.

I don't know where this GPUs have higher latency myth came from or who started it, but its pretty much flat out inaccurate.

Aaron Spink
speaking for myself inc.
 
I'm late to the party, but I think the latency discussion between Scali and Aaron is missing the point. GPUs don't have high memory latency because GDDR has high latency. GDDR has "high" latency because GPUs are tolerant.

Because of the massively parallel workload that is graphics GPU memory controllers are designed to achieve high bandwidth at the expense of latency. The actual memory used is but a part of a GPUs memory latency.

Well I did say "GPU memory systems", it was Aaron Spink who started ignoring the memory controller and looking only at the memory chips themselves. I meant the entire system of GPU/controller + GDDR memory.
Bottomline is that the memory systems of a GPU and CPU have vastly different properties (in a nutshell: high latency, high bandwidth vs low latency, low bandwidth), which cannot easily be integrated in a single system that exhibits the best of both worlds (in a nutshell: low latency, high bandwidth, and ofcourse at reasonable cost... as I said, getting 4+ GB of the type of memory that you find on a fast GeForce or Radeon is considerably more costly than using standard DDR2-modules. Rules of supply and demand may shift the balance somewhat, but at present it's a chicken-and-egg problem).
Since a solution like Fusion will share a single memory controller + chips between the CPU and GPU units, compromises will be made. And that was the point I was making, which has little to do with what Aaron Spink was talking about really. It seems I just accidentally stumbled upon one of his many pet peeves.
 
I don't know where this GPUs have higher latency myth came from or who started it, but its pretty much flat out inaccurate.
I looked up the first CPU memory latency I could find, Itanium. ~210 cycles. So yes CPUs and GPUs are in the same ballpark, but modern GPUs are still a bit higher. Though I don't know if GPU latencies are officially published.

http://softwarecommunity.intel.com/articles/eng/3512.htm

IMO, the main difference between memory architectures is that GPUs are designed to maintain peak bandwidth while absorbing the memory latency while CPU workloads tend to stall because they can't swap in another thread.

Though I wonder if a rasterizer like SwiftShader is designed to prefetch to mitigate this difference.
 
IMO, the main difference between memory architectures is that GPUs are designed to maintain peak bandwidth while absorbing the memory latency while CPU workloads tend to stall because they can't swap in another thread.

Though I wonder if a rasterizer like SwiftShader is designed to prefetch to mitigate this difference.

I'm sure Nick will have a few words here... but IMO software prefetch is a mixed bag with a tremendous number of issues: results are not portable on x86 archs, conflicts with hardware prefetch, also minor issues such as ignored prefetch due to address translation cache miss or too many outstanding cache misses, etc...
 
Back
Top