Ageia, CUDA and 8800xx?

Hello All,

<much rambling ahead>

(sorry, my 6600GT is getting a bit long in the tooth, my games I own but can't play are increasing, and I like the idea of PhysX, and I am cheap) =)

I was wondering if there was info info on how Nvidia is going to implement Ageia va CUDA. It looks like you will at least be able to run PhysX on a 2nd SLI card but I wonder if you will be able to use a portion of your graphics for physics? If that were true then maybe the extra shaders in the 8800GTS may be put to good PhysX use.

I also wonder if you will have to have a SLI motherboard to use 1 graphics and 1 physics card. (I have the NForce 4 Ultra-D)

I also wonder if you get a 2nd card for PhysX does it need to be the same? I would be nice to snag a 8800GTS now, and pick up a 9600GT later when a few more PhysX games were out it CUDA PhysX works well.

Does anyone have any idea when they are supposed to release more Ageia info?

Thanks for listening,
Dr. Ffreeze

PS. The 8800GTS looks to be my best bet now, but I would love to find out if a 9800GT(S) is coming out very soon. I thought rumor was that it was, but now I am not so sure. The value of the 8800GTS is A M A Z I N G.
 
PhysX won't be run via CUDA, as far as I'm aware, and if you're running on a 2nd GPU you have, that GPU doesn't need to be the same as the first. It'll also execute on a single GPU, along with your game's rendering, if that's all you have.

An SLI mainboard shouldn't be a requirement.
 
PhysX won't be run via CUDA, as far as I'm aware
Are you sure? I was under the impression that they're looking at implementing it in CUDA... there doesn't seem to be a lot of point in reinventing the wheel here. Of course they may add a bit of supporting hardware functionality, but I wouldn't be surprised if whatever functionality they add will be exposed in CUDA as well, as long as it is suitably general enough.

I have no special knowledge of this, but I was kind of under the impression that they were planning to go the CUDA route, just from chatting with a few people at GDC. Could be 100% wrong impression though...
 
Aye.. I keep hearing nVidia say they're putting PhysX on CUDA and everyone else say they're not despite knowing that nVidia keeps saying it. Kinda makes me say, hmm?
 
It's about as likely for them to be able to REALLY accelerate PhysX via G8x/G9x CUDA as it is for them to be able to encode a H.264 High Profile 1080p stream in real-time in the shader core. In other words, I don't believe it, but I wouldn't exclude transcoding... (or in PhysX's case, acceleration of only some specific features or a new CUDA path) - this is specifically for G8x/G9x/GT2xx though as I said, all bets are off for their DX11 chip. And who knows, maybe they'll defy the laws of computation and surprise me positively! (only way they could do that AFAICT is if they use the Tensilica DSP for video decoding in the physics process - then things might get interesting and completely nuts, hmm)

And I know pretty much everyone thinks I'm crazy for pretending they're basically lieing about CUDA/PhysX, but unless someone exlains to me how they can even theoretically do some of this stuff efficiently on a SIMD Machine...
 
this is specifically for G8x/G9x/GT2xx though as I said, all bets are off for their DX11 chip
You think the architecture will be significantly different? I thought NVIDIA had pretty clearly committed to some stuff by their design and future promises of support for CUDA. Sure they can add stuff, but the basic model of warps/threads/local shared memory is here to stay for a while AFAIK.

And I know pretty much everyone thinks I'm crazy for pretending they're basically lieing about CUDA/PhysX, but unless someone exlains to me how they can even theoretically do some of this stuff efficiently on a SIMD Machine...
I know nothing about h.264 encoding, but doing some basic physics stuff on the GPU isn't really that hard. I've written code to do collision detection and response on Cell, and while certainly Cell is a slightly better target, the basic algorithms map generally pretty well to the GPU. Particularly with scatter/global writes in CUDA it's not too hard to implement. As you mention capturing SIMD can be a bit of an issue in some of the phases, but in other phases it's very obvious. I certainly wouldn't count acceleration of at least some performance-critical chunks on CUDA out myself. I'm no physics expert, but I know the basics.

PS: You're really anti-SIMD lately Arun ;)
 
You think the architecture will be significantly different? I thought NVIDIA had pretty clearly committed to some stuff by their design and future promises of support for CUDA. Sure they can add stuff, but the basic model of warps/threads/local shared memory is here to stay for a while AFAIK.
It just requires compatibility with that model - DX11 chip is clearly a clean slate afaik/afaict.

I know nothing about h.264 encoding, but doing some basic physics stuff on the GPU isn't really that hard. I've written code to do collision detection and response on Cell
Uhh CELL isn't massively parallel and the SIMD width is only 4 in the SPEs, plus MIMD is fairly efficient.

I certainly wouldn't count acceleration of at least some performance-critical chunks on CUDA out myself. I'm no physics expert, but I know the basics.
I wouldn't count it out either, depending on what 'some' means... However, I can't see how you'll do broadphase efficiently on a GPU, and I can't see how you could do narrowphase for triangle meshes. Convex hulls, errr, maybe. Also fluid/cloth stuff is relatively easy and there are opportunities for CUDA there. But the real *core* of the PhysX API, well, I don't see how you could accelerate that. With some help of the Tensilica core and a lot of effort, though, maybe. But that does pose scaling problems between the low-end and high-end if that core is a bottleneck.

PS: You're really anti-SIMD lately Arun ;)
I'm not anti-SIMD, I just want throughput-oriented MIMD on the same chip (same core ala Larrabee optionally but doesn't have to be the case). I don't care if you have two sets of cores or just one or a billion different kinds of cores that transcode from one ISA to another all the time - that's none of my business as long as there is no obvious and retarded bottleneck. And no, CPU-GPU integration doesn't cut it, because an OoOE CPU has awful per-mm² throughput. And shock horror, that is exactly the viewpoint Neoptica proposed, just in a less direct and straightforward way, in their whitepaper: MIMD-SIMD integration is absolutely necessary for certain workloads and may actually improve efficiency of mostly-SIMD workloads.

If I had some very clear reassurance that there would be no MIMD-SIMD transfer bottleneck on Larrabee, that might make more excited about the chip - certainly much of my optimism is based around that. If it wasn't for that or if it was so laughably slow as to be useless, all you've got is effectively an inefficient GPU with poor hiding of latency and so forth. So stop complaining, since obviously my sudden interest in MIMD is potentially positive for your new employer... ;) (or possibly not - as I said, who knows)
 
It just requires compatibility with that model - DX11 chip is clearly a clean slate afaik/afaict.
True, but if CUDA is no longer an efficient and powerful way to use DX11 processors, I think NVIDIA has really shot themself in the foot. There are a lot of smart people there, though, so I doubt they'd do that :)

Uhh CELL isn't massively parallel and the SIMD width is only 4 in the SPEs, plus MIMD is fairly efficient.
True, but my example was more of a "I have an understanding of the underlying algorithms" than a specific "if it runs on Cell it'll run on GPU!" All I'm trying to say is that from my experience, there's nothing crippling about the way that some of these physics algorithms operate with respect to GPU architectures IMHO, although again I'm no expert.

However, I can't see how you'll do broadphase efficiently on a GPU
Scatter/binning approaches seem to be fairly efficient on GPU... check out Simon Green's SPH stuff.

and I can't see how you could do narrowphase for triangle meshes. Convex hulls, errr, maybe.
I don't see too much trouble with convex hulls, although again you're going to run into a few coherence issues depending on the sizes/regularity of your meshes. Really if we're trying to accelerate this stuff on GPU though, you've got to expect there to be thousands if not tens of thousands of objects, which opens up more possibilities for capturing coherence.

MIMD-SIMD integration is absolutely necessary for certain workloads and may actually improve efficiency of mostly-SIMD workloads.
I don't disagree with you, but the bottom line is that the MIMD part has to be nearing the efficiency as the SIMD part. Certainly 16x slower MIMD wouldn't buy you much on any current cores ;) While it's also none of my business how this works at the hardware level, the current thinking seems to be that we can pack enough SIMD into hardware to justify some slowdown in the more-MIMD cases. That may be totally wrong, but I'd love to see someone prove that in hardware :)

My biggest concern with MIMD architectures is scheduling. As we move forward into thousands of "cores" (or whatever you want to call them), I'm concerned about scheduling presenting an Amdahl's-law-like bottleneck. At these levels, FIFO queues and dynamic load balancing may no longer cut it, as the overhead will be more than the actual computation being done. SIMD can mitigate this problem largely by static scheduling, or at least scheduling at a much courser granularity.

So stop complaining, since obviously my sudden interest in MIMD is potentially positive for your new employer... ;) (or possibly not - as I said, who knows)
Hehe, I was just teasing you Arun ;). I didn't mean any offense. Perhaps we should start a separate thread about this though as I'd love to hear your thoughts, and more about alternative architectures.
 
Last edited by a moderator:
Damn - I might just be wrong on this one! Looking again at hash-based infinite grids as explained in Simon Green's latest presentation, I can see how that could be applicable to the broadphase of arbitrarily shaped objects. Not incredibly efficient compared to the alternatives perhaps, but usable given the sheer horsepower involved. As for narrowphase, bruteforce should indeed work for convex hulls and I *think* the feedback latency might be low enough to do rudimentary hierarchical culling in the narrowphase.

So yeah, unless I'm missing something AGAIN, I was actually wrong and am openly admitting it now - although none of this seems like a very efficient way to handle the problem, a high-end GPU could likely still be substantially faster than modern CPUs. On the other hand, I don't see how this could compete with Larrabee unless MIMD<->SIMD transfers are really shit (which would be embarassing).

As for CUDA, I didn't say that, I just said I doubt the API will remain static and I doubt programming with a G8x/G9x mindset a few generations from now will be the most efficient way to optimize your algorithm... That should surprise no one, really. As for SIMD vs MIMD, yeah, I should start a thread on that one of these days.
 
Not incredibly efficient compared to the alternatives perhaps, but usable given the sheer horsepower involved.
Yes indeed, it's not necessarily as "efficient" as other techniques from an operation-counting point of view, but that's where you have to start considering the cost of SIMD vs MIMD hardware... something which which I really don't have any experience.

On the other hand, I don't see how this could compete with Larrabee unless MIMD<->SIMD transfers are really shit (which would be embarassing).
I tend to agree, although it will be interesting to see PhysX on GPU vs Havok on Larrabee in the coming few years if the predictions are to be believed.

As for CUDA, I didn't say that, I just said I doubt the API will remain static and I doubt programming with a G8x/G9x mindset a few generations from now will be the most efficient way to optimize your algorithm...
I totally agree with this, and I don't feel that CUDA is the "be all, end all" parallel programming model that NVIDIA want us to believe that it is. It's really quite hardware-specific, dispite what they would have people believe. I do have a bit of bias (experience?) being a current employee at RapidMind though, hence why I was interested to hear your opinion on that :) As far as a way to target GPUs, it's great for us. However it doesn't strike me as a programming model that is going to remain unchanged beyond current NVIDIA GPUs. I'm sure features of it will remain, but it does seem rather hardware-specific for you to expect to have CUDA applications written now running optimally on hardware - say - 10 years down the road.
 
I think part of the point for MIMD vs SIMD for Physics is that you don't need that much MIMD to improve your overall efficiency. Certainly both in terms of ops/second and die size, it could be a fairly negligible part of the chip and still help a lot. Heck, just put one of these next to every [strike]multiprocessor[/strike] cluster on a G8x/G9x and give it access to shared memory and I'd already be a happy camper: http://www.tensilica.com/diamond/di_106micro.htm - even without FP32 I think you could do some interesting stuff that way. And part of what makes SGX exciting to me is that it's pure MIMD yet seems very efficient to say the least, which also means the cost of Hybrid MIMD-SIMD vs. SIMD could be quite negligible (although I certainly wouldn't be opposed to pure MIMD becoming mainstream either!)

As for CUDA, I agree it is rather hardware specific, but what I think is important to realize is that future hardware from NV will likely evolve to become more flexible in every way, not more flexible in some and less flexible in others. That means that unless you are doing some hairy batch size-specific stuff, I don't think your 'chip utilization efficiency' will crash down to hell in future generations (although it might drop); on the other hand, the specific selection of algorithms you made might no longer be the most optimal way to extract aggregate performance if your flexibility increases.
 
Last edited by a moderator:
Which kind of leaves ATI as the odd man out this go round it seems. With Nvidia in charge of PhysX and unlikely to license it to ATI, and with Intel in charge of Havok and without signs they will continue serious developement of physics on a GPU...

It would seem unless Microsoft comes to the rescue with Direct Physics that ATI might be in a bind if Nvidia successfully pulls of integration of PhysX with their GPU's.

Regards,
SB
 
view.aspx


Anyone actually listen to the audio stream this far in to see JHH says about it? The webcast is 7 hours long, and this is towards the end...

http://investor.shareholder.com/med...e=1&mediaKey=E0D7BE337DF28EF4467A86CE5AD3B1D2

edit: you can get all the slides by just changing the last number from 1 to 104 here:
http://apps.shareholder.com/slides/...A-B746686EBDBE&width=510&height=286&slideid=1
 
Last edited by a moderator:
Well this news item will interest you then:
http://www.tgdaily.com/content/view/36915/135

"While Intel's Nehalem demo had 50,000-60,000 particles and ran at 15-20 fps (without a GPU), the particle demo on a GeForce 9800 card resulted in 300 fps. If the very likely event that Nvidia’s next-gen parts (G100: GT100/200) will double their shader units, this number could top 600 fps, meaning that Nehalem at 2.53 GHz is lagging 20-40x behind 2006/2007/2008 high-end GPU hardware."

"There was also a demonstration of cloth: A quad-core Intel Core 2 Extreme processor was working in 12 fps, while a GeForce 8800 GTS board resulted came in at 200 fps. Former Ageia employees did not compare it to Ageia's own PhysX card, but if we remember correctly, that demo ran at 150-180 fps on an Ageia card."

So if that's true, byebye Intel and any illusions of replacing physics/graphics units with their multicore architectures. They will really need Larrabee to remain in the race.
Also this shows the true power of the GF8-architecture and Cuda: it can very efficiently share data between stream processors with its special caching system (also the reason why Cuda is only available on GF8 and up).
 
They're not running the same demo. Besides, we know what computing loads GPUs are good at vs current (& near future) SMP CPUs. Larrabee & Sandy Bridge as well as AMD's Fusion projects will certainly be interesting. JHH was certainly entertaining, though.
 
They're not running the same demo.

From what I understood, the cloth demo was the same demo running on a Core2 Extreme and an 8800GTS.

Besides, we know what computing loads GPUs are good at vs current (& near future) SMP CPUs.

What do you mean by that?
I don't know what Cuda is capable of exactly, because the whole architecture is a break from previous GPUs, and is suitable for far more generic processing, such as for example physics (physics requires a lot of data shuffling to and fro, where graphics and linear algebra are generally just very straightforward parallel stream operations).

AMD's Fusion projects will certainly be interesting.

As far as I know, Fusion is nothing more than a regular K10-ish core coupled with a regular R600-ish IGP. So it's just like a current AMD CPU + GPU setup, only more bandwidth-impaired. On top of that, AMD doesn't have an architecture with features like GF8+Cuda, nor does it have a nice programming environment like Cuda. So AMD is waaaaay behind nVidia, and possibly also behind Larrabee when it is introduced (Intel has a big compiler/software team, unlike AMD, and they should have a nice SDK ready when the time comes).
 
So if that's true, byebye Intel and any illusions of replacing physics/graphics units with their multicore architectures.
I don't think a typical OOO CPU is going to ever win at those sorts of tasks, and if people think that they're going to be replacing high-throughput devices for graphics, particle simulation, and arguably a lot of physics, then they're probably fooling themselves. One can of course make the argument that CPUs will be "good enough" at certain tasks, but such arguments have rarely held much water in the past in - for instance - AAA games... generally you want more and bigger.

That said, even if the two were running the same demo, the CPU being "only" 20-40x slower than a GPU in particle and cloth simulation is pretty impressive actually :D

Seriously though, I don't think too many people see high-throughput designs like GPU and Cell going anywhere anytime soon, and Intel seems to agree since they seem to want to get into the space :)
 
Scali, I'd suggest you might want to take a look at AMD/ATI CTM.

I did, feels more like programming in assembly. It's nothing like Cuda. Did you ever look at Cuda?
Also, are you aware of the architecture of the GF8 and its 'Parallel Data Cache'? As far as I know, ATi has nothing like that.
So yes, I've looked at it, have you?
 
I don't think a typical OOO CPU is going to ever win at those sorts of tasks, and if people think that they're going to be replacing high-throughput devices for graphics, particle simulation, and arguably a lot of physics, then they're probably fooling themselves. One can of course make the argument that CPUs will be "good enough" at certain tasks, but such arguments have rarely held much water in the past in - for instance - AAA games... generally you want more and bigger.

That said, even if the two were running the same demo, the CPU being "only" 20-40x slower than a GPU in particle and cloth simulation is pretty impressive actually :D

Seriously though, I don't think too many people see high-throughput designs like GPU and Cell going anywhere anytime soon, and Intel seems to agree since they seem to want to get into the space :)

Well I think it's mostly that one of Intel's spokespersons put his foot in his mouth recently, with that particle demonstration on Nehalem. Intel had already distantiated itself from his claims that GPUs or even IGPs could be replaced, etc.
But it's nice that nVidia decided to shove that foot in his mouth even deeper :)
It's nice to have such a 'rogue' company, reminds me a bit of Apple and their pisstakes on the competition.
It's also good that nVidia is starting to flex those Cuda-muscles. The technology and its capabilities aren't wellknown because they've not been shown off much so far.
 
Back
Top