Physics on Cell

pjbliverpool said:
If the news about the SPE's offloading 50% of the PPE's workload is true then we can say that the physics capabilities of Cell in that specific instance at least are twice that of a single PPE.

That's precisely the conclusion you cannot come to. Again, the load reduction on the PPE tells us nothing about how much (more) the SPEs are doing in terms of the work offloaded from the PPE. The amount the PPE has to do remains fixed, but the rest could scale up across the SPEs, well beyond what was being done on the PPE before (and likely is).

A crude, illustrative example might be doing scene traversal and as much particle simulation as you can on the PPE, and then moving the particle systems to the SPU(s) - the load reduces by 50%, but suddenly you can push your particle systems a lot more with the SPUs than the PPE could handle (while also doing the rest). Amount of work PPE+SPUs are doing != Amount of work PPE was doing alone before (or 2x that)
 
Last edited by a moderator:
Titanio said:
That's precisely the conclusion you cannot come to. Again, the load reduction on the PPE tells us nothing about how much (more) the SPEs are doing in terms of the work offloaded from the PPE. The amount the PPE has to do remains fixed, but the rest could scale up across the SPEs, well beyond what was being done on the PPE before (and likely is).

A crude, illustrative example might be doing scene traversal and as much particle simulation as you can on the PPE, and then moving the particle systems to the SPU(s) - the load reduces by 50%, but suddenly you can push your particle systems a lot more with the SPUs than the PPE could handle (while also doing the rest). Amount of work PPE+SPUs are doing != Amount of work PPE was doing alone before (or 2x that)

Agreed, just because they can now reduce the workload of the PPE by 50% that does not mean that the SPUs will be able to handle only the "rest" 50%, but rather they will be able to have them do much much more, otherwise I am not getting the point with Cell...
 
Titanio said:
That's precisely the conclusion you cannot come to. Again, the load reduction on the PPE tells us nothing about how much (more) the SPEs are doing in terms of the work offloaded from the PPE. The amount the PPE has to do remains fixed, but the rest could scale up across the SPEs, well beyond what was being done on the PPE before (and likely is).

A crude, illustrative example might be doing scene traversal and as much particle simulation as you can on the PPE, and then moving the particle systems to the SPU(s) - the load reduces by 50%, but suddenly you can push your particle systems a lot more with the SPUs than the PPE could handle (while also doing the rest). Amount of work PPE+SPUs are doing != Amount of work PPE was doing alone before (or 2x that)

Yes thats fair enough, I should have paid more attention to what I was reading.
 
http://www.watch.impress.co.jp/game/docs/20060514/ageia.htm
This new report by Zenji Nishikawa is about the Ageia booth at E3.
According to the report, the current PhysX PPU has the spec like this:

GDDR3 RAM: 128MB
PPU-GDDR3 bandwidth: 12GB/sec
convex-based collision simulation: 533,000 objects max
sphere-based collision simulation: 530 million objects max
floating-point processing performance: 58 GFLOPS
operation execution rate: 22 Gops/sec
PCI bus: 133MB/sec

Though the GFLOPS number is not great, how about those collision sim numbers?
 
I remember in an interview with Gabe Newel, he said valve would never use the PPU, or do physics on the GPU, because it would be to much latency to make it interactive so it wouldn't be good for anything but eye-candy

anyone know how true this is, if that is the case I would guess physics on the PS3 would be able to allow the player to interact with objects since latency shouldn't be a problem
 
pegisys said:
I remember in an interview with Gabe Newel, he said valve would never use the PPU, or do physics on the GPU, because it would be to much latency to make it interactive so it wouldn't be good for anything but eye-candy

anyone know how true this is, if that is the case I would guess physics on the PS3 would be able to allow the player to interact with objects since latency shouldn't be a problem

I think Cell Factor pretty much proves its not true.
 
pjbliverpool said:
I think Cell Factor pretty much proves its not true.

Meh, I didn't think Cell Factor looked that impressive, it's what I'd expect a dual core cpu to be able to pull off. Of course, having more hardware means now they can do something else with that 2nd core, just like having a sound card frees up quite a bit of processing power, even if it's still within the cpu's capabilities. I expect all next gen physics engines to be developed with an assumed dual core cpu being in place, and the physx accelerator will merely allow the burden to be shifted somewhere else. Then again, it was several years before games started requiring 3d accelerators (about 2 years), with another year after that before any major games did, another year before they became a mainstream requirement, and another year before they really started demanding any level of performance or features out of them. It could just be, like the initial 3d cards, the initial use of physx accelerators will just be to offload the burden from the cpu, rather than really do anything the cpu can't.
 
We still don't know much about the Ageia PPU. For that reason I'd recommend the otherwise unthinkable: compare theoretical FLOPs. By that measure it looks more competent than a single SPE, but Cell as a whole has more processing resources. It also looks much more powerful than any general-purpose core, x86 or otherwise. (edit: just speaking of code that lends itself well to massively parallel SIMD execution, such as physics, unsurprisingly)

IMO this is an optimistic viewpoint for Ageia. We have no confirmation about the scope of programmability of the Ageia chip. If it is as free-form programmable as an SPE, the flops comparison is somewhat valid. If there is a more restricted and hard-wired execution model going on, the Ageia chip loses some of the merit implied by the FLOPs comparison.

Anyway, to go on with the assumption, the Ageia PPU is clocked lower but much wider than an SPE. This implies that its performance will drop off earlier on workloads that aren't as SIMD-friendly. OTOH it has larger memory, or at least easier/lower latency access to those 128MiBs of GDDR3.
There probably is a cache on the chip, and I'd compare that to the SPE's local store probably, if I only knew how large/fast it was.
 
Last edited by a moderator:
zeckensack said:
We still don't know much about the Ageia PPU. For that reason I'd recommend the otherwise unthinkable: compare theoretical FLOPs. By that measure it looks more competent than a single SPE, but Cell as a whole has more processing resources. It also looks much more powerful than any general-purpose core, x86 or otherwise. (edit: just speaking of code that lends itself well to massively parallel SIMD execution, such as physics, unsurprisingly)

We've established on this board that physics isn't particularly FP heavy and is more a pointer chasing type-app than anything. However it does have inherent parallism and as such fits CELL pretty well; one SPU might be worse than one x86 CPU, but you have a whole bunch of SPUs.

IMO, it's a pretty narrow window of opportunity for both PPUs and CELL, once normal CPUs reach the 4-8 core-on-a-die mark it's over.

Cheers
 
pjbliverpool said:
I think Cell Factor pretty much proves its not true.

that is part of the reason I brought it up, while it was alot going on, some parts seem to not react fast enough, or was a bit off pace

could just be the video I saw though
 
Gubbi said:
We've established on this board that physics isn't particularly FP heavy and is more a pointer chasing type-app than anything. However it does have inherent parallism and as such fits CELL pretty well; one SPU might be worse than one x86 CPU, but you have a whole bunch of SPUs.

IMO, it's a pretty narrow window of opportunity for both PPUs and CELL, once normal CPUs reach the 4-8 core-on-a-die mark it's over.

Cheers

Well to be fair to PS3, it only ever really needs to keep an edge in physics against 360 and Wii to have won it's physics match-up. For Cell in general, at the same point that 'normal' CPUs can/will hit 8 cores per die, there could just as well be multi 'core' Cells as well; roadmaps indicate plans for as much. Now, obviously a lot of that will depend on Cell uptake in the server/workstation market between now and then, but if their is a perceived demand, certainly STI could keep the Cell out in front of other more conventional processors for quite some time. When we start going towards 2010 and the supposed co-processor supported architectures in the 'mainstream' space, things get murkier.
 
I have a question:

Since the PPU from Ageia has a configuration very similar to Cell but less powerful and at the same time powerful GPU like ATI X1900 are a bottleneck for them I want to know if this is the case of Cell+RSX or the FlexIO bus is the solution to the problem and thanks to it the PS3 is several times more powerful than any PC in physics calculations in real time.

I am wrong?
 
Gubbi said:
We've established on this board that physics isn't particularly FP heavy and is more a pointer chasing type-app than anything. However it does have inherent parallism and as such fits CELL pretty well; one SPU might be worse than one x86 CPU, but you have a whole bunch of SPUs.

IMO, it's a pretty narrow window of opportunity for both PPUs and CELL, once normal CPUs reach the 4-8 core-on-a-die mark it's over.

Cheers

I would think that the pointer chasing has a lot to do with the construction of the data structures commonly used to store physics related data to date and just how that data is accessed. How physics is handled most commonly on CPUs to date to me does not necessarily dictate how they could or should be handled on Cell and Aegia's PPU. (not that I want to get into an extended discussion about it mind you)
 
Last edited by a moderator:
xbdestroya said:
For Cell in general, at the same point that 'normal' CPUs can/will hit 8 cores per die, there could just as well be multi 'core' Cells as well;

You can have 2-3 SPUs in the same die area as a x86 core (K8). So when you have 8 x86 cores on a die you can have ~20-25 SPUs on a die. I'm betting that 8 general purpose cores with SIMD will be faster than 20 SPUs on just about anything.

Cheers
 
Gubbi said:
You can have 2-3 SPUs in the same die area as a x86 core (K8). So when you have 8 x86 cores on a die you can have ~20-25 SPUs on a die. I'm betting that 8 general purpose cores with SIMD will be faster than 20 SPUs on just about anything.

Cheers

It's going to depend on memory architecture more than anything.
 
Gubbi said:
You can have 2-3 SPUs in the same die area as a x86 core (K8). So when you have 8 x86 cores on a die you can have ~20-25 SPUs on a die. I'm betting that 8 general purpose cores with SIMD will be faster than 20 SPUs on just about anything.

Cheers

Are you counting L2 caches? 8 cores sharing a single L2 cache won't be anywhere near as efficient as 20-25 SPUs with local memory.

Also, I don't believe for a second that physics is inherently pointer chasing (and sparse matrix processing is just the traditional approach, so there is an assumption that this is the best you can do) That is a result of current physics engine design tailored for serial processors. Whether or not you have to chase pointers depends entirely on your data structure and what data structure you use can be dictated by memory archicture. Unlike other problems in CS, there is no inherent proof of problem difficulty for physics like there is for many other problems.

Edit: let me qualify before there is some confusion. I'd say that something is inherently pointer chasing if the number of pointers chased scaled 1:1 with the number of elements processed assymtotically. That is, say, if performing physics on O(N) verticies requires O(N) pointers to be chased. If only O(log(N)) pointers need to be chased, I would not say it is inherently pointer chasing.
 
DemoCoder said:
Are you counting L2 caches? 8 cores sharing a single L2 cache won't be anywhere near as efficient as 20-25 SPUs with local memory.

The core of a K8 (with 2x 64KB L1 caches) is 33% of 193m^2 in 0.13um (64mm^2) and approximately half that in 0.09um, so I was actually being conservative. It's less than 2 SPUs for each current K8 core. Beef the K8 up with full width SSE (K8L) and it'll be around my initial estimate.

As for a shared L2 (or L3) cache not being as effective as distributed local stores; Contrary to you I believe it will be vastly more effective. No need to store redundant copies of data and code in each local store (L2 cache gets constructive interference), smaller cache lines vs. larger DMA-friendly blocks gives more optimal packing of data structures due to less internal fragmentation of data.

With a good control set of loading and storing non-temporal data a cache hierarchy will outperform local stores in anything.

DemoCoder said:
Edit: let me qualify before there is some confusion. I'd say that something is inherently pointer chasing if the number of pointers chased scaled 1:1 with the number of elements processed assymtotically. That is, say, if performing physics on O(N) verticies requires O(N) pointers to be chased. If only O(log(N)) pointers need to be chased, I would not say it is inherently pointer chasing.

Well, if that's you definition then your point might certainly be valid.

I'd say a task is inherently pointer chasing if a significant part of the total work per unit were down to data dependencies rather then arithmetic work. - Regardless of time complexity.

I'd expect physics load to scale with O(N log(N)) for the general case and of course N x N for worst case (all hulls collide)

Cheers
 
Last edited by a moderator:
Gubbi said:
We've established on this board that physics isn't particularly FP heavy and is more a pointer chasing type-app than anything. However it does have inherent parallism and as such fits CELL pretty well; one SPU might be worse than one x86 CPU, but you have a whole bunch of SPUs.

IMO, it's a pretty narrow window of opportunity for both PPUs and CELL, once normal CPUs reach the 4-8 core-on-a-die mark it's over.

Cheers
I see your point but I don't think it's future-proof ;)

Just think about what exact type of physics you'll end up with if you want to scale it up to a larger scale, which is what this is all about. 10000 particles spawned from an explosion, stuff like that.

The vast majority of the physics actors will need just simple newtonian adjustments of position and velocity.
Objects that are moving against friction are similar, and while you'll want to model friction with some kind of if-then-else construct in C, to account for the minimum threshold of force that allows movement, all SIMD architectures since the Pentium MMX can crunch through that kind of thing without any actual branches.
An actor bouncing off another actor (or a static surface) is a similar case of branch-conceptually-but-not-really.
That's very, very SIMD-friendly IMO.

The pointer chasing comes in when you determine which other surfaces you want to check your cloud of 10000 particles against. And then you'll be happy to have an SIMD monster to perform the actual collision checks and position/velocity updates.

Some parts of a comprehensive physics package will need to chase pointers, yes, and some other parts will be well served with a general-purpose core that's good at branching and has a strong cache. It's just that I think that the majority of physics actors, especially in "gimmicky" situations, statistically don't have an "interesting" interaction. They will only need lots of "stupid" processing power to mow them down.
 
Gubbi said:
The core of a K8 (with 2x 64KB L1 caches) is 33% of 193m^2 in 0.13um (64mm^2) and approximately half that in 0.09um, so I was actually being conservative. It's less than 2 SPUs for each current K8 core. Beef the K8 up with full width SSE (K8L) and it'll be around my initial estimate.

You're going to have to do more to convince me. I don't think you're comparing apples to apples and you're making pretty bold statements.

As for a shared L2 (or L3) cache not being as effective as distributed local stores; Contrary to you I believe it will be vastly more effective. No need to store redundant copies of data and code in each local store (L2 cache gets constructive interference), smaller cache lines vs. larger DMA-friendly blocks gives more optimal packing of data structures due to less internal fragmentation of data.

With a good control set of loading and storing non-temporal data a cache hierarchy will outperform local stores in anything.

I disagree. SMP doesn't scale well, as eventually shared L2 cache ends up bus saturated (cache coherency only makes it worse). SMP was abandoned for NUMA in enterprise systems precisely because of these limitations. I don't see what cache line size has to do with it, since with local store one isn't likely to DMA in a few bytes of data via gather operations, but rather, stream in large batches of data pre-packed. And memory fragmentation will only be a problem if one thinks that SPUlets will be maintaining general purpose heaps which I do not believe will be the programming model.

I'd say a task is inherently pointer chasing if a significant part of the total work per unit were down to data dependencies rather then arithmetic work. - Regardless of time complexity.

But your definition isn't precise. What's significant? 10%? 20%? 50%? Secondly, define pointer chasing. There's pointer chasing, and there's pointer chasing. If a chased pointer is already in local store, it's a huge difference compared to a chased pointer in external memory.

I'd expect physics load to scale with O(N log(N)) for the general case and of course N x N for worst case (all hulls collide)

That may be the case, but that is besides the point of how many pointers need to be chased. The question is, how collision detection is parallelized (which data structure used) and how expensive it is to update the data structure. Those are separate questions to me that determine how much pointer chasing you need to do, and how you will limit scatter writes.

I just think your statement that a centralized L2 cache and a bunch of SMP cores are going to win all the time no matter what the workload is unsupportable without further explanation. (.e.g how can it beat a stream processor on stream processor tasks!)
 
Back
Top