Physics on Cell

Gubbi · May 17, 2006

My argument boils down to this:

1. Because memory coherence is software based in CELL inter process and inter thread communication is expensive. That is counter productive (that's being nice) when scaling your architecture to a massively multicored implementation.

2. Local stores have no automatic way of sharing data. When you want to parallize a task that has a large shared dataset, you end up with individual copies in each SPE, in other words, each SPE only ever has fast access to (less than) 256KB. So even though you have 8 SPEs with a total of 2MB, you only have 256KB effectively per SPE, going forward to future implementations with say 24 SPEs, each SPE still only sees 256KB. This in contrast to a cache memory hiearchy processor, say you have a dual core with 2 MB (L2, shared) cache, going forward and doubling you get 4MB that all processors might take advantage of (constructive interference).

The alternative is to spatially divide the dataset among the SPEs. But like Patsu mentions if you have a hot region with many agents, one SPE is doing all the work while the others are just very expensive (unused) SRAM.

Cheers
Gubbi

patsu · May 18, 2006

Gubbi said:
My argument boils down to this:

1. Because memory coherence is software based in CELL inter process and inter thread communication is expensive. That is counter productive (that's being nice) when scaling your architecture to a massively multicored implementation.

It's true in general. Fine-grained parallelism is hard if we want to maintain decent speedup (even for SMP systems !) because FP/integer calculation is so fast compared to other overhead these days).

About the memory coherence problem, I definitely agree that a general shared memory systems will have a great advantage here. But a lot depends on the actual implementation and relative performance of the system.

+ Is "global (DMA) write" to shared memory in Cell prohibitively expensive vis-a-vis Xenon's shared memory writes with cache coherence ? How much absolute time are we talking about here ? (I don't have the numbers)

+ In real programming terms, shared memory + memory coherence still require concurrent control mechanisms such as semaphores, locks. Does Xenon provide any hardware lock primitives ? If a parallel algorithm can partition the problem such that locking is not needed, then an SPU should be able to exploit the situation. Otherwise, an SPU can issue a DMA lock-and-write command at the hardware level (processed immediately, not queued in DMA queue) to the shared memory. In the end, what is the relative differences between both approaches ? (I don't have the answer)

Gubbi said:
2. Local stores have no automatic way of sharing data. When you want to parallize a task that has a large shared dataset, you end up with individual copies in each SPE, in other words, each SPE only ever has fast access to (less than) 256KB. So even though you have 8 SPEs with a total of 2MB, you only have 256KB effectively per SPE, going forward to future implementations with say 24 SPEs, each SPE still only sees 256KB. This in contrast to a cache memory hiearchy processor, say you have a dual core with 2 MB (L2, shared) cache, going forward and doubling you get 4MB that all processors might take advantage of (constructive interference).

To be a fair comparison, I think in the future, SPE would have larger than 256K local memory (and relatively larger shared memory), still minus the cache logic. Not sure if it's cheaper than a comparable shared memory + cache coherent implementation. The 256K limit is determined by profiling existing target applications (according to some IBM slides).

If we take a step back...

+ How big a problem are we talking about. Many problem only show speed up if it's large enough. Realistically, an approach like Xenon can offer close to twice the speed up of a serialized program (assuming 2 out of 3 cores are used to parallelize the program). An inefficient 2-SPU implementation may offer say 1.3 times the speed up. Unless the specific problem size is huge and this parallel segment is executed frequently, 2x and 1.3x of a small number may be negligible. If it's an important problem, can we throw in more SPUs to get more speed up ?

+ What exactly are we trying to get out of multi-core ?
* More realistic cloud, water, fire, cloth, ... ?
* More interactive elements on screen at the same time ?
* Smarter elements ?
* Humongous deformable world ?
* Multitasking ? (e.g., recording TV, serving out video and playing games at the same time)
* All of the above ?
The simplistic approach in Cell allows for more cores. IMHO, it may be hard to argue for the absolute superiority of one technical approach over another until we pin down the specific problem to solve.

Gubbi said:
The alternative is to spatially divide the dataset among the SPEs. But like Patsu mentions if you have a hot region with many agents, one SPE is doing all the work while the others are just very expensive (unused) SRAM.

Yeah in this case, it's a problem with the problem break down. The busier/more crowded space should be broken down further before/when an SPU realizes that it's too big.

one · May 18, 2006

DarkRage said:
As Ageia physX has got some similarities with Cell for the task in this thread, it can be interesting to have a quick look at new results on that card with updated drivers:

http://www.anandtech.com/video/showdoc.aspx?i=2759

Still dissapointing.

... So you still don't get why GRAW and City of Villains are not appropriate softwares to test PhysX and game softwares developed for PS3 are not similar to them?

anandtech said:
As with any MMOG, City of Villains has been difficult to test. The fact that PhysX hardware is only supported in a very specific area of gameplay makes it even more complicated.

DemoCoder · May 18, 2006

Gubbi said:
My argument boils down to this:
1. Because memory coherence is software based in CELL inter process and inter thread communication is expensive. That is counter productive (that's being nice) when scaling your architecture to a massively multicored implementation.

I don't believe this is the problem for CELL. There are lots of highly scalable architectures for which interprocess communication are expensive (# of cycles), but this doesn't inhibit their ability to achieve high sustained throughput. Even on supercomputers, IPC is still a very high latency operation. Avoiding synchronization is a must. Look at map/reduce style algorithms.

2. Local stores have no automatic way of sharing data. When you want to parallize a task that has a large shared dataset, you end up with individual copies in each SPE,

The copies don't seem to be a problem either. The EIB isn't even close to being saturated in CELL.

My argument in favor of assymmetry is that you can take your K8 cores, and attach ALOT of additional stream processors very cheaply, which will execute many graphics related workloads as fast or faster. I'd also argue that TLP is more important than oooe and caches, and that if you have a sufficient number of threads and sufficient amount of memory, you can hide memory access latency, so bandwidth becomes more important.

I'd take a chip with 2-4 cores and enough functional units to dispatch 100 concurrent threads, over an 8-core unit.

patsu · May 18, 2006

DemoCoder said:
I don't believe this is the problem for CELL. There are lots of highly scalable architectures for which interprocess communication are expensive (# of cycles), but this doesn't inhibit their ability to achieve high sustained throughput. Even on supercomputers, IPC is still a very high latency operation. Avoiding synchronization is a must. Look at map/reduce style algorithms.

Woah ! You don't think their high sustained throughput is problem-dependent ?

EDIT: If the objective is to speed up a set of physics calculations, we probably have to rely primarily on the SIMD engines in the consoles to achieve worthwhile multiplier effect. If we can only rely on multiple cores to solve scalar equations and chase pointers, it will be bounded by the number of available cores, with or without cache coherence (Xenon has 3-6, Cell has 9, PPU and GPU have a good number of specialized ones but less powerful, finally supercomputers have hundreds/thousands to make it worthwhile).

DemoCoder · May 18, 2006

patsu said:
Woah ! You don't think their high sustained throughput is problem-dependent ?

Where did I say it wasn't?

EDIT: If the objective is to speed up a set of physics calculations, we probably have to rely primarily on the SIMD engines in the consoles to achieve worthwhile multiplier effect. If we can only rely on multiple cores to solve scalar equations and chase pointers, it will be bounded by the number of available cores, with or without cache coherence (Xenon has 3-6, Cell has 9, PPU and GPU have a good number of specialized ones but less powerful, finally supercomputers have hundreds/thousands to make it worthwhile).

Well, exactly my point. Large scale supercomputers more than anything else, have been used to solve physics workloads. Computational Fluid Dynamics, climate models, protein folding, nuclear detonation, compressed matter, plasma dynamics, QCD, solar models, cosmology, etc.

These computers have large IPC costs (even at 10-gigabit or even 100-gigabit interconnects), but the programming model is not shared memory or synchronization, it's message passing.

Look at the performance these guys got on commodity Pentium HW on a commodity network infrastructure using hashed parallel oct-trees

The paper on the technique is here

Programming for CELL IMHO requires a rethink, and many of the "traditional n-way SMP core" arguments boil down to "well, I don't want to take the time to learn how to do this the right way for this architecture"

Everytime someone tells me something won't work or is impossible, I am always suprised by the ingenuity out there. I once thought something like unbreakable Broadcast Encryption (as used in AACS) to be impossible, and that's after much time working at IBM in security, implementing new number theoretic protocols, as well as being a former cracker from the 8-bit era. The idea of revokable encryption from billions of clients seemed to be a non-starter. (if you can't trust the client, how can you trust it to disable a key? and how can you avoid individually encrypting a billion datastreams each with a unique key, and putting them all on a single media!) Then I read the original broadcast encryption paper and a door opened in my mind. It was right there in front of me. It was so elegant. So natural.

Often times in arguing in these threads, someone will say "no, no, this problem is inherently serial" because all current described methods in papers and middleware use the serial algorithm, and sites like Gamasutra and others cover only the basic techniques. Then you delve into citeseer, and sometimes you run across a crazy guy who didn't get the message, and did something beautiful.

rounin · May 18, 2006

DemoCoder said:
Often times in arguing in these threads, someone will say "no, no, this problem is inherently serial" because all current described methods in papers and middleware use the serial algorithm, and sites like Gamasutra and others cover only the basic techniques. Then you delve into citeseer, and sometimes you run across a crazy guy who didn't get the message, and did something beautiful.

Found my new signature

EDIT: Didn't know there was a sig char limit

patsu · May 18, 2006

DemoCoder said:
Where did I say it wasn't?

I inferred it from here: "There are lots of highly scalable architectures for which interprocess communication are expensive (# of cycles), but this doesn't inhibit their ability to achieve high sustained throughput."

Sorry if I misunderstood you but I wanted to say: high IPC cost does affect throughput, especially for messaging passing computers. The throughput could be higher with more efficient sharing mechanism. The other factor is of course the type of problem.

Also the supercomputing numbers are skewed because their problem size is huge (and can hide overhead). In gaming physics, the relatively higher overhead (compared to problem size) can kill performance.

DemoCoder · May 18, 2006

patsu said:
I inferred it from here: "There are lots of highly scalable architectures for which interprocess communication are expensive (# of cycles), but this doesn't inhibit their ability to achieve high sustained throughput."

Sorry if I misunderstood you but I wanted to say: high IPC cost does affect throughput, especially for messaging passing computers. The throughput could be higher with more efficient sharing mechanism. The other factor is of course the type of problem.

I didn't say they always achieve high throughput, just that they can achieve it. Of course, lower message passing cost could increase throughput *in problems where message passing cost dominates*, but the same is true of SMP architectures. Lock contention is still an issue even in SMP designs and people doing SMP programming still strive to minimize locks because of this, and a lock is the most basic and trivially IPC primitive there is (was, CAS is more primitive)

I can describe to you an algorithm right now in which messaging passing overhead plays *NO* part in overall throughput whatsoever. Key cracking, be is DES, IDEA, or finding hash collisions. Be it linear or differential cryptanalysis. The problem can be perfectly partitioned with almost no communication required between any nodes, except for a final broadcast or request for work. And it's no surprise that this was the first problem attempted for at-home screensaver-distributed-internet-computing.

Also the supercomputing numbers are skewed because their problem size is huge (and can hide overhead). In gaming physics, the relatively higher overhead (compared to problem size) can kill performance.

"Can" being the operative word. The problem utilized in the paper I referenced was a system containing 8million particles, not beyond the realm of possibility for games perhaps 10 years from now. Their total time spent on messaging passing was just 2.5%. It is unclear to me that if you scale down to say, 80,000 particles (1/100th the size), that this message passing time will remain constant and come to dominate, wiping out throughput gains. For one, it depends on the speed of the node vs the speed of the interconnect, and second, it depends on how the computational part of the algorithm scales. The 8million particle simulation took 164 seconds @ ~6GFlops. (it's a approximation technique, not exact. But still, it's running on an interconnect that is like 3 orders of magnitude slower than EIB, and the amount of memory per node is only like 8MB)

Anyway, here is what they say about memory hierarchy:

Treecodes place heavy demands on the memory subsystems
of modern computers. The quasi-random access to widely
separated memory locations during the tree traversal receives
little benefit from a small on-chip cache, and can
in fact overwhelm the translation look-a-side buffer (TLB)
on microprocessors similar to the i860. This results in
very poor performance from algorithms which have been
designed without consideration of memory bandwidth and
the limitations inherent in memory hierarchies which are
intended to function with DRAM significantly slower than
the processor cycle time.

They then go on to show how they encode tree accesses to reduce random access.

I think there's alot of handwaving going on in this thread on all sides. It makes little sense to throw out assertions that something "may" or "may not" work without choosing a particular problem and actually doing the calculations for various architectures.

patsu · May 18, 2006

DemoCoder said:
I didn't say they always achieve high throughput, just that they can achieve it. Of course, lower message passing cost could increase throughput *in problems where message passing cost dominates*, but the same is true of SMP architectures. Lock contention is still an issue even in SMP designs and people doing SMP programming still strive to minimize locks because of this, and a lock is the most basic and trivially IPC primitive there is (was, CAS is more primitive)

I can describe to you an algorithm right now in which messaging passing overhead plays *NO* part in overall throughput whatsoever. Key cracking, be is DES, IDEA, or finding hash collisions. Be it linear or differential cryptanalysis. The problem can be perfectly partitioned with almost no communication required between any nodes, except for a final broadcast or request for work. And it's no surprise that this was the first problem attempted for at-home screensaver-distributed-internet-computing.

That's why I said: "The other factor is of course the type of problem."

DemoCoder said:
I think there's alot of handwaving going on in this thread on all sides. It makes little sense to throw out assertions that something "may" or "may not" work without choosing a particular problem and actually doing the calculations for various architectures.

Amen.

Crossbar · May 18, 2006

gubbi said:
1. Because memory coherence is software based in CELL inter process and inter thread communication is expensive.

I don't think inter process communication and inter thread communication on the Cell in general need to be more expensive than any other multi-core solution, there are plenty of fast signal routes between the SPEs. All communication does not need to involve lagre chunks of memory and when it does the DMA transfer between different LS is very fast.

I really can't see the big benefits of using an cache instead, anyway you do it you need to pay the price for accessing the main memory at some point, either you do it in a controlled way or in a random way. If you don't bother to organise your data you will lose some performance in the SPEs that's all.

patsu said:
To be a fair comparison, I think in the future, SPE would have larger than 256K local memory (and relatively larger shared memory), still minus the cache logic. Not sure if it's cheaper than a comparable shared memory + cache coherent implementation. The 256K limit is determined by profiling existing target applications (according to some IBM slides).

I agree that larger local SPE memory is very likely in future Cell versions. The size of the SPE memory can actually be read from a register within the SPE, which means that you can write applications today that will be able to take advantage of that memory the day it changes, by allocating larger dynamic buffers etc.

Could also be helpful if you will be running an application on a heterogenous Cell cluster in the future.

BTW: This is the best poublic document describing the Cell so far, don't know if it's been posted before, anyway enjoy!!!

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F

chris1515 · May 18, 2006

BTW: This is the best poublic document describing the Cell so far, don't know if it's been posted before, anyway enjoy!!!
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F

Crossbar, the document deserves it own thread. It is not been posted before.

Titanio · May 18, 2006

Yeah, can't believe I missed that doc! Must have been distracted with pre-E3 shenanigans

That's a lot awaited document - IIRC IBM said they'd talk more specifically about the PPE in it, for example. Thanks for the link, Crossbar!

Physics on Cell

Gubbi

patsu

one

Unruly Member

DemoCoder

patsu

DemoCoder

rounin

patsu

DemoCoder

patsu

Crossbar

chris1515

Titanio

Similar threads