HotChips 17 - More info upcoming on CELL & PS3

Status
Not open for further replies.
Maybe this has something to do with the Microsoft argument?

"The truth is, you'll never really utilize theoretical performance. The important thing is how much can game developers actually leverage out of that theoretical performance.

We basically developed two different architectures [for Xbox 360] in parallel with the game designers, and then we sided with the one that was going to give us the best overall performance when we calculated not just the theoretical performance, but things like cost and flexibility with partners. I think we’ve come as close to realizing the theoretical performance of a machine that you can design in 2005 for game development as anyone else on the Earth could." – J. Allard
 
standing ovation said:
Maybe this has something to do with the Microsoft argument?

Mr.Allard assumed just that because the tools and technology in 2005 are the same for everybody means that the obtained performance as to be equal.

And that is flat wrong. but a nice escuse for the non-tech guys to believe.
 
Oh ... :cry:

I thought he was telling us why Microsoft made 360 (which would include the chips) the way that it did: lower cost, higher flexibility, and it is the design that allows game makers to squeeze out more of its theoretical performance.
 
dskneo said:
Defects are gonna hit the 200mm'2.
-What matters is that, if it hits a small space of the 3 Core CPu, 1/3 of the cpu goes to garbage because one tiny defect on one core disables the FULL core. 2 cores remain.
Anyway, if we are allowed to talk about the reality, there's no concept of redundancy in the Xbox 360 CPU, unlike Cell processor or memory cells, so if a defect hits a small space of the 3-core CPU, all go to garbage. Also it's a highly customised CPU which cut off the AltiVec compatibility. Even if it supported redundancy, there'd be no customers who buy Xbox 360 CPU with less cores.
 
dskneo said:
Mr.Allard assumed just that because the tools and technology in 2005 are the same for everybody means that the obtained performance as to be equal.

And that is flat wrong. but a nice escuse for the non-tech guys to believe.

Agreed. Sony worked with the same guys that he did, so that should help Sony right? I mean doesn't Sony, IBM, and Toshiba have a hand in the CELL being able to reach as close as possible to its theortical peaks also?

I mean the CELL will be in more machines than just the PS3. The CELL chip could help the STI group make billions and billions of dollars if it is successful in the military, hospitals, TVs, etc. My point is shouldn't we also believe that the CELL will have software good enough to enable it to reach high programmable peaks also?
 
mckmas8808 said:
Agreed. Sony worked with the same guys that he did, so that should help Sony right? I mean doesn't Sony, IBM, and Toshiba have a hand in the CELL being able to reach as close as possible to its theortical peaks also?

I mean the CELL will be in more machines than just the PS3. The CELL chip could help the STI group make billions and billions of dollars if it is successful in the military, hospitals, TVs, etc. My point is shouldn't we also believe that the CELL will have software good enough to enable it to reach high programmable peaks also?

In a word? No.

Apparently that MPEG decode demo took roughly a year to get working.

CELL presents a lot of unique challenges. Research challenges.

Sure you will have some housing using it, but they will most be using a varient of macro assembler to offload some functions onto the SPEs for acceleration, but this will form a fraction of the run time of the engines.

Except for trivial cases, CELL will only realize a fraction of its performance. And even in the cases where it realizes moderate performance, it will be primarily for things like geometry offload or partial physics acceleration, things that would be better off done in fixed function.

In addition, it is highly doubtfull that CELL will be in much except for PS3. If CELL sells a couple million parts outsite of PS3, I'll be supprised.

The programming model used for the X360 cpu is much more mature and understood and while the infrastructure is somewhat lacking, it is years if not decades ahead of the infrastructure and knowledge for something like CELL.

Don't get me wrong, from a technical perspective, CELL is interesting, but then again, so was the transputer or the meriad of other pseudo special purpose chips over the years.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
Sure you will have some housing using it, but they will most be using a varient of macro assembler to offload some functions onto the SPEs for acceleration
Not likely.
http://www.research.ibm.com/cellcompiler/compiler.htm
http://www.research.ibm.com/cellcompiler/compiler-perf.htm

aaronspink said:
Apparently that MPEG decode demo took roughly a year to get working.
The MPEG demo, or the development of the underlying software foundation libraries which is the main theme in the paper? I think it's the latter that took time rather than the specific app development.

aaronspink said:
In addition, it is highly doubtfull that CELL will be in much except for PS3. If CELL sells a couple million parts outsite of PS3, I'll be supprised.

The programming model used for the X360 cpu is much more mature and understood and while the infrastructure is somewhat lacking, it is years if not decades ahead of the infrastructure and knowledge for something like CELL.

Don't get me wrong, from a technical perspective, CELL is interesting, but then again, so was the transputer or the meriad of other pseudo special purpose chips over the years.
Offtopic but, is Transputer good at monte carlo simulation, or FFT? I think there are markets where gained performance matters. If it's available cheap, then even better.
http://www.research.ibm.com/people/a/ashwini/E3 2005 Cell Blade reports/CPBS whitepaper - final.pdf
Applications:

The Cell architecture was initially conceived for acceleration of digital content creation. Because this is an extremely computationally intense problem, engines optimized for this have utility in many other areas.

First, let's distinguish between classes of graphics used in games and movies. Game systems use polygon based, real time, lower quality graphics. While Cell technology might be used for the geometry engine, the rasterization will likely be done by dedicated graphics chips.

The second type of graphics is photo realistic, done offline. Studios today use large clusters of rack dense servers, including blade servers, for this purpose. These images are created entirely in software and may take minutes to hours per frame. Cell Processor Based Blade Servers can improve the performance of this compute intensive task by an order of magnitude.

Games represent the holy grail of computational algorithms in a sense, and techniques used in games often have applicability in other scientific areas. A classic pathfinding algorithm, called A* (pronounced A-star) is used extensively in games and military simulation. It is very CPU intense, and becomes more so exponentially with the size of the terrain. Cell processors are particularly effective at A* because of the SPE's register structure and parallel execution. Interestingly enough, monte carlo hedge fund simulation is a similar problem and should benefit from the Cell architecture as well.

SPEs have instructions specifically for scatter/gather operations. Coupled with the ability to have multiple I/O operations outstanding, any algorithms heavily dependent on scatter/gather should demonstrate significant speedup. Life sciences applications like BLAST may expect significant speedups from this architecture.

The very robust communication structure between SPEs is optimal for algorithms where computation is performed on a node, the results passed to a nearest neighbor, more computation is performed, etc. Any application that uses Fast Fourier Transforms successively is a natural for the Cell architecture. Signal processing (like seismic) and computational fluid dynamics applications should benefit significantly from the Cell.

Less traditional applications, like video surveillance, are a perfect fit for Cell processors. The video encoding capabilities, coupled with an Artificial Intelligence engine, might recognize problems and generate appropriate alerts.

The applications list above is not exhaustive, but is meant to give a feel for the breadth of
applications that will perform better on Cell architecture systems.
 
Last edited by a moderator:
defects happen per space of die. Its correct that Smaller the Die is fewer chances of damage. But Fewer the cores means BIGGER Cores to fill the sapce. and the Bigger the cores are, bigger the waste if a tiny defect hits that die space.

Sure, but if you have more cores, 2 or MORE defects are not likely to hit the SAME core. With more cores you need more redundancy because 2 or MORE defects are likely to destroy more cores than a design with less cores. With fewer cores you don't NEED redundancy, you simply throw away the chip.

So in summary my point still stands bigger die area AND more cores means you will need more redundancy. ;-)

As to the former Grall, I'm sure the engineers at Sun must be scratching their heads over why their Niagara processor which supports 32 threads don't have 5MB of L2...in fact it has less than 4MB. In other words, per thread that's only 96KB. ;)
 
PC-Engine said:
What does 4 cores sharing 1MB of L2 have anything to do with getting a less performing CPU? That's right nothing...nada..zilch.

Where is this fantasy drop in performance of a 4 core CPU?
What an absolutely rediculous thing to say, it truly goes to show you lack even the most fundamental aspects of deduction.

Again, the more threads you have competing for the same cache space, the more they're going to keep ejecting each other's cache lines, leading to processing stalls. You know, the stuff that crippled PS2 EE so badly.

That's not the same as saying there is some sort of "fantasy drop" in performance just like that, what I'm talking about is cache performance specifically, do you understand? No, probably not...

Why even bring PC CPUs and 2MB caches into this discussion?
I said 1MB, not 2. And it's called "making a comparison". You know what comparisons are?

Anyway, no point in further commenting your silly made-up statements... Up you go on my ignore list. Impressive, you're the first, not even Kruno's this annoying.
 
Again, the more threads you have competing for the same cache space, the more they're going to keep ejecting each other's cache lines, leading to processing stalls. You know, the stuff that crippled PS2 EE so badly.

That's not the same as saying there is some sort of "fantasy drop" in performance just like that, what I'm talking about is cache performance specifically, do you understand? No, probably not....

Pretty much moot when the bare minimum usable cache size isn't 1MB for 8 threads. then factor in additional full PPE core with VMX, it doesn't take a genious to figure out you're getting a buttload more processing power with only a small tradeoff in cache efficiency.

I said 1MB, not 2. And it's called "making a comparison". You know what comparisons are?

Anyway, no point in further commenting your silly made-up statements... Up you go on my ignore list. Impressive, you're the first, not even Kruno's this annoying.

Doesn't matter if it's 1MB or 2MB, comparing it to PC processors as some form of benchmark is stupid. Did PC processors always have 1MB of cache? No. Did PC processors always need 1MB of cache? No.

NIAGARA

8 cores

32 threads

3MB L2
 
Last edited by a moderator:
the each of the three Xenon CPU cores is more like a cutdown version of the PowerPC 970, but the PPE has many optimizations. the PowerPC 970 is like a cutdown version of the Power4. - I believe the same is true of the PPE in CELL.

so neither console CPU is based on Power5.


unless I am totally mistaken.


a Power4 CPU has 2 full complex cores each core can do 2 threads, 4 in total
a Power5 CPU has 2 full complex cores. each core can do 4 threads, 8 in total


Xenon has 3 simple, streamlined cores, each core can do 2 threads, 6 in total.



my guess for the Xbox3 (360's successor) CPU is a single die with 12 decent cores (beefier than PPEs but not too beefy) capable of 4 threads each - 48 threads total
4-8 MB of L2 cache.
 
Last edited by a moderator:
POWER4 has two cores, but neither core has SMT... that is they can operate on a single thread at a time.

POWER5 addes SMT for each core in the processor.
 
Last edited by a moderator:
Megadrive,
Except of course that the cpu cores in xcpu is nothing at all like the 970... :) That chip is super wide issue (I think like 8 instructions/clock or something like that) with full out of order execution capability. It's also single-threaded. The xcpu core is dual-issue in-order. Massive difference.

This confusion with power4/5 is most likely from all the early speculation caused by overly enthusiastic people such as Deadmeat and a few others who hoped power would be the base on which these chips were built. That never was the case...
 
Have you forgotten also of the clock speed difference? Niagara being in the 1-1.5 GHz range vs. 3.2 GHz for Xenon... The cost of a cache miss is a lot more cycles when the clock is that much higher, which in turn means that the cost of what you consider to be a "small tradeoff in cache efficiency" has that much bigger impact on throughput. It's not a clear-cut ratio of # of threads to cache size. Look at Itanium -- that 6 MB of L2 actually has impact because it's such a high throughput CPU. PC processors have 1 MB because they can get some serious throughput on single-threaded apps -- far, far more than XeCPU or CELL will ever be able to achieve. High throughput on small data sets is fine, but not realistic. It's not common to use 100 values to do 1 billion calculations.
 
ShootMyMonkey said:
Have you forgotten also of the clock speed difference? Niagara being in the 1-1.5 GHz range vs. 3.2 GHz for Xenon... The cost of a cache miss is a lot more cycles when the clock is that much higher, which in turn means that the cost of what you consider to be a "small tradeoff in cache efficiency" has that much bigger impact on throughput. It's not a clear-cut ratio of # of threads to cache size. Look at Itanium -- that 6 MB of L2 actually has impact because it's such a high throughput CPU. PC processors have 1 MB because they can get some serious throughput on single-threaded apps -- far, far more than XeCPU or CELL will ever be able to achieve. High throughput on small data sets is fine, but not realistic. It's not common to use 100 values to do 1 billion calculations.

Actually Niagara was designed to clock up to 2GHz since Sun has found that going higher doesn't significantly improve performance. Niagara needs less L2 cache simply because it has more threads to mask latency. You can't just look at desktop CPUs with 1MB and 2MB of L2 and make blanket statements about XCPU cache size requirements. That's why it's a fallousy to say more threads in XCPU = more L2 required.
 
Last edited by a moderator:
Actually Niagara was designed to clock up to 2GHz since Sun has found that going higher doesn't significantly improve performance. Niagara needs less L2 cache simply because it has more threads to mask latency. You can't just look at desktop CPUs with 1MB and 2MB of L2 and make blanket statements about XCPU cache size requirements. That's why it's a fallousy to say more threads in XCPU = more L2 required.
Adjectives fail me... I don't see how you can make a counterargument on your own point one sentence prior and treat it as proof of your point. ALL increases in execution throughput can increase the demand for cache. Did you stop to think even for a moment why Niagara doesn't show major performance increase beyond 2 GHz? It's because other system components wouldn't be able to keep up with the throughput and the relative cycle cost of latencies would cancel out the benefits of higher clock.

Increase in execution resources ALWAYS sees benefits from an increase in cache relative to some base point. This is not something that applies specifically to PC CPUs or server CPUs... This is just a fundamental effect of the fact that memory will never run as fast as CPUs as long as the Earth revolves around the sun. 1 MB of L2 cache to a single XCPU core will run X fast, adding a second core without increasing the L2 WILL slow down per core performance because you're increasing the probability of cache misses by having two cores contend for the same cache lines. Adding a third core WILL slow down further, though the impact will be smaller than the second. Adding a 4th, 5th, 6th, all the same story.

If you were capable of reading, the point I made was not about how much cache PC cpus (since when is Itanium a PC cpu?) need, but how the quantity of cache relates to the throughput that a CPU is capable of. Did you ever wonder why the POWER MCMs (given that you say POWER is so similar) have 144 MB of cache for their measly 16 hardware threads?
 
Status
Not open for further replies.
Back
Top