New ITAGAKI interview touches on 360 and PS3 comparison

aaronspink said:
Nor did I say it was x86. The actual instruction set being used is a secondary issue to the overarching programming model. The whole of the mainstream of the computing industry is moving towards a model that is roughly the same as the x360 which will reap significant rewards. The network processor industry is the closest thing to cell and has been riddled with frustrations, bugs, and performance issues due to the complex programming models.

The engineers from SCE and Toshiba had sat down to talk with IBM during the early stages of the Cell development.They were in fact aware of multi core architectures since IBM had presented in front of them their own models to choose.It was after scrutinizing all available options that they decided to go with the master-slave approach.So saying that the Cell had been riddled by bugs is absolutely wrong.It was in fact a better decision that could yield the best performance/ease of development ratio.
 
Last edited by a moderator:
ihamoitc2005 said:
The more processing cores working on small bits of data wanting direct access the more the brain-damaged guy is smartest no? :)

Nope. The DMA engines either have to be coherent or not. In the cell design they are coherent which implies that they flow into the pipeline a little ahead of the L2 controller and snoop the L2 before queuing in the memory controller.

In the case that they are incoherent, the programmer has the increased complexity of making sure anything that needs to be DMA'd is evicted from the caches.

The primary issue is maintaining the coherence for accesses to main memory from the SPE's and the complexity of it really doesn't change regardless if you are using a DMA copy engine or direct access.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
To my knowledge, neither Intel nor AMD have published any roadmaps or intentions to develop anything like cell. From their published roadmaps, both Intel and AMD appear to be going down the path of multiple homogeneous processors on a die.

I don't see an identical tri-core CPU coming out in the market anytime soon that shares the same pool of L2 cache.Cell's master-slave approach is still the best for >2 multi cores processors.
 
aaronspink said:
Given the choice between an architecture with DMA movement engines or DMA movement engines along with direct access, 9 out of 10 good programmers will prefer the later. The 10th was just in a car accident and suffered massive brain damage.

Aaron Spink
speaking for myself inc.

Given the choice between co-processors with lockable cache and simple control on DMA movement (you really ought to play around with low level PlayStation 2 DMAC programming to enjoy it :)) to full blown INDEPENDENT processors with ample DMA control and flexibility... the decision is split ;).

There are many developers out there that loved PS2's VU's, did not like the VFPU of the PSP (yeah Faf, I know... a crime ;)) because of its simple dependent co-processor nature, and are in love with the SPE's because they see them mostly as an evolution of the VU's (with some minor drawbacks of course ;)).

Also, what programmers really would like is a single core, low latency, with bad-ass branch predictor,2-way SMT, 4-way, OOOe, 5 GHz processor with 256 KB of L1 cache, 2 MB of L2 cache, 16 MB of L3 cache and the best FPU in the world the FPU+VFPU combo the PSP uses :) along with an uber-optimizing compiler that converts all their single threaded code in perfectly optimized multi-threaded code :).

Seriously, even if the 10th likes the former approach if his code runs faster in the end... well :p.
 
Nemo80 said:
well somebody calling the SPEs "DSPs" shouldnt be taken for serious anyways.

The SPEs have a significant simularity to various DSPs that are available and the overall programming model is also very similar to some DSP plug boards that are available. IMNSHO, it is actually the most apt description of the SPEs available.

I'll put my knowledge of computer architecture, system design, semiconductor physics, and VLSI design against yours any day of the week.

Aaron Spink
speaking for myself inc.
 
inefficient said:
In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.

Please elaborate. How does the SPE know to look for jobs? How does the SPE actually receive the job? How does the PPE know the job is complete? Where does the PPE retrieve the results from?

Saying "you would set up a DMA" is not enough detail. You don't know how long the DMA will take. You don't know how long the SPE will take to execute the job.

How do you synchronize the execution of the SPE to the PPE?
 
Last edited by a moderator:
aaronspink said:
I'll put my knowledge of computer architecture, system design, semiconductor physics, and VLSI design against yours any day of the week.

Aaron Spink
speaking for myself inc.

No sorry, i usually dont argue with college boys. :)
 
darkblu said:
why is that i get the feeling you equate 'correct' with 'easy'? i.e. why the fact that it's relatively easy to write smp multithreaded code (which, btw, many gamedevs i've worked with would outright disagree with) somehow translates to writing correct code?

I do not equate "correct" with "easy", only that there tends to be a relationship.

If it's harder to write correct code, then it tends to be harder to write fast correct code, since all fast code must be correct code to be useful.
If it's easier to write correct code, it also tends to be easier to write fast correct code.

The point is, the less time you need to insure your code is correct, the more time you can spend on optimization.

first, you step on the premise that correct smp multithreading is the easiest, most natural and, i get the feeling magically efficient form of concurrency.

I never said it was the most efficient, merely the easiest and most natural. This is simply because it is the most general form of multithreading -- you have N threads, they execute the same kind of opcodes, they share your entire address space, memory ordering is strict, and memory coherency between the threads is enforced.

Every multithreaded design pattern can be constructed from this basic foundation, which is why it's taught in schools and why it's the dominant form on general purpose CPUs.

Each special purpose optimization you add to this (weakly ordered memory model, NUMA, dropping coherency, asymmetry of the threads, private thread address spaces, etc) can improve peak performance, but at the cost of adding things to worry about.
 
Last edited by a moderator:
Nemo80 said:
well somebody calling the SPEs "DSPs" shouldnt be taken for serious anyways.

For shits and giggles read this from 2001, and notice how the structure with one host processor and a bunch of DSP cores with their own SRAM and shared memory resembles the structure of CELL.

As for

Nemo80 said:
aaronspink said:
I'll put my knowledge of computer architecture, system design, semiconductor physics, and VLSI design against yours any day of the week.
No sorry, i usually dont argue with college boys.
You're pure comedy gold, do you know that?

Involuntary so, but pure gold nevertheless

Cheers
Gubbi
 
The primary issue is maintaining the coherence for accesses to main memory from the SPE's and the complexity of it really doesn't change regardless if you are using a DMA copy engine or direct access.

No the primary issue is limiting the negative effect of memory latency on real world performance. As the number of processing units increase, the greater the marginal cost of direct access.

Its no surprise that theres 1.8MB of local store for the SPE and another 512kb for the dual-threaded PPE. That works out to 256kb/"thread". Plus its hooked up to very low latency XDR.

XeCPU has 6 threads sharing just 1MB L2, average of 183kb/thread leading to higher latency GDDR3 which sounds like cache-misses waiting to happen. I would be very surprised if anyone got decent performance using all the cores.
 
ihamoitc2005 said:
Its no surprise that theres 1.8MB of local store for the SPE and another 512kb for the dual-threaded PPE. That works out to 256kb/"thread". Plus its hooked up to very low latency XDR.

XeCPU has 6 threads sharing just 1MB L2, average of 183kb/thread leading to higher latency GDDR3 which sounds like cache-misses waiting to happen. I would be very surprised if anyone got decent performance using all the cores.

True, if each SPE is doing something completely independent.

However, since the L2 is shared among all three cores, for large, mostly read-only, datastructures you only have one copy, - in the L2. Whereas for the SPEs you'd need a copy in each local store.

Cheers
Gubbi
 
ihamoitc2005 said:
XeCPU has 6 threads sharing just 1MB L2, average of 183kb/thread leading to higher latency GDDR3 which sounds like cache-misses waiting to happen. I would be very surprised if anyone got decent performance using all the cores.

Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.

With Xenon you have more flexibility in this regards. Further, cores can easily share and work on the same information. There is also a danger of equating threads with cores. It has been expressed by some developers that the second thread on a core will frequently be of the same nature.

The two philosophies are different. What I have noticed is that there are clear lines drawn... e.g. you stick your nose up at the idea of the Xenon getting decent performance, yet the same thing has been said about CELL.

As a developer told me multithreading is hard work, and each processor offers its own twist on how to solve that problem. There will be areas where each excells and where each fails, and it wont be easy on either one.

The search feature returns some good information (although far too much to read in even a week and it is cluttered with a lot of trash). But needless to say some devs have spoken up and the CELL model does pose some hurdles and problems (especially with data it will work well with, if the PPE is running your OS and delegating to the SPEs that does not leave a lot of extra room to work with).

I think this thread shows that there are different opinions, and even more that each architecture will benefit certain chores and will favor different programmers. CELL is very nice for a PS2 dev who has had to work hard with EE. Xenon is similar to the model PCs have gone and has more research and will appeal to PC developers.

Ultimately it will come down to tools. Your AAA dev houses have the money, time, people, tools, to crack the case. The real issue is 98% of your developers are NOT AAA guys. They are good--and they make great games!--but they don't have the advantage as the big guys with 250-400 member teams and 5 development studios to share information, code, and resources. Developers need tools to help them get the most out of BOTH platforms.

So whoever makes their platform most approachable to the most developers, and allows them to get the most performance out of their work, wins a magor victory. OBVIOUSLY that answer wont be the same for every dev or every title. The tradeoff of work/power may be great in one title but insuffecient in another.

So tools and architecture are both important factors, more so than "peak" performances. Which reminds of a similar scenario on the PC. Similar architectures, but not the same, can return results that are contradictory to the theoretical performance of a chip. The below quote is a good example of this:

Another way to look at this comparison of flops is to look at integer add
latencies on the Pentium 4 vs. the Athlon 64. The Pentium 4 has two double
pumped ALUs, each capable of performing two add operations per clock, that's
a total of 4 add operations per clock; so we could say that a 3.8GHz Pentium
4 can perform 15.2 billion operations per second. The Athlon 64 has three
ALUs each capable of executing an add every clock; so a 2.8GHz Athlon 64
can perform 8.4 billion operations per second. By this silly console
marketing logic, the Pentium 4 would be almost twice as fast as the Athlon
64, and a multi-core Pentium 4 would be faster than a multi-core Athlon 64.
Any AnandTech reader should know that's hardly the case. No code is
composed entirely of add instructions, and even if it were, eventually the
Pentium 4 and Athlon 64 will have to go out to main memory for data, and
when they do, the Athlon 64 has a much lower latency access to memory than
the P4. In the end, despite what these horribly concocted numbers may lead
you to believe, they say absolutely nothing about performance. The exact
same situation exists with the CPUs of the next-generation consoles; don't
fall for it.
 
Gubbi said:
True, if each SPE is doing something completely independent.

However, since the L2 is shared among all three cores, for large, mostly read-only, datastructures you only have one copy, - in the L2. Whereas for the SPEs you'd need a copy in each local store.

Cheers
Gubbi

Except that all the SPEs can work in parallel or in sequence and can read from each other's local stores so redundant data isnt only unnecessary its silly.
 
Shifty Geezer said:
Dunno. Large programs are broken into smaller procedures or code segments that make up your engine and these get pieced together to make the whole program. 256kb LS for data and code means your program isn't going to be totally massive, and I would guess much smaller than 256kb. Heck, 200kb of assembler isn't a pretty thought! You can achieve a lot in 32kb (whole 8 bit games even. Imagine how fast the original Elite could run when written for a SPE :oops: ) and I'd expect a process could be broken into manageable and efficient chunks. Seems more a matter of good design is needed rather than mystical programming powers. And note SPE's don't need assembler so the points moot anyway. Unless you're still developing for PS2!
(I'm only debating this point based on your original statement about fine-tuned control over memory access and use of assembler.)

Of course large programs are broken into smaller chunks--this has been true for how many decades now? The problem is that given a large code base, no one wants to be writing assembler level code for all the little bits. Instead, you code first, then you profile, then you optimize, and it's during the optimization stage that you may prefer to drop down into assembler/C/machine hacks and want fine-grained control over memory.

When you code, you code for correctness first. Things that make that difficult end up jeopardizing both your profiling and optimization stages.

.Sis
 
flexibility

Acert93 said:
Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.

Using youre method of comparison, if one thread consumed 384k, you have just 5 threads with 128k/thread on XeCPU remaining..

On the other hand, if one thread consumed 384k on CELL, there is still 1 thread with 128k and 7 threads with 256k.
 
Acert93 said:
Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.

With Xenon you have more flexibility in this regards. Further, cores can easily share and work on the same information. There is also a danger of equating threads with cores. It has been expressed by some developers that the second thread on a core will frequently be of the same nature.

The two philosophies are different. What I have noticed is that there are clear lines drawn... e.g. you stick your nose up at the idea of the Xenon getting decent performance, yet the same thing has been said about CELL.

Glad to see someone looking for the reason why it will work instead of the reason why it wont. Does everyone really think that MS and IBM couldnt figure out that XeCPU wouldnt be a 'cache mess' before they went to final hardware (or even finished the initial design)?

From what ive read on this forum the whole 360 desgin seems pretty elegant and well thought-out. All the parts (and their connections) seem to be "just-right" for each other.

J
 
expletive said:
From what ive read on this forum the whole 360 desgin seems pretty elegant and well thought-out. All the parts (and their connections) seem to be "just-right" for each other.

J

All consoles are designed to have as little extra capacity as possible. This is because of cost. But because it is closed box, developers dont need as much breathing room as on PCs. On the whole they are therefore much more elegant in design than PCs.
 
expletive said:
From what ive read on this forum the whole 360 desgin seems pretty elegant and well thought-out. All the parts (and their connections) seem to be "just-right" for each other.

J
Couldn't agree more--but I'd extend it to cover the CELL design as well. Different philosophy in implementation, but still very elegant IMO, just in a different way. I've never been terribly excited by console architecture design and this last generation seemed particularly boring. The next offerings by Sony and MS are very interesting.

.Sis
 
ihamoitc2005 said:
Except that all the SPEs can work in parallel or in sequence and can read from each other's local stores so redundant data isnt only unnecessary its silly.

They can DMA to and from other SPEs' local stores, into their own. Hence a copy and hence redundant.

Cheers
Gubbi
 
cache-mess

expletive said:
Does everyone really think that MS and IBM couldnt figure out that XeCPU wouldnt be a 'cache mess' before they went to final hardware (or even finished the initial design)?

Yes, this can happen because cost estimates can be wrong and many design decisions arent made only by engineers, but also accountants and marketers. Therefore last minute push to reduce cost can cause compromise in otherwise balanced design. Think of missing HD, originally standard, now $100 option. It is unfortunate but thats how it works.
 
Back
Top