ISSCC 2005

Alejux said:
I'm very curious to see what they'll come up with, and I confess I'll be very disappointed if their programming model is just some APU-spawing libraries with no automatic protection against the problems discussed in this thread.

what problems? all we've been talking so far is personal architectural preferences towards or against this or that paradigm. until people start getting their hands on the actual thingie (and its sdk), and actually accumulating some experience, all its "weaknesses", or "strengths", for that matter, are no more than pure imagination. i mean, the kid has not been quite born yet and some guys already declared it a poor university student!

ps: just for the record, i personally don't find a single con about cell from all the tidbits of info i know about it - does that mean it will be the second coming of christ in sillicon?
 
darkblu said:
i personally don't find a single con about cell from all the tidbits of info i know about it - does that mean it will be the second coming of christ in sillicon?
What was the first coming of christ in silicon? The EE? ;)

The aid Sony and IBM can give to developers is a big question mark right now. Odds are, IMO, the aid will be enough for launch but leave a lot of room for improvement.

On a personal level, I'm kinda bummed that they didn't really do anything about alleviating memory latency problems. All I've seen so far indicates the architecture punishes you for doing anything not cache-friendly. That is not a solution; that is a presumption that everyone will find their own solutions. But maybe my perception is off.
 
Inane_Dork said:
darkblu said:
i personally don't find a single con about cell from all the tidbits of info i know about it - does that mean it will be the second coming of christ in sillicon?
What was the first coming of christ in silicon? The EE? ;)

z80

The aid Sony and IBM can give to developers is a big question mark right now. Odds are, IMO, the aid will be enough for launch but leave a lot of room for improvement.

quite possibly.

On a personal level, I'm kinda bummed that they didn't really do anything about alleviating memory latency problems. All I've seen so far indicates the architecture punishes you for doing anything not cache-friendly. That is not a solution; that is a presumption that everyone will find their own solutions. But maybe my perception is off.

are you sayin that you actually know of a solution to the fundamental mem latencies problem and you re disappointed by its absence in STI's design?

cache is not much of a solution to the pu-ram gap either. if it was we'd not be having cell today. practically speaking, cell gives you the best available solution known to mankind: you have a traditional non-streaming pu (with a traditional cache) and a streaming pu* tract, all fitted well together. the former part handles the non-streaming class of tasks, the latter handles anything** of streaming nature.

again, if you know of a better solution to the widening chasm between pu and memory, please share it with the rest of us.

* which by definition would not benefit from any cache but from a good access prediction heuristics; for the time being, the best predictor of the access pattern of any given straming algorithom is the pu on this side of the keyboard.

** 'anyhing' here being very much vector-algebra centric.
 
Well here's my crazy and uneducated idea about managing the Cell's SPUs in a multi-tasking environment:

----------------------------

The issuing of apulets to SPUs can be managed by a bookkeeping system in the OS, whereby you reserve processing time on the SPUs with the OS, coupled with an estimate of how long your processing will take. This estimate is determined by detailed benchmarking and profiling of your code, plus perhaps a bit more time as a safety net. At the appropriate time, your processing is issued and if it takes too long then it's stopped, the next apulet in line is issued, and your processing is automatically re-reserved in the bookkeeping system. Perhaps the OS could send your process a warning, a request for a new time estimate, or something else to help re-reserve the apulet more efficently. More complex arrangements of SPUs, such as pipelines or SPUs that interact with the GPU and other external devices would also be indicated to the OS so it can handle them approprately (e.g. treat multiple SPUs and a unified block of SPUs). All together, with this information the OS should be able to effectively balance the processing time of apulets with the need to multi-task, and thus minimise the SPU context switching problem.

The most difficult part is probably estimating processing time, because you will have to create a forumula that factor in your benchmarking numbers, plus the clockspeed of the SPU, and any slight hardware differences of that SPU. Of course, totally different/custom SPUs is not an issue, because you wouldn't run on them anyways as they are designed for other types of processing and probably have a different ISA too anyways.

----------------------------

So am I an idiot, genius or neither?
 
darkblu said:
are you sayin that you actually know of a solution to the fundamental mem latencies problem and you re disappointed by its absence in STI's design?
Other than building better RAM, not off-hand. I just recall reading that they had addressed it, but it's nowhere near addressed yet. We're just trying to perfect avoiding it. Long-term, we'll hopefully meet the problem more head-on.

cache is not much of a solution to the pu-ram gap either.
It's better than introducing a handful of really fast processors to further slow down what is already the problem.

Just teasing. ;)

practically speaking, cell gives you the best available solution known to mankind: you have a traditional non-streaming pu (with a traditional cache) and a streaming pu* tract, all fitted well together. the former part handles the non-streaming class of tasks, the latter handles anything** of streaming nature.
Well, that's kind of my point. The main addition or innovation more or less presumes you won't even try to implement anything already suffering from memory latency. It's not that cut and dried, of course.

I should mention that software is part of the solution every way I look at it. The issue is building hardware to aid that development as much as possible. And I don't see that Cell really takes a step forward in that with respect to the power it introduces.
 
Look at the EE: two Vector Units (each with 32x128 bits registers... a whopping 40 KB of Local Storage for instruction and data if you add both VU's ) and a SIMD enhanced, single threaded and with 16 KB of Instruction Cache, 8 KB of Data Cache and 16 KB of SPRAM.

Look at the CELL chip presente at ISSCC: 8 independent Vector Processors (each with a TLB enabled DMA engine, 128x128 bits registers and 256 KB of Local Storage for instruction and data) and a Multi-Threaded (2-way SMT) core with dedicated Vector Processing extensions (VMX for Integer and Floating-Point processing) and with 32 KB of L1 Instruction Cache, 32 KB of L1 Data Cache and 512 KB of L2 cache. SPE's have some access to the PPE's L2 cache quite likely.

Also, XDR from what we have seen should have lower latency compared to Direct RDRAM (in which data, addresses and control were all multiplexed on the same shared bus): XDR was not just choosen for the higher bandwidth it provides, but also for the lower latency compared to its predecessor (an improovement on two fronts).

I think they are coming towards developers in lots of ways: they saw one of the biggest shortcomings of the EE (its RISC core) and observed how it pulled the system down. They made the Vector Processor self-feeding and gave them a MUCH larger Local Storage (we do not have situations like with the VU0 which offered only 4 KB of Instruction Memory and 4 KB of Data Memory and the VIF0 did not even support double buffering like VIF1) this way they would not need to wait for the central "managing" processor as much and the central processor would not have to waste tons of cycles feeding each Vector Processor all the times it needed data or when they are sharing data with each other.

When they looked at the Vector Processors and the idea for their role in CELL, they realized that programmers would be helped by a good compiler that would do loop-unrolling for them and do it well: VCL taught some lessons there... they saw that the register file was way too tight that often VCL had to increase the length of the loops and would not be able to take away all the stalls because there were not enough registers to take the existing VU code and unroll all loops efficiently so programmaers had to do that by hand too. Thus the register file was increased in size by a factor or 4x.

What I am hearing developer say is that they see the PPE as one of CELL's saving graces while the RISC core in the EE was seen as one of the worst part of the architecture, the one with basically few or no redeeming qualities. Getting a good compiler to do a decent job with the resources the R5900i is not easy... getting GCC 2.95 to optimize things perfectly there... well... the rsult is that someone has to pick up tons of C/C++ code and manually convert it to optimized ASM if you want the EE to go decently fast and not pull everything down.

Data Cache misses-wise and latency for random memory accesses-wise the R5900i is clearly on another planet: 8 KB of L1 D-Cache vs 32 KB of L1 D-Cache + 512 KB of L2 Cache... the here winner is quite clear IMHO.

What cna this allow ? This allow to optimize a C/C++ compiler better for the PPE and allow this compiler to do a good job (it has access to enough CPU resources to do so: a compiler and a CPU core should not be developed separetely, but together, they should help complement each other) which in turns allows the programmer to spend less time re-writing code in PPE optimized ASM. How many developers pursue the same strategy as PlayStation developers on Xbox ? How many trust ICC 8.x for the XCPU only just as much as GCC is trusted for the RISC core in the EE ?

DMA engines on the CELL processor now understand Virtual addresses which is another help they have given to developers when dealing with more complex OS's that have Virtual Memory support (Inane Dork, get a Linux kit, install SPS2 and have fun managing DMA transfers... oh yes, you can do it no doubt, it is not an impossible task woth the courage of a heroic genius to solve it... is it effortless and painless ? I do not think so ;))

After the POWER4+ to POWER5 transition (everyone expected minor changes... yeah... add SMT, tweak things here and there, etc..., but they got an incredible jump forward that exceeded people's expectation because IBM was able to spot the short-comings of the POWER4+ core and work around them, doing somethings minor changes, but in all the right spots) I'd have more faith in IBM's R&D labs to be able to assist SCE and Toshiba in developing the PlayStation 3 SDK and the related tool-set.
 
Panajev, do you think my theory is correct, that if the CELL shown at ISSCC, ( or a newer revision of CELL going into PS3 ) was scaled down to 6.2 GFLOP performance, that it would still kick Emotion Engine's ass? because of the increased efficiency, better design, lower latency memory, etc ?
 
DudeMiester said:
Well here's my crazy and uneducated idea about managing the Cell's SPUs in a multi-tasking environment:

----------------------------

The issuing of apulets to SPUs can be managed by a bookkeeping system in the OS, whereby you reserve processing time on the SPUs with the OS, coupled with an estimate of how long your processing will take. This estimate is determined by detailed benchmarking and profiling of your code, plus perhaps a bit more time as a safety net. At the appropriate time, your processing is issued and if it takes too long then it's stopped, the next apulet in line is issued, and your processing is automatically re-reserved in the bookkeeping system. Perhaps the OS could send your process a warning, a request for a new time estimate, or something else to help re-reserve the apulet more efficently. More complex arrangements of SPUs, such as pipelines or SPUs that interact with the GPU and other external devices would also be indicated to the OS so it can handle them approprately (e.g. treat multiple SPUs and a unified block of SPUs). All together, with this information the OS should be able to effectively balance the processing time of apulets with the need to multi-task, and thus minimise the SPU context switching problem.

The most difficult part is probably estimating processing time, because you will have to create a forumula that factor in your benchmarking numbers, plus the clockspeed of the SPU, and any slight hardware differences of that SPU. Of course, totally different/custom SPUs is not an issue, because you wouldn't run on them anyways as they are designed for other types of processing and probably have a different ISA too anyways.

----------------------------

So am I an idiot, genius or neither?

dunno whether you're a genius but you're not an idiot ; )
 
Megadrive1988 said:
Panajev, do you think my theory is correct, that if the CELL shown at ISSCC, ( or a newer revision of CELL going into PS3 ) was scaled down to 6.2 GFLOP performance, that it would still kick Emotion Engine's ass? because of the increased efficiency, better design, lower latency memory, etc ?

With bigger memory/cache and better bandwidth -- how is that a fair comparison? Of course it'll kick ass. :D
 
Inane_Dork said:
Panajev2001a said:
So, in summary, you have completely and utterly proven that the PS3 is easier to tap than the PS2.

Yeah, yeah... save the sarcasm buddy ;).

I don't see how that debunks anything I've said, though.

Oh really ?

The issue is building hardware to aid that development as much as possible. And I don't see that Cell really takes a step forward in that with respect to the power it introduces.

Uhm... we disagree here ;).
 
Panajev2001a said:
Inane_Dork said:
The issue is building hardware to aid that development as much as possible. And I don't see that Cell really takes a step forward in that with respect to the power it introduces.

Uhm... we disagree here ;).
I could have told you we disagree there two weeks ago. I really don't see that the cores on a Cell chip will be much better utilized considering the work they're doing. I am by no means the final word on the topic, but given what it will likely do in a game console and what did those jobs before Cell, I don't see a true step forward in utilization.

Of course, it's not bad. It's pretty much as good as it's going to get with current limitations.
 
Back
Top