Cell Programming Examples presentation webcast going on now

I'm new to this, so please go easy on me.

What I'm getting,
The Cell's Gflop peak Gflop (218) performance (19 per SPE) would only be (possibly) acheivable if the dev used all hand written assembly code? Most devs don't do this, right?

Would the 9 Gflop per SPE be what you could expect out of compiled code, say from C+, or what dev's normally use to code games?
 
Knoel said:
I'm new to this, so please go easy on me.

What I'm getting,
The Cell's Gflop peak Gflop (218) performance (19 per SPE) would only be (possibly) acheivable if the dev used all hand written assembly code? Most devs don't do this, right?

Would the 9 Gflop per SPE be what you could expect out of compiled code, say from C+, or what dev's normally use to code games?

This test is very math heavy, it's a job typically done by DSP's rather than general purpose processors.

In real code outside of trivial cases like vertex transforms you don't get anywhere near peak flop ratings, there are far too many dependancies between instructions and latencies you simply cannot hide even when you handcraft the code.
 
ShootMyMonkey said:
I mean, that the compiler will not produce good assembly. You'll have to do it yourself. i.e. Take compiled C code for anything vs. handcoded ASM, the ASM will always outperform the compiled code (no matter what people want to believe)... and it seems that in the case of CELL, it's not a small percentage, either.
In other words the new SPE compiler is called "ERP" or "nAo" or "Deano" etc., whichever the case may be.

Edit: wrong quote :oops:
 
ERP said:
In real code outside of trivial cases like vertex transforms you don't get anywhere near peak flop ratings, there are far too many dependancies between instructions and latencies you simply cannot hide even when you handcraft the code.
I partially agree with you.
Since I've not put my hands on a PS3 devkit I'm just 'extending' my working experience on EE vector units but I believe SPEs should be good (at achieving near peak rate performance figures) at much more than just vertex transforms.
What limits me most on PS2 VUs on complex programs (with a lot of high latency operations) is lack of registers as I try to pack more work per (sometimes unrolled) loop in order to fill all the free instruction slots I have when I process a single computation instance per loop.
SPEs have higher latency ops than VUs but they also have 4x more registers and 8x more memory.
My take on this subject is:
a) if you can stream it (and if you can't you should try to rethink your routines in a streaming fashion, I found it can be done in a lot of cases)
b) if you can process more instance per loop
c) if you can avoid branches in your inner loop (via predication) or if you have enough instructions to schedule before a hint and a branch..

you're probably going to achieve very good performance (if not limited by other means such as memory bandwith..)

ciao,
Marco
 
Knoel said:
The Cell's Gflop peak Gflop (218) performance (19 per SPE)

A small correction, but the 218 total number and the 19 per SPE in this task are not related. The former is a paper peak, the latter is achieved performance in this task.

Performance achieved, and how close you'll get to peaks, depends on the task, as with any chip.
 
Knoel said:
What I'm getting,
The Cell's Gflop peak Gflop (218) performance (19 per SPE) would only be (possibly) acheivable if the dev used all hand written assembly code? Most devs don't do this, right?

Would the 9 Gflop per SPE be what you could expect out of compiled code, say from C+, or what dev's normally use to code games?
No. You will never get the full peak 218 GFlops.

As an analogy, consider an engine. It weighs 100kg and generates 100 HP. From that you could calculate that it can travel a maximum of (forgetting my physics education and grabbing a random number!) 200 MPH. Only it can't travel any speed because it hasn't any wheels.

To make the engine able to move, you need to add wheels, and a chassis, and make it strong enough to hold a 100kg engine. So at the very least you might need to add another 100 kg of structure to the engine to make it able to move. Thus you decrease it's top speed.

So whatever the peak performance of an engine, that's as a standalone part. To make it actually useful you have to make it part of a system, which adds restrictions to its performance.

Let's say we have two vehicles. One has an engine that generates 150 HP. Another generates 800 HP. Which is the faster vehicle? We've no way of telling. We know which has the more powerful engine, but for all we know the 150 HP motor is in a tiny sports car and the 800 HP number is in a heavy Tank or a large boat.

Peak figures are used to compare relative performance. If IBM produce a CPU that has a peak 100 GFlops, and another peak 200 GFlops, at a rough estimate you'd say the latter was the faster processor. But it still all depends on where the processor is used, what it's optimised for, and so forth, so you can't categorically announce that the 200 GFlop processor will result in a faster machine than the 100 GFlop processor.

The other benchmark is real-world performance benchmarks, not theoretical figures. Like PC labs where they test machines and actually record how well they perform in real tasks, to understand the performance of Cell or XeCPU we really need actually software running.

This is something not understood by the mainstream press it seems, a fact exploited by the marketeers who bandy around big numbers to give an impression of powerful systems, when those knowledgeable appreciate they're just numbers for individual parts, not the whole system.
 
Shifty Geezer said:
If IBM produce a CPU that has a peak 100 GFlops, and another peak 200 GFlops

This is a little close to a certain real world set of chips of interest today.. :devilish: ;)

I can already see people misinterpreting your comments..heck we've already had people asking here if that means Cell would be slower than X360. I'm sure that's not the impression you're trying to create..

For the workload under discussion here, I don't expect the paper difference would be betrayed so thoroughly, if at all.
 
Hopefully if they read that, they'll also read the follwoing statement that explains it's proof of nothing on its own!
 
The archived webcasts are up, including the Cell one. At the same link, may be useful to clarify points, or if you missed it live etc.
 
nAo said:
ERP said:
In real code outside of trivial cases like vertex transforms you don't get anywhere near peak flop ratings, there are far too many dependancies between instructions and latencies you simply cannot hide even when you handcraft the code.
I partially agree with you.
Since I've not put my hands on a PS3 devkit I'm just 'extending' my working experience on EE vector units but I believe SPEs should be good (at achieving near peak rate performance figures) at much more than just vertex transforms.
What limits me most on PS2 VUs on complex programs (with a lot of high latency operations) is lack of registers as I try to pack more work per (sometimes unrolled) loop in order to fill all the free instruction slots I have when I process a single computation instance per loop.
SPEs have higher latency ops than VUs but they also have 4x more registers and 8x more memory.
My take on this subject is:
a) if you can stream it (and if you can't you should try to rethink your routines in a streaming fashion, I found it can be done in a lot of cases)
b) if you can process more instance per loop
c) if you can avoid branches in your inner loop (via predication) or if you have enough instructions to schedule before a hint and a branch..

you're probably going to achieve very good performance (if not limited by other means such as memory bandwith..)

ciao,
Marco


OK I'll extend my definition. If you could do it on a VU then getting good parallelism out of an SPE shouldn't be much of an issue. Although I could make a fair argument that most PS2 games don't manage to keep the VU's busy at anywhere near peak because of wasted cycles in DMA transfer.

I think the difference is the types of task were going to want to do on SPE's, in general the majority of them are going to be much more general purpose than those done on VUs (if this isn't the case I'll speculate they'll be sitting idle a lot of the time), and the challenges are going to be more about partitioning data and moving that data around effectively than performing ops of anysort on the data.

In the end it's all speculation until much wider development starts to happen and products get further along.
 
thank you for screenshots

thank you for screenshots , on firefox i can't watch the movies
I think i need mplayer plugin or something (on linux)
 
About assembler: while it might be much better for small and simple tasks if you spend enough time optimizing it, it scales really bad. When the program grows, a compiler starts to generate more comparable code, until it eventually surpasses the assembly code. And it will take much less time to write.

When you combine that with a stream architecture, you could probably reach the largest speed increase by fiddling with the block size, the amount of static vs. dynamic structures and the amount of passes used, and have the compiler do the rest.
 
Back
Top