Cell Programming Examples presentation webcast going on now

What do you mean by handcoded? Any code obviously has to be written by the programmer.
I mean, that the compiler will not produce good assembly. You'll have to do it yourself. i.e. Take compiled C code for anything vs. handcoded ASM, the ASM will always outperform the compiled code (no matter what people want to believe)... and it seems that in the case of CELL, it's not a small percentage, either.
 
ShootMyMonkey said:
What do you mean by handcoded? Any code obviously has to be written by the programmer.
I mean, that the compiler will not produce good assembly. You'll have to do it yourself. i.e. Take compiled C code for anything vs. handcoded ASM, the ASM will always outperform the compiled code (no matter what people want to believe)... and it seems that in the case of CELL, it's not a small percentage, either.

Having listened to the presentation, this certainly was not mooted. If anything the opposite was, there's a PPE compiler and a SPE compiler, each spitting out different code, the only issue is that the compiler doesn't see all the cores as one big CPU, you have to manually work with the multiplicity of cores available. As in pretty much any parallel system, really.

You don't need to use assembly to tap the SPEs, not at all. I don't know how you got that from that slide either :?
 
Titanio said:
You don't need to use assembly to tap the SPEs, not at all. I don't know how you got that from that slide either :?
Though I've not listened to the webcast, according to this pic 19 GFLOPS for an SPE is when you unroll all loops and assign them onto registers by hand, no? Well it's still impressive (1 SPE + XLC compiler) = (1 Pentium 4(Xeon) + vectorizing compiler and newer MKL) even with a compiler though.
slide118al.jpg
 
I imagine that the Pentium 4 they are comparing themselves against is with both cores running 3.2 GHz? I haven't listened to the audio (for streaming problems on my end already menmtioned) but that seems to be what I'm seeing there; am I mistaken?
 
one said:
Titanio said:
You don't need to use assembly to tap the SPEs, not at all. I don't know how you got that from that slide either :?
Though I've not listened to the webcast, according to this pic 19 GFLOPS for an SPE is when you unroll all loops and assign them onto registers by hand, no? Well it's still impressive (1 SPE + XLC compiler) = (1 Pentium 4(Xeon) + vectorizing compiler and newer MKL) even with a compiler though.
slide118al.jpg

Sorry, I see that now. sorry, They're comparing assembly code performance..but not to code generated from a compiler, but code using a general FFT library. There's no direct comparison there that gives us any insight into assembly vs compiler generated code (though obviously I'd expect assembly to be better, I should hope compiler code is within decent range of it as is usually the case).

The second bar in that graph IIRC is unoptimised code using a library, not compiler generated SPE code without a library.

xbdestroya said:
I imagine that the Pentium 4 they are comparing themselves against is with both cores running 3.2 GHz? I haven't listened to the audio (for streaming problems on my end already menmtioned) but that seems to be what I'm seeing there; am I mistaken?

The actual figures were from a 2.8Ghz machine, but they scaled them up linearly to where a 3.2Ghz machine should/would/might be.
 
They're comparing assembly code performance..but not to code generated from a compiler, but code using a general FFT library. There's no direct comparison there that gives us any insight into assembly vs compiler generated code (though obviously I'd expect assembly to be better, I should hope compiler code is within decent range of it as is usually the case).

The second bar in that graph IIRC is unoptimised code using a library, not compiler generated SPE code without a library.
Did you... look at the chart and not the text above it or something? It says right there, optimized compile of a general FFT library using xlc compiler gets 9 GFLOPS vs. 19 for straight assembly. And the second and third bars show exactly that.

Well, like I was saying, it pretty much shows that there's better than 2:1 improvement using direct ASM code over the compiled code, and that is not an insignificant fraction. Well, with any vector ISA, no compiler comes even within 65% of ASM. Compilers that supposedly vectorize for you are really nothing more than a fantasy. There's no compiler that can do that in general, and I wouldn't be surprised if there never will be.
 
ShootMyMonkey said:
They're comparing assembly code performance..but not to code generated from a compiler, but code using a general FFT library. There's no direct comparison there that gives us any insight into assembly vs compiler generated code (though obviously I'd expect assembly to be better, I should hope compiler code is within decent range of it as is usually the case).

The second bar in that graph IIRC is unoptimised code using a library, not compiler generated SPE code without a library.
Did you... look at the chart and not the text above it or something? It says right there, optimized compile of a general FFT library using xlc compiler gets 9 GFLOPS vs. 19 for straight assembly. And the second and third bars show exactly that..

Apologies, again, I was reading it that they took a general FFT library and put it on Cell and were comparing that to their own optimised code. I see where you're coming from now though...

ShootMyMonkey said:
Well, like I was saying, it pretty much shows that there's better than 2:1 improvement using direct ASM code over the compiled code, and that is not an insignificant fraction.

..But on this point, I think it may depend on what you're doing, no? I'm not sure if this represents a general case that can be applied to all code. The compiler may come closer with some work more than others. And of course there's always room for improvement as the compilers mature..
 
Titanio said:
ShootMyMonkey said:
They're comparing assembly code performance..but not to code generated from a compiler, but code using a general FFT library. There's no direct comparison there that gives us any insight into assembly vs compiler generated code (though obviously I'd expect assembly to be better, I should hope compiler code is within decent range of it as is usually the case).

The second bar in that graph IIRC is unoptimised code using a library, not compiler generated SPE code without a library.
Did you... look at the chart and not the text above it or something? It says right there, optimized compile of a general FFT library using xlc compiler gets 9 GFLOPS vs. 19 for straight assembly. And the second and third bars show exactly that..

Apologies, again, I was reading it that they took a general FFT library and put it on Cell and were comparing that to their own optimised code. I see where you're coming from now though...

ShootMyMonkey said:
Well, like I was saying, it pretty much shows that there's better than 2:1 improvement using direct ASM code over the compiled code, and that is not an insignificant fraction.

..But on this point, I think it may depend on what you're doing, no? I'm not sure if this represents a general case that can be applied to all code. The compiler may come closer with some work more than others. And of course there's always room for improvement as the compilers mature..

The reason that compilers get close to hand coded assembler performance on general code has more to do with the fact that cache misses generall dominate execution time.

Where you have local memories I would always expect assembler to pay off (although 2x is more than I would generally expect).

This sort of algorithm is really very specialised, it's very math heavy and probably skews the potential improvents for the more common cases.

I don't imagine very many devs writing 100's of K of assembler for all the tasks they want to run on SPE's.
 
two said:
Noob question:

SPE = 19GFlops
7 SPE = 133GFlops

PS3 Cell = 218GFlops

You're comparing peaks with performance with this workload.

ERP said:
The reason that compilers get close to hand coded assembler performance on general code has more to do with the fact that cache misses generall dominate execution time.

Where you have local memories I would always expect assembler to pay off (although 2x is more than I would generally expect).

Very interesting insight, thanks.

ERP said:
I don't imagine very many devs writing 100's of K of assembler for all the tasks they want to run on SPE's.

Agreed.
 
two said:
Noob question:

SPE = 19GFlops
7 SPE = 133GFlops

PS3 Cell = 218GFlops

So PPE = 85GFlops?
218 GFlops is ultimate peak optimum, if everything's being used to the full.
These listed numbers are for results processing this particular task, where the peak isn't reached because of memory access limitations etc.

These are real-world numbers from a particular task. Different tasks will have different impacts on how efficiently the Cell processor can number crunch.

This is why the biggest CPU or GPU isn't necessarily the best, as the system is only as strong as it's weakest part. A super fast processor capable of a gazillion operations per second stuck with super slow RAM won't be as fast as a fast CPU capable of billions of operations per second on super fast RAM that can supply data quick enough for the CPU to work on.

And this is why to understand the relative power of the next-gen consoles, you must understand what the numbers the marketeers actually mean and how they might be restricted by other aspects to the console they haven't mentioned. Just looking at numbers on their own is virtually pointless.
 
Vennt said:
He stated that the documentation would be available "Soon" - when pushed further he clarified this as "Over the Summer".

V.

Wait, so going back in tme to this post - the full open-sourcing will not take place today, but rather this summer? Man, what dissapointment! :?
 
two said:
Noob question:

SPE = 19GFlops
7 SPE = 133GFlops

PS3 Cell = 218GFlops

So PPE = 85GFlops?

to put it very simply: peak performance vs real-world performance

(as others have already stated in more detail)
 
well..peak performince of 360 is claimed to be 120 Gflops(i don't know exactly)...but will the real performance of 360 be less as ps3 real performance?
 
..But on this point, I think it may depend on what you're doing, no? I'm not sure if this represents a general case that can be applied to all code. The compiler may come closer with some work more than others. And of course there's always room for improvement as the compilers mature..
Yeah, it's probably true that something more general purpose as opposed to something math-heavy like an FFT wouldn't see that much improvement going to straight ASM... that is to say, it would perform poorly either way.

Still, 2:1 is pretty typical considering you're talking about a pipe that does its best work on vector code, and no compiler will ever do that very well (short of mutating the language to one specifically meant for vector operations). It does at least suggest that the code produced is pretty status quo as compilers go.
 
danteye said:
well..peak performince of 360 is claimed to be 120 Gflops(i don't know exactly)...but will the real performance of 360 be less as ps3 real performance?

Of course.
 
danteye said:
well..peak performince of 360 is claimed to be 120 Gflops(i don't know exactly)...but will the real performance of 360 be less as ps3 real performance?

Comparing floating point performance, in the workload being discussed, yes. X360 is obviously also exposed to the concept of reduced real world performance ;) It may be easier to leverage its power, but for stuff like this, optimised on each platform, I wouldn't expect the differences between peak theoretical performances to be betrayed too much in the real world (both would be lower than their peaks, but I'd still expect PS3's CPU to be better for it, perhaps even a couple of times better or more, as the peak figures may lead you to believe).
 
Back
Top