Presentation from a summit on software and algorithms for the Cell Processor

chris1515

Legend
Supporter
Last edited by a moderator:
  • Like
Reactions: one
Anyone thinks Mike Acton's presentation is a really ridiculous and pathetic one?

I can't help laughing through the entire thing when I read it.
 
it seems a good thing to restate, its good to collect these links so ill add Markos's papers that give you proofs and working code to try.
http://www.freevec.org/whitepapers.php
"Vectorization of hashing algorithm Adler32 includes mathematical proof
Vectorization of a family hashing algorithm includes mathematical proof
Vectorization of the Insertion Sort Algorithm Also includes presentation of an N-way Merge Sort algorithm (scalar for now) "

Markos and the other Power devs/interested partys ;) have talked about expanding the work already done in the Altivec sections and elsewere ,that may (or not) lead to another open tech section at powerD geared to providing a solid codebase on which to mix and match for best fit samples, we shall see...:smile:

also checkout the Altivec section for working code/papers and advice
http://www.powerdeveloper.org/forums/viewforum.php?f=23

ohh and just a reminder, if you have G4/G5/CELL Vectorised patchs for the GCC patchs list , get them in ASAP and read the relevant threads
http://gcc.gnu.org/ml/gcc-patches/2006-12/msg00149.html
 
Last edited by a moderator:
Anyone thinks Mike Acton's presentation is a really ridiculous and pathetic one?

I can't help laughing through the entire thing when I read it.

not really , but thats the thing with presentations, theres the fact it's all well and good, but linking in fully working freeware code for others to learn from is far better...

for making presentations taking a little time to learn and use 'rebol view' scripting would go a very long way for these types of presentation and a whole lot more. ;)

'Rebol view' even has http://antigrain.com/ built-in if your into 'coding in the small' use it in a .r script, no messing about linking external stuff

notice how this first url is a perfect model for distributing open code to learn from, and not just rebol scripts.
http://www.rebol.org/cgi-bin/cgiwrap/rebol/search.r?find=agg&form=yes
http://www.rebol.org/cgi-bin/cgiwrap/rebol/search.r?find=presentation&form=yes

http://www.inetdevgrp.org/20010417/outlinee.html
http://box.lebeda.ws/~hmm/rswf/
http://www.rebolforces.com/
http://www.rebol.net/
 
Last edited by a moderator:
Anyone thinks Mike Acton's presentation is a really ridiculous and pathetic one?

I can't help laughing through the entire thing when I read it.

Well, I hope not. :)

I know it's difficult to interpret these slides, there just isn't much content in them (except for the main ideas.) Usually when I give a presentation I try not to rely on slides too much, since I assume everyone in the audience can read (I personally hate it when a presenter is just reading his slides). Especially for this one, most of the content was in what I said, not what I wrote.

That particular summit was actually really cool. There were maybe 30 of us (I don't know exactly) and all the presentations were full discussions - we interruptted eachother, argued a bit, disagreed some and generally had a good time of it.


Patsu said:
Cal, Mike Acton's slides is helpful to me. It helps me understand to what extent I need to recode/rethink Cell programming... sort of an initiation to the club. For specific questions, you're welcomed to visit cellperformance.com to talk to him

As Patsu pointed out, I do try to give more detail on my website (http://www.cellperformance.com), but naturally there's only so much time I can spare, so I haven't had the chance to write in detail about everything I'd like yet.

Gubbi said:
I think Cal's cynicism is based on Mike Acton's answer to your above "to what extent" seems to be "all of it"

To this point, what I am saying is that on the Cell you target the SPUs, not the PPU, and if you're used to writing a giant monolithic code base that doesn't concern itself with either the instruction or data caches, you're going to be in a world of hurt. :)

One of the things I did try to point out though was that if your code base is already fast, which means it surely is both cache-friendly and (probably) using SIMD of some kind, then the port to the Cell will be much easier and you'll probably see the benefits right away.


Back to Cal: Has your experience in programming the Cell then been very different from mine? I would be interested in knowing what specifically you disagree with.

Mike.
 
Sorry for the delay on this appearing, Mike. And the trouble of doing it twice (dup deleted). You ran afoul of our automated spam-catcher, which is weighted towards a suspicious eye on new members first few posts and holds many for mod approval first. Should be okay now tho for you going forward.
 
Sorry for the delay on this appearing, Mike. And the trouble of doing it twice (dup deleted). You ran afoul of our automated spam-catcher, which is weighted towards a suspicious eye on new members first few posts and holds many for mod approval first. Should be okay now tho for you going forward.

So now Mike can drop the act and advertise his competitively priced range of pr0n. Excellent news!
 
To this point, what I am saying is that on the Cell you target the SPUs, not the PPU, and if you're used to writing a giant monolithic code base that doesn't concern itself with either the instruction or data caches, you're going to be in a world of hurt. :)

Well, you'd already be in a world of hurt on "normal" CPUs :)

Exploiting (and maximizing) locality is really the number one commandment of optimization.

One of the things I did try to point out though was that if your code base is already fast, which means it surely is both cache-friendly and (probably) using SIMD of some kind, then the port to the Cell will be much easier and you'll probably see the benefits right away.
It'll certainly be a lot easier, I'll grant you that. There still seems to be a fair amount of gotchas (thanks to Patsu for finding this).

Then there's the work that needs to be done to avoid branches: inlining functions, unrolling and converting conditionals into predicates, the latter would of course also be a win on most SIMD codes on traditional CPUs.

Btw. thanks for taking the time to get involved in our discussions.

Cheers
 
Then there's the work that needs to be done to avoid branches: inlining functions, unrolling and converting conditionals into predicates, the latter would of course also be a win on most SIMD codes on traditional CPUs.

The whole avoid branching on SPU, is really a bit of red herring. Yer its the worst thing you can do on an SPU but its really not that bad compared to most in order CPU architectures that have variable memory access patterns. When you consider that a cache miss can take 50+ (and sometimes much much bigger than 50!) cycles, suddently ~20 for a branch miss doesn't seem so bad.

Of course you can do ALOT of maths in a space of a single branch miss, but each SPU can still do 160+ MILLLION branch misses per second, which isn't exactly rubbish.
 
Found another interesting paper on how to mix SP and DP on Cell (or other processors) to obtain speed up and accuracy:
http://www.c3.lanl.gov/lacsi-wpp/Slides/02_Dongarra_MultiCore.pdf

EDIT:
This one talks about a "parallel" language that is aware of memory hierarchy. Cell is used as an example:
http://sc06.supercomputing.org/schedule/pdf/pap225.pdf

We present Sequoia, a programming language designed to
facilitate the development of memory hierarchy aware parallel
programs that remain portable across modern machines
featuring different memory hierarchy configurations. Sequoia
abstractly exposes hierarchical memory in the programming
model and provides language mechanisms to describe
communication vertically through the machine and to
localize computation to particular memory locations within
it. We have implemented a complete programming system,
including a compiler and runtime systems for Cell
processor-based blade systems and distributed memory clusters,
and demonstrate efficient performance running Sequoia
programs on both of these platforms.

btw, the entire dongarra directory in the OP is gone now :(
 
Last edited by a moderator:
Regarding branch free code. I was trying Mike's advise on using masks to eliminate branches. But guess which code is faster on the PPU.

Hint: You will guess wrong!

With conditionals and branches
inline void test_assign_stop2(unsigned int *point, unsigned int *test, unsigned int i)
{
if (!*point && *test == -1 ) {
*point = i;
}
}

Without conditionals and no branches
inline void test_assign_stop1(unsigned int *point, unsigned int *test, unsigned int i)
{
unsigned int mask_comb = (unsigned int)(((signed int)*test)>>31);
*point = (i & mask_comb) | (*point & ~mask_comb);
}
 
They're not actually performing the same check, are they.

The 2nd one is missing the test for *point==0

In a simple test with this ( 1000 loops or something like that ) the first version may perform better as the conditions are all 'exits' so they can subsume the loop branch..

For example - If I compile with -O3 the function is like this

Code:
test_assign_stop2:
        lwz 0,0(3)
        cmpwi 7,0,0
        bnelr 7
        lwz 0,0(4)
        cmpwi 7,0,-1
        bnelr 7
        stw 5,0(3)
        blr

There are 3 branches - two from the if, and one for the function return - but only 1 branch of the three is ever executed, and the 2nd load and store can be avoided.
In the inline version 2 loads and one store always executed..

If you test with true/false/true/false sequences in a loop the branch prediction should drag down the performance of the first routine - otherwise the prediction will make it the better performer.
 
Damn, you are right. My logic was wrong. Fixed and verified they are functionally equivalent

Anyway, look at what the compiler is doing at -O3

With comparisons and branches
mandelbrot (PPU) 37.50s
mandelbrot (SPU) 16.69s

Code:
inline void test_assign_stop2(unsigned int *point, unsigned int *test, unsigned int i)
{
    if (!*point && *test == -1) {
        *point = i;
    }
}

PPU
Code:
test_assign_stop2:
	lwz %r0,0(%r3)
	cmpwi %cr7,%r0,0
	bnelr %cr7
	lwz %r0,0(%r4)
	cmpwi %cr7,%r0,-1
	bnelr %cr7
	stw %r5,0(%r3)
	blr

SPU
Code:
test_assign_stop2:
	lqd	$2,0($3)
	cwd	$7,0($3)
	rotqby	$2,$2,$3
	binz	$2,$lr
	lqd	$2,0($4)
	hbr	.L16,$lr
	lqd	$6,0($3)
	nop	$127
	nop	$127
	rotqby	$2,$2,$4
	ceqi	$2,$2,-1
	and	$2,$2,$5
	shufb	$6,$2,$6,$7
	stqd	$6,0($3)
.L16:
	bi	$lr


Without comparisons and no branches
mandelbrot (PPU) 37.25
mandelbrot (SPU) 14.18


Code:
inline void test_assign_stop1(unsigned int *point, unsigned int *test, unsigned int i)
{   
    unsigned int mask0 = (unsigned int)(((signed int)-(*point))>>31);
    unsigned int mask1 = (unsigned int)(((signed int)*test)>>31);
    *point = ((i & mask1) & (~mask0)) | (*point & mask0);
}

PPU
Code:
test_assign_stop1:
	lwz %r11,0(%r3)
	lwz %r9,0(%r4)
	neg %r0,%r11
	srawi %r9,%r9,31
	srawi %r0,%r0,31
	and %r9,%r9,%r5
	rldicl %r0,%r0,0,32
	and %r11,%r11,%r0
	andc %r0,%r9,%r0
	or %r0,%r0,%r11
	stw %r0,0(%r3)
	blr

SPU
Code:
test_assign_stop1:
	ori	$7,$3,0
	lqd	$3,0($4)
	hbr	.L9,$lr
	lqd	$2,0($7)
	cwd	$9,0($7)
	rotqby	$4,$3,$4
	ori	$8,$2,0
	rotqby	$6,$2,$7
	rotmai	$3,$4,-31
	sfi	$2,$6,0
	rotmai	$4,$2,-31
	and	$3,$3,$5
	nor	$2,$4,$4
	and	$6,$6,$4
	and	$3,$3,$2
	or	$3,$3,$6
	shufb	$8,$3,$8,$9
	stqd	$8,0($7)
.L9:
	bi	$lr
 
Last edited by a moderator:
Seems to confirm DeanoC's comment that branches don't necessarily cause that much of a performance hit on the SPEs ...

Also, people seem to easily forget that the SPE isn't completely without branch prediction, either.
 
Your code still isn't functionally equivilant in both cases. The non branch version will only work in a limited set of circumstances. If *point is a very large integer that has the highest bit set it will fail or if *test has the highest bit set, but isn't all 1s it can also fail. Perhaps in the case of the mandelbrot code it has the same result, but for general purpose use, they are not the same. I had to be pedantic about this just so it was known.

Now, comparison opcodes shouldn't cause a performance hit (they aren't branches after alll). This code should be the same and may have different performance characteristics for you.

Code:
inline void test_assign_stop1(unsigned int *point, unsigned int *test, unsigned int i)
{
    unsigned int p = *point; // May have some performance improvement depending on the optimizations used by the compiler. 
    unsigned int mask = (p!=0 && *test==-1) * -1; // Note this is a boolean operation, not a conditional operation
    *point = (p * ~mask) | (i * mask);
}
 
Last edited by a moderator:
This code should be the same and may have different performance characteristics for you.

Ran it on a SPU (fixed some bugs first) it is actually slower that the original. It's longer by 4 instructions. And introduces the following instructions to the pipeline 2x ceqi (compare equal) and 2x selb (select bits).
mandelbrot (SPU) 0.188470


Couple little changes:
Code:
inline void test_assign_stop(unsigned int *point, unsigned int *test, unsigned int i)
{
    unsigned int p = *point; // May have some performance improvement depending on the optimizations used by the compiler. 
    unsigned int mask = (p[b]==0[/b] && *test==-1) * -1; // Note this is a boolean operation, not a conditional operation
    *point = (*point [b]&[/b] ~mask) | (i [b]&[/b] mask);
}

Code:
test_assign_stop:
	il	$14,-1
	lqd	$22,0($4)
	il	$10,0
	lqd	$21,0($3)
	ori	$7,$3,0
	hbr	.L18,$lr
	cwd	$11,0($3)
	rotqby	$20,$22,$4
	rotqby	$13,$21,$3
	ceqi	$19,$20,-1
	ceqi	$15,$13,0
	andi	$18,$19,1
	sfi	$17,$18,0
	nor	$16,$17,$17
	and	$5,$5,$17
	selb	$4,$14,$16,$15
	selb	$6,$10,$5,$15
	and	$3,$13,$4
	or	$2,$3,$6
	shufb	$8,$2,$21,$11
	stqd	$8,0($7)
.L18:
	bi	$lr
 
Back
Top