CELL from GDC

one · Mar 10, 2005

This is reminiscent of BlueGene/L, as its single compute node has a mini-kernel (CNK) running on it.

Fafalada · Mar 10, 2005

nAo said:
Faf..I wouldn't be so worried cause you would work that way (with at least 4 elements at time) anyway if you like efficiency

Well no - forcing this type of data rearranging effectively eliminates any real chance of writting readable code, especially if it's in C++.
And I don't know how much you've dealt with writting non-graphics code for vector units, but this kind of limitations will make it insanely difficult to write efficient physics code. Heck you even run into issues with VUs with a far more capable ISA.

Let's hope they gives us GOOD SPEs compiler, that's the real concern to me.

Well, for C/C++ code, it's not like the compiler could rotate vector memory patterns for me, so what code I want to write in straight C will be all manual labour to optimize anyhow. Though I at least expect/hope for decent loop optimizers.
And I suppose Cg could probably help making more readable code possible, though I sinceredly doubt efficiency if you start mixing and matching many single vector operations with matrix transforms and expect it to be handled transparently without doing the above mentioned manual rearrangements.

nAo · Mar 10, 2005

Look at this slide:

I LOVE the last line.. is it a sincere compliment..or a subliminal threat? 8)

passerby · Mar 10, 2005

IMO that presentation is almost like saying things... but also not saying things at all. Duh.

Give an actual code example... how it is designed, how it actually runs, what works, what doesn't, etc.[/i]

Fafalada · Mar 10, 2005

Guden Oden said:
What, you thought being a games programmer was supposed to be easy, fun, making lots of money and driving around in a yellow topless ferrari?

Well yeah, but that was when I was 16.

Thing is people whine about managing multiple processors and this and that, but according to this, writting fast code for just 1 SPE will be obfuscated mess that would make even raw VU assembly code look prettier in some ways.

nAo · Mar 10, 2005

Fafalada said:
Well no - forcing this type of data rearranging effectively eliminates any real chance of writting readable code, especially if it's in C++.
And I don't know how much you've dealt with writting non-graphics code for vector units, but this kind of limitations will make it insanely difficult to write efficient physics code. Heck you even run into issues with VUs with a far more capable ISA.

I share some of your fears, but if they give us a good compiler and I'll just start to use matrices of 4 vectors instead of a single vectors in my loops.
You're right about potentially fucking up code readibility but if you want huge flops figures (we're talking about being 2/3x powerful than their direct competing company hw..) you have to sacrifice things. Nothing is free..
..my non graphics code library on the VUs is almost nil, nada, zero.. so you're right on this too

Well, for C/C++ code, it's not like the compiler could rotate vector memory patterns for me, so what code I want to write in straight C will be all manual labour to optimize anyhow. Though I at least expect/hope for decent loop optimizers.

Umh..imho it's just a matter to change your way to look how data are stored.
I'm arleady working that way on the PS2. Too bad I realized too late in dev process I should have taken that route from the start, so I had some headache when data stored in different ways have to work together someway..(see my comments about lighting code I wrote here some weeks ago..)

I sinceredly doubt efficiency if you start mixing and matching many single vector operations with matrix transforms and expect it to be handled transparently without doing the above mentioned manual rearrangements.

We don't know SPEs ISA so we can't know if they are supporting this kind of rearrangements trough special instructions, let's hope that though.
Faf..life is wonderful

rabidrabbit · Mar 10, 2005

I must say I understood maybe two or three lines from those presentations, so I won't comment if those things are good for the programmability of cell or not.

Did Microsoft give similar info of the insides of their xbox next? Or are they relying XNA solves everything?

That was a game developer's conference after all, right? Not just some pr session for Live end user features

version · Mar 10, 2005

one · Mar 10, 2005

rabidrabbit said:
Did Microsoft give similar info of the insides of their xbox next? Or are they relying XNA solves everything?

That was a game developer's conference after all, right? Not just some pr session for Live end user features

AFAIK they are going to lecture on the usage of OpenMP, so it looks like just another SMP configuration where no such special briefing is required except for the obvious focus on multithread programming.

Fafalada · Mar 10, 2005

nAo said:
I share some of your fears, but if they give us a good compiler and I'll just start to use matrices of 4 vectors instead of a single vectors in my loops.

Actually that's what I was just wondering about - why bother even giving us vector instructions, just stick a 4x4Matrix multiplier in there and be done with it, it'll be about as usable anyhow. :?

You're right about potentially fucking up code readibility but if you want huge flops figures (we're talking about being 2/3x powerful than their direct competing company hw..) you have to sacrifice things. Nothing is free..

Question is, is the performance worth it if ISA might limit you to something far lower in everything but trivial code? If all they wanted was something that can transform vertices fast I am sure they could have gotten comparable performance cheaper.

Umh..imho it's just a matter to change your way to look how data are stored.

Well, it still forces using you to use bizarro classes such as vector4x4. Basically it boils down to obfuscating your algorithms into quadruples or whatever the larger atomic unit you will define.
And I doubt anyone would prototype any new code in that manner, so we'll be back to writting nonoptimized and optimized versions of code - feels like 1999 all over again.

I'm already working that way on the PS2.

I actually wanted to ask, what prompted you to do this eventually? I know writting all code unrolledx4 can help with VCL optimizer among other things, but I doubt that would be the only reason. And I don't suppose you store anything but the normals like that?
Personally I've considered the idea early on but rejected it eventually. I still have all my loops written unrolledx2, but that's cuz of the way UVs are stored (saving space and memory accesses).

Megadrive1988 · Mar 10, 2005

I have a stupid, stupid dumb idiotic question

are these 'Jobs' and 'Job Queue' anything like Cell apulets ?

j^aws · Mar 10, 2005

one said:
This is reminiscent of BlueGene/L, as its single compute node has a mini-kernel (CNK) running on it.

Each SPE runs a mini- kernal

As I predicted they would run some sort of nano-kernel ala TAOS,

TAOS: http://www.byte.com/art/9407/sec6/art1.htm

Jaws said:
...
They get around the uniform ISA problem in a heterogeneous multi-processor environment by compiling to a 'virtual' processor with no overheads in translation due to a very efficient nano-kernel running on each processor. I know the CELL press releases mentions that the CELL processors can run multiple operating systems. If this means multiple nano-kernels or equivalent on each core, then they could be borrowing many ideas from TAOS for CELL.
...

http://www.beyond3d.com/forum/viewtopic.php?p=438987#438987

Hannibal in the article below mentions that SPEs can have various scaler and vector configurations,

http://arstechnica.com/news.ars/post/20050308-4685.html)

Hmm...So we're still on for a CELL Virtual Machine, i.e. CELL VM ISA?

Panajev2001a · Mar 10, 2005

Fafalada said:
nAo said:

IIRC we already discussed that. No broadcast my friend..AFAIK

Click to expand...

Well there was nothing really conclusive said in that discussion though

Anyway what do they expect me to do with no broadcast and no dotproduct, replicating scalar value into entire vector for every dotproduct during matrix and vector transforms?
Don't even get me started on compiler trying to work in such ways, it took GCC like a decade just to support madds.

GCC == best compiler in the world... there is ntohign better than GCC 2.95.x noooo sir.

nAo · Mar 10, 2005

Fafalada said:
Actually that's what I was just wondering about - why bother even giving us vector instructions, just stick a 4x4Matrix multiplier in there and be done with it, it'll be about as usable anyhow. :?

That's true..but we still don't know if there is any other kind of support for this stuff in the ISA. a 4x4 matrix multiplier would have been even a bigger a waste with scalar ops

Question is, is the performance worth it if ISA might limit you to something far lower in everything but trivial code?

I don't know where the sweet spots is.
Do we like challenges? Hell..yes! [masochist mode off]

If all they wanted was something that can transform vertices fast I am sure they could have gotten comparable performance cheaper.

100% agreed. But it's clear SPEs will do much more than transforming vertices. Maybe the GPU is going to do most of the vertex shading stuff, who knows :?

Basically it boils down to obfuscating your algorithms into quadruples or whatever the larger atomic unit you will define.

yeah..it like an obfuscating C++ code compo

(I don't know if I should cry or if I should laugh..)

I actually wanted to ask, what prompted you to do this eventually? I know writting all code unrolledx4 can help with VCL optimizer among other things, but I doubt that would be the only reason. And I don't suppose you store anything but the normals like that?

The idea prompted when I wrote for the first time some VU code for a classic point light. I realized all the renormalization stuff would have taken tons of clock cycles, then I re-discovered (it happened when I workd about doing per triangle mip mapping and I thought about a fast log2 implementantion on the VUs) the ftoi/itof trick (doing calculation in logartmic space..).
That trick takes 4 cycles and work on a single vector, so I could renormalize 4 vectors at the same time (one pow per clock cycle, that' cool

)!
I needed to have data in that way..and the better way to do it, imho, needed to work on 4 vertices a the same time and transpose all my matrices. Moreover this accelerated all my dot3 calculations (0.75 cycles per dot3).
I use lower pipeline to transpose non-transposed stuff (veritces position) and it's free cause I had a lot of free slots. I could have used it to just compose a vector full of scalar to be invsquared but VCL worked better with the transposed configuration, dunno why.
In the end it worked and it's faster than my non transposed implementation, even if I've no doubt someone could come up with something even faster.

ciao,
Marco

Megadrive1988 · Mar 10, 2005

up and down, up and down, up and down. sigh.

it seems like, with every new Cell thread, things shift from good to bad, back to good, then back to bad again. ugh, at least as far as the current thoughts of various people who actually work on games, programming, graphics, etc.

Is Cell looking good or bad at this moment in time ?

nAo · Mar 10, 2005

Megadrive1988 said:
Is Cell looking good or bad at this moment in time ?

It's up to yourself decide

j^aws · Mar 10, 2005

nAo said:
Megadrive1988 said:

Is Cell looking good or bad at this moment in time ?

Click to expand...

It's up to yourself decide

Yeah...I like the VM JIT shit!

...It has many possibilities as I mentioned above with each SPE running it's own micro-kernel.

I believe they are just teasing us with these slides at the mo! 8)

London Geezer · Mar 10, 2005

Well i think it's easy for the gamer to say "YEAH I LOVE IT!", but they're not the ones having to code the bloody thing...

Oh well...

j^aws · Mar 10, 2005

london-boy said:
Well i think it's easy for the gamer to say "YEAH I LOVE IT!", but they're not the ones having to code the bloody thing...

Oh well...

They're just moaning without knowing all the facts. I beleive Faf has 294.912 moan threads con-currently running in his head.

We don't know the CELL ISA yet...and coding to a consistent CELL VM ISA would make life easier...

London Geezer · Mar 10, 2005

Jaws said:
They're just moaning without knowing all the facts. I beleive Faf has 294.912 moan threads con-currently running in his head.

We don't know the CELL ISA yet...and coding to a consistent CELL VM ISA would make life easier...

Well personally i don't care, i'm gonna buy ZOE4 and go WWWWWWWWOOOOOOOOOOOWWWWWWWWWWW.... Who cares if the devs died trying to make the PS3 output 19032trillion particles all moving realistically. One by one. By hand.

CELL from GDC

one

Unruly Member

Fafalada

nAo

Nutella Nutellae

passerby

Fafalada

nAo

Nutella Nutellae

rabidrabbit

A Reformed Member

version

one

Unruly Member

Fafalada

Megadrive1988

j^aws

Panajev2001a

nAo

Nutella Nutellae

Megadrive1988

nAo

Nutella Nutellae

j^aws

London Geezer

j^aws

London Geezer

Similar threads