CELL from GDC

Fafalada said:
Npl said:
Offtopic: Can someone enlight me about the difference between "arbitrary swizzle" and "permute"?
Former is part of execution pipeline.

In other words:
Swizzle is something you can do inside of an instruction.
Permute is a whole instruction by itself.

It's the difference between

Swizzle (shader style)
add r0, r1.wxyz, r2.zxyw

vs

Permute (SSE style)
permute<wxyz> r3, r1
permute<zxyw> r4, r2
add r0, r3, r4
 
You waste more registers this way.. arbitrary swizzles are preferred although they surely come at aa logic or performance hit (permutes are preferrable in this regard).
 
darkblu said:
hypotetically, you can send as many ops as you have ready (i.e. decoded) to a unit given the unit has wide enough entry path to accept all those. what the unit does with them afterwards is, well, its business. if ops were all nops, then, it could just as well retire them all ; )

um, as I said before, No.

and then we have that info (as reliable as it is) that each thread in there sees its own VMX context, or at least, a register file. so regardless of how many ops per cycle that VMX unit can handle, it should at least keep a face to each of those couple of SMT threads out there. so how do you explain that? aside from discarding it as non-credible, that is.

A thread is a hardware context. Each thread better damn well have an individual architectural data state. This is normal. It isn't anything special and it doesn't require a VMX execution unit that is at all different from a scalar VMX unit.
 
corysama said:
Fafalada said:
Npl said:
Offtopic: Can someone enlight me about the difference between "arbitrary swizzle" and "permute"?
Former is part of execution pipeline.

In other words:
Swizzle is something you can do inside of an instruction.
Permute is a whole instruction by itself.

Thanks. No doubt the AoS & swizzles is the more "natural approach". Atleast in the programming languages I know.
 
aaron - what are your thoughts on permute instructions versus swizzles as part of each instruction, from the HW design POV?

I can imagine some potential cons, but I'm not an EE so have no way of evaluating them...

1.) larger instructions
2.) more complex I-decode
3.) increase in all instruction latency - extra pipestage(s) for operand swizzling. Presumably though, the increase would be less than the latency of separate permute op(s) followed by an ALU op
4.) since ALU ops take up to 3 inputs, supporting swizzles completely means you'd have 3 permute execution units (instead of 1), possibly idle much of the time.

It looks to me like going with SoA reduces the need for swizzle operations - can someone with SIMD assembly experience confirm or deny this?
 
psurge said:
It looks to me like going with SoA reduces the need for swizzle operations - can someone with SIMD assembly experience confirm or deny this?

I've been told that horizontal ops (dot products, swizzle and broadcasts) make the internal latency of SIMD unit much higher. If your have no horizontal communication than each bit of the SIMD unit works on its element in complete isolation (except for register work) from its neighbours. As soon as you start wanting to move data across from one bit of the SIMD unit to the next it adds extra steps. As you get to higher clockspeeds, this gets alot harder. But INAHE and Aaron can differently explain much better than I.

* INAHE = I'm Not A Hardware Engineer ;-)

SoA removes much of the swizzling, you work as if it were scalar but 4 items pop out the end. However SoA can increases the amount of pre and post permutation, if you data is in AoS then you need lots of permutes to 'rotate' into a SoA format.
When I was working on SSE (actually I did it when it was still called KNI) we found that storing our vertex data in SoA was alot faster, we then just had the cost of post-swizzling into the graphics card (this was back in the days of CPU TnL).

I expect that a similar approach will work for the SPU, storing things like physics data and any graphics data in SoA. Lots of programmers seem to have a bit of mental block accessing data as SoA, I'm quite lucky and find it very natural, I'm known to use SoA even when I don't have to (much to others programmers dismay).
 
DeanoC, thanks for the reply. Perhaps more vector like load store instructions could be provided which allow for strided access, or even for loading say 4 AoS 4-vectors into 4 SoA registers...

I wonder if GPUs process in SoA format internally (they are working on 2x2 pixel quads and multiple vertices simultaneously after all)?
 
DeanoC said:
psurge said:
It looks to me like going with SoA reduces the need for swizzle operations - can someone with SIMD assembly experience confirm or deny this?

I've been told that horizontal ops (dot products, swizzle and broadcasts) make the internal latency of SIMD unit much higher. If your have no horizontal communication than each bit of the SIMD unit works on its element in complete isolation (except for register work) from its neighbours. As soon as you start wanting to move data across from one bit of the SIMD unit to the next it adds extra steps. As you get to higher clockspeeds, this gets alot harder. But INAHE and Aaron can differently explain much better than I.

* INAHE = I'm Not A Hardware Engineer ;-)

SoA removes much of the swizzling, you work as if it were scalar but 4 items pop out the end. However SoA can increases the amount of pre and post permutation, if you data is in AoS then you need lots of permutes to 'rotate' into a SoA format.
When I was working on SSE (actually I did it when it was still called KNI) we found that storing our vertex data in SoA was alot faster, we then just had the cost of post-swizzling into the graphics card (this was back in the days of CPU TnL).

I expect that a similar approach will work for the SPU, storing things like physics data and any graphics data in SoA. Lots of programmers seem to have a bit of mental block accessing data as SoA, I'm quite lucky and find it very natural, I'm known to use SoA even when I don't have to (much to others programmers dismay).

You are good like that Deano... unfortunately it is just so much more natural to think about a Vector variable and this variable to store XYZW components and to think of the registers being divided in XYZW fields.

I think I'd go the permute way at least at the beginning: this way the compiler might even be able to help... transferring from AoS to SoA the compiler would be doing very cheap loop unrolling which is a very nice benefit of SoA besides giving you cheaper dot products calculations and mroe efficient Dot3 operations (in an AoS VU you would ahve the problem of the W component FMAC sitting idle).
 
marconelly!:
PSP uses OGL ES if I remember correctly. I doubt that would be a "straight" port though :p Perhaps some of the code can be reused?
PSP games aren't sharing much in the way of assets with just PS2 titles.

Advancement in battery power capacity won't ease the constraints on mobile graphics growth fast enough to let them keep pace with the march of desktop/set-top technology. An MBX 2 will be coming, though, which has had its R&D fed off the trickle-down development of 'Series 5' from the Sega Sammy project.
 
So OpenGL ES it is. Fantastic! I hope the PS3 ports take a good byte out of MircoShaft's Direct3D.

That's wrong but amusing, never-the-less.

Inane_Dork said:
Wait, so is the PS3's API going to be GL ES or a customized variant of GL ES? I mean, I see no reason to keep compatibility with GL ES. It's not like anyone will want to do a straight port from PS3 to a mobile platform.

Inane, this is why I don't like assuming to many things first before I see them for myself. The very same way your were looking at Open GL/ES before is mostly likely what they had already been gathering before they made the final decision in choosing it as there API.

In response to your question, I honestly don't know what they'll do or what they're even planning on doing next. I just hope that whatevers to come of it, it will be the right decision.
 
Another article from the session, this time by Zenji Nishikawa @ Game Watch

http://www.watch.impress.co.jp/game/docs/20050316/ps3.htm

ps305.jpg

ps306.jpg


Nishikawa met Neil Trevett of Khronos Group and Trevett confirmed that the PS3 adopts OpenGL/ES 2.0. OpenGL/ES 2.0 will be released in SIGGRAPH 2005 in this summer.
ps318.jpg
 
FWIW, Cool Tips 8 IEEE Symposium, April 20 - 22, 2005, has some sessions about the CELL processor.

9:30-10:20 Keynote Presentation 2
The Cell Processor and the Future
Masakazu Suzuoki (Sony Computer Entertainment Inc.)

10:20-11:00 Invited Presentation 2
All about the Cell Processor
Peter Hofstee (IBM Austin)

17:10-17:35 The Low Power Synergistic Processor Element of a CELL Processor
O. Takahashi, T. Asano, R. Cook, S. Cottier, O. Delgado, S. H. Dhong, B. Flachs,
K. Hirairi, A. Kawasumi, J. Leenstra, T. Machida, B. Michael, H. Murakami. H.
Murakami, T. Nakazato, H. Nishikawa, H. Noro, H. Oh, S. Onishi, J. Pille, J. Qi,
J. Silberman, N. Yano, S. Yong, D. Wendel, M. White (IBM)

17:35-18:00 Low Power Design in 11FO4 Embedded SRAM for the Synergistic Processor
Element of a CELL Processor
Toru Asano, Takaaki Nakazato, Sang H. Dhong, Atsushi Kawasumi, Joel
Silberman, Osamu Takahashi, Michael White, Hiroshi Yoshihara, Scott Cottier
(IBM)

18:00-18:25 A CELL Software Platform for Digital Media Application
Seiji Maeda, Shigehiro Asano, Tomofumi Shimada, Koichi Awazu, Haruyuki
Tago (Toshiba)
 
AnandTech has just put their article on Cell on the site
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2379

Edit: Just finished reading it. While I have no idea how correct it was, I think it was well worth reading, at least for a person like me who previously had little idea what all those "in order" and "out order" meant.

Now I feel so much smarter, I almost feel like I could start learn to program :)
Maybe now I'd also understand the last few pages of this thread too.
 
rabidrabbit said:
AnandTech has just put their article on Cell on the site
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2379

Edit: Just finished reading it. While I have no idea how correct it was, I think it was well worth reading, at least for a person like me who previously had little idea what all those "in order" and "out order" meant.

Now I feel so much smarter, I almost feel like I could start learn to program :)
Maybe now I'd also understand the last few pages of this thread too.

Yeah, this was a good article, different from the others which is nice. I like how they touched rather specifically on its suitability as a games processor too.

one said:
FWIW, Cool Tips 8 IEEE Symposium, April 20 - 22, 2005, has some sessions about the CELL processor.

9:30-10:20 Keynote Presentation 2
The Cell Processor and the Future
Masakazu Suzuoki (Sony Computer Entertainment Inc.)

10:20-11:00 Invited Presentation 2
All about the Cell Processor
Peter Hofstee (IBM Austin)

17:10-17:35 The Low Power Synergistic Processor Element of a CELL Processor
O. Takahashi, T. Asano, R. Cook, S. Cottier, O. Delgado, S. H. Dhong, B. Flachs,
K. Hirairi, A. Kawasumi, J. Leenstra, T. Machida, B. Michael, H. Murakami. H.
Murakami, T. Nakazato, H. Nishikawa, H. Noro, H. Oh, S. Onishi, J. Pille, J. Qi,
J. Silberman, N. Yano, S. Yong, D. Wendel, M. White (IBM)

17:35-18:00 Low Power Design in 11FO4 Embedded SRAM for the Synergistic Processor
Element of a CELL Processor
Toru Asano, Takaaki Nakazato, Sang H. Dhong, Atsushi Kawasumi, Joel
Silberman, Osamu Takahashi, Michael White, Hiroshi Yoshihara, Scott Cottier
(IBM)

18:00-18:25 A CELL Software Platform for Digital Media Application
Seiji Maeda, Shigehiro Asano, Tomofumi Shimada, Koichi Awazu, Haruyuki
Tago (Toshiba)

Thanks! These look quite interesting. The SCEI presentation "The Cell Processor and the Future" sounds quite promising, perhaps, for some PS3 hints..
 
That article talked of cross-platform being a problem, but how much so? Cross platform between XB and PS2 is as far from easy as you can get, the two platforms sharing almost no architecture, but it's still managed.
 
Shifty Geezer said:
That article talked of cross-platform being a problem, but how much so? Cross platform between XB and PS2 is as far from easy as you can get, the two platforms sharing almost no architecture, but it's still managed.

Yes, i have the feeling that next gen porting will be somewhat easier than this generation. Maybe because the GPUs will be kind of similar (ATI and NVIDIA offerings are very similar although i'm not sure what exactly will end up in the next gen consoles).

If people could port games in this generation, where PS2 was just completely different from the other two, i'm sure they'll be fine next year.
 
Just for fun, today is 18th March Japan time, where gamesindustry.biz reported is the day where the PS3, or some form(s) of it will be shown to a select group of people in a very private event...
 
one said:
IBM has digests of 4 Cell technical papers for the ISSCC 2005. Search ''cell processor" from here.

Thanks. There's also an article there I hadn't seen before (if you search for Cell, it's called "Cell moves into the limelight", it was from mpronline.com).

The article seems to have been written with insight from IBM. A quote from it:

"In designing the BPA, IBM looked at different workloads in areas
of cryptography, graphics transform and lighting, physics,
fast-Fourier transforms, matrix math, and other more scientific
workloads."

So..vertex processing, right? Yeah yeah, I know, nothing's certain...just vertex processing on Cell with pixel processing on the GPU is a favoured PS3 configuration for me ;)

Do we know enough now to work out the theoretical peak vertex performance per sec of one SPE @ 4Ghz? What transformation is used when calculating such peaks?

Looking at the SPEs' local memory, they could each hold ~16,000 vertices at a time...right (?) A new vertex can come in over the EIB every 2 cycles (it's effectively 8 bytes per cyle) + the latency of the original load instruction, or...? Sorry, this isn't my forte, but I'd like to learn ;)

edit - doh, mixing up bits and bytes :rolleyes:
 
Back
Top