Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

scooby_dooby said:
Do you have some documentation/studies of the effect of SMT on an In-Order PowerPC core?

Too funny, I ask for proof, and don't get any, but you expect me to provide proof. I'm not the one claiming 50 percent.
 
Edge said:
Too funny, I ask for proof, and don't get any, but you expect me to provide proof. I'm not the one claiming 50 percent.

You're the one claiming it can't be any higher than 10-20%, based on completely different CPU architectures, with a different implementation of SMT, not to mention software that was not designed specifically for the CPU.

Carmack said he got a 50% increase by splitting his game engine, is that not proof enough it's possible? And 50% is an arbitrary number, the point is SMT it has the potential to be much more effective in these CPU's than the P4's etc in the PC world. They are In Order cores for one thing, more prone to stalls, also the code will be highly optimized for the specific CPU, something that doesn't happen in the open-boxed PC world.
 
Last edited by a moderator:
scooby_dooby said:
Carmack said he got a 50% increase by splitting his game engine, is that not proof enough it's possible? And 50% is an arbitrary number, the point is SMT it has the potential to be much more effective in these CPU's than the P4's etc in the PC world. They are In Order cores for one thing, more prone to stalls, also the code will be highly optimized for the specific CPU, something that doesn't happen in the open-boxed PC world.

Love the optimism, but can the same thing be said for CELL? Hard to program, terrible at general purpose code, nothing more than streaming processors, etc, etc, etc. Nothing but pessimistic crap.

Maybe try to be a bit more opened minded.
 
Thread Pruned

I pruned the thread from the contributive in any sort blah-blah and twelve years olds remarks.
I won't prune a thread from such remark next time... Insults, bordering insults, agressive remarks are too numerous lately, it has to stop. And, it will stop.
 
Guys guys, let's be cool.

I want some of the questions from the previous page to possibly be addressed before this thread gets locked, y'know?

(Thanks for pruning and not locking Vysez!)
 
Just when I hit Submit reply...

There's five other OT bickering posts to delete!

...scooby_dooby, Edge, I won't prune the next OT messages.
 
scificube said:
If you have some time DeanoC maybe you could take a look at this.

I'm just trying to understand why you said this a little better.

I would think Xenon's VMX units would have enough register space to handle two threads and I believe that is documented to be the case. However, it's doubtful to me that the VMX units would have 128 128bit registers like an SPE. This would make the SPE a bit more capable of handling the XLC scheme in question no? I mean even if the register space in an SPE was cut in half I would imagine there would still be more registers per thread than what resides in the VMX units.

Also, I thought the VSUs provided for OoOe at their level. So why would the VMX128 in an Xenon core need a large register space to hide the chips 'Inorderness'? I thought the register space was largely due to the VMX units being made capable of serving two threads simultaneously as to avoid stalls there. No one has yet explained just what the PPE's VMX unit is like or how it operates. (Look Ma! I'm fishin'!) Unless the VSUs do not provide OoOe at their level I'm guessing you meant 'inorderness' was being hidden at a higher level.

I get lost though because I don't know why you mentioned Xenon's VMX unit and not it's general purpose registers in a core.

As far as what the XLC scheme is trying to pull off I was under the impression that it was not meant to be a means of maximizing throughput or rather speed but to deal with being memory bound due to a large number of perhaps unpredictable DMA requests. So if the flops should suffer for doing this I would imagine it would be in relation to a more ideal situation than this. What I mean to say is that in a situation where DMA requests are stalling you this scheme would seem to be a way of getting around those stalls and thus if performing flops in between DMA requests flops performance should improve (relatively). Flops performance would be less than a more ideal situation where DMA requests aren't hanging you out to dry.

I'm trying to understand what it is you really said. Do you mean that having to unrool loops would break this 'trick'? Or are you saying this 'trick' should not be a first option as it would adversely affect performance due to loop unrolling? The latter makes sense to me (well maybe not completely) because this 'trick' seems to be for a specific case. The former I don't understand on my own so I'm asking for help. I'm also curious if scheme has no value at all anywhere else.

I'm lost again (just point and laugh...everyone else does) on loop unrolling in itself....wouldn't this be done at compile time? So then wouldn't unrolled loops affect the size of your code and thus how much space is consumed in an LS or in cache instead of the space in the core's general purpose registers?

Lost I am. Saving I will need. Hides the truth the dark side does...clouds my judgment...or is that just my pills...nope....I probably don't know what I'm talking about.

---------------------------------------------------------------------------------

Seperate questions to anybody:

Why would an SPE's iop performance be less than it's flops performance? (Is version's Gints number wrong in post#70...or is this again a special case kind of thing like Cell being able to handle 64 threads)

Why can't a flop be exchanged for an iop? (3D games use flops not iops anyway more often than not no?)


OK I have a suggestion someone who downloaded the IBM Cell Simulator could try building a variety of SPU code fragments in both XLC and GCC and compare code size, performance and compilation time (XLC is REALLY SLOOOOOW).

I think you'll find really fast that XLC is no magic bullet.

Solving parallelism with compilers is just not going to happen any time soon.

AFAICS we have a long ways to go with compilers that do a good job of instruction scheduling on these in order cores with large instruction latencies and large register files.
 
xbdestroya said:
Well he didn't work on Cell per se, he worked on the XLC project overall - so a substantial difference to take into account, though related in some senses. Not to say that he's not well versed in architectural differences of course. And Asher if I'm selling your 'proximity' to Cell short, do feel free to correct me.
I'm a low-level software guy. I worked with the XLC backend, the majority of my time was spend optimizing for the PowerPC 440, but I've also done code reviews and sat in on many presentations regarding SPE/PPE/Xenon core optimizations being implmeneted.

I'm by no means a Cell or Xenon expert, but I know a thing or two.
 
Edge said:
Love the optimism, but can the same thing be said for CELL? Hard to program, terrible at general purpose code, nothing more than streaming processors, etc, etc, etc. Nothing but pessimistic crap.

Maybe try to be a bit more opened minded.
The same thing can be said for Cell, you're right.

I'd venture to say it's much easier to tune for SMT than to get great integer/logic performance out of the SPEs, which don't even support 32-bit integers.

The advantage of consoles is the fixed platform, developers can get down and dirty with the functions of the code and do things that the engineers who designed the chip may not have expected.

Both the Cell and Xenon are vastly different chips than people are used to, and their performance will increase dramatically as developer familiarity of the chips increases. But it's disingenuous to say Cell will always outperform Xenon. They both have theoretical strengths over the other, but in practice it depends on a lot more than that.

As Deano has alluded to in his blog in the past (I believe it was him, anyway), the secret to power this generation won't be in the hardware, but in the development tools and support in unlocking the power that's already there in both systems.
 
one said:
Where did you read that?

Is well know but here is the link :


Graphics are a large part of the Xbox 360’s allure. What special techniques and innovations has EA developed for their long-running pigskin title?

Jeremy Strauser: This is a brand new graphical engine for Madden on the Xbox 360, so the list of what is new is amazingly long. Things like full head and eye tracking, facial animation, how we did player faces/heads, game animations, lighting...almost everything is new here.


Is this an all new game engine or an advanced version of what we’ve seen for Madden in the current-gen?

Jeremy Strauser: This is an all-new game engine; new rendering engine, new animation system, new player models, new stadium models, etc. We were able to share some key data like plays and player ratings with current gen, but just about everything else is brand new.



http://interviews.teamxbox.com/xbox/1359/Madden-NFL-06-Interview/p1/
 
scooby_dooby said:
They are In Order cores for one thing, more prone to stalls, also the code will be highly optimized for the specific CPU, something that doesn't happen in the open-boxed PC world.
One technique that looks to be crucial to XB360 programming is to pre-fetch data for hardware threads whenever possible. When a pre-fetch causes an L1 miss, the thread is flushed (not strictly true, but I'm not gonna waffle about this) <-EDIT: damn don't you just hate it when you put the caveat on the wrong paragraph - fixed now.

The flush means the other thread sharing the core can continue running. If, instead of flushing, the first thread just stalled, then the second thread would also stall. As it happens the flush is automatic - so it's up to the developer to put the pre-fetches in, in order to give the CPU a chance to flush, if an L1 miss occurs. Without the pre-fetch both threads will stall, if an L1 miss occurs.

That's just an example of how, ahem, tediously complicated programming XB360 is going to be. Though, having said that (and never having programmed an SMT core) I imagine these concepts are routine for SMT programming.

Interestingly enough, the "XB360 CPU caches" documentation suggests that 1-byte per thread per core per clock is about the average read data rate.

---

In SPE programming there's no concept of an L1 miss - the programmer has to continually "manage prefetched data" (and write back, too).

There've been long and detailed discussions about SPE LS and how it can be used differently from (or alternatively, like) a cache - and arguments over whether LS should have been implemented as a cache, instead.

I've described a triple-buffered programming method for SPE. It's not rocket science.

http://www.beyond3d.com/forum/showpost.php?p=560159&postcount=60

Jawed
 
Last edited by a moderator:
^^^good stuff!

ERP said:
OK I have a suggestion someone who downloaded the IBM Cell Simulator could try building a variety of SPU code fragments in both XLC and GCC and compare code size, performance and compilation time (XLC is REALLY SLOOOOOW).

I think you'll find really fast that XLC is no magic bullet.

Solving parallelism with compilers is just not going to happen any time soon.

AFAICS we have a long ways to go with compilers that do a good job of instruction scheduling on these in order cores with large instruction latencies and large register files.

Actually I intend to do just that! I need a new HD because I don't have a free partition available for the Fedora OS the SDK requires. Next month I'll get a new HD and go to town so to speak...probably just end up burning the town down.

I do not think I was trying to say the XLC was a magic bullet though in that post. I was asking for understanding more than anything else. I certainly wasn't challenging the big guy...me!...now that's funny :)

As for solving parallelism in the compiler I would tend to agree that this is some ways off and even with my limited understanding of things I expect the compiler can only go so far in getting the job done. Proper programming principles with respect to parallelism is where I think the real solutions will lie. A paradigm shift in the approach to programming itself is the only viable answer in my eyes as I do not see how the programmer can escape responsibility in making efficient parallelized and most importantly reusable code. There may be some nice wins here and there and I think we should all be thankful for them if they should come to pass but I most certainly do not expect a compiler to transform my serialized code into magnificent parallelized code...I'm hopeful there is some degree of help...not wishful for things too good to be true.

Conversely though, I do tend to agree with DeanoC etc. that there still could be valuable help found in proper tools and languages for programming with parallelism in mind. I'm a bit naive for sure but this much does not escape me.
 
Last edited by a moderator:
Alot of this rhetoric is like PS2 vs. Dreamcast all over again. PS2 = not enough memory, impossible to program.
Dreamcast = easy to program, free AA, yadda yadda...

Thankfully only a few more months of this...Come the first sight of amazing realtime PS3 physics engines should help silence the naysayers.

I'd pay money to see want Naughty Dog or Factor 5 are doing with Cell :D
 
iknowall said:
Is well know but here is the link :

Graphics are a large part of the Xbox 360’s allure. What special techniques and innovations has EA developed for their long-running pigskin title?

Jeremy Strauser: This is a brand new graphical engine for Madden on the Xbox 360, so the list of what is new is amazingly long. Things like full head and eye tracking, facial animation, how we did player faces/heads, game animations, lighting...almost everything is new here.

Is this an all new game engine or an advanced version of what we’ve seen for Madden in the current-gen?

Jeremy Strauser: This is an all-new game engine; new rendering engine, new animation system, new player models, new stadium models, etc. We were able to share some key data like plays and player ratings with current gen, but just about everything else is brand new.

http://interviews.teamxbox.com/xbox/1359/Madden-NFL-06-Interview/p1/
Thanks for the link, but it doesn't suggest they didn't use Renderware. They own Renderware anyway and can overhaul it however they want.
 
On topic post : I vaguely remember reading Cell's integer processing followed the same lines of it's float processing, and works in batches of four (vectors). Can anyone confirm this? As such working with individual ints (or floats for that matter) limit peak performance to a quarter, and needs vectorised data to attain higher results.

How does XeCPU compare to that? I'm guessing it's similar.
 
ERP said:
AFAICS we have a long ways to go with compilers that do a good job of instruction scheduling on these in order cores with large instruction latencies and large register files.
I would be happy to have something as VCL 2.0 embedded in some compiler..am I asking too much? I don't think so
 
Asher said:
I'd venture to say it's much easier to tune for SMT than to get great integer/logic performance out of the SPEs, which don't even support 32-bit integers.
We know SPEs lack a 32 bit integers multiply instruction but to say SPEs don't support 32 bit integers it's a bit stretchy IMHO.
What's next? SPEs can't do a division? :)
 
I've read that cell with integer and double precision has han hit of 90% in performances...

my 5 cents:oops:


RealWorlsTechnologies said:
The estimate given by IBM at ISSCC 2005 was that the DP FP computation in the SPE has an approximate 10:1 disadvantage in terms of throughput compared to SP FP computation.
Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops
http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318

for efficiency of spe:
[font=arial,helvetica,sans-serif]do not posses branch predictors (rely on software) and have 256KB of memory available to each of them. This is different from cache, as it is not shared between the SMEs. If an SME wants to access the local memory of another SME, it has to go via the PPE. The SPEs are not provided with dedicated caches.

[/font][font=arial,helvetica,sans-serif]Since the SPEs do not possess any branch predictor they function by removing the need to put a loop altogether. A loop is a command line that you can execute when you need to repeat an instruction multiple times. For example, let’s take a loop from C++:[/font]
[font=arial,helvetica,sans-serif]
[/font][font=arial,helvetica,sans-serif]Because of the massive lines of code some loops can run into, the processors require a lot of registers to implement this technique successfully and this is one of the prime reasons why each SPE has been bestowed with 128 registers. [/font]
[font=arial,helvetica,sans-serif]
[/font]
http://www.cooltechzone.com/index.php?option=content&task=view&id=1660&Itemid=0&limit=1&limitstart=3

[font=arial,helvetica,sans-serif]Why The SPEs Lack Cache?[/font]
[font=arial,helvetica,sans-serif]To understand the reasoning behind the lack of SPE cache, we must comprehend the concept of in-order and out-of-order processing techniques.[/font]
[font=arial,helvetica,sans-serif]In-Order Processing[/font]
[font=arial,helvetica,sans-serif]An in-order is a processor core that processes the instructions in the same order as they are received in.[/font]
[font=arial,helvetica,sans-serif]For instance, let’s say you have four variables A, B, C and D. You give a command to the PC saying add A and B and store the value in C. The next command you give is add C and D and store the value in another variable: E. In basic, the command would look something like:[/font]
[font=arial,helvetica,sans-serif]10 C=A+B[/font]
[font=arial,helvetica,sans-serif]20 E=C+D[/font]
[font=arial,helvetica,sans-serif]The problem here is that before you store the value of A+B in C, statement 20 cannot be executed. This is known as a Read After Write Dependency (RAW). The problem that these dependencies create is that even though (at least, theoretically) the Cell can process both lines at once, it still has to wait for the RAW dependency to get processed and then move to line two. This means that only one execution unit of the Cell is sitting idle.[/font]

http://www.cooltechzone.com/index.php?option=content&task=view&id=1660&Itemid=0&limit=1&limitstart=4

 
SynapticSignal said:
I've read that cell with integer and double precision has han hit of 90% in performances...

my 5 cents:oops:



http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318

DP is of marginal interest in a discussion of Cell as a games processor. DP on competing chips is..?

As for integer, I haven't yet seen any figures or guesses as to its performance, other than that it is of secondary importance to the SPUs. If you've seen a figure, let us know.

SynapticSignal said:
for efficiency of spe:

do not posses branch predictors (rely on software) and have 256KB of memory available to each of them. This is different from cache, as it is not shared between the SMEs. If an SME wants to access the local memory of another SME, it has to go via the PPE. The SPEs are not provided with dedicated caches.

Since the SPEs do not possess any branch predictor they function by removing the need to put a loop altogether. A loop is a command line that you can execute when you need to repeat an instruction multiple times. For example, let’s take a loop from C++:

Because of the massive lines of code some loops can run into, the processors require a lot of registers to implement this technique successfully and this is one of the prime reasons why each SPE has been bestowed with 128 registers.

http://www.cooltechzone.com/index.php?option=content&task=view&id=1660&Itemid=0&limit=1&limitstart=3


This doesn't really make any direct points about efficiency, I don't think? It just describes some of the architecture.

On SPUs accessing other LS - can anyone confirm the PPEs role here, if any? Can't one SPU put something on the EIB, and another pick it up?

Re. looping/branching - I wasn't aware looping was not available in SPU code. In fact I thought it was - you just don't have any branch prediction, it'll always assume that the branch is taken. There are branch hints, though, and of course, ways to avoid branching and looping - loop unbundling being one way for the latter, as he describes. Using that doesn't mean you couldn't use a loop if you wanted, though, and were confident of the behaviour of the loop. Assuming I'm not mistaken about the simple availability of loops in SPU code? I need to spend more time with that simulator ;)

SynapticSignal said:

That doesn't actually tell us why the SPUs don't use cache - at all. The In-order processing issues relate aren't unique to Cell (Xenon is in-order too, for example), and can be overcome in some (if not many) cases with an intelligent approach/more work. To take his example, he should be using E = A + B + D ;) Or if he wanted to store A+B seperately, then he should just follow E = A+B+D with C=A+B. A trivial, silly example, I know, and not all dependencies would be so easily resolved at all, but still.
 
Back
Top