scooby_dooby said:Do you have some documentation/studies of the effect of SMT on an In-Order PowerPC core?
Too funny, I ask for proof, and don't get any, but you expect me to provide proof. I'm not the one claiming 50 percent.
scooby_dooby said:Do you have some documentation/studies of the effect of SMT on an In-Order PowerPC core?
Edge said:Too funny, I ask for proof, and don't get any, but you expect me to provide proof. I'm not the one claiming 50 percent.
scooby_dooby said:Carmack said he got a 50% increase by splitting his game engine, is that not proof enough it's possible? And 50% is an arbitrary number, the point is SMT it has the potential to be much more effective in these CPU's than the P4's etc in the PC world. They are In Order cores for one thing, more prone to stalls, also the code will be highly optimized for the specific CPU, something that doesn't happen in the open-boxed PC world.
scificube said:If you have some time DeanoC maybe you could take a look at this.
I'm just trying to understand why you said this a little better.
I would think Xenon's VMX units would have enough register space to handle two threads and I believe that is documented to be the case. However, it's doubtful to me that the VMX units would have 128 128bit registers like an SPE. This would make the SPE a bit more capable of handling the XLC scheme in question no? I mean even if the register space in an SPE was cut in half I would imagine there would still be more registers per thread than what resides in the VMX units.
Also, I thought the VSUs provided for OoOe at their level. So why would the VMX128 in an Xenon core need a large register space to hide the chips 'Inorderness'? I thought the register space was largely due to the VMX units being made capable of serving two threads simultaneously as to avoid stalls there. No one has yet explained just what the PPE's VMX unit is like or how it operates. (Look Ma! I'm fishin'!) Unless the VSUs do not provide OoOe at their level I'm guessing you meant 'inorderness' was being hidden at a higher level.
I get lost though because I don't know why you mentioned Xenon's VMX unit and not it's general purpose registers in a core.
As far as what the XLC scheme is trying to pull off I was under the impression that it was not meant to be a means of maximizing throughput or rather speed but to deal with being memory bound due to a large number of perhaps unpredictable DMA requests. So if the flops should suffer for doing this I would imagine it would be in relation to a more ideal situation than this. What I mean to say is that in a situation where DMA requests are stalling you this scheme would seem to be a way of getting around those stalls and thus if performing flops in between DMA requests flops performance should improve (relatively). Flops performance would be less than a more ideal situation where DMA requests aren't hanging you out to dry.
I'm trying to understand what it is you really said. Do you mean that having to unrool loops would break this 'trick'? Or are you saying this 'trick' should not be a first option as it would adversely affect performance due to loop unrolling? The latter makes sense to me (well maybe not completely) because this 'trick' seems to be for a specific case. The former I don't understand on my own so I'm asking for help. I'm also curious if scheme has no value at all anywhere else.
I'm lost again (just point and laugh...everyone else does) on loop unrolling in itself....wouldn't this be done at compile time? So then wouldn't unrolled loops affect the size of your code and thus how much space is consumed in an LS or in cache instead of the space in the core's general purpose registers?
Lost I am. Saving I will need. Hides the truth the dark side does...clouds my judgment...or is that just my pills...nope....I probably don't know what I'm talking about.
---------------------------------------------------------------------------------
Seperate questions to anybody:
Why would an SPE's iop performance be less than it's flops performance? (Is version's Gints number wrong in post#70...or is this again a special case kind of thing like Cell being able to handle 64 threads)
Why can't a flop be exchanged for an iop? (3D games use flops not iops anyway more often than not no?)
I'm a low-level software guy. I worked with the XLC backend, the majority of my time was spend optimizing for the PowerPC 440, but I've also done code reviews and sat in on many presentations regarding SPE/PPE/Xenon core optimizations being implmeneted.xbdestroya said:Well he didn't work on Cell per se, he worked on the XLC project overall - so a substantial difference to take into account, though related in some senses. Not to say that he's not well versed in architectural differences of course. And Asher if I'm selling your 'proximity' to Cell short, do feel free to correct me.
The same thing can be said for Cell, you're right.Edge said:Love the optimism, but can the same thing be said for CELL? Hard to program, terrible at general purpose code, nothing more than streaming processors, etc, etc, etc. Nothing but pessimistic crap.
Maybe try to be a bit more opened minded.
one said:Where did you read that?
One technique that looks to be crucial to XB360 programming is to pre-fetch data for hardware threads whenever possible. When a pre-fetch causes an L1 miss, the thread is flushed (not strictly true, but I'm not gonna waffle about this) <-EDIT: damn don't you just hate it when you put the caveat on the wrong paragraph - fixed now.scooby_dooby said:They are In Order cores for one thing, more prone to stalls, also the code will be highly optimized for the specific CPU, something that doesn't happen in the open-boxed PC world.
ERP said:OK I have a suggestion someone who downloaded the IBM Cell Simulator could try building a variety of SPU code fragments in both XLC and GCC and compare code size, performance and compilation time (XLC is REALLY SLOOOOOW).
I think you'll find really fast that XLC is no magic bullet.
Solving parallelism with compilers is just not going to happen any time soon.
AFAICS we have a long ways to go with compilers that do a good job of instruction scheduling on these in order cores with large instruction latencies and large register files.
Thanks for the link, but it doesn't suggest they didn't use Renderware. They own Renderware anyway and can overhaul it however they want.iknowall said:Is well know but here is the link :
Graphics are a large part of the Xbox 360’s allure. What special techniques and innovations has EA developed for their long-running pigskin title?
Jeremy Strauser: This is a brand new graphical engine for Madden on the Xbox 360, so the list of what is new is amazingly long. Things like full head and eye tracking, facial animation, how we did player faces/heads, game animations, lighting...almost everything is new here.
Is this an all new game engine or an advanced version of what we’ve seen for Madden in the current-gen?
Jeremy Strauser: This is an all-new game engine; new rendering engine, new animation system, new player models, new stadium models, etc. We were able to share some key data like plays and player ratings with current gen, but just about everything else is brand new.
http://interviews.teamxbox.com/xbox/1359/Madden-NFL-06-Interview/p1/
I would be happy to have something as VCL 2.0 embedded in some compiler..am I asking too much? I don't think soERP said:AFAICS we have a long ways to go with compilers that do a good job of instruction scheduling on these in order cores with large instruction latencies and large register files.
We know SPEs lack a 32 bit integers multiply instruction but to say SPEs don't support 32 bit integers it's a bit stretchy IMHO.Asher said:I'd venture to say it's much easier to tune for SMT than to get great integer/logic performance out of the SPEs, which don't even support 32-bit integers.
http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318RealWorlsTechnologies said:The estimate given by IBM at ISSCC 2005 was that the DP FP computation in the SPE has an approximate 10:1 disadvantage in terms of throughput compared to SP FP computation.
Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops
http://www.cooltechzone.com/index.php?option=content&task=view&id=1660&Itemid=0&limit=1&limitstart=3[font=arial,helvetica,sans-serif]do not posses branch predictors (rely on software) and have 256KB of memory available to each of them. This is different from cache, as it is not shared between the SMEs. If an SME wants to access the local memory of another SME, it has to go via the PPE. The SPEs are not provided with dedicated caches.
[/font][font=arial,helvetica,sans-serif]Since the SPEs do not possess any branch predictor they function by removing the need to put a loop altogether. A loop is a command line that you can execute when you need to repeat an instruction multiple times. For example, let’s take a loop from C++:[/font]
[font=arial,helvetica,sans-serif]
[/font][font=arial,helvetica,sans-serif]Because of the massive lines of code some loops can run into, the processors require a lot of registers to implement this technique successfully and this is one of the prime reasons why each SPE has been bestowed with 128 registers. [/font]
[font=arial,helvetica,sans-serif]
[/font]
[font=arial,helvetica,sans-serif]Why The SPEs Lack Cache?[/font]
[font=arial,helvetica,sans-serif]To understand the reasoning behind the lack of SPE cache, we must comprehend the concept of in-order and out-of-order processing techniques.[/font]
[font=arial,helvetica,sans-serif]In-Order Processing[/font]
[font=arial,helvetica,sans-serif]An in-order is a processor core that processes the instructions in the same order as they are received in.[/font]
[font=arial,helvetica,sans-serif]For instance, let’s say you have four variables A, B, C and D. You give a command to the PC saying add A and B and store the value in C. The next command you give is add C and D and store the value in another variable: E. In basic, the command would look something like:[/font]
[font=arial,helvetica,sans-serif]10 C=A+B[/font]
[font=arial,helvetica,sans-serif]20 E=C+D[/font]
[font=arial,helvetica,sans-serif]The problem here is that before you store the value of A+B in C, statement 20 cannot be executed. This is known as a Read After Write Dependency (RAW). The problem that these dependencies create is that even though (at least, theoretically) the Cell can process both lines at once, it still has to wait for the RAW dependency to get processed and then move to line two. This means that only one execution unit of the Cell is sitting idle.[/font]
SynapticSignal said:I've read that cell with integer and double precision has han hit of 90% in performances...
my 5 cents
http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318
SynapticSignal said:for efficiency of spe:
do not posses branch predictors (rely on software) and have 256KB of memory available to each of them. This is different from cache, as it is not shared between the SMEs. If an SME wants to access the local memory of another SME, it has to go via the PPE. The SPEs are not provided with dedicated caches.
Since the SPEs do not possess any branch predictor they function by removing the need to put a loop altogether. A loop is a command line that you can execute when you need to repeat an instruction multiple times. For example, let’s take a loop from C++:
Because of the massive lines of code some loops can run into, the processors require a lot of registers to implement this technique successfully and this is one of the prime reasons why each SPE has been bestowed with 128 registers.
http://www.cooltechzone.com/index.php?option=content&task=view&id=1660&Itemid=0&limit=1&limitstart=3
SynapticSignal said:Why The SPEs Lack Cache?
To understand the reasoning behind the lack of SPE cache, we must comprehend the concept of in-order and out-of-order processing techniques.
In-Order Processing
An in-order is a processor core that processes the instructions in the same order as they are received in.
For instance, let’s say you have four variables A, B, C and D. You give a command to the PC saying add A and B and store the value in C. The next command you give is add C and D and store the value in another variable: E. In basic, the command would look something like:
10 C=A+B
20 E=C+D
The problem here is that before you store the value of A+B in C, statement 20 cannot be executed. This is known as a Read After Write Dependency (RAW). The problem that these dependencies create is that even though (at least, theoretically) the Cell can process both lines at once, it still has to wait for the RAW dependency to get processed and then move to line two. This means that only one execution unit of the Cell is sitting idle.