Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

expletive said:
My point was that if there is in fact such a disparity...the features that would come from that probably wouldnt be things that 95% of the general public would notice or care about.
You might well be right, but that goes for all advances in tech. As the tech levels increase, the detectable difference decreases. I'm sure a GOW could be rendered with all the models at 70% of the mesh density without most people noticing. I'm sure the difference between 1000 animated bolders and 2000 animated bolders will larger go unnoticed too. There needs to be a performance disparity of an order of magnitude for people to look and see one is better than the other. Also where the perforamnce is applied makes a difference. If you were to have one game with 10,000 rocks and rigid facial animations, versus another with 200 rocks and fluid facial animations, even if the former is technically more demanding the latter will likely be thought of as the better game. I imagine if you were to take PGR3 and render it on SDTV with no antialiasing, and next to it render GT4 with 16xAA, most people would rate the suped up GT4 as the more impressive due to IQ and an absence of jaggies. So if Cell were to offer some amazing fluid dynamics, it might go unnoticed to the pundits.
 
Yep, and i think its unfortunate because a lot of evoutionary gameplay elements may go unrealized becuase it wont be viewed as being able to sell units. If you read all these non-industry reviews of the 360 (i.e. USA Today, Ziff-Davis PC-guy, etc) they ALL FOCUS ON GRAPHICS. I think we're at point in this gen where there may not be enough of 'everything else' to draw attention away from grpahics just yet. Maybe near the end of this gen when theyve really been able to master multithreading that they can get people excited about *gasp* physics and AI for the PS4 and Xbox 720. :)
 
expletive said:
Yep, and i think its unfortunate because a lot of evoutionary gameplay elements may go unrealized becuase it wont be viewed as being able to sell units. If you read all these non-industry reviews of the 360 (i.e. USA Today, Ziff-Davis PC-guy, etc) they ALL FOCUS ON GRAPHICS. I think we're at point in this gen where there may not be enough of 'everything else' to draw attention away from grpahics just yet. Maybe near the end of this gen when theyve really been able to master multithreading that they can get people excited about *gasp* physics and AI for the PS4 and Xbox 720. :)

I doubt it will take 5 years for that to happen.
 
expletive said:
Yep, and i think its unfortunate because a lot of evoutionary gameplay elements may go unrealized becuase it wont be viewed as being able to sell units. If you read all these non-industry reviews of the 360 (i.e. USA Today, Ziff-Davis PC-guy, etc) they ALL FOCUS ON GRAPHICS. I think we're at point in this gen where there may not be enough of 'everything else' to draw attention away from grpahics just yet. Maybe near the end of this gen when theyve really been able to master multithreading that they can get people excited about *gasp* physics and AI for the PS4 and Xbox 720. :)

"Things the CPU does" can indirectly and directly improve the looks of a game, particularly in motion. There is the potential to realise detail on a scale not seen before in games - yes, because of CPUs - that if anything the mainstream may be more appreciative of, given that such detail was previously missing in games and so evident in other media (movies etc.). One of the biggest technical "wow" moments for me sofar with the next-gen games was seeing the building explosion in the MGS4, and watching the fine rubble rain down on Snake, and bounce off him very convincingly. Now, we can argue till the cows come home as to whether that was pre-scripted or simulated, but assuming the latter (and I think it's absolutely possible for that to be done), that kind of granularity, we've never seen before - and you could have all the rendering power in the world, but without power to simulate those collisions and that physics etc. you'd never be able to have such detail. Same story with the larger debris, if not also the finer dirt that's flying around in that demo in the wind outside - asides from the actual wind simulation itself, which is rather convincing when you see how stuff flips and flicks around, those pieces of paper/plastic/whatever actually collide with the tanks and so forth, you can see one piece get stuck to the side of a tank for example. Although you may have to look to point these things out, if such detail wasn't there I think even a casual eye would discern a difference (because, for example, if you didn't have collision detection for the stuff flying around in the wind, you wouldn't have that stuff probably at all - and things would hence look less "gritty", the wind would not be tangible etc.). They're just a couple of examples of many I hope and think we'll see going forward (one of many in that trailer alone in fact).

Pure rendering power can only go so far without enough simulation power to realise detail as mentioned above, and make it possible to even render in the first place.
 
Last edited by a moderator:
mckmas8808 said:
I doubt it will take 5 years for that to happen.

To change the paradigm where the public focuses on things like physics and AI instead of graphics? I think 5 years is actually a generous estimate.
 
Edge said:
That seems highly unlikely. The dual threading of the Xbox 360 cores is seperate register banks to allow fast context switches. That's great, but I can't imagine a 50 percent increase. Can you provide any evidence of this?
That's not how SMT works in Xenon. Both threads are "in flight" at the same time, but cannot access the same components of the chip at the same time. If these threads are utilizing different components of the chip -- one using the VMX-128 unit and the other the ALU, for instance, it is extremely effective at increasing ILP.

Most of the "10-20%" performance increases you've seen are probably from the Pentium 4 w/ HT running multi-threaded applications originally designed for SMP, not SMT.
 
Edge said:
What do you have anything to provide here?
Nothing that hasn't been said hundreds of times before here and on other sites.

The SPEs are not at all effective at integer and logic operations compared to the PPE. There are many reasons for this, the most obvious way to explain it is the SPUs were originally called Stream Processing Units and are designed for massive FLOPS at the expense of MIPs and logic operations.
 
version said:
Barry_Minor :

IBM's SPE XLC compiler is adding the function to compile to register ranges which would enable the threading model I talked about. We have coded examples of this in SPE asm to validate the concept.
I worked on the xlc compilers for two years (until September when I left to go back for a Masters), and the simplest way to debunk this is to say it's nonsense.
 
Asher said:
the SPUs were originally called Stream Processing Units

Do you have proof of that? I been following CELL's news since the beginning and I have never heard that.

A pure streaming processor passes it's results from one processor to the next, but EACH SPE's have full access to external memory, and is in no way forced to pass results from one SPE to the next. The SPE's are far from being streaming processors ONLY, as they are general purpose processors, but with 300 GB/sec of internal bandwidth, and 256 KB of localized SRAM per SPE, the SPE's can be used very well for streaming algorithms, and if used as such, would blow the Xbox 360 CPU out of the water.

The compiler for the SPE fully supports integer data types. Integer performance for the PPE is superior to a SPE, but don't forget their are SEVEN SPE's, and one PPE versus the Xbox 360s' CPU three PPE's.

So as you can see, the SPE have the flexibility of a general purpose processor, having full access to main memory, and being able to work on integer data types through a compiler, all the while supporting streaming algorithms. This FLEXIBILITY is a STRENGTH and not a weakness.

We already have an example where CELL is SUPERIOR to the Xbox 360 CPU, and that is in Ageia's physics API. The first of many examples.
 
Last edited by a moderator:
Asher said:
If these threads are utilizing different components of the chip -- one using the VMX-128 unit and the other the ALU, for instance, it is extremely effective at increasing ILP.

Well each SPE is dual pipelined, so using your logic, each SPE can support dual threads. I don't consider that dual threading, and your example is incorrect also, as dual threading has to consider being able to run dual threads on the ALU, which it cannot, as one thread running, has to stop the other thread. Dual threading on the PPE's is only for faster context switches, and thus only supports 10 to 20 percent increase at most.

Dual threading does not mean dual execution!!!

Your example also, would be very limited to an algorithm, supporting both DIFFERENT types of execution, which I would think would be a RARE occurance.
 
Last edited by a moderator:
Asher said:
I worked on the xlc compilers for two years (until September when I left to go back for a Masters), and the simplest way to debunk this is to say it's nonsense.

Asher no offense, but I just don't see Barry Minor flat-out lying about this. Maybe there's some shade of grey thing going on, but I think you're going to have to offer more up than your resume, since Barry's is more impressive. Now that you've given us the 'simple' way of debunking it, I think we're ready for the next step up in difficulty.
 
Last edited by a moderator:
SPE have a large register context because they need lots of loop unrolling to get decent speed (same for XeCPU, hence VMX128). If the compiler halves the register file it will seriously limit the extent to which the compiler can hide the in-orderness of it and the FLOPs rating will be reduced.
 
DeanoC said:
SPE have a large register context because they need lots of loop unrolling to get decent speed (same for XeCPU, hence VMX128). If the compiler halves the register file it will seriously limit the extent to which the compiler can hide the in-orderness of it and the FLOPs rating will be reduced.

Deano while you're out and about, I'm wondering if you might expound on your (indirect) thoughts in post #113. :)

I felt that one of the root goals with the Cell architecture was to get the programming to a point where eventually the programmer would simply write code for either the PPE's or SPE's - with an emphasis on the SPE's - without worrying about the actual number of available hardware resources, the thought being that the chip itself (or network thereof) would divy up the tasks, and indeed even as it stands now the chip being able to handle SPE threads in number beyond what it has available at any one time in hardware and being able to dynamically schedule for such. Now obviously though it can't in fact execute beyond it's immediate resources...

Anyway I forget where I read it but STI is supposedly working on just such a compiler as to take code and recompile it for the SPE's, while at the same time threading it to a certain extent. Not sure how far along that project is at this point, but would that address your concerns in a sense? Obviously no one's going to try and talk down on the state of Microsoft's compiler efforts, but just wondering (since the thread is Cell-focused) if all available information was taken into account with that blog entry, and if so (or if not) what your thoughts on it all might be.
 
Last edited by a moderator:
xbdestroya said:
Anyway I forget where I read it but STI is supposedly working on just such a compiler as to take code and recompile it for the SPE's, while at the same time threading it to a certain extent. Not sure how far along that project is at this point, but would that address your concerns in a sense? Obviously no one's going to try and talk down on the state of Microsoft's compiler efforts, but just wondering (since the thread is Cell-focused) if all available information was taken into account with that blog entry, and if so (or if not) what your thoughts on it all might be.
AFAIK xlc (IBM) is the experimental compiler (thats always been that way), so you might get an insight into some possible strategies from that, but currently I've seen nothing on the same league as MS research stuff.

IBM are trying mad ideas in a desperate attempt to make Cell easy but IMHO they really seem to be missing the point... and MS are actually trying to solve the issue the right way (accept that there is no magic bullet, so change the language to help coder and compiler).

I have the even less faith in auto-parellisation than auto-vectorisation (and I have NO faith in auto-vectorisation).
 
DeanoC said:
AFAIK xlc (IBM) is the experimental compiler (thats always been that way), so you might get an insight into some possible strategies from that, but currently I've seen nothing on the same league as MS research stuff.

Currently I'd say that Sony is working on normal compiler tech (after all just generating optimised single thread code from C code isn't trivial), IBM are trying mad ideas in a desperate attempt to make Cell easy but IMHO they really seem to be missing the point... and MS are actually trying to solve the issue the right way (accept that there is no magic bullet, so change the language to help coder and compiler).

I have the even less faith in auto-parellisation than auto-vectorisation (and I have NO faith in auto-vectorisation).

Well, can't deny the value of honesty - thanks for the input Deano.
icon14.gif


It was indeed on XLC that they were looking to pursue this, so I guess I'll just hope for the best in terms of their efforts!
 
I've got this document here called xbox_cpu_pipelines from Gamefest 2005.
  • each of the hardware threads in each core are "equal" - there's no "primary" and "secondary"
  • a stall in one hardware thread (during decode and issue) stalls the other hardware thread - e.g. if one thread dual-issues integer ADD and integer ADD, then the second ADD will stall (one cycle), until the first ADD has been issued. This stall only impacts the instruction queue, it doesn't impact the execution pipelines
  • if the two ADDs are dependent, then the second ADD will have to wait 2 cycles - again it only affects the instruction queue, but of course causes the other hardware thread to stall another cycle
  • the integer instruction queue ("instruction decode and dependency checking block") only checks integer instructions (vector load/stores count as integer instructions while addresses are being generated) though vector instructions are pipelined through it, too
  • the Vector Instruction Queue follows on from the integer instruction queue. The VIQ handles structural (dual-issuable?) and data issues, and will cause stalls to resolve them, just like the earlier queue. A VIQ stall can also stall the integer instruction queue, if a vector instruction is at the end of it, ready to move into the VIQ
  • data dependency between two consecutive instructions in the vector pipeline will cause up to a 12 cycle stall in VIQ (14 if an estimating instruction)
etc. It's nice getting this stuff from the horse's mouth.

Jawed
 
DeanoC said:
]and MS are actually trying to solve the issue the right way (accept that there is no magic bullet, so change the language to help coder and compiler).
I would take a changed hardware too to help me :p
Ok Ok I promise I'll stop. But I'm curious, what kind of new language development is MS working on?

Gubbi said:
Oh and by the way, the same mechanism can be used on both the PPE and the XCPU, and probably will be (vertical threading anyone?)
I 100% agree it's a pissing contest, but it is true that SPE has a bit of an advantage over PPE/PPX in this area, namely, a single 128 GPR file as opposed to 3 different ones and only 32 GPRs.
Coupled with localstore predictability it simplifies some things.
 
Jawed said:
  • each of the hardware threads in each core are "equal" - there's no "primary" and "secondary"


  • What ever, my who point of using the term secondary thread, was just to point out, that dual registers to support dual threading, does not mean dual execution. Dual threading supports at most 10 to 20 percent performance advantage.

    I'm just trying to point out, that dual threading does not give you a 100 percent increase per core, as I'm sure some people think it does.

    Sun's UltraSPARC T1 with CoolThreads chip does FOUR threads per core! That's much better!
 
Last edited by a moderator:
Edge said:
Do you have proof of that? I been following CELL's news since the beginning and I have never heard that.
I'm sure it's been on the internet before, they were referred to as Stream Processing Units and Primary Processing Unit internally by the engineers, before it went public the marketing and PR people decided to rename it to something more impressive sounding.

A pure streaming processor passes it's results from one processor to the next
I'm not going to play with semantics. There's no point.

That was the original intention of the SPEs. Their high-level design comes from Toshiba, and they are designed mostly to process streams of multimedia, much like HDTV feeds. They were never designed to be good at "general processing", in other words integer and logic operations. If you look at the design of the SPEs and the fact that they're all vectorized with minimal branching support and a limited ALU, it becomes even more obvious.

The compiler for the SPE fully supports integer data types. Integer performance for the PPE is superior to a SPE, but don't forget their are SEVEN SPE's, and one PPE versus the Xbox 360s' CPU three PPE's.
This is a common misconception (predominately on ArsTechnica), while the Xbox 360's CPUs look like PPEs from a high level, they're not. They support different instructions and have very different vector units on them, among other things. Both the PPE and the Xenon Cores were derived from an identical "base", but one on one the Xenon cores will outperform a PPE core.

Aside from that, I understand very well that xlc supports integer data types for the SPEs. I'm also very well aware that it is not meant for any serious work with integer performance, I'm well aware of the optimizations (both IPA and in the backend) that are disabled for SPEs, and I'm also well aware of the performance penalties incurred to process ALU-style instructions on the SPEs.

So as you can see, the SPE have the flexibility of a general purpose processor, having full access to main memory, and being able to work on integer data types through a compiler, all the while supporting streaming algorithms. This FLEXIBILITY is a STRENGTH and not a weakness.
You're slightly changing the argument. I have never said SPEs cannot perform integer and logic instructions. They can, but it's a bit like doing a "cone challenge" with a semi-truck. It works, you just gotta go real slow. ;)
 
Last edited by a moderator:
Back
Top