About nextgen consoles' CPUs..

Good interview.

IBM is one of the top tech companies in the world. They must be given credit for trying looking at a problem and tackling it heasd on. Obviously IBM has smart people who aim to do the best they can and they believe in their product. It will be interesting to see if Intel and AMD take similar approaches down the road. As the interviewer notes CELL is best at media applications; the question will be when will Intel and AMD get serious about high end accelleration on a level similar to CELL? It may be another 5 years before they can fit a P4/AMD64 chip with SPE like units, but it will be interesting. Obviously IBM has a headstart.
 
I think the whole 3 companies should be given credit. As one's translated articles exaplined, there were different ideas and compromises. IBM wanted a multi-core PPC, Toshiba a SPE-only device, and Sony negotiated a compromise that seems to bring the best of both worlds. From most talk it sound like IBM practically did everything and Sony+Toshiba just fronted money or something. Though IBM's engineers, outnumbering the other companies, obviously had the main input in circuit design, I think the whole architecture is a team effort that no one company would have developed if they went it alone. I think this is important to appreciate - people can manage better working together and sharing resources and ideas rather than competing.
 
Shifty Geezer said:
I think the whole 3 companies should be given credit. As one's translated articles exaplined, there were different ideas and compromises. IBM wanted a multi-core PPC, Toshiba a SPE-only device, and Sony negotiated a compromise that seems to bring the best of both worlds. From most talk it sound like IBM practically did everything and Sony+Toshiba just fronted money or something. Though IBM's engineers, outnumbering the other companies, obviously had the main input in circuit design, I think the whole architecture is a team effort that no one company would have developed if they went it alone. I think this is important to appreciate - people can manage better working together and sharing resources and ideas rather than competing.

STI all the way baby!!! Sorry I had to say that:) I think you're right. I read all the time where people say that IBM didn't everything, which like you said is incorrect. I really hope that the CELL processor becomes the next big thing. I don't think it will overtake Intel and AMD as far as computers go, but the CELL does have its place.
 
Acert93 said:
IBM is one of the top tech companies in the world.

yep.

They must be given credit for trying looking at a problem and tackling it heads on.

they have already been given creadit on this board. but if you think they're not being given enough creadit please feel free to start up a dedicated thread on that - this one here is about other things.

Obviously IBM has smart people who aim to do the best they can and they believe in their product.

the same can be said about all of the rest cpu vendors.

It will be interesting to see if Intel and AMD take similar approaches down the road.

aha.

As the interviewer notes CELL is best at media applications; the question will be when will Intel and AMD get serious about high end accelleration on a level similar to CELL?

last time i checked their cpus had a _completely_ different target to aim for. unless they change their target they don't have a reason to come up with a cell-like approach. i hope you don't believe cell would be good at running pc apps.

It may be another 5 years before they can fit a P4/AMD64 chip with SPE like units, but it will be interesting. Obviously IBM has a headstart.

again, if they don't shift their paradigm they have zero incetinve to create cells for the pc market.


ps: could a kind moderator, please, delete all the posts on this page starting from after nAo's last post on this page up to this post of mine includingly. as an originally good thread is about to be turned into mindless gibberish. thank you.
 
darkblu said:
ps: could a kind moderator, please, delete all the posts on this page starting from after nAo's last post on this page up to this post of mine includingly. as an originally good thread is about to be turned into mindless gibberish. thank you.

I have to agree. I mean this was the lowest point:

The Cell processor itself has in excess of 200 gigaflops, which is 200 billion operations per second. So, imagine if you had 5 billion people in the world doing 40 calculations a second—that would be pretty amazing
:???:

Major clean up needed, including this post.
 
darkblu said:
i hope you don't believe cell would be good at running pc apps.

As far as I am concerned it does not need to have awesome performance at running typical PC apps, I am happy with decent performance there ;).

I mean, PS2 Linux manages to be usable (slow and annoying, but I have used worst things) and what does it run on ? A system with 32 MB of RAM (4 MB of VRAM which the CPU cannot directly read, besides using the DMA to dump it all into main RAM, not exactly the fastest thing on earth, but good enough to take screenshots), a slow HDD, an in-order 2-way super-scalar 300 MHz CPU with 8 KB of L1 D-Cache and 16 KB of L1 I-cache and NO L2 cache ?

Both Xbox 360 and PLAYSTATION 3 seem more than powerful enough to run a modern OS, multi-tasking your instances of Firefox, Open Office, etc... without too much hassle if the OS is well designed/optimized (see the benchmarks of the ex BeOS now ZetaOS 1.0 on even slow-ass PC's).

Different approaches for different target markets, but each still being able to do some kind of good job in the other's area of expertise so to say... or you tell me that a dual-core Athlon 64 is uber-slow in 3D processing (fixed-function T&L and Vertex Shading) ?
 
  • Like
Reactions: jvd
Panajev2001a said:
As far as I am concerned it does not need to have awesome performance at running typical PC apps, I am happy with decent performance there ;)

well, yes, pana, you can crack nuts open with a juicemaker too : )

I mean, PS2 Linux manages to be usable (slow and annoying, but I have used worst things) and what does it run on ? A system with 32 MB of RAM (4 MB of VRAM which the CPU cannot directly read, besides using the DMA to dump it all into main RAM, not exactly the fastest thing on earth, but good enough to take screenshots), a slow HDD, an in-order 2-way super-scalar 300 MHz CPU with 8 KB of L1 D-Cache and 16 KB of L1 I-cache and NO L2 cache ?

yes, and noone ever claimed ps2 was unable to run linux or any other desktop os one would care to port to the ps2 for that matter. but how many people do you know who used their ps2s as desktops under linux? it was there for the enthusiast to tinker with the hw, not for the desktop users to do whaterver they do with desktops. know why? because a 300MHz pentium2 platform was better at that in price/performance terms and the people behind ps2linux were well aware of that.

Both Xbox 360 and PLAYSTATION 3 seem more than powerful enough to run a modern OS, multi-tasking your instances of Firefox, Open Office, etc... without too much hassle if the OS is well designed/optimized (see the benchmarks of the ex BeOS now ZetaOS 1.0 on even slow-ass PC's).

yes, the next generations of consoles will be 'powerful enough' to run modern apps. question is, will they be more powerful at it when compered to the next generation of x86s/g5s? i doubt it very much. (no, i don't need to see the zeta benchmarks, beos 5 has been my home platform for the past 4 years, i'm aware what difference the os design/architecture makes). so would i rather see zeta running on a dual athlon64 mp rather than on a xcpu/cell - bloody right yes.

Different approaches for different target markets, but each still being able to do some kind of good job in the other's area of expertise so to say... or you tell me that a dual-core Athlon 64 is uber-slow in 3D processing (fixed-function T&L and Vertex Shading) ?

no, an athlon would not be slow at T&L at all, and yet a cell would wipe the floor with it at that task. now take a cell targetted app and translate that to an athlon64 platform - you think that would go smooth and rosy?
 
No of course it would not.

My point is what do I want to do with it.

Do I rather have an architecture screamingly fast at general purpose computing, but not as good at multi-media processing or one that is screamingly fast at multi-media processing, but not as good at general purpose computing ?

The answer is both ;)... One machine running Windows XP and one running PS3 Linux (please SCE, release it).

Chances are that this time I will use PS3 Linux more asa Desktop rather than only as a remote compiling/processing machine (since I started thinkering more and more with PS2 development with SPS2 I used the PC to code in... Visual Studio.NET is not something I want to part with ;)).

yes, the next generations of consoles will be 'powerful enough' to run modern apps. question is, will they be more powerful at it when compered to the next generation of x86s/g5s? i doubt it very much. (no, i don't need to see the zeta benchmarks, beos 5 has been my home platform for the past 4 years, i'm aware what difference the os design/architecture makes). so would i rather see zeta running on a dual athlon64 mp rather than on a xcpu/cell - bloody right yes.

Sorry, I was not trying to insult your computer expertise, just making a point.

Most applications, especially because few people would try to optimize them at all for PS3... they would just port them quickly, would run much faster on your Athlon 64 platform, but if you do not want to have dual boot with Linux on your PC (I like Windows XP on it just fine) then the next-generation of consoles become much more appealing for potential Linux solutions running on them for homebrew/learning purposes, but also as a worthy Linux Desktop (I doubt the 3.2 GHz PPE is much slower than a 1.5 GHz Celeron II and that is what is running this laptop right now... certainly with much slower RAM to boot ;)).
 
Panajev2001a said:
I mean, PS2 Linux manages to be usable (slow and annoying, but I have used worst things) and what does it run on ? A system with 32 MB of RAM (4 MB of VRAM which the CPU cannot directly read, besides using the DMA to dump it all into main RAM, not exactly the fastest thing on earth, but good enough to take screenshots), a slow HDD, an in-order 2-way super-scalar 300 MHz CPU with 8 KB of L1 D-Cache and 16 KB of L1 I-cache and NO L2 cache ?

Off-topic, but not to be out done. Dreamcast runs Linux on 200Mhz CPU with only 16MB of RAM and no HDD...now that's slow.
 
How is Cell going to run Linux well? I mean it's not like the OS is designed to run on like 8 cores. At least I don't know that it is. LOL. Cell may run it terribly too. :)
 
swaaye said:
How is Cell going to run Linux well? I mean it's not like the OS is designed to run on like 8 cores. At least I don't know that it is. LOL. Cell may run it terribly too. :)

I doubt, I have first hand experience running Solaris and notultra-light applications on a UltraSparc II processor which was in-order, super-scalar and clocked at way less than 3.2 GHz. I do not think the PPE is slower than that USII I was using.

IBM Germany, which is finishing the Linux kernel port for CELL, is working on a way for developers and applications to access the SPE's (SPUfs) and likely doing some optimization work even if it is at the GCC level making sure the PPE is well optimized for. Nothing bars SCE from optimizing the Linux source code further when and if they release PS3 Linux (I hope they do release it :)).
 
Oh I don't doubt they can get it working on cell or to be more precise the PPE, but finding a way to ustilize those SPE's without alot of hand optimized code is a whole other matter altogether....
 
Late to the party...just got off a plane...so...yeah, anyways...

Enjoyed your post Nao. If I'm hearing you correctly, aren't you just telling programmers to vectorize their code? Someone is still going to have to go through and do the hard work of writing the trace(vect_data1, vect_data2,...) function and making it work on multi-core CPUs.

MATLAB works much like this - all their functions are designed to take n-dim matrix inputs, and those MATLAB functions are optimized for whatever system you're running on (x86, Alpha, UltraSparc...). I'm pretty sure MathWorks has also written multi-core MATLAB libraries, since MATLAB gets decent speed-up on multi-CPU systems, and I'm sure-as-hell not writing parallelized code for a lowly MATLAB script.

I think parallelized libraries is the way to go - only a few smart people need to go through the hard work of writing parallelized functions, and everyone else can use their libraries (for a fee, of course...;)). Middleware libraries?
 
To Panajev:

Just read your first post on this thread, great read. But it got me thinking - by vectorizing, you might lose some of the data locality inherent in the game scene. To take Nao's example:

// Nao's vectorized code///////////////////////////////////////////////////////////////
common_particle_update( IN vec3 particlePosition, IN vec3 particleSpeed, UNIFORM float deltaTime, UNIFORM polysoup roomDatabase....)
{

// compute next frame particle position
vec3 nextPosition = particlePosition + particleSpeed*deltaTime

// check if this particle is going to hit something..
intersection check = trace(particlePosition, nextPosition, roomDatabase)

// create a new explosion particle if there is a hit, otherwise update particle position
if (check.outcoming == true) then create (check.intersection_data, particle_explosion)
else particlePosition = nextPosition

}
///////////////////////////////////////////////////////////////////////////////////////////

Might be slower than code that "recognizes" data locality (important changes bold)
[EDIT: Sorry for no tabs...how do you get it to tab properly?]

///////////////////////////////////////////////////////////////////////////////////////////
for(each particle cloud)
{
cloud_particle_update( IN vec3 cloudParticlePosition, IN vec3 cloudParticleSpeed, UNIFORM float deltaTime, UNIFORM polysoup roomDatabase....)
{

// compute next frame particle position
vec3 nextCloudPosition = cloudParticlePosition + cloudParticleSpeed*deltaTime

// get local geometry around particle cloud
polysoup localDatabase = fetchLocalGeometry(cloudParticlePosition, nextCloudPosition, roomDatabase)


// check if this particle is going to hit something..
intersection check = trace(cloudParticlePosition, nextCloudPosition, localDatabase)

// create a new explosion particle if there is a hit, otherwise update particle position
if (check.outcoming == true) then create (check.intersection_data, particle_explosion)
else cloudParticlePosition = nextCloudPosition
}
}
///////////////////////////////////////////////////////////////////////////////////////////

This is sort of a half-way compromise between the two code samples Nao gave. This will increase per-particle special processing, but my point is, there must be a sweet spot between the two examples Nao gave. By making the code explicitly aware that a each localized particle cloud is dependent on only the local geometry (and not the geometry on the other side of the room), I think this code can be made to run faster.

I guess a well-written trace(...) function would also solve this problem (at least in the example I give), but I'm sure there are other situations where leaving it up to the function to find data locality is not good enough.

BTW, nice to see you around Panajev...:D
 
Last edited by a moderator:
nondescript said:
Enjoyed your post Nao. If I'm hearing you correctly, aren't you just telling programmers to vectorize their code?
No, I'm not. In fact the first code example can be vectorized as the second one.
What I'd want to do is to overcome some common problems one'd have working on next gen consoles' processors.
Since we don't have out of order execution that helps us hiding memory latency we should use some special care to make sure processing cores don't waste clock cycles waiting for data to process.
On processors like SPEs we don't have branch prediction so we should also try to avoid branchy code.
Moreover all these new high frequency SIMD units when working on floating point data have a relatively high latency associated to math ops, and this is a problem if we are waiting a result and we don't have any work to do in the meantime.
One solution for this is fine grained multithreading, another solution for processors that don't have any kind of fine grained multithreading support is trying to work on more than one 'object' at the same time.
SPEs are single thread processors but they have so many vector registers (128) cause they are designed to efficiently work on many objects at the same time.
XeCPU's cores have customized altivec units with also 128 registers cause a couple of threads per core in many cases are not enough to hide floating point ops latency.
With the second example I tried to address some of these problems:
  • branches -> 2 way look up table reference and write out mem ops (results written in memory can be read thousands of clock cycles later so mem latency can be completely hidden in this case)
  • floating point math ops latency -> in the ray-triangle intersection test more than one test can be easily computed at the same time: going horizontal (one ray per vector component performing 4 rays test at the same time) and going vertical (multiple 4-rays test blocks per inner loop)
  • mem latency -> data batches processing makes data prefetch efficient (fewer mem ops on big data structures instead of many mem ops on small data structure) and automatically exploitable without any explicit programmer intervention, at least in my simple example ;)

ciao,
Marco
 
Back
Top