xenon and PS3: general purpose performance....

aaaaa00 · Jan 24, 2005

If the PS3 turns out to be just 1 or 2 CELLs...

Isn't it possible that xenon (with a triple dual threaded core CPU) might end up with higher general purpose code performance than PS3 with its single or dual PU and 8/16 APUs?

How much of the code do you reckon in a next generation title is gonna be branchy and integer-centric?

Just something to throw out there... debate away!

DemoCoder · Jan 24, 2005

How many FP vector units does the purported Xenon CPU have? If it's 1-2, then the Xenon ends up with 3-6 "APUs" while the PS3 has 8, each asynchronous with their own memory. The only factor unknown factor how whether CELL can keep all its units fed.

Fafalada · Jan 24, 2005

On currently available public info a completely random answer has about the same chance of being accurate as trying to guess making about a hundred assumptions along the way.

But since this is what we do on this forum so often (assumption making

) here you go:
Assuming that my comic remarks from that other thread were correct, and PU and Xenon core are indeed architecturally closely related - we will assume the same IPC per core.
The current clock speed assumptions are ~3.5Ghz and 4-4.5Ghz, and your topic question puts the number of PUs at 1 or 2.

At high end estimate (2*4.5) we get very similar IPC index for both console configurations (though still a couple of % points higher for Xenon), at low end(1*4) we get the PU rating around 40% of XCPU.

This is also assuming that XCPU is performing 100% nothing but general purpose code, leaving all those nice vector coprocessors idle.

There you go, I only made 5+2 assumptions(all of them based on no real evidence 8) ) to give you two estimates. And it could all collapse on the first one - if PU core happened to be something lower or higher end then what I assumed.

DemoCoder said:
How many FP vector units does the purported Xenon CPU have?

According to the leaks - 1, but that's really going off topic regarding the general purpose performance.

Admitedly S|APUs are capable of doing at leaast certain general purpose tasks by themselves, but that opens up another major set of assumptions...

cthellis42 · Jan 24, 2005

...and you know what happens when you assume...

AlNom · Jan 24, 2005

you get formulae for ideal situations in engineering courses.

cthellis42 · Jan 24, 2005

You got i...!

HEY!

DeanoC · Jan 24, 2005

The other key point is that a multi-threaded system will not increase the theoritical performance but will likely increase the real world performance.

As such if we were to get a situation where each Xenon core is multi-threaded but PUs aren't then even if we get equal theoritical figures then Xenon will have better real world figures.

Of course equally true the other way...

And as regard FP performance, the Xe rumour is one vector unit per real core.

Guden Oden · Jan 24, 2005

DeanoC said:
As such if we were to get a situation where each Xenon core is multi-threaded but PUs aren't then even if we get equal theoritical figures then Xenon will have better real world figures.

That's not neccessarily the case, as witnessed with hyperthreaded P4s vs. athlon 64 chips. If we have a case where a multithreaded core doesn't beat a singlethreaded core when both use the same ISA, imagine predicting how the outcome would be when the cores use different ISAs!

Besides... The more threads you have competing over the same cache, the bigger the risk they're going to be pushing each other out of it, and six threads and 1MB cache (of which some could be reserved for vertex buffers for the GPU) could create quite a mess. Even at best it's less cache per thread than what the GC offers now I might add.

Gubbi · Jan 24, 2005

Guden Oden said:
DeanoC said:

As such if we were to get a situation where each Xenon core is multi-threaded but PUs aren't then even if we get equal theoritical figures then Xenon will have better real world figures.

Click to expand...

That's not neccessarily the case, as witnessed with hyperthreaded P4s vs. athlon 64 chips. If we have a case where a multithreaded core doesn't beat a singlethreaded core when both use the same ISA, imagine predicting how the outcome would be when the cores use different ISAs!

Which was partly true with the Northwood core, but less so with Prescott.

The P4 isn't particularly elegant in the way it statically splits resources when Hyper Threading is enabled. It divides it's global scheduling window into two equally sized chunks, most queues are also split in two, like the memory instruction queue, the general instruction queue etc.

Weird thing like write combine buffers etc. are also split.

Prescott increases the size of some of the buffers, but not the size of some very important resources like the trace cache and the global scheduler.

What this means is that instead of getting better throughput you can end up in situations where each of the two threads ends up stalling, having run out of resources, instead of just stalling one thread, with the other chugging along.

For example by splitting the global scheduler, you have two threads with half the latency tolerance of the single thread you replaced it with. Another problem is that the two Hyper Threads takes turn on fetching three uops from the trace cache on alternating cycles, so even if you have two perfectly mixed thread, one with a high number of memory dependencies, and on which is uop throughput limited, the P4 can't take advantage of that.

Guden Oden said:
Besides... The more threads you have competing over the same cache, the bigger the risk they're going to be pushing each other out of it, and six threads and 1MB cache (of which some could be reserved for vertex buffers for the GPU) could create quite a mess. Even at best it's less cache per thread than what the GC offers now I might add.

Right, this is particularly a problem with the P4's tiny 12K uop trace cache; thrashing is probably not uncommon.

However, IBM's Power 4/5 cpu multithreading implementations seem to be much more robust, as witnessed by their not-so-stellar single thread performance (specInt&FP) but crazy throughput (specRATEs).

First their resource division scheme seems to be much more flexible, and second, Xenon CPUs are looking to have lots of cache.

Cheers
Gubbi

Inane_Dork · Jan 24, 2005

Guden Oden said:
DeanoC said:

As such if we were to get a situation where each Xenon core is multi-threaded but PUs aren't then even if we get equal theoritical figures then Xenon will have better real world figures.

Click to expand...

That's not neccessarily the case, as witnessed with hyperthreaded P4s vs. athlon 64 chips.

I bet if you made your program while knowing each core were multi-threaded you would get better performance. It's true HT doesn't make every program faster, but that's expected. That it can't make a program faster that's designed for HT would be unexpected.

xenon and PS3: general purpose performance....

aaaaa00

DemoCoder

Fafalada

cthellis42

Hoopy Frood

AlNom

Moderator

cthellis42

Hoopy Frood

DeanoC

Trust me, I'm a renderer person!

Guden Oden

Senior Member

Gubbi

Inane_Dork

Rebmem Roines

Similar threads