What if x360 used a dual core AMD or Intel proc. instead...

Guilty Bystander · Jan 29, 2006

It would be trivial to create a situation where Cell and Xenon would perform 50x slower than a P4 or A64.. one example would be running Windows XP.

I don't think they would perform that bad just look at how Apple's handled them selfs running x86 line of coding Mac OS X anyone?
Also let's then run Linux for the Cell on an A64 or P4 shall we if you wanna do silly comparisons.

j^aws · Jan 29, 2006

Guilty Bystander said:
...
For Xenon there weren't any peak measurements I only read they got 83GFlop/s but I think they could something like 100GFlop/s in peak performance.
...

I'd like a link please. Given recent leak, XeCPU peak is 8 flops/cycle per core, nets 77 Gflops @ 3.2 Ghz peak. Though PR claims 115 Gflops and recent leak claims 77-90 Gflops peak...

TurnDragoZeroV2G · Jan 29, 2006

Maybe he's confusing Xenon with the results of one of the GPGPU results (IIRC) for the R520, yielding about 83GFLOPS in the pixel shaders? (I haven't heard 83 referred to anywhere else with regard to Xenon)

*Or perhaps I'm hallucinating. Ignore me*

pjbliverpool · Jan 29, 2006

Guilty Bystander said:
The P4 you're discribing can pull off 12-15GFlop/s while the Xenon can pull off 83GFlop/s and the Cell even an astounding 155GFlop/s.
So how on Earth would you call this fairly similar?
Bare in mind these are sustained max results by both Cell and Xenon and not peak results.
Cell's peak were even 199-200GFlop/s which could only be done under certain conditions hence peak.
For Xenon there weren't any peak measurements I only read they got 83GFlop/s but I think they could something like 100GFlop/s in peak performance.

Thinking the Xenon and especially Cell are not much better than their PC and PowerPC counterparts is really dumb.
Xenon is so powerfull it calculates normal gaming worlds with complex AI, physics, 5.1 sound (on it's own) and on top off that assists the GPU.
Just do 5.1 sound on your dual core PC and see how performance comes crashing down.
Or just let your CPU run some terrain demo's like the CPU demos in 3Dmark2006 and see your CPU crash down to 1fps or even less.
Now let's go on Cell.
There are videos floating out there of Cell processing a medical image (I forgot the name of the company but frequent visitors know what I'm talking about) against a supercomputer and while Cell processes the image very quick the supercomputer takes a very long time.
Also on IBM there are benchmarks of 1 SPE against a PowerPC 970 at 2,7GHz where the SPE is always much faster and in some things is even 50 times faster.
Cell or Xenon worse than an Amd 64 or P4?
I don't think so.

Your so full of rubbish im not goint to waste any more time one you.

Blazkowicz · Jan 29, 2006

a little bit about the gigaflops crap..
I have a XP2400+ , 2ghz. I benched it on Sandra once (was still using sdram at the time), it got a gigaflops rating quite close to an Athlon 64 at same frequency. not surprising, these are actually quite close architectures, with similar FPU (expect for SSE2 support on the latter)

but real world gaming benchs show Athlon 64 at 2ghz almost twice faster than Athlon XP 2ghz. same story for SuperPi (which is a number crunching benchmark isn't it?)

in the end I care only about real world performance on multiple scenarios. that seems more important than stupid theoric numbers or useless code specifically written to get the max possible number of flops.

j^aws · Jan 29, 2006

Blazkowicz_ said:
...
in the end I care only about real world performance on multiple scenarios. that seems more important than stupid theoric numbers or useless code specifically written to get the max possible number of flops.

Any sane person would care about realworld performance. However, theoretical peaks, synthetic benches *and* an understanding of the underlying architecture are useful in explaining 'why' realworld performance would suck or excel relatively...

Your Athlon example could be explained by more efficient IPC, better compilers, memory architecture etc... This has already been touched upon earlier in this thread with the tradeoff of OOOe with IOe etc..

pc999 · Jan 29, 2006

Also we should see that next gen games may uses diferent propertys that corrent one dont, using physics, animations, complex calculations for AI (see KZ1 AI powerpoint), may change what is needed for a gamimg CPU (just like a few years ago filtrate used to be king in GPUs).

ShootMyMonkey · Jan 29, 2006

I have a XP2400+ , 2ghz. I benched it on Sandra once (was still using sdram at the time), it got a gigaflops rating quite close to an Athlon 64 at same frequency. not surprising, these are actually quite close architectures, with similar FPU (expect for SSE2 support on the latter)

but real world gaming benchs show Athlon 64 at 2ghz almost twice faster than Athlon XP 2ghz. same story for SuperPi (which is a number crunching benchmark isn't it?)

Yep, that's the power of things like latency and having extra registers. Sandra's FLOPS benchmark is probably designed (intentionally) to fit completely within cache, so between your 2 GHz AXP and 2 GHz A64, there should be no extra latencies or differences in performance due to the motherboard. Once you start introducing external memory and device access, it's a different ballgame. The AXP is probably running on a slower memory platform, and likely a slower AGP bus. Compared to the A64 running with an integrated controller and most likely PCIe hardware.

Bohdy · Jan 29, 2006

You are pretty far off in your peak FP numbers , Guilty Bystander.

IIRC,

Xenos = 12 (vmx128) * 3.2 (ghz) * 3 (cores) = 115.2 gflops
PS3 Cell = 8 (spe simd) * 3.2 (ghz) * 7 (spe's) + 12 (ppe vmx) * 3.2 (ghz) = 217.6 gflops
Pentium XE 955 = 4 (SSE2) * 3.46 (ghz) * 2 (cores) = 27.68 gflops

Yay for peak numbers.
Of course, the more knowledgeable posters here know that peak numbers are far less important than the greater architecture, and thereby the real-world performance on useful tasks. Obvously you are not one of those.

Tahir2 · Jan 29, 2006

ShootMyMonkey said:
Yep, that's the power of things like latency and having extra registers. Sandra's FLOPS benchmark is probably designed (intentionally) to fit completely within cache, so between your 2 GHz AXP and 2 GHz A64, there should be no extra latencies or differences in performance due to the motherboard. Once you start introducing external memory and device access, it's a different ballgame. The AXP is probably running on a slower memory platform, and likely a slower AGP bus. Compared to the A64 running with an integrated controller and most likely PCIe hardware.

Whilst I agree in general with what you are saying, internally the Athlon64 is vrey different to the AthlonXP. The Athlon64 core is not just an integrated memory controller and 64bit registers. There have been numerous improvements to the various parts of logic within the CPU which give rise to the increased IPC and helps the processor get closer to its theoretical performance figures.

thekey · Jan 29, 2006

Guilty Bystander said:
The P4 you're discribing can pull off 12-15GFlop/s while the Xenon can pull off 83GFlop/s and the Cell even an astounding 155GFlop/s.
So how on Earth would you call this fairly similar?
Bare in mind these are sustained max results by both Cell and Xenon and not peak results.
Cell's peak were even 199-200GFlop/s which could only be done under certain conditions hence peak.
For Xenon there weren't any peak measurements I only read they got 83GFlop/s but I think they could something like 100GFlop/s in peak performance.

Thinking the Xenon and especially Cell are not much better than their PC and PowerPC counterparts is really dumb.
Xenon is so powerfull it calculates normal gaming worlds with complex AI, physics, 5.1 sound (on it's own) and on top off that assists the GPU.
Just do 5.1 sound on your dual core PC and see how performance comes crashing down.
Or just let your CPU run some terrain demo's like the CPU demos in 3Dmark2006 and see your CPU crash down to 1fps or even less.
Now let's go on Cell.
There are videos floating out there of Cell processing a medical image (I forgot the name of the company but frequent visitors know what I'm talking about) against a supercomputer and while Cell processes the image very quick the supercomputer takes a very long time.
Also on IBM there are benchmarks of 1 SPE against a PowerPC 970 at 2,7GHz where the SPE is always much faster and in some things is even 50 times faster.
Cell or Xenon worse than an Amd 64 or P4?
I don't think so.

Also did you see the explosions demo? The whole thing is rendered with cell, ask an a64 or p4 to do something like that.........

Gubbi · Jan 29, 2006

Tahir2 said:
Whilst I agree in general with what you are saying, internally the Athlon64 is vrey different to the AthlonXP. The Athlon64 core is not just an integrated memory controller and 64bit registers. There have been numerous improvements to the various parts of logic within the CPU which give rise to the increased IPC and helps the processor get closer to its theoretical performance figures.

Not really. The caches, the schedulers, the exec units (with load/store units) are basically the same (64bit capability not withstanding). The differences are in the decoder, the addition of SSE2 capability and the integrated memory controller.

BY FAR the biggest contributor to the superior performance of the A64 is the integrated memory controller. The primary parameter that has been improved upon with the memory controller is main memory latency which isn't measured in any synthetic benchmarks that I have seen on a regular basis (No, constant stride memory access does not measure memory latency, but rather the performance of the prefetcher and hence is a bandwidth benchmark.)

Synthetic benchmarks are mostly useless, taken out of context, completely useless.

Same goes with paper benchmarks.

Cheers

thekey · Jan 29, 2006

ban25 said:
> a) Isnt what I mean obvious ?, he said that PowerPC was "just an instruction set". It is a completely differet architecture.

I don't think you understand the terminology here. Xenon is an implementation of the PowerPC Instruction Set Architecture (i.e. the part visible to the programmer).

> b)"The cores in the 360 are far less complex than the PPC 970...2-issue, no OOOE, weak branch predictor, etc. Their IPC is significantly lower, and I think if you look, you'll find plenty of benchmarks to support this" I would like to see some evidence for this.

Take an EE class or at least inform yourself of the Xenon architecture.

> d) Acording to the Sciencemark test I jut ran, single precision would be around 5.9 (doble precision 1.8) gigaflops, but even if it were 7.8 that doesnt compare to over 30 gigaflops of EACH core.

I said *theoretical* performance. Your ScienceMark test (GEMM, I suppose) is measuring actual performance with all the constraints that includes (like memory bandwidth).

a) "I don't think you understand the terminology here. Xenon is an implementation of the PowerPC Instruction Set Architecture (i.e. the part visible to the programmer)."
PowerPCs also come with a different and more powerful internal organization. Its not only the instruction set that changes.

b)"Take an EE class or at least inform yourself of the Xenon architecture."

So that is you evidence? Im pretty sure there is all the cache and branch prediction capabilities the cores need. Microsoft remarked several times that the processor of the 360 was better than cell because of general porpuse capabilities (and here cache and branch preition come into play).

c) I never said I didnt believe the 7.8 gigaflops claim. It could be that 7.8 gigaflops are reached with the use of 64 bit registers (my test was 32 bits). Either way my point remains: the 360 has over 30 gigaflops on each core. They are not even comparable.

thekey · Jan 29, 2006

Bohdy said:
You are pretty far off in your peak FP numbers , Guilty Bystander.

IIRC,

Xenos = 12 (vmx128) * 3.2 (ghz) * 3 (cores) = 115.2 gflops
PS3 Cell = 8 (spe simd) * 3.2 (ghz) * 7 (spe's) + 12 (ppe vmx) * 3.2 (ghz) = 217.6 gflops
Pentium XE 955 = 4 (SSE2) * 3.46 (ghz) * 2 (cores) = 27.68 gflops

Yay for peak numbers.
Of course, the more knowledgeable posters here know that peak numbers are far less important than the greater architecture, and thereby the real-world performance on useful tasks. Obvously you are not one of those.

No he was not " too far off" , and true numbers make his point stronger than he actually did.

"Of course, the more knowledgeable posters here know that peak numbers are far less important than the greater architecture, and thereby the real-world performance on useful tasks"

Im pretty sure he knows that fairly well. And IÂ´m also sure that this posters know far less than the engeneers who designed the processor. Do you really believe they are going to invest millons in processors that can reach those numbers if they arent of real use?
I think not.

AlphaWolf · Jan 29, 2006

thekey said:
Do you really believe they are going to invest millons in processors that can reach those numbers if they arent of real use?

That's sigworthy.

ERP · Jan 29, 2006

Im pretty sure he knows that fairly well. And IÂ´m also sure that this posters know far less than the engeneers who designed the processor. Do you really believe they are going to invest millons in processors that can reach those numbers if they arent of real use?

Having worked with many Hardware engineers over the years, most of them don't give a shit what the software problems are.

Many years ago in a different field was delivered apiece of hardware that would run none of out "emulated" code. After a month of both sides arguing about who's fault it was, it turns out that the Hardware engineers hadn't bothered wiring in the low address line on the CPU, and didn't bother telling the software guys. The hardware "worked" as long as you never read or wrote a byte from an odd location.

IME Hardware is rarely designed with any real regard for the software that will run on it.

Clearly the Cell/X360 CPU designers believed that Floating point execution resources were the primary limiting factor in game performance.....

IME this is not accurate, it's true for some portions of a game, but they aren't just speeding up these portions of a game they are doing it at the expense of penalising every other line of code in the application.

Tahir2 · Jan 29, 2006

Gubbi said:
Not really. The caches, the schedulers, the exec units (with load/store units) are basically the same (64bit capability not withstanding). The differences are in the decoder, the addition of SSE2 capability and the integrated memory controller.

BY FAR the biggest contributor to the superior performance of the A64 is the integrated memory controller. The primary parameter that has been improved upon with the memory controller is main memory latency which isn't measured in any synthetic benchmarks that I have seen on a regular basis (No, constant stride memory access does not measure memory latency, but rather the performance of the prefetcher and hence is a bandwidth benchmark.)

Synthetic benchmarks are mostly useless, taken out of context, completely useless.

Same goes with paper benchmarks.

Cheers

Athlon64 reaches a figure of 0.97 issues per cycle at the moment
The branch predictor is probably the same
L1 and L2 caches are streets ahead of the AthlonXP architecture even if they are the same sizes in some cases
There are many improvements to the Athlon64 core over the AthlonXP, we don't hear about them.

Source: Penstar Systems

I have heard similar things from the now defunct JC-News and Aces Hardware.

ADEX · Jan 30, 2006

BY FAR the biggest contributor to the superior performance of the A64 is the integrated memory controller. The primary parameter that has been improved upon with the memory controller is main memory latency

The memory controller not only reduces memory latency but also increases bandwidth. That alone will increase performance by a decent amount simply because the cache can fill faster and the processor will thus stall less, all modern processors spend a lot of time stalled (probably more than working) so anything which helps that will boost performance.

Another big difference is the addition of architectural registers in 64 bit mode.

If you notice both Xenon and Cell have large high bandwidth busses and large numbers of registers in the vector units, for vector processing they don't need OOO.

zidane1strife · Jan 30, 2006

pjbliverpool said:
Evidence of the architectural details or the performance? Because the architectural details are common knowledge so you really should find that yourself. And with knowledge of those details certain conclusions do seem obvious.

p.s. you should be comparing the Pentium 955EE in SP SSE3 FLOPs to cell or Xenon. Then you should consider what Carmack said about them not approaching their theoretical peaks in anything but trivial benchmarks. I think the results should end up fairly similat and thats in the area where cell/xenon excel. But afterall commen sense could have predicted that since how on earth could the trailing horse suddenly take a commanding lead when faced with the challenges of low power/heat output, low cost and huge mass production with a brand new architecture?

The EE got pretty high and quite close to its theoretical high in the same environment the cell is going to be dealing with. I'd imagine if you can deal with the hassles of spu memory management effectively, then you should've the spus singing and you'd have your data virtually always sitting right next to the processor, dealing away with most of the idleness/memory issues of conventional cpus.

Unless there's some fundamental issues with the design, I just can't see plain simple h/w (caches) toppling sheer mind-power(ls) in terms of achieving the best mem management outcome, given the apt lvl of effort, intellect, and dedication. My guess is this is not done frequently because it usually ain't worth the effort in constantly changing h/w, but for static h/w it should be the best design.

Gubbi · Jan 30, 2006

Tahir2 said:
L1 and L2 caches are streets ahead of the AthlonXP architecture even if they are the same sizes in some cases
Penstar Systems

The level1 cache is basically the same: 2-way set associative pseudo dual-ported (8-banked) 64 bit access data cache, it has been like this since the very first K7.

The L2 cache is reworked to use ECC, if data is evicted from the instruction cache, the instruction boundary information is stored in the ECC bits.

So no, not streets ahead.

Cheers

What if x360 used a dual core AMD or Intel proc. instead...

Guilty Bystander

j^aws

TurnDragoZeroV2G

pjbliverpool

B3D Scallywag

Blazkowicz

j^aws

pc999

ShootMyMonkey

Bohdy

Tahir2

thekey

Gubbi

thekey

thekey

AlphaWolf

Specious Misanthrope

ERP

Tahir2

ADEX

zidane1strife

Gubbi

Similar threads