Simple Question on Next Generation Consoles

Status
Not open for further replies.

gosh

Newcomer
Which is more useful in gaming experience and/or output of the game. Is it the Total Integer Calculations per second or Total Teraflop performance of the System.

I am curious because I will eventually buy all 3 consoles but looking at PS3 which has higher Tflop rating and Xbox 360 which has higher Integer Calculation rating
 
gosh said:
Which is more useful in gaming experience and/or output of the game. Is it the Total Integer Calculations per second or Total Teraflop performance of the System.

I am curious because I will eventually buy all 3 consoles but looking at PS3 which has higher Tflop rating and Xbox 360 which has higher Integer Calculation rating

Wher did you get the information that the xbox 360 has a higher integer calculation performance?
 
Which is more useful in gaming experience and/or output of the game
Number of MuuMuu's per week. Now the exact mating rituals of MuuMuu's are too gruesome to go into details on a PG approved forum like this, but Sony has been working with them since PS2, so they have a slight advantage in the process.
 
gosh said:
http://www.major-nelson.com/blogcast/mnr-5-26-05-mp3.mp3

audio blog interview, where Xbox advance hardware engineers did all the talking and said PS3 might have a higher tflop score but Xbox 360 has a higher integer calculations score

Major Nelson is your source? :LOL:
 
no the guys who designed the Xbox hardware. listen to the audio. its only the 2 engineers talking
 
gosh said:
Which is more useful in gaming experience and/or output of the game. Is it the Total Integer Calculations per second or Total Teraflop performance of the System.

The usefulness depends on whatever the developer needs are for a particular game...

gosh said:
I am curious because I will eventually buy all 3 consoles but looking at PS3 which has higher Tflop rating and Xbox 360 which has higher Integer Calculation rating

The X360 doesn't have higher integer performance. Perhaps you're confusing branching performance with integer performance. And also confusing integer performance with general purpose performance.

System wide metrics

http://www.beyond3d.com/forum/viewtopic.php?p=528929#528929

CPU integer metrics

http://www.beyond3d.com/forum/viewtopic.php?p=533374#533374

CPU instruction metrics

http://www.beyond3d.com/forum/viewtopic.php?p=535138#535138

These are peak, raw metrics that need further parameters and definition. The raw metrics are as useful as what the developers make of them...

And again to re-interate the point,

*SPUs ARE UNIFIED VECTOR/SCALAR/FP/INTEGER UNITS*

To those Major Nelson engineers and co...go do some friggin research! :rolleyes:
 
Jaws said:
gosh said:
Which is more useful in gaming experience and/or output of the game. Is it the Total Integer Calculations per second or Total Teraflop performance of the System.

The usefulness depends on whatever the developer needs are for a particular game...

gosh said:
I am curious because I will eventually buy all 3 consoles but looking at PS3 which has higher Tflop rating and Xbox 360 which has higher Integer Calculation rating

The X360 doesn't have higher integer performance. Perhaps you're confusing branching performance with integer performance. And also confusing integer performance with general purpose performance.

System wide metrics

http://www.beyond3d.com/forum/viewtopic.php?p=528929#528929

CPU integer metrics

http://www.beyond3d.com/forum/viewtopic.php?p=533374#533374

CPU instruction metrics

http://www.beyond3d.com/forum/viewtopic.php?p=535138#535138

These are peak, raw metrics that need further parameters and definition. The raw metrics are as useful as what the developers make of them...

And again to re-interate the point,

*SPUs ARE UNIFIED VECTOR/SCALAR/FP/INTEGER UNITS*

To those Major Nelson engineers and co...go do some friggin research! :rolleyes:

dont play smart ass with me. I know what SPUs are. SPUs are subcore units, helper units , simpler CPU with no access to memory and only access to the core. If you dont have access to the memory you are lying about the fact that its unified vector/scalar/fp/integer units. Its a subcore issue which has nothing to do with GPU scaling/vectoring performance
 
Synergistic Processing Elements
Each SPE or SPU is a SIMD 128-bit vector processor with 256 KB of local high speed memory, which is also visible to the PE to be loaded with data and programs as needed. The SPE's memory is private but data can be sent to other SPEs via DMA. In general use the system will load the SPEs with small programs, chaining the SPEs together to handle each step in a complex operation. For instance, a set-top box could load up programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. At 4 GHz, each SPE gives 32 GFLOPS of performance, thereby giving the SPEs 256 GFLOPS of performance. Performance of the PE's VMX unit is unclear, but should be around 32 GFLOPS in addition to the SPEs.

(Initial speculation of TeraFLOPS performance was largely based on claims of a 65nm SOI process. Though IBM, Sony and Toshiba were following this agenda in the beginning, Intel and AMD's renewed concern for multi core processing and Sony wanting first-mover's advantage on next generation gaming consoles may have forced them to go with a 90nm SOI process very much similar to the Intel Prescott core manufacturing process. However the 'Broadband engine' integrated into the Cell helps it attain enough bandwidth for theoretical 1TeraFLOPS performance, though real-world models may rarely rise to such a figure)

Due to the nature of its applications, Cell is optimized towards single precision floating point computation. "This design decision is based the real time nature of game workloads and other media applications: most often, saturation is mathematically the right solution."
 
gosh said:
...
dont play smart ass with me. I know what SPUs are. SPUs are subcore units, helper units , simpler CPU with no access to memory and only access to the core. If you dont have access to the memory you are lying about the fact that its unified vector/scalar/fp/integer units. Its a subcore issue which has nothing to do with GPU scaling/vectoring performance


Patents

Publications

More information

Michael Gschwind

The CELL project at IBM Research 

  Project description 


The CELL Architecture

The CELL Architecture grew from a challenge posed by Sony and Toshiba to provide power-efficient and cost-effective high-performance processing for a wide range of applications, including the most demanding consumer appliance: game consoles. CELL - also known as the Broadband Processor Architecture (BPA) - is an innovative solution whose design was based on the analysis of a broad range of workloads in areas such as cryptography, graphics transform and lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientific workloads. As an example of innovation that ensures the clients' success, a team from IBM Research joined people from IBM Systems Technology Group, Sony and Toshiba, to lead the development of a novel architecture that represents a breakthrough in performance for consumer applications. IBM Research participated throughout the entire development of the architecture, its implementation and its software enablement, ensuring the timely and efficient application of novel ideas and technology into a product that solves real challenges.

CELL is a heterogeneous chip multiprocessor consisting of a 64-bit Power core, augmented with 8 specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture called SPU (Synergistic Processor Unit), for data intensive processing as is found in cryptography, media and scientific applications. The system is integrated by a coherent on-chip bus.

Based on the analysis of available die area, cost and power budgets, and achievable performance, the best approach to achieving the performance target was the exploitation of parallelism through a high number of nodes on a chip multiprocessor. To further reduce power, the team opted for a heterogeneous configuration with a novel SIMD-centered architecture. This configuration combines the flexibility of an IBM Power core with the functionality and performance-optimized SPU SIMD cores.

Cell Block Diagram

In this organization, the SPU accelerators operate from a local storage which contains instruction and data for a single SPU. This local storage is the only memory directly addressable by the SPU.

The SPU architecture was built with the goals to
• provide a large register file,
• simplify code generation,
• reduce the size and power consumption by unifying resources, and
• simplify decode and dispatch.

These goals were achieved by architecting a novel SIMD-based architecture with 32 bit wide instructions encoding a 3-operand instruction format. Designing a new instruction set architecture (ISA) allowed us to streamline the instruction side, and provide 7-bit register operand specifiers to directly address 128 registers from all instructions using a single pervasive SIMD computation approach for both scalar and vector data. In this approach, a unified 128 entry 128bit SIMD register file provides scalar, condition and address operands, such as for conditional operations, branches, and memory accesses.

While the SPU ISA is a novel architecture, the operations selected for the SPU are closely aligned with the functionality of the Power VMX unit. This facilitates and simplifies code portability between the Power main processor and the SPU SIMD-based co-processors. However, the range of data types supported in the SPU has been reduced for most computation formats. While VMX supports a number of densely packed saturating integer data types, these data types lead to a loss of dynamic range which typically degrades computation results. The preferred computation approach is to widen integer data types for intermediate operations and perform them without saturation. Unpack and saturating pack operations allow memory bandwidth and memory footprint to be reduced while maintaining high data integrity.

Floating point data types automatically support a wide dynamic data range and saturation, so no additional data conditioning is required. To reduce area and power requirements, floating point arithmetic is restricted to the most common and useful modes. As a result, denormalized numbers are automatically flushed to 0 when presented as input, and when a denormalized result is generated. Also, a single rounding mode is supported.


Single precision floating point computation is geared for throughput of media and 3D graphics objects. In this vein, the decision to support only a subset of IEEE floating point arithmetic and sacrifice full IEEE compliance was driven by the target applications. Thus, multiple rounding modes and IEEE-compliant exceptions are typically unimportant for these workloads, and are not supported. This design decision is based the real time nature of game workloads and other media applications: most often, saturation is mathematically the right solution. Also, occasional small display glitches caused by saturation in a display frame is tolerable. On the other hand, incomplete rendering of a display frame, missing objects or tearing video due to long exception handling is objectionable.

Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the Power processor or an SPU. The DMA-based interface uses the Power page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure. A high-performance on-chip bus connects the SPU and Power computing elements.

The SPU is an in-order dual-issue statically scheduled architecture. Two SIMD instructions can be issued per cycle: one compute instruction and one memory operation. The SPU branch architecture does not include dynamic branch prediction, but instead relies on compiler-generated branch prediction using "prepare-to-branch" instructions to redirect instruction prefetch to branch targets.

The SPU was designed with a compiled code focus from the beginning, and early availability of SIMD-optimized compilers allowed development of high-performance graphics and media libraries for the Broadband Architecture entirely in the C programming language.

Based on these decisions to share compute semantics, data types, and virtual memory model, the SPUs synergistically exploit and amplify the advantages when combined with the Power architecture to form the Broadband Processor Architecture.

The IBM Research division grew its partnership in the development of the Broadband Processor Architecture beyond the initial definition of the architecture. During the course of this partnership with the STI Design Center, members of the original CELL team developed the first SPU compiler which was a guiding force for the definition of the SPU architecture and the SPU programming environment, and sample code to exploit the strengths of the Broadband Processor Architecture. The extended partnership led to further contributions by IBM Research, including the development of an advanced parallelizing compiler with auto-SIMDization features based on IBM XL compiler technology, the design of the high-frequency Power core at the center of the CELL architecture, and a full-system simulation infrastructure.

CELL is not limited to game systems. IBM has announced a CELL-based "blade" leveraging the investment into the high-performance CELL architecture. Other future uses include HDTV sets, home servers, game servers, and supercomputers. Also, CELL is not limited to a single chip, but is a scalable system. The number of attached SPUs can be varied, to achieve different power/performance and price/performance points. Also, the CELL architecture was conceived as a modular, extendible system where multiple CELL subsystems each with a Power core and attached SPUs, can form a symmetric multiprocessor system.

Cell Prototype Die

Some CELL statistics:
• Observed clock speed: > 4 GHz
• Peak performance (single precision): > 256 GFlops
• Peak performance (double precision): >26 GFlops
• Local storage size per SPU: 256KB
• Area: 221 mm?
• Technology 90nm SOI
• Total number of transistors: 234M

CELL received the 2004 Microprocessor Report Analysts' Choice Award for Best Technology.

  Patents 

SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode
M. Gschwind, P. Hofstee, M. Hopkins
1/4/2005 Issued as US patent 6839828



Token-based DMA
P. Hofstee, R. Nair, J. Wellman
11/16/2004 Issued as US patent 6820142


Symmetric multi-processing system with attached processing units being able to access a shared memory without being structurally configured with an address translation mechanism
E. Altman, P. Capek, M. Gschwind, P. Hofstee, J. Kahle, R. Nair, S. Sathaye, J. Wellman, M. Suzuoki, T. Yamazaki
8/17/2004 Issued as US patent 6779049


Pipeline control for high-frequency pipelined designs
M.K. Gschwind
02/20/2000 Issued as US patent 6192466

  Publications 

A 1.0-GHz single-issue 64-bit powerPC integer processor
Journal of Solid State Circuits, Vol. 33, No. 11, Nov 1998. (J. Silberman et al.)

Exploring realtime multimedia content creation in video games
6th Workshop on Media and Streaming Processors in conjunction with MICRO 36, December 2004. (B. Matthews, J.D. Wellman, M. Gschwind)

The Design and Implementation of a First-Generation CELL Processor
ISSCC 2005, February 2005. (D. Pham et al.)


The Microarchitecture of the Streaming Processor for a CELL Processor
ISSCC 2005, February 2005. (B. Flachs et al.)


Power Efficient Architecture and the Cell Processor
HPCA-11, February 2005. (P. Hofstee)

http://www.research.ibm.com/cell/

*SPUs ARE UNIFIED VECTOR/SCALAR/FP/INTEGER UNITS*
:rolleyes:
 
Jaws said:
gosh said:
...
dont play smart ass with me. I know what SPUs are. SPUs are subcore units, helper units , simpler CPU with no access to memory and only access to the core. If you dont have access to the memory you are lying about the fact that its unified vector/scalar/fp/integer units. Its a subcore issue which has nothing to do with GPU scaling/vectoring performance


Patents

Publications

More information

Michael Gschwind

The CELL project at IBM Research 

  Project description 


The CELL Architecture

The CELL Architecture grew from a challenge posed by Sony and Toshiba to provide power-efficient and cost-effective high-performance processing for a wide range of applications, including the most demanding consumer appliance: game consoles. CELL - also known as the Broadband Processor Architecture (BPA) - is an innovative solution whose design was based on the analysis of a broad range of workloads in areas such as cryptography, graphics transform and lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientific workloads. As an example of innovation that ensures the clients' success, a team from IBM Research joined people from IBM Systems Technology Group, Sony and Toshiba, to lead the development of a novel architecture that represents a breakthrough in performance for consumer applications. IBM Research participated throughout the entire development of the architecture, its implementation and its software enablement, ensuring the timely and efficient application of novel ideas and technology into a product that solves real challenges.

CELL is a heterogeneous chip multiprocessor consisting of a 64-bit Power core, augmented with 8 specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture called SPU (Synergistic Processor Unit), for data intensive processing as is found in cryptography, media and scientific applications. The system is integrated by a coherent on-chip bus.

Based on the analysis of available die area, cost and power budgets, and achievable performance, the best approach to achieving the performance target was the exploitation of parallelism through a high number of nodes on a chip multiprocessor. To further reduce power, the team opted for a heterogeneous configuration with a novel SIMD-centered architecture. This configuration combines the flexibility of an IBM Power core with the functionality and performance-optimized SPU SIMD cores.

Cell Block Diagram

In this organization, the SPU accelerators operate from a local storage which contains instruction and data for a single SPU. This local storage is the only memory directly addressable by the SPU.

The SPU architecture was built with the goals to
• provide a large register file,
• simplify code generation,
• reduce the size and power consumption by unifying resources, and
• simplify decode and dispatch.

These goals were achieved by architecting a novel SIMD-based architecture with 32 bit wide instructions encoding a 3-operand instruction format. Designing a new instruction set architecture (ISA) allowed us to streamline the instruction side, and provide 7-bit register operand specifiers to directly address 128 registers from all instructions using a single pervasive SIMD computation approach for both scalar and vector data. In this approach, a unified 128 entry 128bit SIMD register file provides scalar, condition and address operands, such as for conditional operations, branches, and memory accesses.

While the SPU ISA is a novel architecture, the operations selected for the SPU are closely aligned with the functionality of the Power VMX unit. This facilitates and simplifies code portability between the Power main processor and the SPU SIMD-based co-processors. However, the range of data types supported in the SPU has been reduced for most computation formats. While VMX supports a number of densely packed saturating integer data types, these data types lead to a loss of dynamic range which typically degrades computation results. The preferred computation approach is to widen integer data types for intermediate operations and perform them without saturation. Unpack and saturating pack operations allow memory bandwidth and memory footprint to be reduced while maintaining high data integrity.

Floating point data types automatically support a wide dynamic data range and saturation, so no additional data conditioning is required. To reduce area and power requirements, floating point arithmetic is restricted to the most common and useful modes. As a result, denormalized numbers are automatically flushed to 0 when presented as input, and when a denormalized result is generated. Also, a single rounding mode is supported.


Single precision floating point computation is geared for throughput of media and 3D graphics objects. In this vein, the decision to support only a subset of IEEE floating point arithmetic and sacrifice full IEEE compliance was driven by the target applications. Thus, multiple rounding modes and IEEE-compliant exceptions are typically unimportant for these workloads, and are not supported. This design decision is based the real time nature of game workloads and other media applications: most often, saturation is mathematically the right solution. Also, occasional small display glitches caused by saturation in a display frame is tolerable. On the other hand, incomplete rendering of a display frame, missing objects or tearing video due to long exception handling is objectionable.

Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the Power processor or an SPU. The DMA-based interface uses the Power page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure. A high-performance on-chip bus connects the SPU and Power computing elements.

The SPU is an in-order dual-issue statically scheduled architecture. Two SIMD instructions can be issued per cycle: one compute instruction and one memory operation. The SPU branch architecture does not include dynamic branch prediction, but instead relies on compiler-generated branch prediction using "prepare-to-branch" instructions to redirect instruction prefetch to branch targets.

The SPU was designed with a compiled code focus from the beginning, and early availability of SIMD-optimized compilers allowed development of high-performance graphics and media libraries for the Broadband Architecture entirely in the C programming language.

Based on these decisions to share compute semantics, data types, and virtual memory model, the SPUs synergistically exploit and amplify the advantages when combined with the Power architecture to form the Broadband Processor Architecture.

The IBM Research division grew its partnership in the development of the Broadband Processor Architecture beyond the initial definition of the architecture. During the course of this partnership with the STI Design Center, members of the original CELL team developed the first SPU compiler which was a guiding force for the definition of the SPU architecture and the SPU programming environment, and sample code to exploit the strengths of the Broadband Processor Architecture. The extended partnership led to further contributions by IBM Research, including the development of an advanced parallelizing compiler with auto-SIMDization features based on IBM XL compiler technology, the design of the high-frequency Power core at the center of the CELL architecture, and a full-system simulation infrastructure.

CELL is not limited to game systems. IBM has announced a CELL-based "blade" leveraging the investment into the high-performance CELL architecture. Other future uses include HDTV sets, home servers, game servers, and supercomputers. Also, CELL is not limited to a single chip, but is a scalable system. The number of attached SPUs can be varied, to achieve different power/performance and price/performance points. Also, the CELL architecture was conceived as a modular, extendible system where multiple CELL subsystems each with a Power core and attached SPUs, can form a symmetric multiprocessor system.

Cell Prototype Die

Some CELL statistics:
• Observed clock speed: > 4 GHz
• Peak performance (single precision): > 256 GFlops
• Peak performance (double precision): >26 GFlops
• Local storage size per SPU: 256KB
• Area: 221 mm?
• Technology 90nm SOI
• Total number of transistors: 234M

CELL received the 2004 Microprocessor Report Analysts' Choice Award for Best Technology.

  Patents 

SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode
M. Gschwind, P. Hofstee, M. Hopkins
1/4/2005 Issued as US patent 6839828



Token-based DMA
P. Hofstee, R. Nair, J. Wellman
11/16/2004 Issued as US patent 6820142


Symmetric multi-processing system with attached processing units being able to access a shared memory without being structurally configured with an address translation mechanism
E. Altman, P. Capek, M. Gschwind, P. Hofstee, J. Kahle, R. Nair, S. Sathaye, J. Wellman, M. Suzuoki, T. Yamazaki
8/17/2004 Issued as US patent 6779049


Pipeline control for high-frequency pipelined designs
M.K. Gschwind
02/20/2000 Issued as US patent 6192466

  Publications 

A 1.0-GHz single-issue 64-bit powerPC integer processor
Journal of Solid State Circuits, Vol. 33, No. 11, Nov 1998. (J. Silberman et al.)

Exploring realtime multimedia content creation in video games
6th Workshop on Media and Streaming Processors in conjunction with MICRO 36, December 2004. (B. Matthews, J.D. Wellman, M. Gschwind)

The Design and Implementation of a First-Generation CELL Processor
ISSCC 2005, February 2005. (D. Pham et al.)


The Microarchitecture of the Streaming Processor for a CELL Processor
ISSCC 2005, February 2005. (B. Flachs et al.)


Power Efficient Architecture and the Cell Processor
HPCA-11, February 2005. (P. Hofstee)

http://www.research.ibm.com/cell/

*SPUs ARE UNIFIED VECTOR/SCALAR/FP/INTEGER UNITS*
:rolleyes:

again.. if you dont have access to the main memory and only have smaller private memory, SPEs only act as helper units within thier own private domain with the main core. They dont interact with anything outside the core. The core itself uses the SPEs to perform internally.

So my question again is, if Xbox 360 has higher Integer performance and PS3 has higher Teraflop performance which is better for the output of games
 
gosh said:
dont play smart ass with me.

Jeez, relax, he was throwing his eyes up at the MS comments, not you!

gosh said:
with no access to memory

Yes they do. Every CPU has to work with a memory heirarchy. Every CPU can only directly work with the data in its registers. Then beyond that you've local memory in the case of SPEs or cache in the case of other CPUs. And beyond that you have main memory, whos access is expensive no matter what chip you're talking about. The SPEs can access main memory themselves.

gosh said:
If you dont have access to the memory you are lying about the fact that its unified vector/scalar/fp/integer units.

Memory access has little to do with the execution hardware on a core, and the SPEs have floating point, scalar and integer computational capability. And they do have access to main memory anyway (and access to other SPE's local memory and access to the PPE's cache for that matter too).
 
gosh said:
dont play smart ass with me.
You shouldn't take that attitude with Jaws, he's a pretty knowledgeable guy... :)

I know what SPUs are. SPUs are subcore units, helper units , simpler CPU with no access to memory and only access to the core.
That is incorrect.

Cell SPUs are most definitely stand-alone CPUs in their own right. They have access to main memory via MMU/DMA operations, and peak integer ops is the same as peak flops. Hence, Cell out-ops x360 CPU where *peak* figures are concerned.

Real-world performance will vary in both cases, as always.

Its a subcore issue which has nothing to do with GPU scaling/vectoring performance
Au contraire, mon capitain!

Even if Cell did NOT have access to main memory as you claim (which it actually HAS, as already stated), that would not affect either flop or iop peak figures. Cell would still score higher on both counts. :)
 
Guden Oden said:
gosh said:
dont play smart ass with me.
You shouldn't take that attitude with Jaws, he's a pretty knowledgeable guy... :)

I know what SPUs are. SPUs are subcore units, helper units , simpler CPU with no access to memory and only access to the core.
That is incorrect.

Cell SPUs are most definitely stand-alone CPUs in their own right. They have access to main memory via MMU/DMA operations, and peak integer ops is the same as peak flops. Hence, Cell out-ops x360 CPU where *peak* figures are concerned.

Real-world performance will vary in both cases, as always.

Its a subcore issue which has nothing to do with GPU scaling/vectoring performance
Au contraire, mon capitain!

Even if Cell did NOT have access to main memory as you claim (which it actually HAS, as already stated), that would not affect either flop or iop peak figures. Cell would still score higher on both counts. :)

Like the Playstation 3's Cell processor, the Xbox 360's Xenon processor represents a fundamentally different approach to performance than that which characterized the previous generation of consoles—and the previous generation of PCs for that matter. The Xbox 360 will rely on multithreading and procedural synthesis to make visual environments that are more immersive than anything that's possible on the present generation of either game consoles or PCs. Still, with all that pixel-pushing power at its disposal, there are a few probable flies in the Xbox 360 ointment.

Rumors and some game developer comments (on the record and off the record) have Xenon's performance on branch-intensive game control, AI, and physics code as ranging from mediocre to downright bad. Xenon will be a streaming media monster, but the parts of the game engine that have to do with making the game fun to play (and not just pretty to look at) are probably going to suffer. Even if the PPE's branch prediction is significantly better than I think it is, the relatively meager 1MB L2 cache that the game control, AI, and physics code will have to share with procedural synthesis and other graphics code will ensure that programmers have a hard time getting good performance out of non-graphics parts of the game.

Furthermore, the Xenon may be capable of running six threads at once, but the three types of branch-intensive code listed above are not as amenable to high levels of thread-level parallelization as graphics code. On the other hand, these types of code do benefit greatly from out-of-order execution, which Xenon lacks completely, a decent amount of execution core width, which Xenon also lacks; branch prediction hardware, which Xenon is probably short on; and large caches, which Xenon is definitely short on. The end result is a recipe for a console that provides developers with a wealth of graphics resources but that asks them to do more with less on the non-graphical side of gaming.

Still, there is some hope on that front. In the PC market where there are multiple processors to support, developers can't fine-tune games for a specific CPU. This heterogeneity of hardware especially hurts with platform-sensitive optimizations like branch hints, which is one reason they don't get used much. In contrast, with the Xenon, the hardware will be fixed, which means that programmers can go all-out in profiling and optimizing branchy game control, AI, and physics code using every trick in the book. Furthermore, console coders can also take heavy advantage of prefetching to overcome the Xenon's cache size limitations. So it's quite possible that as time goes on developers will find ways to get much better branch-intensive code performance out of the hardware. Just don't count on it in the first generation of games, though.

At any rate, Playstation 3 fanboys shouldn't get all flush over the idea that the Xenon will struggle on non-graphics code. However bad off Xenon will be in that department, the PS3's Cell will probably be worse. The Cell has only one PPE to the Xenon's three, which means that developers will have to cram all their game control, AI, and physics code into at most two threads that are sharing a very narrow execution core with no instruction window. (Don't bother suggesting that the PS3 can use its SPEs for branch-intensive code, because the SPEs lack branch prediction entirely.) Furthermore, the PS3's L2 is only 512K, which is half the size of the Xenon's L2. So the PS3 doesn't get much help with branches in the cache department. In short, the PS3 may fare a bit worse than the Xenon on non-graphics code, but on the upside it will probably fare a bit better on graphics code because of the seven SPEs.

In sum, the Xenon will certainly make the Xbox 360 a 3D graphics powerhouse. Though history suggests that the Xbox 360's games will probably never attain the level of graphical realism promised by Microsoft's pre-launch hype and portrayed in the pre-rendered "game demos" that were shown off at E3, gamers can nonetheless expect a significant advance in levels of graphical realism and visual immersiveness.

again. we are talking about performance of the gaming A.I and its output. Which will perform better? the Teraflop argument or the Integer Calculations argument
 
im not talking about how the television output will be from the Cell or Xbox 360 im talking about the gaming AI, the strength of the gaming AI. which will perform better in that regard
 
gosh said:
Guden Oden said:
gosh said:
dont play smart ass with me.
You shouldn't take that attitude with Jaws, he's a pretty knowledgeable guy... :)

I know what SPUs are. SPUs are subcore units, helper units , simpler CPU with no access to memory and only access to the core.
That is incorrect.

Cell SPUs are most definitely stand-alone CPUs in their own right. They have access to main memory via MMU/DMA operations, and peak integer ops is the same as peak flops. Hence, Cell out-ops x360 CPU where *peak* figures are concerned.

Real-world performance will vary in both cases, as always.

Its a subcore issue which has nothing to do with GPU scaling/vectoring performance
Au contraire, mon capitain!

Even if Cell did NOT have access to main memory as you claim (which it actually HAS, as already stated), that would not affect either flop or iop peak figures. Cell would still score higher on both counts. :)

Like the Playstation 3's Cell processor, the Xbox 360's Xenon processor represents a fundamentally different approach to performance than that which characterized the previous generation of consoles—and the previous generation of PCs for that matter. The Xbox 360 will rely on multithreading and procedural synthesis to make visual environments that are more immersive than anything that's possible on the present generation of either game consoles or PCs. Still, with all that pixel-pushing power at its disposal, there are a few probable flies in the Xbox 360 ointment.

Rumors and some game developer comments (on the record and off the record) have Xenon's performance on branch-intensive game control, AI, and physics code as ranging from mediocre to downright bad. Xenon will be a streaming media monster, but the parts of the game engine that have to do with making the game fun to play (and not just pretty to look at) are probably going to suffer. Even if the PPE's branch prediction is significantly better than I think it is, the relatively meager 1MB L2 cache that the game control, AI, and physics code will have to share with procedural synthesis and other graphics code will ensure that programmers have a hard time getting good performance out of non-graphics parts of the game.

Furthermore, the Xenon may be capable of running six threads at once, but the three types of branch-intensive code listed above are not as amenable to high levels of thread-level parallelization as graphics code. On the other hand, these types of code do benefit greatly from out-of-order execution, which Xenon lacks completely, a decent amount of execution core width, which Xenon also lacks; branch prediction hardware, which Xenon is probably short on; and large caches, which Xenon is definitely short on. The end result is a recipe for a console that provides developers with a wealth of graphics resources but that asks them to do more with less on the non-graphical side of gaming.

Still, there is some hope on that front. In the PC market where there are multiple processors to support, developers can't fine-tune games for a specific CPU. This heterogeneity of hardware especially hurts with platform-sensitive optimizations like branch hints, which is one reason they don't get used much. In contrast, with the Xenon, the hardware will be fixed, which means that programmers can go all-out in profiling and optimizing branchy game control, AI, and physics code using every trick in the book. Furthermore, console coders can also take heavy advantage of prefetching to overcome the Xenon's cache size limitations. So it's quite possible that as time goes on developers will find ways to get much better branch-intensive code performance out of the hardware. Just don't count on it in the first generation of games, though.

At any rate, Playstation 3 fanboys shouldn't get all flush over the idea that the Xenon will struggle on non-graphics code. However bad off Xenon will be in that department, the PS3's Cell will probably be worse. The Cell has only one PPE to the Xenon's three, which means that developers will have to cram all their game control, AI, and physics code into at most two threads that are sharing a very narrow execution core with no instruction window. (Don't bother suggesting that the PS3 can use its SPEs for branch-intensive code, because the SPEs lack branch prediction entirely.) Furthermore, the PS3's L2 is only 512K, which is half the size of the Xenon's L2. So the PS3 doesn't get much help with branches in the cache department. In short, the PS3 may fare a bit worse than the Xenon on non-graphics code, but on the upside it will probably fare a bit better on graphics code because of the seven SPEs.

In sum, the Xenon will certainly make the Xbox 360 a 3D graphics powerhouse. Though history suggests that the Xbox 360's games will probably never attain the level of graphical realism promised by Microsoft's pre-launch hype and portrayed in the pre-rendered "game demos" that were shown off at E3, gamers can nonetheless expect a significant advance in levels of graphical realism and visual immersiveness.

again. we are talking about performance of the gaming A.I and its output. Which will perform better? the Teraflop argument or the Integer Calculations argument

If you're going to paste entire articles out in response to someone's post, you should at least highlight the pertinent points. Few people are going to read all that...

...I did, and I fail to see how any of it connects directly to the points Guden was making (?) Perhaps you can elaborate.

As a general comment on the anandtech article, we've all read it, and I think most agreed that Hannibal was being a little strict about what would and wouldn't work on both the cores in X360 and PPE in Cell and the SPEs. Clever coding and "non-traditional" thought will be required to get best performance out of both..with most algorithms you'll have to rethink how they map to the hardware.
 
gosh said:
...
again.. if you dont have access to the main memory and only have smaller private memory, SPEs only act as helper units within thier own private domain with the main core. They dont interact with anything outside the core. The core itself uses the SPEs to perform internally.

So my question again is, if Xbox 360 has higher Integer performance and PS3 has higher Teraflop performance which is better for the output of games

The SPUs prefetch data from system RAM via DMA engines. If you ignore the SPUs, then you are ignoring what the CELL processor is about. The CELL processor gets it FP or INTEGER performance from its SPUs NOT the Power PPE core.

Frankly, if you're discussing one CELL PPE core with 3 XeCPU cores, then you're ignoring ~ 160 million transistors from CELL and pretty much the philisophy behind the CELL processor.

Can you define what your integers and FPs are and where they are coming from?
 
wow . ok u guys are stuck in graphical, calculation LIMBO. I am simply asking Which architecture. The Cell or the Xbox 360 will perform better calculations in terms of Gaming A.I performance. Leave all issues of graphics, game structure. Im specifically talking about how well the game will perform in terms of A.I and performance in general

If you take 2 equally talented developers. Which System will get the A.I to perform better, faster and more efficiently?
 
gosh said:
wow . ok u guys are stuck in graphical, calculation LIMBO. I am simply asking Which architecture. The Cell or the Xbox 360 will perform better calculations in terms of Gaming A.I performance. Leave all issues of graphics, game structure. Im specifically talking about how well the game will perform in terms of A.I and performance in general

If you take 2 equally talented developers. Which System will get the A.I to perform better, faster and more efficiently?

http://www.beyond3d.com/forum/viewtopic.php?t=20329

http://www.beyond3d.com/forum/viewtopic.php?t=22725

Read these threads and make your own conclusions as there are differing views on the subject. Pay attention to MfA, ERP, nAo, Fafalada, DeanoC etc. as they are game developers. And there are plenty of other threads on the topic, so I suggest you use the search function as this topic has been discussed to death...
 
Status
Not open for further replies.
Back
Top