ATI - PS3 is Unrefined

Jawed · Dec 31, 2005

Always useful, even if short on detail:

http://www.research.ibm.com/journal/rd/494/kahle.pdf

Jawed

Dr. Nick · Jan 1, 2006

dcforest said:
Given Barry counts the PPE at 25.6 GFLOPS, it is likely that Xenon comes in at 77 GFLOPS, unless their is some special scheduling/execution magic in the Xenon VMX/FPU units that no one has publically talked about.

The VMX units in the PPE in the Cell and the VMX units in the PPEs in the Xenon are not the same so his math can't be applied to the Xenon without some tweaking of the numbers.

Titanio · Jan 1, 2006

Dr. Nick said:
The VMX units in the PPE in the Cell and the VMX units in the PPEs in the Xenon are not the same so his math can't be applied to the Xenon without some tweaking of the numbers.

Extra registers and a dot product instruction don't change the number of flops it can do per cycle, which is 8 I believe, same as Cell PPE and other VMX.

ROG27 · Jan 1, 2006

Jaws said:
No, I think you're getting mixed up. We are CLEARLY referring to the parent die and its shader ALUs. If you've followed any of the derivations, then you'd realise that were are using 48 ALUs from the shader array and not referring to the fixed function logic in the daughter die...

That's what I'm saying Jaws...I believe the 240 number comes from MS including fixed function ops on the daughter die. The 216 number is purely derived from the 48 ALUs on the parent die.

I think they included those extra fixed ops in their number because programmable ops on a traditional gpu would be responsible for those types of post-rendering functions (anti-aliasing, blurring effects, etc.)

Edge · Jan 1, 2006

Entropy said:
I'm not a games programmer, and thus I can't really say what one of those would find difficult. It would depend on their background and their personality, I guess.
But judging from my own experience, I'd guess that the PS3 takes a little more in the way of rearranging your gears if you come from a typical PC-like background, whereas the 360 will start giving you headaches as you try to exploit its three cores more fully.

Both consoles can be programmed straight on along PC patterns, you simply use the PPE on the PS3 or one of the cores on the 360, and then talk to the GPU as per normal. Both consoles allow themselves to be used like that, and it doesn't limit the GPU much from what I can see, the limitations are mostly on what you can achieve on the CPUs. So that would produce nice looking pixels on the screen for both consoles. If your game requires more in the way of CPU performance, for physics, game logic or graphical processing reasons, then but only then will you have to dig in deeper.

Looking at the designs from the outside implies that slightly different challenges will present themselves.
The 360 is architecturally very similar to, say, the XBox or an integrated graphics PC. It has advantages to both though, in substantially higher bandwidth CPU to GPU, and GPU to memory (as well as the bandwidth saving feature of the intelligent buffer memory on the GPU). It also has three cores, operating in a traditional symmetrical multiprocessing/ uniform memory architecture. The problem with this layout is contention for memory. The three cores have relatively small private L1 caches, they share the L2, and they share the main memory with the GPU. So while it presents a rather straightforward programming model, actually getting CPU to perform well is going to require dealing with three cores thrashing each others cache, and stepping on each others feet in trying to access the same memory pool as the worst memory hog of them all, the GPU. Additionally, the internal data traffic between the CPU and the GPU will also load the CPU memory path. As a programmer this situation is typically really nasty, because you don't really have much in the way of tools to control/synchronize the different threads and the GPU. (Lockable cache areas can help. A little bit.) These issues can basically only be alleviated by making the constrained resource really ample. But doing that with the memory path is expensive. So while the 360 is better off than a typical integrated chipset PC, you can still see that bandwidth and memory contention is going to be a significant problem, and a difficult to manage one at that. The very design principles that makes the transition to multiprocessing easy both from a programming and from a hardware point of view comes back to bite you, and make actually extracting high utilization rates from the additional resources difficult.

The PS3 requires you to take a step back from typical PC procedure, and take a broader look at what you want to achieve. (One reason is that you might want to utilize the CPU for some graphics related task, shifting bandwidth and processing capablilities around for optimum yield. I won't go there, as I'm not qualified to comment.) Not only do you want to partition your problem into blocks that can be farmed out to the SPEs, but you'd also want to adapt your in-thread algorithms to be partioned and distributed to the SPEs. The Cell processor offer additional flexibility in that the SPEs can also pass/pipe tasks between themselves, and basically you have a bunch of options there that to a PC programmer is new and thus both a bit difficult and hopefully exciting. What is really good about the PS3 compared to the 360 is the resources that has been dedicated to manage memory and communication. The SPEs have 256 KBytes of local memory, which they can access without any risk of having their data flushed or needing to cache snoop or any such. There are fast data paths within the chip to transfer data to and from the PPE/SPEs, and between them. The CPU also has its own dedicated path to memory and a completely separate very high bandwidth connection to the GPU, that in turn has its own dedicated path to graphics memory. And not only does the PS3 sidestep the nastiest contention issues by providing separate datapaths, these separate datapaths also provide higher bandwidth individually than the shared resources of the 360. For someone with a background in scientific computing like me, the data flow model of the PS3 looks much better. I can't speak for games programmers.

So which console offers the steeper learning curve? I'd say that depends on where on the curve you are. Not all games require cutting edge utilization, and at that point I'd say both should actually be fairly easy to deal with. If you want to squeezee more out of the respective consoles, the PS3 departs more significantly in its architecture and possibilities from a PC, and thus most programmers would need to study the architecture, their algorithms and the available tool carefully in order to build an application that is well suited to the console. In contrast, the vanilla SMP/UMA of the 360 is really simple conceptually and doesn't suffer much of a learning curve at all apart from managing a few threads. In that respect the 360 is much simpler. It's memory and communication limitations will range from being non-issues to presenting insurmountable problems depending on what you want to achieve, but wringing really good performance from the 360 will require you to try to balance the different processes that need to access the memory paths very very well, because you want to wring maximum utilization out of this limited resource. And that will definitely not be easy.

Steeper learning curve is in all probablility the PS3, but in no way, shape or form does that imply that the 360 will cause its programmers fewer gray hairs.

Again, all of the above from someone without a games developing background, but with practical experience of non-PC type architectures. YMMV.

It seems some of the most well written and intelligent posts around here get no responses. Let me be the first to say, very nice post!

One of the more balanced views I have seen on the respective achitectures. Thank you for that.

tsek · Jan 1, 2006

I agree with you Edge.I just finished reading it and then i tried to see if there was any reply...but none.The most important besides it's not just a ****** post is that it doesn't mess with tones of technical stuff.It's straight to the point.Well done Entropy!

ihamoitc2005 · Jan 1, 2006

Clear

Edge said:
It seems some of the most well written and intelligent posts around here get no responses. Let me be the first to say, very nice post!

One of the more balanced views I have seen on the respective achitectures. Thank you for that.

The writer has a clear mind.

Sis · Jan 2, 2006

Edge said:
It seems some of the most well written and intelligent posts around here get no responses. Let me be the first to say, very nice post!

One of the more balanced views I have seen on the respective achitectures. Thank you for that.

It's unfortunate that it was in response to a fishing expedition by dukmahsik and may explain the lack of further comments.

.Sis

Entropy · Jan 2, 2006

Edge said:
It seems some of the most well written and intelligent posts around here get no responses. Let me be the first to say, very nice post!

One of the more balanced views I have seen on the respective achitectures. Thank you for that.

Thank you and the others who chimed in.
So I'll try to output a little more, and try to explain why there are limits to how firm statements you can make about these systems from my (or indeed any one mans) perspective. It's one thing to point the finger at an issue, quite another to judge its relevance. That kind of information is both situational, and unfortunately, quite rare.

When I look at architectures, I do it in a manner grounded in scientific calculation, which for a long time now has focussed on data flow, rather than FLOPS. The bottleneck (typically) isn't the time it takes to perform the operations in the arithmetic units, but in getting the necessary data to and from the logic fast enough. (Communication between processors is another big issue, but not as relevant for these console designs.)

Weighing the 360 and PS3 CPU designs against each other is difficult. A lot of relevant, system specific technical minutiae isn't available (latencies, overhead, turnaround times et cetera) which makes a really good armchair analysis hard. But more importantly, and where I can't contribute, it requires knowledge of the particular application.

I'll explain what I mean, starting with what I percieve as the largest problem with the 360 design, the CPU-to-GPU interface. All memory traffic of the CPUs has to pass through, as well as all CPU-to-GPU internal communication. Additionally, some of the memory traffic will be congested due to the memory bus being busy processing GPU memory transfers. Add contention between the cores.

The theoretical throughput of the CPU-to-GPU channel is 10GB/s. Real throughput will be lower for many reasons among them latencies, both electrical and protocol. For the sake of argument, estimate that the effective bandwidth is halved to 5GB/s. Now, we've seen tons of FLOPS analysis on these boards. But while some may find this interesting in its own right, the numbers produced should be contrasted with the data paths available. The memory channel of the 360 CPU can, if we are generous, sustain roughly 5GB/s bidirectional or 2.5 GB/s in either direction assuming symmetric load in and out and negligeable CPU-to-GPU internal traffic, corresponding to roughly 0.5 billion single precision floating point numbers.

The question then becomes: What is likely to, given typical applications, limit the calculational throughput of the 360 CPU? The theoretical FLOPS of its cores, or the memory subsystem? It would take extreme data locality, and imply algorithms that do a huge amount of work per data point for the memory channel not to be limiting. To what extent does this describe game code? I don't know. Only someone who is actually tangling with such code could know, and even then, some games or even parts of games will have different requirements than others. I will go out on a limb though, because it looks pretty damn sturdy, and say that the memory subsystem of the 360 CPU is likely to be the greatest impediment to getting high performance out of the cores, and that trying to avoid loading the CPU-to-GPU channel is going to cause many bald patches on game programmers heads. If, that is, the overall performance constraints are on the CPU side at all, and not due to the GPU.

An additional reason to believe that this is going to be a difficult spot for the 360 CPU is the much greater resources that are spent on the PS3 for the same problem - the aggregate theoretical bandwidth for CPU-to-memory and CPU-to-GPU traffic is several times higher and the data paths are separate, avoiding contention at the cost of dynamic resource use. Both consoles target the same application space and the differences in their abilities in this respect is considerable.

The primary problem of harnessing the processing capabilities of the Cell is generally held to be partitioning your problems into SPE friendly chunks and having these chunks process reasonably synchronized (so as not have the rest of the chip twiddling its thumbs too long waiting for a single SPE to finish up). This is non-trivial for sure, tool dependent, and pretty much impossible for someone not doing the actual job to judge the difficulty of. Some parts of the game code are likely to be relatively easy to parallellize, other parts are definitely likely not to be or even impossible. Again, we are probably not going to be limited by theoretical FLOPs. If you are able to partition you job over all the SPEs, and get uniform and high utilization from them, well then you are still likely to face the main memory subsystem throughput limit with all the latency, protocol, granularity et cetera issues involved.

How difficult is it to parallellize the time-critical portions of the game code? Is the Cell CPU even limiting at all, or are the limitations firmly on the GPU side of things? Even developers are only starting getting the answers to those questions, and their experience is likely to differ a bit from person to person. It would be interesting to hear some initial impressions nevertheless, as well as seeing if their opinions change over time.

Armchair expertise only goes so far.
I'll submit that I feel that the discussions here are a bit myopic - typically focussing on details (preferably unknown

) of the GPU or FLOPs counting, or some such. How you tie it all together - data paths and interprocess(or) communication, is getting comparitively little attention, probably because it takes a bit more background knowledge, and is more complex.

But even then, the analysis is completely hardware centric. The practical balance aspects are completely application dependent, and ultimately it is the programmers who determine how well the hardware will be utilized. Creative programmers may well be able to circumvent what would otherwise appear to be decisive shortcomings of a console. Or, at the other end of the scale, leave capabilities untapped due to constraints in time, talent or tools.

Theorizing from whitepapers can only go so far. Beware anyone drawing strong conclusions from them.
Unfortunately my time in the armchair this New Year is all spent.

Happy New Year to ye all, particularly to those of you who wrestle with the new consoles.

expletive · Jan 2, 2006

Entropy said:
I'll explain what I mean, starting with what I percieve as the largest problem with the 360 design, the CPU-to-GPU interface. All memory traffic of the CPUs has to pass through, as well as all CPU-to-GPU internal communication. Additionally, some of the memory traffic will be congested due to the memory bus being busy processing GPU memory transfers. Add contention between the cores.

The theoretical throughput of the CPU-to-GPU channel is 10GB/s. Real throughput will be lower for many reasons among them latencies, both electrical and protocol. For the sake of argument, estimate that the effective bandwidth is halved to 5GB/s. Now, we've seen tons of FLOPS analysis on these boards. But while some may find this interesting in its own right, the numbers produced should be contrasted with the data paths available. The memory channel of the 360 CPU can, if we are generous, sustain roughly 5GB/s bidirectional or 2.5 GB/s in either direction assuming symmetric load in and out and negligeable CPU-to-GPU internal traffic, corresponding to roughly 0.5 billion single precision floating point numbers.

Thanks for some really great posts as theyve been interesting/informative to read. One of the things that has stood out about the 360s design is, in fact, how balanced it is throughout. And by balanced i mean exactly what youre saying here, processors and datapaths that are designed to support their funcitonality and not exceed them needlessly just to win a war on paper. THat said, i personally cant see MS designing all this technology into the 360, only to miscalculate and have something like xenon or xenos only be able to perform at 70% (or less) of its potential becuase of a datapath. If that were the case then they probably would have designed a cheaper Xenos or Xenon with less pipes/cores (or ghz seeing as how hot and low yield the intial Xenons are) rather than create this very advanced/powerful GPU and have lots of its power sit idle. Given the design decisions theyve made that ARE fully understood, this bandwidth starved design doesnt really make any sense to me.

EDIT: i'd be itnerested to know what your thoughts are if you halved the PS3s bandwdith as youve done here. How would that impact its effectiveness?

Tahir2 · Jan 2, 2006

The XBOX360 will not always be bandwidth starved, in certain pixel rich (as in lots of intensive shaders onscreen) conditions you will find the bottleneck switch to the GPU.

Entropy did mention this in his two posts. The act of fine tuning and being perfectly balanced along all sides is not easy to achieve, especially since memory speeds have increased a lot slower than processer and GPU power over the years.

MS chose a UMA type architecture with the eDRAM for AA and Sony took a different path (some of those decisions may have been to make the RSX development easier e.g. sticking with DDR memory for the GPU).

j^aws · Jan 2, 2006

^^^

For X360,

CPU-GPU is 10.8 GB/sec Read and 10.8 GB/sec write from L2 CPU cache. However when accessing GDDR3 at 22.4 GB/sec, the peak CPU-GPU is limited to 10.8 GB/sec read, (from leak 2004 diagram).

Also , the recent 'leak' states this at only ~6.5 GB/s read/write through the L2 cache due to limitations...

Edit: This was a reply to Expletive/Entropy if it wasn't clear...

j^aws · Jan 2, 2006

ROG27 said:
That's what I'm saying Jaws...I believe the 240 number comes from MS including fixed function ops on the daughter die. The 216 number is purely derived from the 48 ALUs on the parent die.

I think they included those extra fixed ops in their number because programmable ops on a traditional gpu would be responsible for those types of post-rendering functions (anti-aliasing, blurring effects, etc.)

Nah, they calculate 240 GF from the 48 Shader ALUs,

10 Flops/ALU x 48 ALUs x 0.5 Ghz ~ 240 GF

Lysander · Jan 2, 2006

Inside the Smart 3D Memory is what is referred to as a 3D Logic Unit. This is literally 192 Floating Point Unit processors inside our 10MB of RAM.

http://www.hardocp.com/article.html?art=NzcxLDM=

j^aws · Jan 2, 2006

Lysander said:
http://www.hardocp.com/article.html?art=NzcxLDM=

Those 192 FIXED FUNCTION FPUs (that you originally posted) are NOT programmable shader ALUs that we're referring to. They are basically the 8 ROPs that are in the daughter die, making up around 20-25 million transistors. Again, we're' referring to the 48 Shader ALUs in the parent die..

liolio · Jan 3, 2006

Edit
for clarity i want more info like Expletive in reaction to very interesting Entropy's comments on memory archithecture of both systems
Edit

Somebody knows an average value of the cpu/ram bandwidth of the current Pc (i know PIV and athlon are somewhat different).?
I've read in X86-secret a value of 6.4 gb/s for K8 using ddr330 and an hypertransport 800mhz bus but i don't know if it is read/write or only read (so should be 12.8 gb/s read+write).
Pemtium IV seems to have more bandwith to RAm but the memory controler include in athlon's core seems better when latencies are concerned (+result is obvious in amd 's favor).
does pc cpu are limited by cpu/ram bandwidth issues in game type applications?
In that case Xbox360 seems to have have to weak point vs cell:
no on chip memory controler (amd have prove that is good design choice)
less bandwidth to ram.
where sony has taken best of the two current pc cpu (on chip memory controler aka AMD+ lot of bandwith to ram aka P IV quad pump ddr2), Ms seems to have chosen the what is weak in this very current cpu.

expletive · Jan 3, 2006

Who dseigned the entire system layout of the 360? ATI, IBM, or 'other'?

V3 · Jan 3, 2006

I must have been refreshed by my holiday, that I managed to read through this thread

But sadly no new information.

Happy New Year Everyone

pegisys · Jan 3, 2006

expletive said:
Who dseigned the entire system layout of the 360? ATI, IBM, or 'other'?

I would think Microsoft did, from what I have heard they have thier own engineering team, they probably came up with broad design and then went searching for parts that would fit into there design without to many changes

and I think in a ati interview they say microsoft had a big hand in designing the gpu(might have been the article that started this post)

tema · Jan 3, 2006

XCPU is very bandwidth starved. PS3 alphablends all procedural on Cell.
Xenos is an unrefined lower-clocked R580 with edram.

ATI - PS3 is Unrefined

Jawed

Dr. Nick

Titanio

ROG27

Edge

tsek

ihamoitc2005

Sis

mental_v-sync=off;

Entropy

expletive

Tahir2

j^aws

j^aws

Lysander

j^aws

liolio

Aquoiboniste

expletive

V3

pegisys

tema

Similar threads