ATI - PS3 is Unrefined

Titanio · Dec 30, 2005

Lysander said:
Jeffrey Brown`s IBM article on 360cpu

All that tells us is the scaler DP ops have a 10 cycle latency. The given figures for Xenon's floating point performance (115, or 90 or 75.6) are SP figures.

Shifty Geezer · Dec 30, 2005

115 DP GFlops per chip would see all the talk being of XeCPU finding it's way into supercomputers rather than Cell!

overclocked · Dec 30, 2005

Titanio said:
I think you mean parallelism. And I think you'll find that Xenon is a parallel architecture also - without parallelism on a core level, it'd be sitting there with ~25.6-30 Gflops (assuming everything else about it stayed the same). Parallelism is being generally adopted as the route to more power now, that's just the way it is.

edit - actually, I'm not really sure what you're referring to now. SPEs run one thread at a time. In that sense, it compels you to ensure that your single thread is using as much power as possible and does not block. The SPEs, on their own, are single threaded (potential software solutions for multi-threading on one SPE aside).

Your edit text pretty much is on spot with what i meant. I see it as a good thing that an single SPE has such high theoretical FP without having to reside to swith two threads.
I wonder if they sometime in the development thought of change the size of LS to add more logic too instead. In that Cell summary that was posted a while ago they had apparently been thinking of having a smaller LS(128?), but if it was for diesize or adding something i dont remember. Anyway think they made a good decision going with so much memory on the chip. Hope coders feel the same when they have understood how to get the much out of it.

dukmahsik · Dec 30, 2005

i wonder which has the steeper learning curve, xenos or cell

Bobbler · Dec 31, 2005

dukmahsik said:
i wonder which has the steeper learning curve, xenos or cell

I would hazard a guess and say Cell, by miles. Outside of building an engine that can tile, Xenos, isn't all that different from a normal graphics card in how you handle it, I imagine (especially with API's covering up the metal). Finding uses for Memexport (and that is sort of moot, since you don't need to use it) and tiling will probably be the only real differences -- the shaders that run in pc games and xbox will probably run fine on Xenos without any changes needed (since you just pass them into a fancy compiler and it makes them all pretty for the GPU anyways... the underlying architecture of shaders and how the code is executed is pretty much the same as any other GPU). The unified shaders should be completely transparent to the user/programmer.

dukmahsik · Dec 31, 2005

that's a good thing then imo

j^aws · Dec 31, 2005

Titanio said:
I've heard the FPU - the other 4 flops - cannot execute unless the VMX is executing a load/store or logical operation? I don't know if the missing 4 flops can be accounted for in other ways also, though.

That sounds about right when two threads are active because the core can only dual issue...

Those missing 4 flops would likely be from the FPU (i.e. a MADD capable, 2-way SIMD unit) but they are aggregating those flops to present a number that wouldn't even be a theoretical peak number (115 GF)...

j^aws · Dec 31, 2005

ROG27 said:
I think where we are getting mixed up here people is the shader operations which take place on the main die vs the daughter die. Approximately 216 programmable shader ops take place on the main die...the other 26 take place on the daughter die and have soley to do with post-rendering effects (not fully programmable).

No, I think you're getting mixed up. We are CLEARLY referring to the parent die and its shader ALUs. If you've followed any of the derivations, then you'd realise that were are using 48 ALUs from the shader array and not referring to the fixed function logic in the daughter die...

j^aws · Dec 31, 2005

Dave Baumann said:
The leaks are not particularly accurate in all places - for instance the overview states the shader array is 24G Instructions /s, which is wrong since its 48G Instructions/s.

Yeah, I've got the overview doc now and it does state that. However if you're suggesting the the 216 figure is a typo, then the odds are against it for the following reasons,

- introducing a 2nd typo in a technical doc will lower the odds

- the 24/48 figure can easily be a typo, usually factors of 2, 4 or 10 are. The 216 no. isn't for 240. Nor is it near nos. on the keyboard/keypad...

- you can cross reference the 216 no. because it's involved in a calculation. I.e. 216 (Xenos)/ 88 (R420) ~ 2.455 ~ 2.5

- The 216 no. is non-obvious and it's derivations was in the Watch article...

Dave Baumann said:
They also highlight no capability differences between the Vector and Scalar portions.

This can still be valid for 216. E.g. for Vec3+scalar, the scalar unit could be MADD capable but for Vec4+scalar, it may not...the same rules may not apply when scheduling these instructions across very wide issue SIMD engines...

So can we now agree it's still 216?

j^aws · Dec 31, 2005

aaronspink said:
The problem is we don't know what is actually correct, the "technical' leak doc or the various other MS/ATI documents. Nor do we know that if the technical leak doc is taking into account certain scheduling restrictions and if so, if the numbers quoted by IBM/Sony/Nvidia are taking into account similar scheduling restrictions.

For Xenon, I get 76.8 GFLOPS if the core can't dual issue VMX and FPU, 96 GFLOPS if it can (but its kinda pointless really since you'll at least need to do some loads and stores).

Yeah I agree with those. The doc clearly states 8 flops/cycle per core as peak and the derivation would be 77 GF @ 3.2 GHz, but the doc doesn't state this no.

Derivation for 96 GF @ 3.2 Ghz, it would be 10 Flops/ cycle which would include 2 Flops from the FPU. The doc stating 8 Flops/cycle per core as peak, we can infer that dual issue across FPU+VMX isn't possible.

The doc actually states 84 GF @ 3 GHz, which would be extrapolated ~ 90 GF @ 3.2 GHz. My guess is that they are including some D3D 'free' instructions with that, to get the increase from the expected 8 flops/cycle, 77 GF nos....

aaronspink said:
The 115 GFLOPs number seems to be counting 12 flops per cycle which I still don't understand how they get.

It's likely the 4 flops are from a MADD capable 2-way SIMD FPU unit...

aaronspink said:
Likewise, I only get 204 GFLOPs for CELL.

I get that too...

rendezvous · Dec 31, 2005

Jaws said:
aaronspink said:

Likewise, I only get 204 GFLOPs for CELL.

Click to expand...

I get that too...

Under UIUC Dr. Peter Hofstee gave a presentation about cell, the presentation can be found on the following URL.
http://www.acm.uiuc.edu/conference/webcast.php
(Cred goes to phed for finding it)
Between 40:18 and 40:31 in the video a block diagram is shown over the PPE. It seems like the PPE is able to issue 2 instructions to the VMX/FPU units per clock which would raise the peak GFLOPs to above 204,8 for a 1+7 configuration.

trinibwoy · Dec 31, 2005

Jawed said:
They didn't leave any room on 110nm - 7800GTX-512 isn't a viable product, just a few thousands boards for the purposes of marketing.

Sorry for the late reply.....I'm on vacation and just decided to pop in

Considering most GTX's hit ~ 490Mhz on stock cooling and voltage they had lots of room. Note that I never said that they could have come with the GTX512 from the start, both you and AlphaWolf jumped to that conclusion for some reason. All I said was that they had lots of headroom. If you think they couldnt have gotten good yields with the original GTX at 460 - 500Mhz with slightly higher voltage and X1800XT class cooling you're mistaken IMHO.

dcforest · Dec 31, 2005

Note, Barry Minor of IBM makes his own FLOPS calculation for CELL and also arrives at a 204 GFLOPS number.

See (scroll down to a Sept. 1st response comment from Barry) : http://gametomorrow.com/blog/index.php/2005/07/26/beyond-polygons/

Given Barry counts the PPE at 25.6 GFLOPS, it is likely that Xenon comes in at 77 GFLOPS, unless their is some special scheduling/execution magic in the Xenon VMX/FPU units that no one has publically talked about.

Entropy · Dec 31, 2005

dukmahsik said:
i wonder which has the steeper learning curve, xenos or cell

I'm not a games programmer, and thus I can't really say what one of those would find difficult. It would depend on their background and their personality, I guess.
But judging from my own experience, I'd guess that the PS3 takes a little more in the way of rearranging your gears if you come from a typical PC-like background, whereas the 360 will start giving you headaches as you try to exploit its three cores more fully.

Both consoles can be programmed straight on along PC patterns, you simply use the PPE on the PS3 or one of the cores on the 360, and then talk to the GPU as per normal. Both consoles allow themselves to be used like that, and it doesn't limit the GPU much from what I can see, the limitations are mostly on what you can achieve on the CPUs. So that would produce nice looking pixels on the screen for both consoles. If your game requires more in the way of CPU performance, for physics, game logic or graphical processing reasons, then but only then will you have to dig in deeper.

Looking at the designs from the outside implies that slightly different challenges will present themselves.
The 360 is architecturally very similar to, say, the XBox or an integrated graphics PC. It has advantages to both though, in substantially higher bandwidth CPU to GPU, and GPU to memory (as well as the bandwidth saving feature of the intelligent buffer memory on the GPU). It also has three cores, operating in a traditional symmetrical multiprocessing/ uniform memory architecture. The problem with this layout is contention for memory. The three cores have relatively small private L1 caches, they share the L2, and they share the main memory with the GPU. So while it presents a rather straightforward programming model, actually getting CPU to perform well is going to require dealing with three cores thrashing each others cache, and stepping on each others feet in trying to access the same memory pool as the worst memory hog of them all, the GPU. Additionally, the internal data traffic between the CPU and the GPU will also load the CPU memory path. As a programmer this situation is typically really nasty, because you don't really have much in the way of tools to control/synchronize the different threads and the GPU. (Lockable cache areas can help. A little bit.) These issues can basically only be alleviated by making the constrained resource really ample. But doing that with the memory path is expensive. So while the 360 is better off than a typical integrated chipset PC, you can still see that bandwidth and memory contention is going to be a significant problem, and a difficult to manage one at that. The very design principles that makes the transition to multiprocessing easy both from a programming and from a hardware point of view comes back to bite you, and make actually extracting high utilization rates from the additional resources difficult.

The PS3 requires you to take a step back from typical PC procedure, and take a broader look at what you want to achieve. (One reason is that you might want to utilize the CPU for some graphics related task, shifting bandwidth and processing capablilities around for optimum yield. I won't go there, as I'm not qualified to comment.) Not only do you want to partition your problem into blocks that can be farmed out to the SPEs, but you'd also want to adapt your in-thread algorithms to be partioned and distributed to the SPEs. The Cell processor offer additional flexibility in that the SPEs can also pass/pipe tasks between themselves, and basically you have a bunch of options there that to a PC programmer is new and thus both a bit difficult and hopefully exciting. What is really good about the PS3 compared to the 360 is the resources that has been dedicated to manage memory and communication. The SPEs have 256 KBytes of local memory, which they can access without any risk of having their data flushed or needing to cache snoop or any such. There are fast data paths within the chip to transfer data to and from the PPE/SPEs, and between them. The CPU also has its own dedicated path to memory and a completely separate very high bandwidth connection to the GPU, that in turn has its own dedicated path to graphics memory. And not only does the PS3 sidestep the nastiest contention issues by providing separate datapaths, these separate datapaths also provide higher bandwidth individually than the shared resources of the 360. For someone with a background in scientific computing like me, the data flow model of the PS3 looks much better. I can't speak for games programmers.

So which console offers the steeper learning curve? I'd say that depends on where on the curve you are. Not all games require cutting edge utilization, and at that point I'd say both should actually be fairly easy to deal with. If you want to squeezee more out of the respective consoles, the PS3 departs more significantly in its architecture and possibilities from a PC, and thus most programmers would need to study the architecture, their algorithms and the available tool carefully in order to build an application that is well suited to the console. In contrast, the vanilla SMP/UMA of the 360 is really simple conceptually and doesn't suffer much of a learning curve at all apart from managing a few threads. In that respect the 360 is much simpler. It's memory and communication limitations will range from being non-issues to presenting insurmountable problems depending on what you want to achieve, but wringing really good performance from the 360 will require you to try to balance the different processes that need to access the memory paths very very well, because you want to wring maximum utilization out of this limited resource. And that will definitely not be easy.

Steeper learning curve is in all probablility the PS3, but in no way, shape or form does that imply that the 360 will cause its programmers fewer gray hairs.

Again, all of the above from someone without a games developing background, but with practical experience of non-PC type architectures. YMMV.

j^aws · Dec 31, 2005

Jaws said:
Dave Baumann said:

The leaks are not particularly accurate in all places - for instance the overview states the shader array is 24G Instructions /s, which is wrong since its 48G Instructions/s.

Click to expand...

Yeah, I've got the overview doc now and it does state that. However if you're suggesting the the 216 figure is a typo, then the odds are against it for the following reasons,

- introducing a 2nd typo in a technical doc will lower the odds

- the 24/48 figure can easily be a typo, usually factors of 2, 4 or 10 are. The 216 no. isn't for 240. Nor is it near nos. on the keyboard/keypad...

- you can cross reference the 216 no. because it's involved in a calculation. I.e. 216 (Xenos)/ 88 (R420) ~ 2.455 ~ 2.5

- The 216 no. is non-obvious and it's derivations was in the Watch article...

Also wanted to add that I re-read the 24/48 Ginst for Xenos in the doc, and it doesn't look like a typo in the context it's presented with it's comparison to Xbox1's discrete vertex/pixel shader units. They basically split it in the comparison.

So do we agree? (Third time of asking!)

rendevous said:
Under UIUC Dr. Peter Hofstee gave a presentation about cell, the presentation can be found on the following URL.
http://www.acm.uiuc.edu/conference/webcast.php
(Cred goes to phed for finding it)
Between 40:18 and 40:31 in the video a block diagram is shown over the PPE. It seems like the PPE is able to issue 2 instructions to the VMX/FPU units per clock which would raise the peak GFLOPs to above 204,8 for a 1+7 configuration.

Thanks for the link. I had a look, but unless I'm missing something, I can't see it in the same cycle...?

rendezvous · Dec 31, 2005

Jaws said:
Thanks for the link. I had a look, but unless I'm missing something, I can't see it in the same cycle...?

I figured that the numbers next do the internal buses were width of the buses in number of instructions. The bus going from VMX FPU Issue (Queue) stage has the width 2, just like the general issue stage. There may however be limitations of which instructions you could pair in the FPU/VMX units which would affect the performance numbers.

j^aws · Dec 31, 2005

rendezvous said:
I figured that the numbers next do the internal buses were width of the buses in number of instructions. The bus going from VMX FPU Issue (Queue) stage has the width 2, just like the general issue stage. There may however be limitations of which instructions you could pair in the FPU/VMX units which would affect the performance numbers.

Okay,

...........................................|2 VMX FPU Issue Queue..................................
| 1 VMX Load/store...| 1 VMX Logic/Arith..............................| 1 FPU Load/store......| 1 FPU Logic/Arith

The queue branches with width 2 on either side, I'm not sure what combination/permutation to take there in a given cycle...?

Jawed · Dec 31, 2005

According to the pipeline diagram, Xenon can't dual-issue vector math and scalar math computations.

http://www.beyond3d.com/forum/showpost.php?p=661681&postcount=612

PPE VMX is not the same, where it's clear that vector and scalar can be dual-issued (in the VSU).

Jawed

Shifty Geezer · Dec 31, 2005

We don't know much about PPE. From dev hints here, it's undergone a few changes. Assumptions that it's very like a Xenon core seem to be pure speculation.

TurnDragoZeroV2G · Dec 31, 2005

My opinion doesn't mean squat, but I'm going to say that the 216 is "somewhat" agreeable, but that the Vec4+Scalar being madd-incapable (as in, not all 5 of the components are madd capable(edit)) is not. You can arrive at 216 if you assume that pixel work will always be 4 components, vertex work 5 components, and the work is split evenly (if we are to assume that, as you too noticed in that one chart, they just split work out evenly or otherwise ignore the scalar, then it's quite plausible); or, they figure the scalar is only used 50% of the time, or something along those lines. But that the scalar isn't madd capable just boggles my mind. The additional complexity in hardware/compiler to suit that, plus a loss in usable ALU time, plus very minimal savings in hardware transistors-wise. I don't buy that. Average of 4 and 5 components yielding 9 flops/clock*pipe I will buy, though.

I can't say I entirely trust the person(s) responsible for the document. 88GFlops for the R420, from 16 pixel pipes and 6 vertex pipes running at 500Mhz. I only found one reasonable match for that (not that I spent very long on it, of course), with 4 components and 4 components. Maybe they don't find the scalar useful for vertex operations. Maybe there's somthing I missed about R420. I don't know. Except that I don't believe a Madd Vec4 + Mul/add/whatever Scalar. Take my pennies or leave 'em.

ATI - PS3 is Unrefined

Titanio

Shifty Geezer

uber-Troll!

overclocked

dukmahsik

Bobbler

Shazbot!

dukmahsik

j^aws

j^aws

j^aws

j^aws

rendezvous

trinibwoy

Meh

dcforest

Entropy

j^aws

rendezvous

j^aws

Jawed

Shifty Geezer

uber-Troll!

TurnDragoZeroV2G

Similar threads