Was Cell any good? *spawn

Status
Not open for further replies.
Yup, prolly not enough cpu grunt to go around. They scale havok across all spu's which is great, but you need a couple of spu + ppu to match the other consoles baseline cpu power, and you need more spus for graphics work, etc, so not a whole lot of spu's left for havok alas.

An SPU can be faster than the PPU VMX when the programmer uses the Local Store and DMA effectively. You don't need a couple of SPUs to match a PPU. You need a couple of SPUs to match the GPU at its game though. e.g., Can you get a PPU VMX to do MLAA and blend the result with the GPU on-time ?

They can make great tech demos though that use all 6 spu's, but that's not indicative of the real games market alas.

Games are already using all the SPUs and PPU. ^_^
The question is whether they can use them effectively.
 
An SPU can be faster than the PPU VMX when the programmer uses the Local Store and DMA effectively. You don't need a couple of SPUs to match a PPU.



Games are already using all the SPUs and PPU. ^_^
The question is whether they can use them effectively.

I love how you try to educate the guy who's actually worked on the platform. :LOL:
 
Yup, prolly not enough cpu grunt to go around. They scale havok across all spu's which is great, but you need a couple of spu + ppu to match the other consoles baseline cpu power, and you need more spus for graphics work, etc, so not a whole lot of spu's left for havok alas. They can make great tech demos though that use all 6 spu's, but that's not indicative of the real games market.

Isn't Havok style computation exactly what the SPUs excel at? I do not think Havok needs many SPUs to beat the 360 cpu. Does anyone have any benchmarks?
 
An SPU can be faster than the PPU VMX when the programmer uses the Local Store and DMA effectively. You don't need a couple of SPUs to match a PPU.

It depends on the code and the game. Effective use of local store and dma is largely moot, you always use local store one way or another, and dma can be absorbed. What does matter is what you are trying to do and how it translates to code. If what you are doing is such that it can very effectively use dual issue to where the spu profiler tells you 0.5 instructions per cycle, then yeah a single vmx unit will not be able to keep pace. That doesn't always happen though. Early vmx code likewise didn't keep up with spu code due to stalls, etc but people are pretty good at coding vmx now. Also the nature of spu design means you can sometimes use a more efficient algotihm on the vmx version. In any case I gave approximate numbers, it's impossible to say exactly how many spu's are needed to replace a ppu, but the point was that some have to be used.


You need a couple of SPUs to match the GPU at its game though. e.g., Can you get a PPU VMX to do MLAA and blend the result with the GPU on-time ?

Definitely tough to say how many spu's replace gpu, it will vary game to game. I threw a number out there just to imply that spu's aren't all free to run havok, but definitely don't use 2 spu as a rule of thumb, it will vary wildly across games! Mlaa as I understand it is pretty involved, I can't imagine ppu vmx would be used for that but then again mlaa came into use after my time in games so someone else can correct me there. Seems like all post aa solutions are either run on gpu or spu.


Games are already using all the SPUs and PPU. ^_^
The question is whether they can use them effectively.

For sure...I'm saying I would not expect a system split across the two in anything more than ppu being traffic cop and spu handling the main work, or the main work all done on vmx and not split off to spu. Like havok, I can't imagine their physics resolver would work simultaneously across spu+ppu, I figure they have an all spu solution for ps3, and vmx version on 360.

Do any ps3 games really show off physics better than the 360? I haven't seen any but I mostly just look at the aaa games. Psn games though could be good candidates to show off cell physics, they have much less going on graphically so they could use all 6 spu's to party with.


Isn't Havok style computation exactly what the SPUs excel at? I do not think Havok needs many SPUs to beat the 360 cpu. Does anyone have any benchmarks?

Don't have benchmarks but I asked around and the typical answer is "2" when asked how many spu's are given to havok. Spu's are busy little beasts in todays games, not much of their time to spare.
 
You also need to miminize blu-ray head seeks, that's a big killer of performance. So in practise both end of being compressed as heavily as possible.

Yes, more compression is helpful. And if it's already implemented on one, developers would use it for the other. The developers can also duplicate the data on Blu-ray to minimize seek time.

They would not share a given load over all 8 cores because the spu cores would be optimized differently than the ppu cores. The ppu cores all but have to be used in vmx at this point, would be madness otherwise. The spu cores need to take advantage of dual issue whenever possible. The two may even have different data structures/algotihims each optimized for vmx or spu+dma. Most everything needs to be shifted to spu now and optimized for that, whatever is left over will need to be optimized for ppu and left entirely there, I wouldn't expect the two to cross. There are definitely optimization similarties between the two, but you have to treat them separate to get the most out of them.

In some cases, the PPU can be the scheduler for distributing SPU workload though. At the same time, the VMX may be occupied doing similar/preparation tasks too.
 
It depends on the code and the game. Effective use of local store and dma is largely moot, you always use local store one way or another, and dma can be absorbed. What does matter is what you are trying to do and how it translates to code. If what you are doing is such that it can very effectively use dual issue to where the spu profiler tells you 0.5 instructions per cycle, then yeah a single vmx unit will not be able to keep pace. That doesn't always happen though. Early vmx code likewise didn't keep up with spu code due to stalls, etc but people are pretty good at coding vmx now. Also the nature of spu design means you can sometimes use a more efficient algotihm on the vmx version. In any case I gave approximate numbers, it's impossible to say exactly how many spu's are needed to replace a ppu, but the point was that some have to be used.

...

The Local Store is not moot. It shortens memory access for both code and data. But you need to fit your code into the small size.

People are getting better in both VMX and SPU programming today. You don't really have to rely on approximate numbers to know which is more effective for real time computing. While handling other tasks, the SPUs can calculate MLAA and blend it to keep pace with the GPU. You need both computational power, enough cores, plus quick and high memory throughput to achieve that.



Don't have benchmarks but I asked around and the typical answer is "2" when asked how many spu's are given to havok. Spu's are busy little beasts in todays games, not much of their time to spare.

Allocating one full SPU to Havok may be a waste if the game doesn't employ physics heavily. They may spread the load to 2 or more part time SPUs so they don't have to worry about it. It doesn't measure relative performance at all.
 
The Local Store is not moot. It shortens memory access. But you need to fit your code into the small size.

Local store in and of itself doesn't shorten memory access. You have to manage your dma right to reduce memory access. As long as your data is dma'd into local store before spu code needs it then yes, it does eliminate memory access latency. The same though applies to vmx code. The difference between the two is that cache handles it for you with vmx, and you more or less manually implement the cache yourself on spu versions. You can still get the same result on vmx, if your instruction latency exceeds the latency to have the vmx registers filled in time then the net result is the same.


Allocating one full SPU to physics to Havok may be a waste if the game doesn't employ physics heavily. They may spread the load to 2 or more part time SPUs so they don't have to worry about it. It doesn't measure relative performance at all.

Remember my original point was to show that all 6 spu's are likely not available just for physics, which is why physics still remain relativelty basic in games even though havok does scale to all 6 spu's.
 
Local store in and of itself doesn't shorten memory access. You have to manage your dma right to reduce memory access. As long as your data is dma'd into local store before spu code needs it then yes, it does eliminate memory access latency. The same though applies to vmx code. The difference between the two is that cache handles it for you with vmx, and you more or less manually implement the cache yourself on spu versions. You can still get the same result on vmx, if your instruction latency exceeds the latency to have the vmx registers filled in time then the net result is the same.

Yes and no. In the self-contained SPU world, even instructions reside in the LocalStore. Everything is simple/shortcut'ed, and runs at that speed.

While you can control your data latency in VMX, the latency is longer, and the entire architecture is subjected to the full memory overhead.

Try writing a post processing effect on VMX to compete with the GPU (while also using the VMX for other tasks within one frame time).

Remember my original point was to show that all 6 spu's are likely not available just for physics, which is why physics still remain relativelty basic in games even though havok does scale to all 6 spu's.

tuna was asking for benchmark though. Your previous response of 2 SPUs for 1 VMX is misleading, 'specially if the SPUs are part time.

Physics remains basic because not many game needs it. It has nothing to do with the SPUs or VMXs. If physics is important enough for a game, we will see the developers allocating enough budget to implement them well.
 
^_^ You can post too if you have something to contribute !
Whether joker454 feels educated or not, it's not my intention at all.

As Shifty pointed out before, this thread is intended to fuel silly arguments, so I'm just here to see the silliness unfold.

I think you contribute enough in this silly thread for the both of us, thanks. ;)

Don't get what you mean whether Joker feels educated or not bit, he worked on the PS3 so I'm sure he knows more than many people here.
 
Yeah, it'll likely be just a part of those two cores. Remember those Killzone SPU job overview graphs? There's a crazy amount of jobs going on at any one time spread over all seven cores.

Also patsu, spu's can actually be scheduled more efficiently by other spu's rather than the ppu.

That said, Motorstorm 2 and 3 have some pretty involved physics going on at very high speeds, and where that differs from an also quite proficient game like Split Second is that the cars break into a lot of different pieces that all have collision detection with each other as well as with a sometimes constantly changing environment (like Split Second).

Super Stardust also does some completely crazy number of objects that collide, are navigated by AI, etc. There is nothing quite like it. Flower's grass is another interesting example, though recently a 360 game, was it Kinectimals?, also did something like that quite impressively I thought.
 
Yes and no. In the self-contained SPU world, even instructions reside in the LocalStore. Everything is simple/shortcut'ed, and runs at that speed.

Code execution is similar on both when written right (excluding dual issue). On the vmx side it keeps I think 6 instructions in the pipeline, so properly written vmx code will be single cycle execute. The compiler helps with scheduling but you can do it by hand as well. On spu you have to have your code dma'd to local store before it can execute. Both have their issues when it comes to having code ready to run, but both are solvable.

Memory latency is also similar on both in that Instructions only run at full speed if the vmx registers are ready and if the data you need is ready in local store. Again similar on both, the data needs to be there otherwise you stall. Both have their various methods to resolve that.

If you exclude dual issue, both will run at the same speeds more or less, vmx vs spu. Dual issue is a factor but realistically you can't always get a perfect mix of float/int instructions to hit the magical 0.5 instructions/cycle count.


tuna was asking for benchmark though. Your previous response of 2 SPUs for 1 VMX is misleading, 'specially if the SPUs are part time.

Where did I say 2 spu replaces 1 vmx? That's definitely not the case.


Physics remains basic because not many game needs it. It has nothing to do with the SPUs or VMXs. If physics is important enough for a game, we will see the developers allocating enough budget to implement them well.

I disagree. The tools are there to have lots of stuff interact with good physics. There just isn't enough cpu power to do it in these consoles. That's why I asked about psn games, they are definitely less graphics intensive so I was curious if any pushed the physics boundaries with all 6 spu's.
 
Don't have benchmarks but I asked around and the typical answer is "2" when asked how many spu's are given to havok. Spu's are busy little beasts in todays games, not much of their time to spare.

I'd assume all SPU's are beeing used by Havok.
i.e. MLAA-prcosessing and Havok-physics-calculations would run on the same SPU, but not at the same time, that would overload that SPU - so the jobs-file would choose the most idle SPU too run the next calculation.
If you just let i.e. 2 SPU's do one thing, and 4 other SPU's do a different thing, it wouldn't be very optimized would it?
 
As Shifty pointed out before, this thread is intended to fuel silly arguments, so I'm just here to see the silliness unfold.

I think you contribute enough in this silly thread for the both of us, thanks. ;)

Don't get what you mean whether Joker feels educated or not bit, he worked on the PS3 so I'm sure he knows more than many people here.

Of course, but, forgive my arrogance, every developer has a different mind & a different way to work in a platform & that not means absolutely we can't discuss with him or tries to share some knowledge. & in case of error can easily put in the shame us.
 
Don't have benchmarks but I asked around and the typical answer is "2" when asked how many spu's are given to havok. Spu's are busy little beasts in todays games, not much of their time to spare.

Is that two SPUs statically set to run nothing else than Havok, or do they run other stuff as well? If so, how much of their time is Havok time?
 
Is that two SPUs statically set to run nothing else than Havok, or do they run other stuff as well? If so, how much of their time is Havok time?

I never used havok directly (not a physics guy) but from what I recall you could use the spu's that havok grabbed for other tasks when havok was idling.
 
Of course, but, forgive my arrogance, every developer has a different mind & a different way to work in a platform & that not means absolutely we can't discuss with him or tries to share some knowledge. & in case of error can easily put in the shame us.

I agree that every developer has different views and methods on how to approach a given piece of hardware. However I'm also sure in the years of development, a typical developer would be familiar and likely have tried many different methods to use said hardware. On top of this, unless I'm mistaken, before Joker stopped developing to pursue other opportunities, he was primarily a PS3 developer.

Sorry but I highly doubt anyone in the PS3 brigade is in any position to tell a former developer what is and is not an error.
 
Sorry but I highly doubt anyone in the PS3 brigade is in any position to tell a former developer what is and is not an error.

We're not on Joker's blog here, it's a discussion-forum.
Sure, Joker's contribution's is allways interesting, but I'd be very surprised if someone here were thinking someone here never got something wrong.
If you choose to believe only one person have all the answers, that's up to you.
 
Yes and no. In the self-contained SPU world, even instructions reside in the LocalStore. Everything is simple/shortcut'ed, and runs at that speed.

While you can control your data latency in VMX, the latency is longer, and the entire architecture is subjected to the full memory overhead.

Try writing a post processing effect on VMX to compete with the GPU (while also using the VMX for other tasks within one frame time).
We can assume a tight code loop with very few branches when we do post processing on any CPU (SPU or PPU). PPU instruction caches prefect data, and since the code is relatively small and is not branchy, there will be pretty much no instruction cache stalls on the PPU. Most instructions are read directly from L1i.

The data access pattern depends of the nature of the post process effect. The processed back buffer is just a single large linear memory array. Both writes and reads to it are very cache friendly. If the post process only reads & writes the back buffer, with manual cache control prefetch instructions you can be almost sure that the accessed data is in L1d when the CPU needs it. L1d access is faster than local store access.

For MLAA and deferred lighting (with no shadow map sampling) the accesses should be very cache friendly. For shadow mapped deferred lighting, the access pattern is more random. But since shadowmaps are too large to fit local store, it will be even slower to sample them in the SPUs (manual local store cache system needs to be implemented). Of course SPUs are smaller than PPUs, so you can have more of them on the same die. GPU is the best bet for random memory access patterns, since it incorporates so efficient latency hiding (by excessive parallelism).
 
Last edited by a moderator:
<delurk>
OK, let's look at some actual numbers.
Grab your copy of the CBEA Programming Handbook (Version 1.11), if you want to read along.

Section 6.1.2 gives us an overview of the PPE caches. There's a separate L1 I$ and D$, 32KB each. If you look at figure 6-1, you can see the entire setup, but the important part to notice is that the two SMT threads share all caches. What this effectively means is that for the purpose of this comparison, one PPE should be assumed to execute one thread. SMT doesn't help when you're going for peak performance. Both SMT threads can share the data in the caches, so if both threads run the same code, at least the I$ contention can be minimized. But again, not that interesting.

If you look at 6.1.3.1, you'll notice that the L1 I$ contents do not need to be in L2, so this gives the PPE the ability to use the entire L2 as a D$, as long as your codes fits into the I$. Is is actually really nice.

Of course, caches suffer from aliasing, which you can look up in sections 6.1.3.5 and 6.1.3.10. I'll be ignoring this.

The L2 is 512KB.

So how fast are these? Sadly, there is no official information out there that I can find. You can google around a bit and find some ball-park numbers, it's not really super important, as you'll see in a bit.

Also, if you enjoy these things, look at section 6.2, so the next time someone tells you that SPE doesn't have caches, you can make them angry at you.

A.3.2. gives us the latencies and throughput of the VMX32 instructions. The latency of an instruction depends on what unit processes it, so we have VXU load/store (memory access, 2 cycles), permute (4 cycles), simple (integer stuff, 4 cycles), FPU (single-precision floating point, 12 cycles), estimate (single precision floating point estimator for rcp and rsqrt, 14 cycles) and complex (integer multiplication, 9 cycles). All of these are well behaved and have a throughput of 1 instruction per cycle.
The register file is 32 128b entries large, or 512B.
Also note that VXU load/store is coupled with the LSU, which has implications for dual issue.

As it turns out, the PPE actually does support dual issue, as defined in A.5. But it's complicated. There is this great figure A-1, which explains the rules. So if you have a VSU Type 1 instruction, which is all the math stuff, in slot 0, you can get a Type 2 in slot 1. Cool stuff. You can even dual-issue scalar integer stuff with VMX, which the SPU can't. No dual-issue of FPU and VMX, however.

Now let's compare that to the SPU.

The Synergistic Processing Element has 256KB of Local Store, an SPU with 128x128b registers (that's 2KB) and 4 execution units (+ change) and a Memory Flow Controller, which is the funky DMA unit.
Figure 3-2 gives you a nice idea how the SPU is set up. The important part is that every cycle, you can execute an instruction in both the even and odd pipelines (or EVEN and ODD, as they are usually referred to). If this reminds you of the PPU earlier, you wouldn't be too far off the mark.

I'd also recommend looking at section 3.2.4.2 about DMA lists, which should make clear that the MFC is actually quite powerful and can take a lot of work off the SPU's shoulder. SPU beginners often fail to exploit the MFC fully, as it's not really something you have in a regular CPU. You can often lay out your data in a way that the MFC will be significantly more efficient than any prefetcher could ever be, simply because you have fine-level control. There will never be speculative fetching that wastes bandwidth and LS, for example.

But well, let's jump to table B-2, because this is getting ridiculously long.
Look at those latencies and stalls. Notice something? Most of the instructions that are “simple” on the PEE are 2 cycles now instead of 4 and all that “float” stuff is mostly 6 cycles instead of 12. Estimates are a lot faster. The lq* and stq* instructions are listed as 6 cycles, which is literally how long it takes to get the data out of local store.

Now that we've seen some numbers, what does that mean?
Starting from the top: Why do you care about instruction latencies and not just throughput? In terms of throughput the two PEs are pretty much the same, after all.
This comes down to the dependencies and critical path of your computation. You need to wait a number of cycles defined by the instruction latency, before you can use the result of that instruction. If you don't want to stall the chip, you will need to have other work to do in that time. And this basically means having more than one computation in flight at a time and interleaving those. Of course, to do that you need to be able to store the data for all those computations, which first and foremost means you need more register space the higher your latencies are. So the SPUs have 4 times the registers and half the latency. That's a pretty major advantage. Just imagine the SPU only had 16 registers, which would be equivalent to the VMX.
You could make an argument that for a lot of code the SPUs register file is a bit overkill, but I'd disagree with that for the general case.

Then how do local store and the cache hierarchy compare? As we have less register space, we have less opportunities to hide memory accesses with computation. Let's assume that the 2 cycle latency we saw for VXU load/store is the actual guaranteed time it takes to load data into the VMX registers from L1 D$. That would mean we have 32KB which are 3 times faster than the 256KB of LS. To illustrate what that means, let's normalize those numbers. 32KB/2cycles is 16KB/cycle and 256KB/6cycles is 42.7KB/cycle. We learned earlier that the L1 I$ is independent, and the SPU instruction reside in local store, so we either pretend we have twice the L1 cache on the PPU (giving 32KB/cycle for PPU) or that we only have 224KB of LS (37.3KB/cycle for SPU). And that's assuming you actually use all 32KB of I$. In either case, the SPU wins here before we even factor in register set size.
You can do the same comparison with L2 if you have the numbers, but let's just say that L2 is twice the size of LS and more than twice the latency.

So what good is the bytes/cycle metric? It's basically an estimator how how many simultaneous data instances you can have in the system at any given level. If it takes twice as long for new data to arrive, I'll need twice as much data readily available to prevent stalls.
This of course means that having a higher value further down the memory hierarchy is only useful if you can support that many instances at higher levels. This is the entire point of a cache hierarchy.
Using this metric, the SPUs shine with their low-latency data pipes and large register file.

When doing post-processing, a data instance can be either a pixel or a scanline or a tile. If it's a pixel, chances are a PPE can compete with an SPE (on a one-to-one basis, not one PPE vs. 6 SPEs). If it's a scanline, this becomes a lot harder. The reason for this is simple: If I'm tight on memory, the SPE will need less than half the memory that the PPE needs to stay saturated. So the SPE has a much better chance of not needing to hit main memory more than once per scanline. Once you need to loop the data through main memory, you are consuming precious main memory bandwidth (of which the PS3 has plenty, but no nearly as much as aggregate LS bandwidth), which is bad for many reasons.
</delurk>
 
Status
Not open for further replies.
Back
Top