Was Cell any good? *spawn

kagemaru · Nov 5, 2011

KongRudi said:
We're not on Joker's blog here, it's a discussion-forum.
Sure, Joker's contribution's is allways interesting, but I'd be very surprised if someone here were thinking someone here never got something wrong.
If you choose to believe only one person have all the answers, that's up to you.

I don't believe only one person has all the answers, but I do believe I'll take the word of a developer over a forum poster with an agenda. If other devs wish to discuss such matters with Joker, great I'd love to read it, but that isn't what's going on here.

joker454 · Nov 5, 2011

T.B. said:
Starting from the top: Why do you care about instruction latencies and not just throughput? In terms of throughput the two PEs are pretty much the same, after all.
This comes down to the dependencies and critical path of your computation. You need to wait a number of cycles defined by the instruction latency, before you can use the result of that instruction. If you don't want to stall the chip, you will need to have other work to do in that time. And this basically means having more than one computation in flight at a time and interleaving those. Of course, to do that you need to be able to store the data for all those computations, which first and foremost means you need more register space the higher your latencies are. So the SPUs have 4 times the registers and half the latency. That's a pretty major advantage. Just imagine the SPU only had 16 registers, which would be equivalent to the VMX.

I agree that you need registers to help hide latency on vmx....but are you sure the above is correct? The 360's implementation of vmx is slightly modified to where it's improved over the ps3 version. From what I recall they upped the register count to 128 as well as having additional instructions.

Shifty Geezer · Nov 5, 2011

Yes, that's true, but the original comparison was just PPU VMX, and comparisons with other processors will be with smaller register sets, unless they're all sporting 128 register vector units these days! It's actually quite an importnat consideration, what Xenon's large VMX units bring to the table if anything. Sadly we don't get in depth feedback on those, so can't compare massive register sets on a future processor.

T.B. · Nov 5, 2011

joker454 said:
I agree that you need registers to help hide latency on vmx....but are you sure the above is correct? The 360's implementation of vmx is slightly modified to where it's improved over the ps3 version. From what I recall they upped the register count to 128 as well as having additional instructions.

I was comparing the CBEA components, as the docs are openly available. Xenon is a lot less publicly documented.
However, IBM did talk about it a bit here, so yes, 128 registers with some restrictions and additions to the base VMX. They added a few AoS instructions and D3D format conversions.

The L2 is also increased to 1MB, but it's now shared over the three cores, giving you effectively less L2 per unit. You win some, you lose some.

All in all this makes VMX128 more capable than VMX32 in small-dataset floating point loops, but it's still not in the same league as an SPE, primarily due to the higher instruction latency.

joker454 · Nov 5, 2011

T.B. said:
I was comparing the CBEA components, as the docs are openly available. Xenon is a lot less publicly documented.
However, IBM did talk about it a bit here, so yes, 128 registers with some restrictions and additions to the base VMX. They added a few AoS instructions and D3D format conversions.

The L2 is also increased to 1MB, but it's now shared over the three cores, giving you effectively less L2 per unit. You win some, you lose some.

All in all this makes VMX128 more capable than VMX32 in small-dataset floating point loops, but it's still not in the same league as an SPE, primarily due to the higher instruction latency.

1MB of L2 shared three ways sounds like a hit, but comparitively you only realistically get around ~110k or so of usable local sore on each spu after you double buffer data and account for code + stack.

Personally I'd say vmx128 is much more capable than vmx32 because vmx is very sensitive to lantency without a large register set, and because it's more subject to The Intern Effect (tm) than spu's are. So going from 32 to 128 is a huge improvement! Also generally speaking you are always dealing with small data sets on both machines because of the local store size on spu and the register count on vmx. Well ok, more like large data sets processed in really small chunks but the net result becomes the same basically that latency become much more manageable on vmx with the way data is churned through on both consoles. For the stuff as sebbbi mentions that are large data set with random access, you shift those to the gpu.

kagemaru said:
I agree that every developer has different views and methods on how to approach a given piece of hardware. However I'm also sure in the years of development, a typical developer would be familiar and likely have tried many different methods to use said hardware. On top of this, unless I'm mistaken, before Joker stopped developing to pursue other opportunities, he was primarily a PS3 developer.

Don't sweat it daddio, come January it will be 3 years since I touched console code so my memory is starting to get a bit hazy on some stuff. Feel free to treat me like one of the guys at this point

T.B. · Nov 5, 2011

joker454 said:
1MB of L2 shared three ways sounds like a hit, but comparitively you only realistically get around ~110k or so of usable local sore on each spu after you double buffer data and account for code + stack.

I'd disagree with this on two levels. First, the equivalent to LS on a VMX in my mind is L1, not L2. L2 is far, far away in terms of latency. If all your data fits into L1 and is nicely laid out, I'd argue that VMX128 and the SPE are on more or less equal footing, assuming there is no second thread interfering with your VMX code and polluting its cache. If you're sufficiently hardcore, an SPE will still win in many cases due to funky things you can do with the MFC, the IMO more powerful ODD vs Type 2 and the less restrictive ISA, but that's a level of engineering you'll rarely ever see.
Of course, the assumption of dropping SMT and fitting everything into 32KB means that we're probably talking about significantly more engineering effort on in the VMX than the SPE, paradoxically. Does anyone run only one thread on a core to maximize performance? I don't think I've ever seen that.

The second part I disagree with is the 110k. Let's talk actual numbers again. Edge MLAA is less than 40k of code all in all and - depending on the exact configuration - north of 200k worth of data. Some of that data will be in flight at any given time and some of it will be actively processed. This is not different from a cache, which also has some data in flight and some actively useable. Actually, you can use the MFC to do a really tight DMA loop where only the minimum amount of data is in flight at any given time. This is much, much harder to do with a prefetcher, since the control is much more indirect.

Personally I'd say vmx128 is much more capable than vmx32 because vmx is very sensitive to lantency without a large register set, and because it's more subject to The Intern Effect (tm) than spu's are. So going from 32 to 128 is a huge improvement!

Absolutely. My point was not that VMX128 is weak. It isn't. But it's not as powerful as the SPE. And quite frankly, if IBM could not have engineered and new specialized core (the SPE) that beats an extended version of one of their old cores at a very specialized job, they would not be some of the best processor designers in the business.
Looking at if from the other side, some of the VMX128 extensions allow it it beat SPEs at some very specialized tasks by a fair margin as well. But I'd say those are more rare.

Also generally speaking you are always dealing with small data sets on both machines because of the local store size on spu and the register count on vmx. Well ok, more like large data sets processed in really small chunks but the net result becomes the same basically that latency become much more manageable on vmx with the way data is churned through on both consoles. For the stuff as sebbbi mentions that are large data set with random access, you shift those to the gpu.

The comparison to GPUs is an interesting one, because GPUs are a whole different class of processors. VMX and SPE are sort of designed for the same thing, with pretty different parameters, but they are still very comparable. If you don't care about 25% or even 100% difference core-for-core, then yes, VMX and SPE look a lot alike. They are both 4 wide dual issue SIMD units attached to a bit of fast memory and clocked at 3.2GHz. Within that class however, they do differ in the details and those details are exploitable, if you are willing to spend the engineering effort, which is not always a sound investment for all teams.

sebbbi · Nov 8, 2011

T.B. said:
When doing post-processing, a data instance can be either a pixel or a scanline or a tile. If it's a pixel, chances are a PPE can compete with an SPE (on a one-to-one basis, not one PPE vs. 6 SPEs). If it's a scanline, this becomes a lot harder. The reason for this is simple: If I'm tight on memory, the SPE will need less than half the memory that the PPE needs to stay saturated. So the SPE has a much better chance of not needing to hit main memory more than once per scanline. Once you need to loop the data through main memory, you are consuming precious main memory bandwidth (of which the PS3 has plenty, but no nearly as much as aggregate LS bandwidth), which is bad for many reasons.

(I have to admit, I was comparing SPU to VMX128 instead of the less powerful VMX32)

A well programmed loop would never access main memory more than once for reading each pixel and once for writing each pixel. If you have fat pixels, you might need more than L1d and 128 VMX registers to hide pipeline latency, so your algorithm might sometimes hit the 1MB L2. But if the post process algorithm requires more than 1 MB of memory to hide the pipeline latency, there's something badly wrong in the code (as the data access pattern is very cache friendly).

Both SPU and the VMX do exactly the same amount of loads and stores to the main memory. Each pixel is read once and write once. You do not even need to add any manual cache control instructions to reach this on VMX. However if you add manual cache control instructions, you are pretty much guaranteed to always hit L1d (assuming 8888 format pixels, you can have 8192 of them simultaneously in 32KB L1d), since the post process loop doesn't have branches (it's very easy to predict how long it's going to execute on a in-order CPU, so you can put the cache prefetch instructions in ideal places).

Also VMX128 includes 3d/4d dot products. This helps if the input/output data is in AoS layout. Without it, you need to either have the pixel data interleaved in SoA layout (difficult as the data comes from GPU) or transform it to SoA layout (more instructions). Also when calculating dot products in SoA layout, you do four at a time, and thus need more registers. AoS dot products can relieve the register pressure. VMX128 also has fast (low latency) float16->float32->float16 conversion/packing (and conversion/packing to other pixel formats as well), so it's pretty well capable of processing pixel data in all of the currently used LDR and HDR formats. Many other vector processing algorithms benefit also from fast loading/storing values as 16 bit floats in memory (halves cache/memory footprint compared to 32 bit float vectors). In many cases you do not need more precision.

Naughty Dogs lighting stuff:
http://www.naughtydog.com/docs/gdc2010/intro-spu-optimizations-part-2.pdf
On SPU: AoS = 10.75 cycles, SoA = 7.75 cycles. Not having AoS dot products hurt SPUs a bit when you have to process AoS data. Working on SoA layout is often the preferred way, but that's not always possible.

With longer VMX instruction latencies, getting pipelines 100% utilized is of course a harder task, especially when doing it by hand like Naughty Dog does in their SPU lighting code. But it's not impossible, you just basically need to manually interleave the processing of a few pixels. The positive thing however is that code hot spots are often self contained, and optimizing the short inner loops often is enough to get good performance. With Cell like architecture, the whole game program needs to be adapted and optimized to suit the system, or the performance will be really poor. With a more traditional cache based UMA system, you only need to optimize the hotspots (= less than 1% of the whole code), the CPU automatically runs other code well enough.

Arwin · Nov 8, 2011

How about chaining? E.g. if I understood it correctly (I did read most of the manual, even if I never really worked with it more than running sample code back when PS3 still had Linux), you can chain SPUs with no additional delay to the pipeline. E.g. you could assign the task to one SPU, who does some basic work on the data, divides up the work and passes it to two other to two other SPUs. Can you do something similar with the three VMX128s in the 360's CPU?

T.B. · Nov 8, 2011

Seems as if we agree for the most part.

sebbbi said:
A well programmed loop would never access main memory more than once for reading each pixel and once for writing each pixel. If you have fat pixels, you might need more than L1d and 128 VMX registers to hide pipeline latency, so your algorithm might sometimes hit the 1MB L2. But if the post process algorithm requires more than 1 MB of memory to hide the pipeline latency, there's something badly wrong in the code (as the data access pattern is very cache friendly).

Consider a scatter algorithm, like IIR gaussian approximation. You'll need to do a forward and a backward pass over each row and column and the intermediate values need to be float precision if you want extreme blurs. 1280*3*sizeof(float) = 15360B, and that's assuming you somehow got rid of the alpha channel.
Even if everything is gather based, you can have more data to gather in LS than in L1.

If your effect has nicely independent pixels, then of course, as I stated earlier, it's a pretty even playing field. And if you do, say, a tonemapping, VMX128 should be a good chunk faster. Again, I'm not saying VMX128 is bad (or even VMX32 is bad) or that it doesn't have cases where it can be faster than an SPE.

This is more a case of SPEs being able to efficiently run a wider class of algorithms at high utilization. I'll need to think about if there is an interesting class of algorithms at which the VMX will be significantly faster for architectural reasons. The L2 cache lines are 128B, so that's a pretty DMA-able size...

With Cell like architecture, the whole game program needs to be adapted and optimized to suit the system, or the performance will be really poor. With a more traditional cache based UMA system, you only need to optimize the hotspots (= less than 1% of the whole code), the CPU automatically runs other code well enough.

If this has been your experience with writing PS3 games, then kudos to you guys for going all the way. This is not usually how it works.

I really don't think a lot of games have significantly more than 1% of their codebase on the SPUs, but that's just a gut feeling.
In any case, I can't really talk too much about ease of development, since I've not done a whole lot of VMX128 coding. So I'll stick to commenting on chip design, where I actually might know what I'm talking about.

Love_In_Rio · Nov 10, 2011

T.B. said:
Seems as if we agree for the most part.

Consider a scatter algorithm, like IIR gaussian approximation. You'll need to do a forward and a backward pass over each row and column and the intermediate values need to be float precision if you want extreme blurs. 1280*3*sizeof(float) = 15360B, and that's assuming you somehow got rid of the alpha channel.
Even if everything is gather based, you can have more data to gather in LS than in L1.

If your effect has nicely independent pixels, then of course, as I stated earlier, it's a pretty even playing field. And if you do, say, a tonemapping, VMX128 should be a good chunk faster. Again, I'm not saying VMX128 is bad (or even VMX32 is bad) or that it doesn't have cases where it can be faster than an SPE.

This is more a case of SPEs being able to efficiently run a wider class of algorithms at high utilization. I'll need to think about if there is an interesting class of algorithms at which the VMX will be significantly faster for architectural reasons. The L2 cache lines are 128B, so that's a pretty DMA-able size...

If this has been your experience with writing PS3 games, then kudos to you guys for going all the way. This is not usually how it works.
I really don't think a lot of games have significantly more than 1% of their codebase on the SPUs, but that's just a gut feeling.
In any case, I can't really talk too much about ease of development, since I've not done a whole lot of VMX128 coding. So I'll stick to commenting on chip design, where I actually might know what I'm talking about.

Do you work for IBM? If so give us a hint... Will we see VMX units in next gen?. SPUs improved ?.

forumaccount · Nov 10, 2011

T.B. said:
I really don't think a lot of games have significantly more than 1% of their codebase on the SPUs, but that's just a gut feeling.

Going by elf sizes, around 20% of my game is SPU code. So my estimation would be... probably a lot higher than 1% for the typical game. I definitely know of titles that used 0% SPU, but they didn't have much going on and that was at the start of the console gen.

I don't have internal knowledge of more than 3 or 4 titles that shipped in the last 2 years, but I feel like I have a pretty good grasp on what can be done with PPU alone... I'm going to disagree with both your take on it and sebbi's.

patsu · Nov 10, 2011

Love_In_Rio said:
Do you work for IBM? If so give us a hint... Will we see VMX units in next gen?. SPUs improved ?.

T.B. is one of the (2?) devs who wrote the God of War 3 MLAA. The module is written such that developers can plonk in the MLAA code easily (if they have spare SPU cycles).

Love_In_Rio · Nov 10, 2011

patsu said:
T.B. is one of the (2?) devs who wrote the God of War 3 MLAA.

Well, then he could answer those questions aswell

.

Now seriously, i have a real answerable question for him or any other in the known. What could be done to the SPUs to make them more flexible, easy to program in an improved Cell version?.

Add dynamic branching? ooo capabilities? add integer units? increase local storage or replace it for a cache?.

sebbbi · Nov 10, 2011

T.B. said:
If your effect has nicely independent pixels, then of course, as I stated earlier, it's a pretty even playing field. And if you do, say, a tonemapping, VMX128 should be a good chunk faster. Again, I'm not saying VMX128 is bad (or even VMX32 is bad) or that it doesn't have cases where it can be faster than an SPE.

This is more a case of SPEs being able to efficiently run a wider class of algorithms at high utilization. I'll need to think about if there is an interesting class of algorithms at which the VMX will be significantly faster for architectural reasons. The L2 cache lines are 128B, so that's a pretty DMA-able size...

Agreed. SPUs should perform better when the data set required for each pixel is larger than L1d (32 KB), but smaller than local store size (or half of it minus code = ~128 KB, since you need double buffering to load next data while processing old one). But once the data set is larger than local store size, the VMX128 processing would simply start using the 1 MB L2 cache instead of the faster L1d, while the SPU code would need to frantically swap data to/from main memory (slowing it down to a crawl). But all this is pretty much academic debate, since you only have one PPC core on PS3, and six SPUs. Using the only general purpose CPU core for post process pixel processing would be quite inefficient approach

T.B. said:
If this has been your experience with writing PS3 games, then kudos to you guys for going all the way. This is not usually how it works.
I really don't think a lot of games have significantly more than 1% of their codebase on the SPUs, but that's just a gut feeling.

I haven't written code for PS3, just for Xbox 360 (and older Sony consoles). But I have of course followed PS3 game development quite closely. SPU programming articles (like that Naughty Dog one) are very interesting read for me, since I do most of our low level vector optimizations (and all our GPGPU stuff). It's interesting to compare different vector architectures.

Love_In_Rio · Nov 10, 2011

sebbbi said:
Agreed. SPUs should perform better when the data set required for each pixel is larger than L1d (32 KB), but smaller than local store size (or half of it minus code = ~128 KB, since you need double buffering to load next data while processing old one). But once the data set is larger than local store size, the VMX128 processing would simply start using the 1 MB L2 cache instead of the faster L1d, while the SPU code would need to frantically swap data to/from main memory (slowing it down to a crawl).

I haven't written code for PS3, just for Xbox 360 (and older Sony consoles). But I have of course followed PS3 game development quite closely. SPU programming articles (like that Naughty Dog one) are very interesting read for me, since I do most of our low level vector optimizations (and all our GPGPU stuff). It's interesting to compare different vector architectures.

What would you modify in the SPUs to make them easy to program and not make it a pain in the ass while maintaining their capabilities?. Could be made modifications that allowed disregarding in a new design the need of a PPU?.

patsu · Nov 10, 2011

sebbbi said:
Agreed. SPUs should perform better when the data set required for each pixel is larger than L1d (32 KB), but smaller than local store size (or half of it minus code = ~128 KB, since you need double buffering to load next data while processing old one). But once the data set is larger than local store size, the VMX128 processing would simply start using the 1 MB L2 cache instead of the faster L1d, while the SPU code would need to frantically swap data to/from main memory (slowing it down to a crawl). But all this is pretty much academic debate, since you only have one PPC core on PS3, and six SPUs. Using the only general purpose CPU core for post process pixel processing would be quite inefficient approach

Hmm... it should be common for SPUs to tear through datasets larger than 128K by streaming/staggering the data via DMA. The main issue is random access data, or data with too much dependencies (Can't fetch early enough or in parallel). Most graphics jobs are highly parallelizable.

Developers can also combine similar jobs together (both code and data) to make a good/bigger batch size.

Once they are satisfied with a single SPU implementation, they will have more cores to distribute the workload in a predictable way.

liolio · Nov 10, 2011

patsu said:
T.B. is one of the (2?) devs who wrote the God of War 3 MLAA. The module is written such that developers can plonk in the MLAA code easily (if they have spare SPU cycles).

He was also involved in Sacred 2 development before moving to Sony, or I'm confused?
I know that they are two members involved in gaming industry that used 2letters pseudo I tend to confuse them from time to time :???:

Shifty Geezer · Nov 10, 2011

T.B. worked on Sacred 2 and moved to Sony's ATG in Cambridge. So he'll be working at a lower level on SPUs than anyone being in the luxurious position of just having to develop technologies without product deadlines to worry about, but perhaps hasn't the same experience with something like VMX128 (I don't think he programmed XB360 during Sacred 2). Whereas Sebbbi is all XB360 and no SPU experience.

Arwin · Nov 10, 2011

Love_In_Rio said:
What would you modify in the SPUs to make them easy to program and not make it a pain in the ass while maintaining their capabilities?. Could be made modifications that allowed disregarding in a new design the need of a PPU?.

If I were to loosely summarise what I have heard from some developers (and I could be very wrong) they would ideally have the local stores be a unified cache addressable in main address space. I imagine that you could then still lock parts of that memory space so that it works similar to local store in terms of predictability, but you have more flexibility, and make it easier to use for those who do not code at low level.

assurdum · Nov 10, 2011

patsu said:
T.B. is one of the (2?) devs who wrote the God of War 3 MLAA. The module is written such that developers can plonk in the MLAA code easily (if they have spare SPU cycles).

Oh my... he is a God for me

pretty curios to know whether FXAA was possible on the ps3 through SPU & how would works.

Was Cell any good? *spawn

kagemaru

joker454

Shifty Geezer

uber-Troll!

T.B.

joker454

T.B.

sebbbi

Arwin

Now Officially a Top 10 Poster

T.B.

Love_In_Rio

forumaccount

patsu

Love_In_Rio

sebbbi

Love_In_Rio

patsu

liolio

Aquoiboniste

Shifty Geezer

uber-Troll!

Arwin

Now Officially a Top 10 Poster

assurdum

Similar threads